How do different genes on human genome express themselves?

How do different genes on human genome express themselves?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

It is said that human genome contains over twenty five thousand genes, How many of these (can) express themselves as an external or internal trait in human beings (for e.g. like eye-colors, hair colors, earlobe-assosiation etc.). Other genes are responsible for (mostly) what functionalities?

The human genome doesn't contain 30000 genome, rather it contains 30000 genes, as the HGP has said us. Human gene is split gene,i.e., long sections of non-coding DNA is present in between short sections of actually coding DNA. Thus there is a lot of junk present. Different genes express themselves by a two step process, the first being transcription, where DNA is converted first into a mRNA, which undergoes splicing and then finally that mRNA is converted into protein and this is the second step which is known as translation. Thus it is by the conversion of DNA segment into proteins that helps the various genes in expressing themselves.

As for what the other genes do, not all genes are busy in determining phenotypic traits. There is a lot other things also going on… the various biochemical reactions in our body need to be regulated by enzymes and the production of these enzymes are controlled by these genes. For more knowledge about this field… Google lac-operon concept or the tryptophan operon concept.

Individual Differences Caused By Shuffled Chunks Of DNA In The Human Genome

A study by Yale researchers offers a new view of what causes the greatest genetic variability among individuals -- suggesting that it is due less to single point mutations than to the presence of structural changes that cause extended segments of the human genome to be missing, rearranged, or present in extra copies.

"The focus for identifying genetic differences has traditionally been on point mutations or SNPs -- changes in single bases in individual genes," said Michael Snyder, the Cullman Professor of Molecular, Cellular & Developmental Biology and senior author of the study, which was published in Science Express. "Our study shows that a considerably greater amount of variation between individuals is due to rearrangement of big chunks of DNA."

Although the original human genome sequencing effort was comprehensive, it left regions that were poorly analyzed. Recently, investigators found that even in healthy individuals, many regions in the genome show structural variation. This study was designed to fill in the gaps in the genome sequence and to create a technology to rapidly identify structural variations between genomes at very high resolution over extended regions.

"We were surprised to find that structural variation is much more prevalent than we thought and that most of the variants have an ancient origin. Many of the alterations we found occurred before early human populations migrated out of Africa," said first author Jan Korbel, a postdoctoral fellow in the Department of Molecular Biophysics & Biochemistry at Yale.

To look at structural variants that were shared or different, DNA from two females -- one of African descent and one of European descent -- was analyzed using a novel DNA-based methodology called Paired-End Mapping (PEM). Researchers broke up the genome DNA into manageable-sized pieces about 3000 bases long tagged and rescued the paired ends of the fragments and then analyzed their sequence with a high-throughput, rapid-sequencing method developed by 454 Life Sciences.

"454 Sequencing can generate hundreds of thousands of long read pairs that are unique within the human genome to quickly and accurately determine genomic variations," explained Michael Egholm, a co-author of the study and vice president of research and development at 454 Life Sciences.

"Previous work, based on point mutations estimated that there is a 0.1 percent difference between individuals, while this work points to a level of variation between two- and five-times higher," said Snyder.

"We also found 'hot spots' -- particular regions where there is a lot of variation," said Korbel. "While these regions may be still actively undergoing evolution, they are often regions associated with genetic disorder and disease."

"These results will have an impact on how people study genetic effects in disease," said Alex Eckehart Urban, a graduate student in Snyder's group, and one of the principal authors on the study. "It was previously assumed that 'landmarks,' like the SNPs mentioned earlier, were fairly evenly spread out in the genomes of different people. Now, when we are hunting for a disease gene, we have to take into account that structural variations can distort the map and differ between individual patients."

"While it may sound like a contradiction," says Snyder, "this study supports results we have previously reported about gene regulation as the primary cause of variation. Structural variation of large of spans of the genome will likely alter the regulation of individual genes within those sequences."

According to the authors, even in healthy people, there are variants in which part of a gene is deleted or sequences from two genes are fused together without destroying the cellular activity with which they are associated. They say these findings show that the "parts list" of the human genome may be more variable, and possibly more flexible, than previously thought.

Other authors from Yale in addition to primary authors Alex E Urban and Jan Korbel, who is also affiliated with the European Molecular Biology Laboratory in Heidelberg, Germany, are Fabian Grubert, Philip Kim, Dean Palejev, Nicholas Carriero, Andrea Tanzer, Eugenia Saunders, Sherman Weissman, and Mark Gerstein. The research was funded the National Institutes of Health, a Marie Curie Fellowship, the Alexander von Humboldt Foundation, The Wellcome Trust, Roche Applied Science and the Yale High Performance Computation Center.

Citation: Science: Science Express (on line) September 28, 2007.

Story Source:

Materials provided by Yale University. Note: Content may be edited for style and length.

New Eastern Outlook

It is often said that if it can be imagined, it will inevitably be done. And such a sentiment could not be any truer in terms of applying genetic engineering and synthetic biology to the genomes of our planet’s organisms including humans themselves.

While the process of synthesizing and arranging genetic code has many processes, perhaps none has been as promising as the CRISPR-Cas system. From laboratory experiments to emerging software used to create code genetically almost as easily as code for a computer, gene editing has never been easier, opening the door to never-before-possible applications.

Perhaps no technology yet has been poised to change the world so profoundly. All life on Earth, every living organism, now stands the possibility of potentially being “edited” on the most basic genetic level, enhancing or degrading it, but forever changing it.

Gene editing or “gene therapy” performed on children or adults changes the genetic makeup of targeted cells after which and upon dividing, impart this new genetic material on each subsequent new cell. This is why treatments for diseases using gene therapy often are successful with only a single shot. The “treatment” self-replicates perpetually within the patient’s body. Everything from leukemia to congenial genetic defects have been overcome in clinical trials using this method.

As far as science knows, these changes cannot be passed onto the offspring of patients. However, changing the genetic makeup of a human at their earliest stages of development can be passed on, spreading genetic changes made in labs onto the greater population.

The Biggest Threats: The Jab and Slow Kill

Talk of gene editing usually revolves around its use to treat diseases and produce super-crops and livestock to “save the world.” But as history has shown us, any technology is but a double edged sword. Whatever good it is capable of, it is proportionally capable of just as much bad.

The first and foremost danger of human gene editing in particular is its use in weaponized vaccines. Such fears are founded upon what was revealed by the United Nations during the apartheid government in South Africa where a government program named “Project Coast” actually endeavored to produce vaccines that were race-specific in hopes of sterilizing or killing off its black population.

One example of this interaction involved anti-fertility work. According to documents from RRL [Roodeplaat Research Laboratories], the facility had a number of registered projects aimed at developing an anti-fertility vaccine. This was a personal project of the first managing director of RRL, Dr Daniel Goosen. Goosen, who had done research into embryo transplants, told the TRC that he and Basson had discussed the possibility of developing an anti-fertility vaccine which could be selectively administered—without the knowledge of the recipient. The intention, he said, was to administer it to black South African women without their knowledge.

At the time, the technology to accomplish such a feat never materialized. Now it has.

Another danger is “slow kill.” This would be the process of using gene editing to affect individuals directly or through a genetically modified food supply subtly, infecting or killing off targeted demographic groups over a longer period of time. The advantage of this method would be the ambiguity surrounding what was causing upticks in “cancer” and other maladies brought on by degraded immune systems and overall health.

And while some might be tempted to claim the dangers of this technology being used against populations remains solely in the realm of “Nazi eugenicists” and racist South African regimes, the truth of the matter is even Washington has penned policy papers advocating weapons deployed amid the “world of microbes.”

Mentioned in the US Neo-Conservative Project for a New American Century’s (PNAC) 2000 report titled Rebuilding America’s Defenses it stated:

The proliferation of ballistic and cruise missiles and long-range unmanned aerial vehicles (UAVs) will make it much easier to project military power around the globe. Munitions themselves will become increasingly accurate, while new methods of attack – electronic, “non-lethal,” biological – will be more widely available. (p.71 of .pdf)

Although it may take several decade for the process of transformation to unfold, in time, the art of warfare on air, land, and sea will be vastly different than it is today, and “combat” likely will take place in new dimensions: in space, “cyber-space,” and perhaps the world of microbes. (p.72 of .pdf)

And advanced forms of biological warfare that can “target” specific genotypes may transform biological warfare from the realm of terror to a politically useful tool. (p.72 of .pdf)

Biological warfare that can “target” specific genotypes is precisely what is now possible in the advent of improved gene editing. While many may suspect profit alone drives large pharmaceutical corporations to push vaccines on the global population, in reality, what it may also represent is an attempt by these very conspirators to create a well established globalized medium through which to administer their targeted bioweapons, yet another reason why the matter of human healthcare and biotechnology (and specifically vaccines) is a matter of not just business, but of national security as well.

Overwriting the Planet’s Genetic Heritage

Recently, Chinese scientists have crossed what many Western commentators, scientists and others have claimed is an “ethical line” by applying gene editing to human embryos. Critics have condemned the move specifically because any human “edited” while at their embryonic stage would likely transfer those genetic changes to any offspring they had upon becoming an adult.

Yet many of these critics have been vocal advocates for precisely the same use of biotech, though not for humans, but rather for our food supply. Genetically modified organisms (GMOs), particularly modified crops transfer their artificially altered genetic code to its next generation. Cross pollination has repeatedly contaminated the fields of farmers not using GMOs, creating an expanding controversy and multiple lawsuits and legal reviews.

In reality, all genetic editing, especially when it alters the genetic material of subsequent generations, represents a potential threat to the genetic heritage of the entire planet with potential consequences we may still not fully understand. In a world where the “science is final” regarding humanity’s impact on the planet’s climate, demanding “urgent action” to stop or reverse it, the absence of a similar impetus behind stopping the contamination of our planet’s genetic heritage seems suspiciously hypocritical if not utterly reckless and even intentional.

Of course, gene editing will be done, with or without the approval of governments and the people they govern. However, measures should be developed and put in place to preserve the natural genetic heritage of the planet, and such measures should be decentralized as much as possible.

The James Bond-esque “Svalbard Global Seed Vault” in the frigid climate of Norway represents a sort of “backup” for many of the planet’s horticultural species, but is controlled by the very interests intentionally destroying the planet’s genomes. It represents essentially a criminal gang preparing to sink the ship, but only after securing for themselves the only lifeboat available.

More lifeboats must be made available and it will require the understanding of policymakers of this emerging technology and the threats it presents, along with national and local policies to hedge against these threats.

The West Trapped in its Own Hypocrisy

Ironically, the West’s own hypocrisy has tied its hands in condemning China’s moves to recklessly alter the human genomes of embryos. Not only is the West’s attitude toward GMOs in general now hurting their case against China, the prevailing attitude in the West that embryos are not even “human” is also critically hypocritical, regardless of how irrational, unscientific and unqualified (however very politically convenient) such an attitude is.

To the West, unborn children are virtually “garbage” to be thrown away on a whim. So the Chinese might be forgiven for thinking it is perfectly ok to experiment recklessly upon them. In reality a human being’s unique genetic code and the metabolic cellular activity that constitutes the beginning of its life… both of which perpetuate themselves uninterrupted until birth and continues on until death, natural or otherwise … begins at conception. As such, experimenting on a human embryo may not superficially “feel” or “look” like human experimentation, but scientifically it is.

The West is quite right about condemning China for its experimentation on human embryos, however its confused self-serving hypocrisy has made this condemnation incoherent and unfortunately irrelevant.

Regardless, those nations still adhering to a sense of both objective science and humanity can and must set a precedent based on the above described realities. They must recognize the threats and abuses this technology poses equally with its benefits. They must educate their populations to understand the difference between the two, and the importance of developing a national biotechnology initiative as a matter of both national security and progress. But above all, they must understand that biotechnology represents the next big revolution, after information technology, and begin building the necessary infrastructure to support it.

Without doing so, nations will find themselves ill-prepared to either capitalize on its benefits or defend against its many and incredibly dangerous abuses.

Weaponization, accidents and even the prospect of globalized corporations finding, then making inaccessible the cures to diseases and conditions affecting millions such as cancer, diabetes and heart disease are all threats we now face, whether we would like to admit it or not. One point the West correctly made upon its hand wringing over China’s most recent and reckless leap forward, was that the matter of biotechnology’s profound impact on the human genome and the genetic heritage of the entire planet is no longer the subject of a “future” scenario. It is a matter of present concern.

Ulson Gunnar, a New York-based geopolitical analyst and writer especially for the online magazine “New Eastern Outlook”.

Applying Genomics

The introduction of DNA sequencing and whole genome sequencing projects, particularly the Human Genome Project, has expanded the applicability of DNA sequence information. Genomics is now being used in a wide variety of fields, such as metagenomics, pharmacogenomics, and mitochondrial genomics. The most commonly known application of genomics is to understand and find cures for diseases.

Predicting Disease Risk at the Individual Level

Predicting the risk of disease involves screening and identifying currently healthy individuals by genome analysis at the individual level. Intervention with lifestyle changes and drugs can be recommended before disease onset. However, this approach is most applicable when the problem arises from a single gene mutation. Such defects only account for about 5 percent of diseases found in developed countries. Most of the common diseases, such as heart disease, are multifactorial or polygenic, which refers to a phenotypic characteristic that is determined by two or more genes, and also environmental factors such as diet. In April 2010, scientists at Stanford University published the genome analysis of a healthy individual (Stephen Quake, a scientist at Stanford University, who had his genome sequenced) the analysis predicted his propensity to acquire various diseases. A risk assessment was done to analyze Quake&rsquos percentage of risk for 55 different medical conditions. A rare genetic mutation was found that showed him to be at risk for sudden heart attack. He was also predicted to have a 23 percent risk of developing prostate cancer and a 1.4 percent risk of developing Alzheimer&rsquos disease. The scientists used databases and several publications to analyze the genomic data. Even though genomic sequencing is becoming more affordable and analytical tools are becoming more reliable, ethical issues surrounding genomic analysis at a population level remain to be addressed. For example, could such data be legitimately used to charge more or less for insurance or to affect credit ratings?

Genome-wide Association Studies

Since 2005, it has been possible to conduct a type of study called a genome-wide association study, or GWAS. A GWAS is a method that identifies differences between individuals in single nucleotide polymorphisms (SNPs) that may be involved in causing diseases. The method is particularly suited to diseases that may be affected by one or many genetic changes throughout the genome. It is very difficult to identify the genes involved in such a disease using family history information. The GWAS method relies on a genetic database that has been in development since 2002 called the International HapMap Project. The HapMap Project sequenced the genomes of several hundred individuals from around the world and identified groups of SNPs. The groups include SNPs that are located near to each other on chromosomes so they tend to stay together through recombination. The fact that the group stays together means that identifying one marker SNP is all that is needed to identify all the SNPs in the group. There are several million SNPs identified, but identifying them in other individuals who have not had their complete genome sequenced is much easier because only the marker SNPs need to be identified.

In a common design for a GWAS, two groups of individuals are chosen one group has the disease, and the other group does not. The individuals in each group are matched in other characteristics to reduce the effect of confounding variables causing differences between the two groups. For example, the genotypes may differ because the two groups are mostly taken from different parts of the world. Once the individuals are chosen, and typically their numbers are a thousand or more for the study to work, samples of their DNA are obtained. The DNA is analyzed using automated systems to identify large differences in the percentage of particular SNPs between the two groups. Often the study examines a million or more SNPs in the DNA. The results of GWAS can be used in two ways: the genetic differences may be used as markers for susceptibility to the disease in undiagnosed individuals, and the particular genes identified can be targets for research into the molecular pathway of the disease and potential therapies. An offshoot of the discovery of gene associations with disease has been the formation of companies that provide so-called &ldquopersonal genomics&rdquo that will identify risk levels for various diseases based on an individual&rsquos SNP complement. The science behind these services is controversial.

Because GWAS looks for associations between genes and disease, these studies provide data for other research into causes, rather than answering specific questions themselves. An association between a gene difference and a disease does not necessarily mean there is a cause-and-effect relationship. However, some studies have provided useful information about the genetic causes of diseases. For example, three different studies in 2005 identified a gene for a protein involved in regulating inflammation in the body that is associated with a disease-causing blindness called age-related macular degeneration. This opened up new possibilities for research into the cause of this disease. A large number of genes have been identified to be associated with Crohn&rsquos disease using GWAS, and some of these have suggested new hypothetical mechanisms for the cause of the disease.


Pharmacogenomics involves evaluating the effectiveness and safety of drugs on the basis of information from an individual's genomic sequence. Personal genome sequence information can be used to prescribe medications that will be most effective and least toxic on the basis of the individual patient&rsquos genotype. Studying changes in gene expression could provide information about the gene transcription profile in the presence of the drug, which can be used as an early indicator of the potential for toxic effects. For example, genes involved in cellular growth and controlled cell death, when disturbed, could lead to the growth of cancerous cells. Genome-wide studies can also help to find new genes involved in drug toxicity. The gene signatures may not be completely accurate, but can be tested further before pathologic symptoms arise.


Traditionally, microbiology has been taught with the view that microorganisms are best studied under pure culture conditions, which involves isolating a single type of cell and culturing it in the laboratory. Because microorganisms can go through several generations in a matter of hours, their gene expression profiles adapt to the new laboratory environment very quickly. On the other hand, many species resist being cultured in isolation. Most microorganisms do not live as isolated entities, but in microbial communities known as biofilms. For all of these reasons, pure culture is not always the best way to study microorganisms. Metagenomics is the study of the collective genomes of multiple species that grow and interact in an environmental niche. Metagenomics can be used to identify new species more rapidly and to analyze the effect of pollutants on the environment (Figure 10.3.3). Metagenomics techniques can now also be applied to communities of higher eukaryotes, such as fish.

Figure 10.3.3: Metagenomics involves isolating DNA from multiple species within an environmental niche. The DNA is cut up and sequenced, allowing entire genome sequences of multiple species to be reconstructed from the sequences of overlapping pieces.

Creation of New Biofuels

Knowledge of the genomics of microorganisms is being used to find better ways to harness biofuels from algae and cyanobacteria. The primary sources of fuel today are coal, oil, wood, and other plant products such as ethanol. Although plants are renewable resources, there is still a need to find more alternative renewable sources of energy to meet our population&rsquos energy demands. The microbial world is one of the largest resources for genes that encode new enzymes and produce new organic compounds, and it remains largely untapped. This vast genetic resource holds the potential to provide new sources of biofuels (Figure 10.3.4).

Figure 10.3.4: Renewable fuels were tested in Navy ships and aircraft at the first Naval Energy Forum. (credit: modification of work by John F. Williams, US Navy)

Mitochondrial Genomics

Mitochondria are intracellular organelles that contain their own DNA. Mitochondrial DNA mutates at a rapid rate and is often used to study evolutionary relationships. Another feature that makes studying the mitochondrial genome interesting is that in most multicellular organisms, the mitochondrial DNA is passed on from the mother during the process of fertilization. For this reason, mitochondrial genomics is often used to trace genealogy.

Genomics in Forensic Analysis

Information and clues obtained from DNA samples found at crime scenes have been used as evidence in court cases, and genetic markers have been used in forensic analysis. Genomic analysis has also become useful in this field. In 2001, the first use of genomics in forensics was published. It was a collaborative effort between academic research institutions and the FBI to solve the mysterious cases of anthrax (Figure 10.3.5) that was transported by the US Postal Service. Anthrax bacteria were made into an infectious powder and mailed to news media and two U.S. Senators. The powder infected the administrative staff and postal workers who opened or handled the letters. Five people died, and 17 were sickened from the bacteria. Using microbial genomics, researchers determined that a specific strain of anthrax was used in all the mailings eventually, the source was traced to a scientist at a national biodefense laboratory in Maryland.

Figure 10.3.5: Bacillus anthracis is the organism that causes anthrax. (credit: modification of work by CDC scale-bar data from Matt Russell)

Genomics in Agriculture

Genomics can reduce the trials and failures involved in scientific research to a certain extent, which could improve the quality and quantity of crop yields in agriculture (Figure 10.3.6). Linking traits to genes or gene signatures helps to improve crop breeding to generate hybrids with the most desirable qualities. Scientists use genomic data to identify desirable traits, and then transfer those traits to a different organism to create a new genetically modified organism, as described in the previous module. Scientists are discovering how genomics can improve the quality and quantity of agricultural production. For example, scientists could use desirable traits to create a useful product or enhance an existing product, such as making a drought-sensitive crop more tolerant of the dry season.

Figure 10.3.6: Transgenic agricultural plants can be made to resist disease. These transgenic plums are resistant to the plum pox virus. (credit: Scott Bauer, USDA ARS)


Proteins are the final products of genes that perform the function encoded by the gene. Proteins are composed of amino acids and play important roles in the cell. All enzymes (except ribozymes) are proteins and act as catalysts that affect the rate of reactions. Proteins are also regulatory molecules, and some are hormones. Transport proteins, such as hemoglobin, help transport oxygen to various organs. Antibodies that defend against foreign particles are also proteins. In the diseased state, protein function can be impaired because of changes at the genetic level or because of direct impact on a specific protein.

A proteome is the entire set of proteins produced by a cell type. Proteomes can be studied using the knowledge of genomes because genes code for mRNAs, and the mRNAs encode proteins. The study of the function of proteomes is called proteomics. Proteomics complements genomics and is useful when scientists want to test their hypotheses that were based on genes. Even though all cells in a multicellular organism have the same set of genes, the set of proteins produced in different tissues is different and dependent on gene expression. Thus, the genome is constant, but the proteome varies and is dynamic within an organism. In addition, RNAs can be alternatively spliced (cut and pasted to create novel combinations and novel proteins), and many proteins are modified after translation. Although the genome provides a blueprint, the final architecture depends on several factors that can change the progression of events that generate the proteome.

Genomes and proteomes of patients suffering from specific diseases are being studied to understand the genetic basis of the disease. The most prominent disease being studied with proteomic approaches is cancer (Figure 10.3.7). Proteomic approaches are being used to improve the screening and early detection of cancer this is achieved by identifying proteins whose expression is affected by the disease process. An individual protein is called a biomarker, whereas a set of proteins with altered expression levels is called a protein signature. For a biomarker or protein signature to be useful as a candidate for early screening and detection of a cancer, it must be secreted in body fluids such as sweat, blood, or urine, so that large-scale screenings can be performed in a noninvasive fashion. The current problem with using biomarkers for the early detection of cancer is the high rate of false-negative results. A false-negative result is a negative test result that should have been positive. In other words, many cases of cancer go undetected, which makes biomarkers unreliable. Some examples of protein biomarkers used in cancer detection are CA-125 for ovarian cancer and PSA for prostate cancer. Protein signatures may be more reliable than biomarkers to detect cancer cells. Proteomics is also being used to develop individualized treatment plans, which involves the prediction of whether or not an individual will respond to specific drugs and the side effects that the individual may have. Proteomics is also being used to predict the possibility of disease recurrence.

Figure 10.3.7: This machine is preparing to do a proteomic pattern analysis to identify specific cancers so that an accurate cancer prognosis can be made. (credit: Dorie Hightower, NCI, NIH)

The National Cancer Institute has developed programs to improve the detection and treatment of cancer. The Clinical Proteomic Technologies for Cancer and the Early Detection Research Network are efforts to identify protein signatures specific to different types of cancers. The Biomedical Proteomics Program is designed to identify protein signatures and design effective therapies for cancer patients.

Scientists discover new roles for viral genes in the human genome

Singapore – The human genome is the blueprint for human life, but much of this blueprint still remains a mystery. Researchers from A*STAR's Genome Institute of Singapore (GIS) have now discovered that sequences from old viruses that were thought to be useless, might contribute to the earliest cell types in the human life cycle. These newly discovered viral elements can be used to identify new types of embryonic stem cells, opening more possibilities to understanding human development and diseases.

The viral sequences that are the focus of the discovery are similar to retroviruses , but since they are a part of the human genome, they are known as endogenous retroviruses (ERV). ERVs are able to reinsert another copy of their own DNA into the human genome once they are activated. Since they mainly multiply their own DNA, they are sometimes referred to as 'selfish DNA'. Because of their 'selfishness', ERVs are potentially dangerous when they destroy genes that are essential to human life. In a study recently published in Cell Stem Cell, scientists describe that many ERVs are activated in cells from early embryos, but instead of being harmful, they might have become useful over the course of evolution.

Genes that are activated are transcribed into RNA to function. Therefore, scientists investigate the RNAs in the cell to identify active genes. "When we investigated public data from embryonic cells, we found that many RNAs originated from regions in the human genome that are ERVs," explained GIS Fellow Dr Jonathan Göke, who led the study. "We did not only observe isolated events, but systematic activation of these ERVs. Every cell type showed transcription of specific classes, something that is very unlikely to occur by chance".

"Many ERV elements are only fragments of the full viruses," added Dr Göke. "They maintain the activation sequence, but the RNA that they generate can be very different from the RNA that retroviruses generate". In many cases, these ERV-RNAs are even parts of RNAs generated from other genes. This way, ERVs might have evolved to gain a new function they might have become a part of the blueprint for human life.

ERVs have been shown to play a role in diseases such as cancer. Because many ERVs are not expressed in the most widely used cell models, and they do not exist in mouse, scientists do not yet fully understand their function. The researchers now showed that a part of the ERVs which functions as activator can be used to identify cells that show expression of these ERV families. Such cells might overcome the limitations of current cell models to study the role and function of ERVs in development and disease.

"These are fascinating findings as the embryonic cells that express these ERV-derived RNAs are fundamental to the human life cycle. Now the big question is what they are actually doing." said Dr Guillaume Bourque, associate professor at the McGill University in Canada, who has worked on ERVs himself for many years. "From research with human embryonic stem cells, we know that ERVs have become essential, so it is quite likely that the ERVs described in this study contribute in a number of ways to human development."

"This is a very exciting study," said Prof Huck-Hui Ng, executive director of the GIS. "The results open up many new opportunities to better understand why and how embryonic cells are different from adult cells, and what role these newly discovered ERV-genes play. Some ERVs may even be involved in the formation of diseases, such as cancer."

Dr Göke's team at the GIS plans to take their research further. "We are now developing new algorithms that will help us identify additional ERVs in the human genome, and we try to isolate cells that express these ERV-RNAs. This way we will be able to study their function and how they contribute to human diseases".


We describe that genes with housekeeping expression contain more divergent promoters than genes with a more restricted tissue expression. Importantly, this property cannot be fully explained by the functional class of the encoded gene products, or by a higher prevalence of CpG islands in HK gene promoters. In addition, we have identified a number of transcription factors that are likely to play a predominant role in the control of HK gene expression. We argue that the lower promoter conservation observed in HK genes could be due to a more simple regulation of gene transcription.

Peter Kerpedjiev needed a crash course in genetics. A software engineer with some training in bioinformatics, he was pursuing a PhD and thought it would really help to know some fundamentals of biology. “If I wanted to have an intelligent conversation with someone, what genes do I need to know about?” he wondered.

Kerpedjiev went straight to the data. For years, the US National Library of Medicine (NLM) has been systematically tagging almost every paper in its popular PubMed database that contains some information about what a gene does. Kerpedjiev extracted all the papers marked as describing the structure, function or location of a gene or the protein it encodes.

Sorting through the records, he compiled a list of the most studied genes of all time — a sort of ‘top hits’ of the human genome, and several other genomes besides.

Heading the list, he found, is a gene called TP53. Three years ago, when Kerpedjiev first did his analysis, researchers had scrutinized the gene or the protein it produces, p53, in some 6,600 papers. Today, that number is at about 8,500 and counting. On average, around two papers are published each day describing new details of the basic biology of TP53.

Its popularity shouldn’t come as news to most biologists. The gene is a tumour suppressor, and widely known as the ‘guardian of the genome’. It is mutated in roughly half of all human cancers. “That explains its staying power,” says Bert Vogelstein, a cancer geneticist at the Johns Hopkins University School of Medicine in Baltimore, Maryland. In cancer, he says, “there’s no gene more important”.

But some chart-topping genes are less well known — including some that rose to prominence in bygone eras of genetic research, only to fall out of fashion as technology progressed. “The list was surprising,” says Kerpedjiev, now a postdoc studying genomic-data visualization at Harvard Medical School in Boston, Massachusetts. “Some genes were predictable others were completely unexpected.”

To find out more, Nature worked with Kerpedjiev to analyse the most studied genes of all time (see ‘The top 10’). The exercise offers more than a conversation starter: it sheds light on important trends in biomedical research, revealing how concerns over specific diseases or public-health issues have shifted research priorities towards underlying genes. It also shows how just a few genes, many of which span disciplines and disease areas, have dominated research.

Source: Peter Kerpedjiev/NCBI-NLM

Out of the 20,000 or so protein-coding genes in the human genome, just 100 account for more than one-quarter of the papers tagged by the NLM. Thousands go unstudied in any given year. “It’s revealing how much we don’t know about because we just don’t bother to research it,” says Helen Anne Curry, a science historian at the University of Cambridge, UK.

In and out of fashion

In 2002, just after the first drafts of the human genome were published, the NLM started systematically adding ‘gene reference into function’, or GeneRIF, tags to papers 1 . It has extended that annotation back to the 1960s, sometimes using other databases to help fill in the details. It is not a perfectly curated record. “In general, the data set is somewhat noisy,” says Terence Murphy, a staff scientist at the NLM in Bethesda, Maryland. There’s probably some sampling bias for papers published before 2002, he warns. That means that some genes are over-represented and a few may be erroneously missing. “But it’s not awful,” Murphy says. “As you aggregate over multiple genes, that potentially reduces some of these biases.”

With that caveat noted, the PubMed records reveal a few distinct historical periods in which gene-related papers tended to focus on particular hot topics (see ‘Fashionable genes through the years’). Before the mid-1980s, for example, much genetic research centred on haemoglobin, the oxygen-carrying molecule found in red blood cells. More than 10% of all studies on human genetics before 1985 were about haemoglobin in some way.

Source: Peter Kerpedjiev/NCBI-NLM

At the time, researchers were still building on the early work of Linus Pauling and Vernon Ingram, trailblazing biochemists who pioneered the study of disease at a molecular level with discoveries in the 1940s and 1950s of how abnormal haemoglobin caused sickle-cell disease. Molecular biologist Max Perutz, who won a share in the 1962 Nobel Prize in Chemistry for his 3D map of haemoglobin’s structure, continued to explore how the protein’s shape related to its function for decades afterwards.

According to Alan Schechter, a physician-scientist and senior historical consultant at the US National Institutes of Health in Bethesda, the haemoglobin genes — more than any others at the time — offered “an entryway to understanding and perhaps treating a molecular disease”.

A sickle-cell researcher himself, Schechter says that such genes were a focus of conversation both at major genetics meetings and at blood-disease meetings in the 1970s and early 1980s. But as researchers gained access to new technologies for sequencing and manipulating DNA, they started to move on to other genes and diseases, including a then-mysterious infection that was predominantly striking down gay men.

Even before the 1983 discovery that HIV was the cause of AIDS, clinical immunologists such as David Klatzmann had noticed a peculiar pattern among people with the illness. “I was just struck by the fact that these people had no T4 cells,” recalls Klatzmann, who is now at Pierre and Marie Curie University in Paris. He showed 2 in cell-culture experiments that HIV seemed to selectively infect and destroy these cells, a subset of the immune system’s T cells. The question was: how was the virus getting into the cell?

Klatzmann reasoned that the surface protein (later called CD4) that immunologists used to define this set of cells might also serve as the receptor through which HIV entered the cell. He was right, as he reported 3 in a study published in December 1984, alongside a similar paper 4 from molecular virologist Robin Weiss, then at the Institute of Cancer Research in London, and his colleagues.

Within three years, CD4 was the top gene in the biomedical literature. It remained so from 1987 to 1996, a period in which it accounted for 1–2% of all the tags tallied by the NLM.

That attention stemmed in part from efforts to tackle the emerging AIDS crisis. In the late 1980s, for example, several companies dabbled with the idea of engineering therapeutic forms of the CD4 protein that could mop up HIV particles before they infected healthy cells. But results from small human trials proved “underwhelming”, says Jeffrey Lifson, director of the AIDS and Cancer Virus Program at the US National Cancer Institute in Frederick, Maryland.

An even bigger part of CD4’s popularity had to do with basic immunology. In 1986, researchers realized that CD4-expressing T cells could be subdivided into two distinct populations — one that eliminates cell-infecting bacteria and viruses, and one that guards against parasites such as worms, which cause illness without invading cells. “It was a fairly exciting time, because we really understood very little,” says Dan Littman, an immunologist at the New York University School of Medicine. Just the year before, he had helped to clone the DNA that encodes CD4 and insert it into bacteria 5 , so that vast quantities of the protein could be made for research.

A decade later, Littman also co-led one of three teams to show 6 that to enter cells, HIV uses another receptor alongside CD4: a protein identified as CCR5. These, and a second co-receptor called CXCR4, have remained the focus of intensive, global HIV research ever since, with the goal — as-yet unfulfilled — of blocking the virus’s entry into cells.

Fifteen minutes of fame

By the early 1990s, TP53 was already ascendant. But before it climbed to the top of the human gene ladder, there were a few years in which a lesser-known gene called GRB2 was in the spotlight.

At the time, researchers were starting to identify the specific protein interactions involved in cell communication. Thanks to pioneering work by cell biologist Tony Pawson, scientists knew that some small intracellular proteins contained a module called SH2, which could bind to activated proteins at the surface of cells and relay a signal to the nucleus.

In 1992, Joseph Schlessinger, a biochemist at the Yale University School of Medicine in New Haven, Connecticut, showed 7 that the protein encoded by GRB2 — growth factor receptor-bound protein 2 — was that relay point. It contains an SH2 module as well as two domains that activate proteins involved in cell growth and survival. “It’s a molecular matchmaker,” Schlessinger says.

Other researchers soon filled in the gaps, opening a field of study in signal transduction. And although many other building blocks of cell signalling were soon unearthed — ultimately leading to treatments for cancer, autoimmune disorders, diabetes and heart disease — GRB2 stayed at the forefront and was the top-referenced gene for three years in the late 1990s.

In part, that was because GRB2 “was the first physical connection between two parts of the signal-transduction cascade”, says Peter van der Geer, a biochemist at San Diego State University in California. Furthermore, “it’s involved in so many different aspects of cellular regulation”.

GRB2 is something of an outlier in the most-studied list. It’s not a direct cause of disease nor is it a drug target, which perhaps explains why its moment in the sun was fleeting. “You have some rising stars that fall down very quickly because they have no clinical value,” says Thierry Soussi, a long-time TP53 researcher at the Karolinska Institute in Stockholm and Pierre and Marie Curie University. Genes with staying power usually show some sort of therapeutic potential that attracts funding agencies’ support. “It’s always like that,” Soussi says. “The importance of a gene is linked to its clinical value.”

It can also be linked to certain properties of the gene, such as the levels at which it is expressed, how much it varies between populations and the characteristics of its structure. That’s according to an analysis by Thomas Stoeger, a systems biologist at Northwestern University in Evanston, Illinois, who reported this month at a symposium in Heidelberg, Germany, that he could predict which genes would garner the most attention, simply by plugging such attributes into an algorithm.

Stoeger thinks that the reasons for these associations largely boil down to what he calls discoverability. The popular genes happened to be in hot areas of biology and could be probed with the tools available at the time. “It’s easier to study some things than others,” says Stoeger — and that’s a problem, because vast numbers of genes remain uncharacterized and underexplored, leaving major gaps in the understanding of human health and disease.

Curry also points to “intertwined technical, social and economic factors” shaped by politicians, drugmakers and patient advocates.

Right place, right time

Stoeger has also tracked how the general features of popular genes have changed over time. He found, for example, that in the 1980s, researchers focused largely on genes whose protein products were found outside cells. That’s probably because these proteins were easiest to isolate and study. Only more recently did attention shift towards genes whose products are found inside the cell.

That shift happened alongside the publication of the human genome, says Stoeger. The advance would have opened up a larger percentage of genes to enquiry.

Many of the most explored genes, however, don’t fit these larger trends. The p53 protein, for example, is active inside the nucleus. Yet TP53 became the most studied gene around 2000. It, like many of the genes that came to dominate biological research, was not properly understood after its initial discovery — which may explain why it took several decades after the 1979 characterization of the protein for the gene to rise to the top spot in the literature.

At first, the cancer-research community mistook it for an oncogene — one that, when mutated, drives the development of cancer. It wasn’t until 1989 that Suzanne Baker, a graduate student in Vogelstein’s lab, showed 8 that it was actually a tumour suppressor. Only then did functional studies of the gene really begin to pick up steam. “You can see from the spike in publications that go up essentially at that point that there were a lot of people who were really very interested,” says Baker, now a brain-tumour researcher at the St. Jude Children’s Research Hospital in Memphis, Tennessee.

Research into human cancer also brought scientists to TNF, the runner-up to TP53 as the most-referenced human gene of all time, with more than 5,300 citations in the NLM data (see ‘Top genes’). It encodes a protein — tumour necrosis factor — named in 1975 because of its ability to kill cancer cells. But anticancer action proved not to be TNF’s main function. Therapeutic forms of the TNF protein were highly toxic when tested in people.

Source: Peter Kerpedjiev/NCBI-NLM

The gene turned out to be a mediator of inflammation its effect on tumours was secondary. Once that became clear in the mid-1980s, attention quickly shifted to testing antibodies that block its action. Now, anti-TNF therapies are mainstays of treatment for inflammatory disorders such as rheumatoid arthritis, collectively pulling in tens of billions of dollars in annual sales worldwide.

“This is an example where the knowledge of the gene and the gene product has relatively quickly changed the health of the world,” says Kevin Tracey, a neurosurgeon and immunologist at the Feinstein Institute for Medical Research in Manhasset, New York.

TP53’s dominance was briefly interrupted by another gene, APOE. First described in the mid-1970s as a transporter involved in clearing cholesterol from the blood, the APOE protein was “seriously considered” as a lipid-lowering treatment for preventing heart disease, says Robert Mahley, a pioneer in the field at the University of California, San Francisco, who tested the approach in rabbits 9 .

Ultimately, the creation of statins in the late 1980s doomed this strategy to the dustbin of pharmaceutical history. But then, neuroscientist Allen Roses and his colleagues found the APOE protein bound up in the sticky brain plaques of people with Alzheimer’s disease. They showed 10 in 1993 that one particular form of the gene, APOE4, was associated with a greatly increased risk of the disease.

This generated much wider interest in the gene. Still, it took time to move up the most-studied chart. “The reception was very cool,” recalls Ann Saunders, a neurogeneticist and chief executive of Zinfandel Pharmaceuticals in Chapel Hill, North Carolina, who collaborated with Roses, her late husband. The amyloid hypothesis, which states that build-up of a protein fragment called amyloid-β is responsible for the disease, was all the rage in the Alzheimer’s-research community at the time. And few researchers seemed interested in finding out what a cholesterol-transport protein had to do with the disease. But the genetic link between APOE4 and Alzheimer’s risk proved “irrefutable”, Mahley says, and in 2001, APOE briefly overtook TP53. It remains in the all-time top five, at least for humans (see ‘Beyond human’).

Beyond human

The US National Library of Medicine has tracked references to genes from dozens of species, including mice, flies and other important model organisms, as well as viruses. Looking at genes from all species, more than two-thirds of the 100 most studied genes over the past 50 years have been human (see ‘The gene menagerie’). But non-human genes do appear quite high on the list. Often, these have a clear link to human health, as with mouse versions of TP53, or env, a viral gene that encodes envelope proteins involved in gaining entry to a cell.

Source: Peter Kerpedjiev/NCBI-NLM

Others became foundational to broader genetic studies. A gene from the fruit fly Drosophila melanogaster known simply as white has been the focus of about 3,600 papers — dating back to when biologist Thomas Hunt Morgan, working at Columbia University in New York City, peered through a hand lens one day in 1910 and saw a single male fly with white eyes instead of red 11 . Because its product causes an easily observable change in the fly, the white gene serves as a marker for scientists looking to map and manipulate the fly genome. It has been involved in many fundamental discoveries 12 , such as the demonstration that large stretches of DNA can be duplicated because of unequal exchange between matching chromosomes.

The most popular non-human gene of all time is actually a spot in the mouse genome whose normal function remains poorly understood. Rosa26 comes from an experiment published 13 in 1991, in which cell biologists Philippe Soriano and Glenn Friedrich used a virus to insert an engineered gene randomly into mouse embryonic stem cells. In one cell line, dubbed ROSA26, the engineered gene seemed to be active at all times and in nearly every cell type. The discovery served as a building block for the creation of tools to make and manipulate transgenic mice. “People starting using it like crazy,” recalls Soriano, who is now at the Icahn School of Medicine at Mount Sinai in New York City. So far, the genetic locus known as Rosa26 has been involved in some 6,500 functional studies. It is second only to TP53.

Like other popular genes, APOE is well studied because it’s central to one of the biggest unsolved health problems of the day. But it’s also important because anti-amyloid therapies have mostly flamed out in clinical testing. “I hate saying this, but what helped me were the failed trials,” says Mahley, who this year raised US$63 million for his company E-Scape Bio to develop drugs that target the APOE4 protein. Those failures, he says, forced industry and funding agencies to rethink therapeutic strategies for tackling Alzheimer’s.

There’s the rub: it takes a certain confluence of biology, societal pressure, business opportunity and medical need for any gene to become more studied than any other. But once it has made it to the upper echelons, there’s a “level of conservatism”, says Gregory Radick, a science historian at the University of Leeds, UK, “with certain genes emerging as safe bets and then persisting until conditions change”.

The question now is how conditions might change. What new discoveries might send a new gene up the chart — and knock today’s top genes off their pedestal?

The Human Genome Is Full of Viruses

V iruses are amazing molecular machines that are much tinier than even the smallest cells. We often think of viruses like the flu, chickenpox, or herpes as “external” invaders, but viruses are more inherently associated with human life than we often realize. Even after recovering from an infection there will always be a piece of that virus encoded within your DNA (depending on the type of virus). Approximately 8% of the human genome is made up of endogenous retroviruses (ERVs), which are viral gene sequences that have become a permanent part of the human lineage after they infected our ancient ancestors. And these endogenous retroviruses don’t just sit silently in the genome — their expression has been implicated in diseases like autoimmune disorders and breast cancer.

But endogenous retroviruses don’t only harm our health they can also be extremely useful for human survival. For example, they play a very important role as an interface between a pregnant mother and her fetus by regulating placental development and function. It has been suggested that viruses are not only necessary for the existence of placental mammals, but also for the existence of life in general. Professor Luis P. Villarreal, the Founding Director of the Center for Virus Research at UC Irvine, says it like this: “So powerful and ancient are viruses, that I would summarize their role in life as ‘Ex Virus Omnia’ (from virus everything).”

Viruses are powerful, ancient, and vital to our existence, but they are extremely simple constructions. They tend to be nothing more than a few pieces: a protein capsid, which is a simplistic and protective shell a protein called a polymerase, which carries out most of the functions related to replicating the viral genome and a sequence of nucleotides — either RNA or DNA — that encode for the previously mentioned viral proteins. The image below shows one of the ways that these viral components can be assembled into a unified whole. Unlike a human genome, a viral genome can be thought of as a self-contained model of the entire viral form. Within its RNA or DNA, a virus contains all the instructions necessary to create an entirely new body for itself and to replicate those same instructions. The simplicity and self-contained nature of viruses makes them phenomenal tools for biological engineering and medicine.

Viruses are so simple that they don’t always need their own body to survive they have circadian rhythms like all living things. We experience these rhythms through cycles of sleep and wakefulness, whereas viral rhythms occur as periods of dormancy between rounds of infection. Viruses don’t technically have a body during their dormant phase — they are nothing more than a string of letters in the book of the genome. But, as soon as something disturbs their sleep (like a mutation or a new virus invading the host) viruses can awaken and rebuild their physical bodies from a purely genetic form. When the wrong (or right, depending on your perspective) protein manages to leak out of a dormant viral gene, it is like the virus is suddenly awake again. A new physical body means that it has all the tools necessary to replicate.

Even beyond these rhythmic cycles, certain kinds of viruses don’t need a physical form at all. These disembodied viruses are called transposable elements, or transposons. True viruses have a body made from proteins, but transposons are mobile genetic elements — sequences of DNA that physically move in and out of genomes. For this reason, they are often referred to as “jumping genes.” Transposons do very much the same thing as true viruses, i.e. they copy and paste themselves throughout genomes. They are so similar to true viruses that some endogenous retroviruses (ERVs) are themselves transposons. As stated above,

8% of the human genome is made up of ERVs, but nearly 50% of the human genome is made of transposons! Humans are basically just big piles of viral-like sequences.

Transposons have a disturbing capacity to disrupt important genes by inserting themselves into the DNA sequences. It’s like if a series of words in a book could physically move around from page to page — these words would have a high likelihood of jumping into the middle of a sentence, thereby making it nonsensical. Amazingly, transposons preferentially insert themselves into important and functional genes — as if those jumping words wanted to disrupt the most interesting parts of the book rather than the index or bibliography. This is a powerful evolutionary strategy, since transposons are much more likely to get “read” by a cell if they jump into the middle of an important (and therefore, active) gene.

Transposons can very easily mess up important genes that we need to survive, so it has been theorized that epigenetic mechanisms evolved to stop transposons from moving around the genome. Furthermore, since transposons can rapidly alter DNA sequences, they are thought to play a major role in the processes of evolution and speciation (how a species evolves into a new form). In plants, transposons become highly active in response to stressful conditions, and this could act as a rapid source of short-term mutation when the environment starts pressuring you to survive or die. In addition, an animal’s genome changes when they are domesticated (like going from a wolf to a dog, or from an aurochs to a cow), and a majority of these changes occur in transposon sequences. No one is really sure why or how this happens, but it is clear that viruses play a very important role in rapid genetic change.

A biological virus (whether it is a true virus, an endogenous retrovirus, or a transposon) can literally lay dormant in a word document as a string of As, Ts, Cs, and Gs. In other words, viruses can exist independently of genetics, solely in the symbolic dimension of evolution. A virus is nothing more than an idea until it finds a host within which it can replicate itself. Despite their ephemerality, viral sequences are clearly important for our lives as humans. After all, they compose nearly half of our genome and seem to play an important role in our long-term evolution.

In many ways, viruses are eerily reminiscent of the idea of ancient spells, which sit quietly as words in a book until someone utters the mystical syllables and unleashes the magic contained within. Perhaps due to the mysticism of this concept, many scientists and philosophers have a hard time accepting viruses as living things. But, whether or not you classify viruses as living entities, they certainly show us that the line between living things and pure information is a lot fuzzier than we often think…

Copyright © 2019 by Ben L. Callif. Used by permission of S. Woodhouse Books, an imprint of Everything Goes Media. All rights reserved.

If you enjoyed this read, sign up for ourmailing list to stay connected!

Reporter Genes Reveal When and Where a Gene Is Expressed

Clues to gene function can often be obtained by examining when and where a gene is expressed in the cell or in the whole organism. Determining the pattern and timing of gene expression can be accomplished by replacing the coding portion of the gene under study with a reporter gene. In most cases, the expression of the reporter gene is then monitored by tracking the fluorescence or enzymatic activity of its protein product (pp. 518�).

As discussed in detail in Chapter 7, gene expression is controlled by regulatory DNA sequences, located upstream or downstream of the coding region, which are not generally transcribed. These regulatory sequences, which control which cells will express a gene and under what conditions, can also be made to drive the expression of a reporter gene. One simply replaces the target gene's coding sequence with that of the reporter gene, and introduces these recombinant DNA molecules into cells. The level, timing, and cell specificity of reporter protein production reflect the action of the regulatory sequences that belong to the original gene (Figure 8-61).

Figure 8-61

Using a reporter protein to determine the pattern of a gene's expression. (A) In this example the coding sequence for protein X is replaced by the coding sequence for protein Y. (B) Various fragments of DNA containing candidate regulatory sequences are (more. )

Several other techniques, discussed previously, can also be used to determine the expression pattern of a gene. Hybridization techniques such as Northern analysis (see Figure 8-27) and in situ hybridization for RNA detection (see Figure 8-29) can reveal when genes are transcribed and in which tissue, and how much mRNA they produce.

Human Genome Is Much More Than Just Genes

The human genome—the sum total of hereditary information in a person—contains a lot more than the protein-coding genes teenagers learn about in school, a massive international project has found. When researchers decided to sequence the human genome in the late 1990s, they were focused on finding those traditional genes so as to identify all the proteins necessary for life. Each gene was thought to be a discrete piece of DNA the order of its DNA bases—the well-known "letter" molecules that are the building blocks of DNA—were thought to code for a particular protein. But scientists deciphering the human genome found, to their surprise, that these protein-coding genes took up less than 3% of the genome. In between were billions of other bases that seemed to have no purpose.

Now a U.S.-funded project, called the Encyclopedia of DNA Elements (ENCODE), has found that many of these bases do, nevertheless, play a role in human biology: They help determine when a gene is turned on or off, for example. This regulation is what makes one cell a kidney cell, for instance, and another a brain cell. "There's a lot more to the genome than genes," says Mark Gerstein, a bioinformatician at Yale University.

The insights from this project are helping researchers understand the links between genetics and disease. "We are informing disease studies in a way that would be very hard to do otherwise," says Ewan Birney, a bioinformatician at the European Bioinformatics Institute in Hinxton, U.K., who led the ENCODE analysis.

As part of ENCODE, 32 institutions did computer analyses, biochemical tests, and sequencing studies on 147 cell types—six fairly extensively—to find out what each of the genome's 3 billion bases does. About 80% of the genome is biochemically active, ENCODE's 442 researchers report today in Nature. Some of these DNA bases serve as landing spots for proteins that influence gene activity. Others are converted into strands of RNA that perform functions themselves, such as gene regulation. (RNA is typically thought of as the intermediary messenger molecule that helps make proteins, but ENCODE showed that much of RNA is an end product and is not used to make proteins.) And many bases are simply places where chemical modifications serve to silence stretches of our chromosomes.

ENCODE's results are changing how scientists think about genes. It found about 76% of the genome's DNA is transcribed into RNA of one sort or another, way more than researchers had originally expected. That DNA includes slightly less than 21,000 protein-coding genes (some researchers once estimated we had more than 100,000 such genes) "genes" for 8800 small RNA molecules and 9600 long noncoding RNA molecules, each of which is at least 200 bases long and 11,224 stretches of DNA that are classified as pseudogenes, "dead" genes now known to really be active in some cell types or individuals. In addition, efforts to define the beginning end, and coding regions of these genes revealed that genes can overlap and have multiple beginnings and ends.

The project uncovered 4 million spots in our DNA that act as switches controlling gene activity. Those switches can be both near and far from the gene they regulate and act in different combinations in different cell types to give each cell type a unique genomic identity. In addition, at least some of the RNA strands produced by the genome also help to control how much protein results from a particular gene's activity. Thus, the regulation of a gene is proving much more complex than expected.

These and other findings appear today in six papers in Nature, and 24 in Genome Research and Genome Biology. Two additional papers are published today on Science online. In a database, ENCODE has created a map showing the roles of all the different bases. "It's like Google Maps for the human genome," says Elise Feingold, a program director for the National Human Genome Research Institute in Bethesda, Maryland, which funded ENCODE. With Google Maps one can choose various views to see different aspects of the landscape. Likewise, in the ENCODE map, one can zoom in from the chromosome level to the individual bases and switch from looking at whether those bases yield RNA or are places where DNA-regulatory proteins bind, for example.

This catalog "will change the way people think about and actually use the human genome, says John A. Stamatoyannopoulos, an ENCODE researcher at the University of Washington, Seattle.

Already he and others are harnessing this information—much of which is already publicly available—to learn about genetic influences on disease. Many large-scale studies have linked specific base changes to higher or lower risks for disorders ranging from diabetes to arthritis. Now researchers can look to see whether those variants are involved in regulation of some sort and if so, what genes are being regulated. For his study of cancer and epigenetics, "ENCODE data were fundamental," says Mathieu Lupien, a molecular biologist from the University of Toronto in Canada who was not associated with ENCODE.

9.22 | Genomics and Proteomics

Proteins are the final products of genes, which help perform the function encoded by the gene. Proteins are composed of amino acids and play important roles in the cell. All enzymes (except ribozymes) are proteins that act as catalysts to affect the rate of reactions. Proteins are also regulatory molecules, and some are hormones. Transport proteins, such as hemoglobin, help transport oxygen to various organs. Antibodies that defend against foreign particles are also proteins. In the diseased state, protein function can be impaired because of changes at the genetic level or because of direct impact on a specific protein.

A proteome is the entire set of proteins produced by a cell type. Proteomes can be studied using the knowledge of genomes because genes code for mRNAs, and the mRNAs encode proteins. Although mRNA analysis is a step in the right direction, not all mRNAs are translated into proteins. The study of the function of proteomes is called proteomics. Proteomics complements genomics and is useful when scientists want to test their hypotheses that were based on genes. Even though all cells of a multicellular organism have the same set of genes, the set of proteins produced in different tissues is different and dependent on gene expression. Thus, the genome is constant, but the proteome varies and is dynamic within an organism. In addition, RNAs can be alternately spliced (cut and pasted to create novel combinations and novel proteins) and many proteins are modified after translation by processes such as proteolytic cleavage, phosphorylation, glycosylation, and ubiquitination. There are also protein-protein interactions, which complicate the study of proteomes. Although the genome provides a blueprint, the final architecture depends on several factors that can change the progression of events that generate the proteome.

Metabolomics is related to genomics and proteomics. Metabolomics involves the study of small molecule metabolites found in an organism. The metabolome is the complete set of metabolites that are related to the genetic makeup of an organism. Metabolomics offers an opportunity to compare genetic makeup and physical characteristics, as well as genetic makeup and environmental factors. The goal of metabolome research is to identify, quantify, and catalogue all of the metabolites that are found in the tissues and fluids of living organisms.

Cancer Proteomics

Genomes and proteomes of patients suffering from specific diseases are being studied to understand the genetic basis of the disease. The most prominent disease being studied with proteomic approaches is cancer. Proteomic approaches are being used to improve screening and early detection of cancer this is achieved by identifying proteins whose expression is affected by the disease process. An individual protein is called a biomarker, whereas a set of proteins with altered expression levels is called a protein signature. For a biomarker or protein signature to be useful as a candidate for early screening and detection of a cancer, it must be secreted in body fluids, such as sweat, blood, or urine, such that large scale screenings can be performed in a non-invasive fashion. The current problem with using biomarkers for the early detection of cancer is the high rate of false-negative results. A false negative is an incorrect test result that should have been positive. In other words, many cases of cancer go undetected, which makes biomarkers unreliable. Some examples of protein biomarkers used in cancer detection are CA-125 for ovarian cancer and PSA for prostate cancer. Protein signatures may be more reliable than biomarkers to detect cancer cells. Proteomics is also being used to develop individualized treatment plans, which involves the prediction of whether or not an individual will respond to specific drugs and the side effects that the individual may experience. Proteomics is also being used to predict the possibility of disease recurrence.

The National Cancer Institute has developed programs to improve the detection and treatment of cancer. The Clinical Proteomic Technologies for Cancer and the Early Detection Research Network are efforts to identify protein signatures specific to different types of cancers. The Biomedical Proteomics Program is designed to identify protein signatures and design effective therapies for cancer patients.