Decoding ENCODE

Dennis Venema
On September 24, 2012

Fuzzy, but useful

One of the challenges for my students learning biology is summed up in one of my favorite sayings (that I’m sure some students are tired of hearing from me): “All the good concepts are fuzzy.” Take a basic concept like “living” versus “non-living,” for example. Obviously this is a fundamental concept for a biologist, since “biology” means the study of living things. Even here, though, we find that a precise definition of what is “alive” is a hard thing to nail down. While things like humans, dogs and cats obviously qualify (though some days with early lectures I might have my doubts for humans), there are other entities out there that blur the boundary between life and non-life. Viruses, for example, have many of the features of living things, but lack some others. Transposons are less life-like even than viruses, and there are even transposon-like entities that parasitize viruses. Life and non-life are useful concepts, but the precise boundary between them is fuzzy.

More technology = greater fuzz

Often, an increase in technological ability exacerbates the “fuzziness” issue. One example in genetics (that we will later see to be highly relevant to understanding the results of ENCODE) is the concept of “dominant” versus “recessive” for different versions of a given gene. If you recall anything at all about genetics from high school, you might remember learning about Gregor Mendel crossing pea plants that differed in certain characteristics (purple versus white flowers, for example). Mendel deduced that the “particles” that controlled a certain trait (what we would later call “genes”) came in pairs, and that the presence of one type of particle (e.g. the one for purple flowers) could mask the presence of another (in this case the one for white flowers). He deduced that one gene version (what we now call an allele) was dominant over the other one, which in turn was recessive. For Mendel, one determined a dominant / recessive relationship by examining the appearance of a plant with both alleles: whichever allele determined the appearance was the dominant one.

Advances in technology would later do two things to Mendel’s model. First, they would provide deeper insights to what was actually going on at the biochemical level. Secondly, those deeper insights would cause the concept of “dominant” or “recessive” to become more fuzzy. I’ll illustrate what I mean with a (hypothetical, but representative) example.

When Mendel did his work he was limited to what he could observe with the naked eye. Now we have the ability to examine the effects of alleles at much deeper levels than Mendel could. Let’s say, for the sake of the discussion, that the gene Mendel was working with made an enzyme that produced purple pigment. The “purple” allele of the gene (let’s represent it with the symbol “P”) made a fully functioning enzyme: its DNA is copied into mRNA, and that mRNA is used to code for the protein enzyme that does the work of making pigment. The “white” allele (let’s call it “p”), on the other hand, turns out to have a mutation in the protein coding portion of the gene. This single mutation has two effects: it stops translation early, resulting in a protein that is too short and cannot work as an enzyme. The mutation also has an effect on the stability of the mRNA: the mRNA produced by the white allele degrades more readily, resulting in a lower steady-state amount of the mRNA in the cell.

With this background in mind, suppose a scientist performs a series of different tests on a plant that has one purple allele and one white allele (i.e. is “Pp”):

If the scientist looks at the flower color of the Pp plant, she would conclude (as did Mendel) that the p allele is recessive to the P allele, since the Pp plant is as purple as a plant with two purple alleles (PP). This arises because one P allele can produce enough enzyme for complete flower pigmentation.

If the scientist compares the amount of mRNA for this gene between PP, Pp and pp plants, she would notice three different outcomes. PP would have the most, Pp would have less, and pp would have the least. For this test the Pp plant is intermediate between the PP and pp plants. The scientist would conclude that neither the P nor p allele is completely dominant over / recessive to the other (an effect known as “incomplete dominance”).

If the scientist did a test to compare the physical size of the protein enzyme in PP, Pp and pp plants, she would again notice three outcomes. PP plants would have only full-sized enzymes, pp plants would have only small enzyme fragments, and Pp plants would have both distinct sizes, full-sized and small. In this case, the Pp plant shows both character traits (full-sized and small) at the same time. The scientist would conclude that the P and p alleles are both dominant, since both alleles display their version of the trait with neither masking the other in any way (an effect known as “co-dominance”.)

So, is the P allele dominant, incompletely dominant, or co-dominant with respect to the p allele? The answer is “yes” – all three apply, but it depends specifically on the details that the new technology is revealing. Which answer is the most meaningful one? Well, it depends on the specific question the researcher is asking. Now that we have the ability to sequence DNA, we can directly observe the nature of all alleles in any given organism, and the presence of other alleles does not interfere with this observation. In effect, modern molecular biology has made all alleles “co-dominant” since all alleles display their “version of the trait” (i.e. their sequence) when they are sequenced. If one was so inclined, one could argue that “recessiveness” is an outdated concept, and that eventually we will determine through sequencing technology that all alleles are co-dominant. While this would be technically true, it would be very misleading. The p allele remains “recessive” in biologically meaningful ways: it is a loss of an enzyme function, and its complete loss has an effect on the appearance of the organism. Plants that have one of each allele (Pp) have the same enzyme content as PP plants. Anyone who would argue that “recessiveness” was no longer a feature of alleles in light of the new sequencing technology would have to address these issues in a meaningful way, since the evidence for “recessiveness” did not simply evaporate when we learned how to sequence genomes. By any measure, Mendel’s ideas of dominance and recessiveness are still useful concepts.

The relevance to ENCODE

So, how does this all relate to the ENCODE project? It hinges on another very useful, and therefore fuzzy term: “function.” Like “life” and “dominant”, “function” is a useful idea in biology, but much hinges on precisely how it is defined, and the technology used to assess its presence or absence.

The ENCODE definition of “function” is a useful one for the purposes of the large undertaking that this project represents. Specifically, ENCODE was seeking for biochemical activity in the genome: the interaction of chromatin proteins with DNA, regions of DNA that are made into RNA, and so on. This is all well and good, for we now have new tools available that allow us to test for these effects – we have new technology that can shed new insights on what is going on in the genome.

What these results don’t do, however, is cause the prior lines of evidence relating to non-functional DNA to suddenly disappear. As we saw with the dominance issue, the results from new techniques will need to be integrated into a more complete understanding of the data. We must also have a wider understanding of the strengths and weaknesses of various techniques to answer certain specific kinds of questions.

As a way to illustrate these issues for the ENCODE project, let’s consider the hypothetical example we used to explore the dominance issue. The ENCODE definition of “function” includes any detectable biological activity such as the presence of an mRNA transcript. In our example, both the “P” allele (that produces a working protein enzyme) and the “p” allele (which does not) both produce an mRNA transcript. As such, the ENCODE project would indentify both alleles as equally functional. In fact, the ENCODE definition of “detectable biological activity” as “function” would not be able to distinguish between these two alleles in any meaningful way, despite the fact that they have real, biological, and obviously functional differences. This is not to criticize the working definition of function adopted by ENCODE, but merely to demonstrate that this definition, while useful in some contexts, has limitations.

These limitations should stand as a caution to any group that wishes to adopt the ENCODE definition as the only viable definition of biological function. To consider our example again, I suspect that many of those opposed to evolution would bristle at the suggestion that the p allele was equally functional to the P allele, given than it represents a clear loss of function in keeping with common Young Earth Creationist, Old Earth Creationist, and Intelligent Design definitions of loss-of-function alleles, and the propensity of these groups to insist that such mutations destroy functional information. Yet what we have seen from these groups, by and large, is a robust embrace of ENCODE and its view of function. I suspect that these groups, in their excitement over the media frenzy declaring the idea of “junk DNA” to be dead, have not yet had time to carefully think through the implications of that embrace.

Image courtesy of Flickr user prettywar-stl.

So far, I introduced “function” as a particularly useful concept in biology, but also cautioned that—like all good concepts—it has “fuzzy” edges. Indeed, it has lots of similarly “fuzzy” peers in the language of biology: for example, asking a molecular biologist “What is a gene?” or asking an ecologist “What is a species?” is not advisable unless you have an hour or more to devote to the conversation. A discussion of biological “function” could generate a similar conversation.

For most biologists, something biological has function if to contributes to the characteristics of an organism in such a way as to favor its reproduction (usually by favoring its survival). Conversely, for a biologist to claim that some feature of an organism is non-functional, they are claiming that this feature does not contribute to or favor survival or reproduction. To return to our historically-interesting example from yesterday, the wild-type allele of the enzyme responsible for making purple pigment in pea flowers (the “P” allele) is functional since it has an observable affect on the characteristics of the organism that favors its reproduction (attracting pollinators, perhaps). When a mutation arose in this gene to prematurely terminate the synthesis of the protein enzyme, the recessive “p” allele resulted. Biologists would not hesitate to label this allele as a loss-of-function allele, because the function they have in mind is that of making purple pigment. The fact that this allele still produces an mRNA and even a partial protein product would not faze them in the least, since the known biological function of the gene has been disrupted. On the other hand, the ENCODE definition of “function” as “any detectable biological activity” presents things differently—by that standard we would not be able to discern any difference between these two alleles, despite the evidence that one is functional (in the sense above) and the other is not.

What this means is that the ENCODE definition of “function” is specific to a context: detecting (any) biochemical activity for a segment of DNA in the genome. As I mentioned in my first post, looking for biochemical activity is a useful and interesting undertaking, and the ENCODE project is impressive in its scope. What it does not do, however, is define “function” in the usual biological sense we have just discussed: that of meaningful contribution to survival and reproduction. In fact, biologists would expect that many DNA sequences that are non-functional in the traditional sense would be detected as “functional” using the ENCODE definition. One such example is that of transposon-derived sequences, which make up nearly 50% of the human genome.

Transposons, and ENCODE

We previously examined transposons in our series on “Junk DNA.” In brief, these are parasitic DNA sequences that serve to replicate themselves and spread within genomes. They have sequences that act to recruit host enzymes for making mRNA and a protein enzyme that acts to copy and/or move the transposon to a new chromosome location. These entities are veritable beehives of biochemical activity, but biologists consider them non-functional (with respect to their hosts) even if they are highly functional (with respect to the transposon). In many cases, however, transposon sequences in mammals are defective—they have picked up mutations such that they no longer make the enzyme they need for movement, or perhaps the mutation ruined one of the DNA sites the enzyme binds to. As before, these sequences are non-functional with respect to their mammalian host—they make no contribution to the host organism at all—and they are non-functional even to themselves (since the transposon cannot replicate any longer). Even such doubly non-functional sequences, however, will retain detectable biochemical activity. Host DNA-binding proteins will still bind to these sequences, mRNA may be produced, and even the transposon enzyme might be partially made as a non-functional protein. These biochemical activities may persist for thousands of generations before additional mutations silence them, so these sequences would still be identified as “functional” according the ENCODE criteria. Since almost half of the human genome is made up from such repetitive sequences, it’s not surprising that ENCODE found so much “function.” Yes, these sequences have detectable biochemical activity, but that’s not surprising at all, given what we know about transposons. Nor does such activity demonstrate that these sequences are functional in the more strict sense. Indeed, lines of evidence from comparative genomics strongly suggest they are not.

Consider the onion

One such line of evidence is that closely related species can vary widely in the amount of DNA they contain, yet have the same number of genes. For example, some species in the genus Allium (onion, garlic and related plants) can have over five times as much DNA as other species within the same group. The difference is largely in repetitive DNA sequences, such as transposons and transposon fragments. Such observations are challenging to square with the hypothesis that the species with the larger amounts require all of it for function in the strict sense, since the species in the group are all almost exactly the same structurally. If Onion Species B has five times as much DNA as Onion Species A, it does not mean that all of it is necessary to build the body form of Species B. No, the developmental process for building Species B involves laying down the very same structures that we find in Species A, with only slight modifications. So even if all of the “extra” DNA in Species B is doing something biochemically, it doesn’t mean that it is all necessary to build or maintain the body form. Furthermore, we might notice that the onion has over five times as much DNA as humans. Do we really think that it takes five times more functionally necessary DNA to build an onion than it does to make a human being? No. Much of the extra DNA, put simply, may be “functioning” in some way (i.e. biochemically active), but it is highly unlikely that it is functionally necessary. This observation led evolutionary and genome biologist T. Ryan Gregory to propose the “onion test” as a mental check against proposed universal functions for non-coding DNA (using “function” in the strict sense):

The onion test is a simple reality check for anyone who thinks they have come up with a universal function for non-coding DNA. Whatever your proposed function, ask yourself this question: Can I explain why an onion needs about five times more non-coding DNA for this function than a human?”

The “vitellogenin test”

Whereas the onion test of meaningful function is a broad look across the genomes of a group of related organisms, a complementary strategy is to examine specific cases of DNA that have been widely accepted as being non-functional (i.e. not necessary for the building and maintenance of the body). Indeed, if the argument against the very idea of “:non-functional DNA” is to be convincing to most biologists, it needs to address cases where the accumulated evidence for the standard definition of non-functionality is strong. So, with a tip of the hat to Gregory’s “onion test,” I’d also like to propose a test to be used for the claim that “junk DNA” has been shown to be non-existent. Simply put, the test asks: does the claim address the features we observe in the human Vitellogenin 1 pseudogene?

Since this is a pseudogene that may already be familiar to readers from my previous discussions of “junk DNA,” it will serve as a useful example to explore further. For those who have not yet encountered this example, however, I will summarize its relevant features before going on to re-evaluate it in light of ENCODE.

In egg-laying animals, including some mammals like the platypus, the Vitellogenin 1 (Vit 1) gene produces a protein that is used in the formation of egg yolk. Yolk serves as a source of nutrients for the developing embryo once it is cut off from the maternal supply when the eggshell is formed. Placental mammals, like humans, retain a link to their mothers throughout their embryonic development through the placenta, and therefore do not need egg yolk in the same way that egg-laying organisms do.

Several years ago, a group of researchers went looking for remains of Vit 1 gene sequences in humans and other mammals. According to evolutionary theory, all mammals are the descendents of egg-laying ancestors – meaning that, if traced back far enough, placental mammals and modern egg-laying organisms such as birds once were the same ancestral species, with a common genome. Working with this knowledge, the researchers located the Vit 1 gene sequence in chickens, and took note of the sequences on either side of it (for convenience we’ll call them “Gene A” and “Gene B”. They then located these sequences in the human genome, where they also sit side-by-side. Examining the sequence between Gene A and Gene B in the human genome revealed that the mutated remains of the Vit 1 gene were still present in the human genome, in the exact spot that an expectation of common ancestry (in this case, conservation of genome structure, or shared synteny) would predict:

Also, when comparing the Vit 1 pseudogene between various placental mammals, we observe that several of the inactivating mutations (deletions) are common to all, indicating that they occurred in the last common ancestor of these species, and were subsequently inherited:

To sum up, what we observe in the mammalian Vit 1 pseudogenes is as follows:

  1. The function of the Vit 1 gene in egg-laying organisms is well known and well understood.
  2. Placental mammals, including humans, do not require a functional Vit 1 protein product, yet have a Vit 1 sequence that cannot, due to many mutations, perform its known function as a protein involved in yolk formation. In other words, in placental mammals, the Vit 1 gene has suffered a loss of function that renders it a pseudogene.
  3. A Vit 1 pseudogene in placental mammals can be located using predictions based on shared synteny with egg-laying organisms such as chicken.
  4. Placental mammals, including humans, share a number of identical mutations within their Vit 1 pseudogene, indicating that these mutations happened once in a common ancestor, and were inherited from that common ancestor.

Taken together, these lines of evidence strongly support the conclusion that the Vit 1 gene we observe in placental mammals is non-functional in the strict sense – that it does not contribute to reproduction or survival. The possibility that the Vit 1 sequence in placental mammals might retain some residual biochemical activity (it once was a functional gene, after all) would not change these lines of evidence or the conclusions drawn from them. Moreover, the (however slight) possibility that certain parts of any given pseudogene might have gained an important new function – a process called exaptation that we have discussed previously – does not affect the conclusions drawn from the whole study of Vit 1 as to its origins as a previously functional but now non-functional gene.

Taking the test

Though I have presented much of this evidence about the Vit 1 pseudogene here on BioLogos in the past, I am not yet aware of any other science/faith organization that has addressed this evidence. Web searches for terms such as “junk DNA” or “pseudogene” at various such sites produce a significant number of articles addressing the topic, and all sites examined had at least one page addressing the human GULO / GLO pseudogene as a specific example. Similar searches, including searches for the more generic term “yolk” failed to reveal any discussion of this pseudogene on any of the websites listed. I would invite these groups, all of whom have recently posted on the ENCODE project to suggest that “junk DNA” is no longer a tenable idea, to “take the test” and offer an explanation for the features we observe in the human Vitellogenin 1 pseudogene.


Dear BioLogos reader ...

In the escalating vitriol in our culture, “science” and “faith” have found each other on opposite sides of a polarized divide. Truth and community are under attack.

If there is one thing the pandemic has shown us, it is what science can and cannot do. Scientists and doctors have done amazing things during the pandemic—identified the virus, treated the disease, and developed safe vaccines that work.

But in these polarized times, science can’t reduce anger, forgive sins, build mutual respect, or fill us with compassion for others.

Science alone can’t give us hope. Faith can. Join BioLogos today in reaching a world desperate for hope. Your tax-deductible donation will be the difference between someone encountering misinformation, or a thoughtful, truthful, and hopeful Christian perspective that shows faith and science working hand in hand.

Give Now


Dennis Venema
About the Author

Dennis Venema

Dennis Venema is professor of biology at Trinity Western University in Langley, British Columbia. He holds a B.Sc. (with Honors) from the University of British Columbia (1996), and received his Ph.D. from the University of British Columbia in 2003. His research is focused on the genetics of pattern formation and signaling using the common fruit fly Drosophila melanogaster as a model organism. Dennis is a gifted thinker and writer on matters of science and faith, but also an award-winning biology teacher—he won the 2008 College Biology Teaching Award from the National Association of Biology Teachers. He and his family enjoy numerous outdoor activities that the Canadian Pacific coast region has to offer.