Join us April 17-19 for the BioLogos national conference, Faith & Science 2024, as we explore God’s Word and God’s World together!

Dennis Venema
 on December 30, 2011

Is There “Junk” in Your Genome? Exploring Pseudogenes

On “junk DNA”—how genomics can be employed to test for non-functional sequences by comparing sequences between related organisms.


On “junk DNA”—how genomics can be employed to test for non-functional sequences by comparing sequences between related organisms.

One of the challenges for discussing evolution within evangelical Christian circles is that there is widespread confusion about how evolution actually works. In this installment on “junk DNA”, we explore how genomics can be employed to test for non-functional sequences by comparing sequences between related organisms.

Do genomes have non-functional sequences?

There are various ways to test the hypothesis that certain regions of DNA are non-functional, and in this series we will explore some of them. One way to estimate the fraction of non-functional DNA in a particular genome is to determine which portions of the genome can be freely altered by mutation without consequence to the organism. DNA sequences that cannot be mutated freely without a loss in function are said to be under purifying selection: as mutated forms of this sequence arise in a population, the loss of function associated with the mutated sequence reduces the likelihood that the organism will pass this mutation on to future generations. This type of mutation, in a functional sequence, has deleterious consequences. Another way to put it is that functional sequences are subject to natural selection, which acts as a filter to “purify” the genome at a particular location, but that non-functional sequences are free from the constraints of selection, and “anything goes” with respect to mutation.

Tell us again, Grandpa!

One way to think about this is to consider a humorous story that is told within an extended family (I think every family has these types of stories – I know my kids love to hear certain ones told and retold again). Certain incidental details of the story can be altered from telling to telling, and perhaps Uncle Joe tells it a certain way but Uncle Jeff tells it another with respect to those types of details. There are, however, certain features of the story that are absolutely non-negotiable, or the story doesn’t “work” (and telling these parts incorrectly will generate protests and corrections from the kids who know how the story goes and insist that you are not telling it correctly). These types of stories, like genomes, have some bits that can freely change and others that can’t. The bits that can’t change are under constraint and, in biological terms, subject to selection. The same factors apply in more concise form to jokes: some bits can change (and do, as the joke is told and retold) – but some bits cannot (for example, the punch line).

The best way to test for purifying selection is to compare the genomes of related organisms that have been separate species for some time. (To continue our analogy, you could determine what parts of the story are really important by comparing how each of the uncles tells it and listing out the parts that are the same in all the various versions). The genomes in the two species are modified versions of the same genome present in the common ancestor species: they started as virtually identical but have since experienced mutations in different locations over time. Mutations in functional sequences will have been subject to purifying selection to remove loss-of-function mutations, whereas mutations will have freely accumulated in non-functional sequences. The two genomes are thus a collection of similarities and differences, as we have discussed before:

In some ways, comparing the DNA sequence between related organisms is like reading alternative history novels. The hypothesis of common ancestry between similar organisms makes a very straightforward prediction about their genomes: it simply predicts that they were once the same genome, in the same ancestral species. This hypothesis also predicts that these two genomes, having gone their separate ways in the diverged species, will have accumulated changes once they separated. Like an alternative history, each genome has the same backstory, and then a history independent from the other after the point of separation.

These similarities and differences, however, will not be randomly distributed. Sequences subject to purifying selection will have fewer differences than sequences that can freely mutate. Accordingly, when compared side-by-side, the two genomes should have regions where differences are common, and where differences are rare. For example, consider a genome segment in two related species where there is one gene present. This gene has some regions that cannot be changed without significant consequences (the DNA letters that code for the amino acid sequence of the gene product, for example) and some regions that can be mutated without consequence (such as some sequences inside introns, the non-coding segments that separate gene coding segments and are spliced out of the final gene product):

Figure: Junk DNA

What biologists observe when comparing sequences like this between two related organisms is that coding sequences, which obviously are required for the gene’s function, have far fewer differences between them than do sequences found in introns or in between genes. The idea is not that mutations are preferentially happening in those areas, but that mutations can occur everywhere in the genome, but are more likely to be selected out of populations if they alter functional sequences.

The expanding data set

This type of analysis gets easier to do the longer two species have been separated, and the more species one has to compare to each other. Very recently separated species will have a very high degree of genetic similarity simply because neither species has had appreciable mutations to a common ancestral genome. As such it is difficult to pick out the sequences that have been subject to selection, since functional and non-functional sequences are both still highly similar (virtually identical). It is only as species have been separated for a long time that a pattern begins to emerge: sequences that are functional remain “constrained” by purifying selection to remain more similar, and non-functional sequences accumulate mutations in the separate lineages that make them less and less alike.

Now that biologists have access to a wide range of mammalian genomes, this type of analysis has been done on the human genome with ever-increasing precision. Early studies comparing the human genome to other genomes, such as the mouse genome (compared in 2002) and dog genome (2005), suggested that only a small fraction of the human genome was subject to purifying selection (about 5%). Recent work published a few months ago has taken this approach to a whole new level: a genome-wide comparison of 29 mammalian species (!). These results are exciting from a biological perspective because this work helps scientists tease out what bits of the human genome are under selection, and what bits aren’t (which isn’t always obvious, because we don’t always know what sequences are functional or non-functional). This type of approach is non-biased: it requires no prior hypotheses of what types of sequences to look for, but rather simply looks for what has been selected to remain more similar over time. The results, based on the (very nearly) whole-genome sequences of 29 placental mammals, are in keeping with previous estimates: about 5-6% of the human genome is under purifying selection, and the rest appears to be rather free to accumulate changes. As a species, our genome seems to be about 95% incidental details and 5% punch line.

So, what sorts of things lurk out there in the “other” 95%? In the next section, we’ll head out into the wilds of the human genome and have a look.

Editors Note: So now that you’ve read the essay, see if you can surmise the meaning of the figure at the top. This is a tiny stretch of DNA, 21 bases (units of code) long. Why do you think position #4 shows only an A and position #5 shows only a G, whereas other positions are not restricted in this manner? Pretend that you could represent the genome as a whole in this manner. Of the 3 billion bases in our genome, how many of them would be configured like position #4 or #5? What about the rest? Is the specific base (unit of code) functionally important for that set? Upon what, do you base your conclusions.? Finally do the presuppositions of the Intelligent Design Movement and Reasons to Believe pivot on how to interpret this data? How would such proponents interpret the data differently than mainstream biologists?

As we saw in the last section, only a small fraction of the human genome appears to be subject to selection (on the order of 5-6%). The rest appears free to mutate freely without consequence to mammalian biology, and as such constitutes good evidence that it performs no particular function. An additional line of evidence in favor of non functionality in the human genome is the observation that a large fraction of our genetic material is made up of what are known as mobile genetic elements, or “transposons.” These little snippets of DNA are well known and well studied in many organisms, including humans. So, what are they, and what are they up to?

Along for the ride, but looking out for number 1

Non-biologists are usually somewhat taken aback when they learn about transposons. Transposons are small segments of DNA inserted into in the genomes of many organisms that are little worlds unto themselves: they have a few genes that serve only to copy themselves and move themselves to new locations in a genome. That’s it! On the scale of biodiversity, transposons are less life-like even than viruses. They are the perfect parasites: using their host to provide resources so they can replicate themselves, and with a “lifestyle” so simple that replication is essentially its only feature. Their origins, like the origins of viruses, is somewhat of a mystery.

Despite their somewhat mysterious nature, transposon sequences make up a staggering 45% or more of our genome. That’s about 1.4 billion DNA base pairs of our genetic material that is recognizable as functional transposons or their mutated, fragmentary remains. Not surprisingly, nearly all transposon sequences in the human genome are not under selection – they are free to accumulate mutations. These mutations have no effect on us since they do not alter any function we require.

Rags to riches: converting transposons to functional sequences

Despite their parasitic nature, sometimes the host species can exploit transposons as a source of genetic novelty. The ability of transposons to copy and spread themselves around in genomes raises the intriguing possibility that they can acquire a function if they land in the right chromosomal area. While it is difficult (though not impossible) for a transposon to acquire a function as gene coding sequence (i.e. becoming a host protein product), it is comparatively easy for a transposon to pick up a function as a regulatory sequence: a segment of DNA that directs when and where a certain host gene product should be made. Transposons contain regulatory sequences for their own genes already, and these sequences can potentially interact with regulatory sequences in the host genome.

Perhaps a review of gene structure and function would be helpful at this point. Genes are portions of the long DNA sequences that make up chromosomes (each chromosome is one very long DNA molecule). As we have seen above, a good proportion of these sequences are either transposons or the defective fragments of transposons, as well as other DNA that is not under selection and is free to mutate. Interspersed in this sea of non-selected sequences are genes: segments of chromosomes that code for protein products that carry out functions within the cell: enzymatic functions, structural functions, and so on. These sequences stand out because they are subject to selection, and thus do not change at the same rate as sequences that are free to mutate (as we discussed previously).

Genes have a typical structure (obviously simplified here somewhat). First off, there is the actual DNA sequence that specifies the protein product sequence (the so-called “coding sequence”, shown in blue). This sequence is usually broken up into segments in mammalian genes, and these sequences are spliced together when the DNA sequence of the gene is transcribed into a “working copy” called mRNA – a short duplicate of the code that can be used by the cell’s machinery to actually build the specified protein.

Figure 2: Junk DNA

In addition to the actual coding sequences, other sequences are needed to tell the cell when and where certain genes should be transcribed into mRNA. Every cell in an organism has the same genes in their chromosomes, but not all are transcribed. Using different genes in different combinations is what makes cells take on distinct roles – for example, cells in your small intestine need different genes (for absorption of nutrients) than do cells of the immune system (for fighting off pathogens). Regulatory sequences make sure any given cell type has the right genes transcribed and made into protein products. Some of these sequences are part of the mRNA transcript (shown in red), and others are not transcribed but only part of the chromosomal DNA sequence (such as the “promoter” region that directs the enzymes responsible for making the mRNA transcript (shown in blue).

So, what happens when a transposon inserts into the regulatory sequence of a gene? In many cases, this mutation (the insertion event) will cause a problem (perhaps the gene is no longer transcribed in the right way, for example). In some cases, however, the gene can tolerate such an insertion. Regulatory DNA is more able to accept changes than is coding sequence DNA, so it is quite possible that an insertion may not harm the function of a gene.

Figure 3: Junk DNA

In some cases, sequences from the transposon can participate in the regulation of the neighboring gene. If these changes are beneficial, as they sometimes are, then the transposon sequences involved in regulation come under selection. Some parts of the transposon mutate away beyond recognition, and the useful bits remain since they, now being under selection, are not (as) free to mutate. The end result is a gene that has co-opted a fortuitous event (a transposon insertion) and, through mutation and selection, honed it to serve a new function (altered regulation of its product). This is an example of exaptation, the conversion of one function to another through mutation and subsequent selection. In this case the old function (a “self-serving” transposon) has had a portion of its sequence exapted to become part of the host regulatory DNA.

Recent work comparing 29 different mammals has shown there are about 280,000 examples of exapted transposon fragments in mammalian genomes. Despite this large number, the absolute fraction of human DNA that falls into this category is tiny: of our 3 billion base pairs of DNA, only about 7 million are the detectable remnants of exapted transposons. The vast majority of transposon and transposon fragments in the human genome (as we mentioned, totaling around 1.4 billion base pairs) are not under selection and are free to mutate without affecting any function.

The genomic recycling bin

So, transposons are at once a good example of non-functional DNA in genomes (indeed, nearly half of our own genome is made up of them), and an example of how evolutionary processes can convert non-functional DNA into functional DNA through mutation and selection. While I did not discuss exapted transposons in my previous series, this is another clear example of how evolution can produce novel information within the genome: by “recycling” small amounts of its junk to produce new functions. Note well, however: the fact that a small fraction of transposons have been exapted into functional sequences does not “confer” functionality on all transposons. We see the signs of selection on only a tiny minority, and even then typically only on fragmentary remains.

In the next section, we’ll examine another form of non-functional DNA present in genomes: processed pseudogenes.

“Pseudogenes” (literally “false genes”) are generally viewed as sequences in genomes that, though they have high sequence similarity to “real” genes, do not have a function. Historically they were found before the advent of whole-genome sequencing as alternate forms of genes that lacked certain features. Some pseudogenes have characteristics that indicate they are derived from “real” genes – a class of pseudogenes called processed pseudogenes. In this section we’ll discuss the mechanism by which processed pseudogenes arise, and then discuss how a small fraction of them pick up functions and become “real” themselves.

Some assembly required

Processed pseudogenes arise from gene sequences that are transcribed into RNA, and spliced together to form “messenger RNA,” or mRNA, which is what the cell uses to guide protein translation. While I discussed how genes work in the last section, let’s briefly revisit the topic with a view to certain details that we’ll need to understand how processed pseudogenes come to be.Figure 4: Junk DNA

Genes are segments of DNA on chromosomes housed in the nucleus of the cell. In order to make a specific protein encoded by the gene, a “working copy” of the sequence is transcribed into RNA, which for our purposes you can think of as a single-stranded version of DNA (DNA being double-stranded, of course). This RNA copy is thenprocessed to remove sequences that interrupt the protein sequence code. These sequences are called introns, and they are spliced out of the RNA to produce what is called “messenger RNA” or “mRNA” – since it is now ready to carry the protein sequence code, or “message” out to the place where the protein will actually be constructed (outside the nucleus, in the cytoplasm).

While mRNA is a single-stranded molecule, an enzyme called reverse transcriptase is capable of re-creating a double-stranded DNA copy of it. This enzyme function is not a normal cellular function, since the point of producing mRNA in the first place is to make protein, not DNA. Cells don’t need to make DNA copies of RNA transcripts.

So, why is there reverse transcriptase present in cells at all? The answer, as it turns out, is that this enzyme is part of a type of transposon found in many organisms. We discussed transposons in a previous section. In brief, transposons are self-replicating DNA segments that copy themselves and spread within genomes – a sort of minimalist DNA parasite. One class of transposons called retrotransposons copy themselves into RNA and then back into DNA using reverse transcriptase – so this enzyme is present in cells as a result. On occasion, reverse transcriptase makes a DNA copy of a host cell mRNA instead of its intended target (the transposon RNA):

Figure 5: Junk DNA

This DNA copy of the mRNA may, at a low frequency, re-enter the nucleus and insert itself into a chromosome. The result is a sequence that is highly similar to the original gene, but lacking several key components: introns are missing, obviously, but also the original “parent gene” chromosomal DNA regulatory sequences. Processed pseudogenes, when inserted, have no function the cell requires – indeed, the cell was getting along just fine without it before it inserted. Accordingly, the vast majority of processed pseudogenes in genomes are not under natural selection, but may mutate freely without consequence.

Seeking the living among the dead

For a tiny fraction of processed pseudogenes, however, this may not be the end of the story. As we saw previously for transposon insertions, in rare cases the arrival of new DNA sequence at a chromosomal location might alter a cellular function and then be selected for on that basis. So to in this example, with an added twist: the fact that the processed pseudogene and its “parent gene” share a great deal of sequence in common raises the possibility that they could interact as RNA copies. If the processed pseudogene lands in a chromosome location that has regulatory sequences nearby, it might be transcribed into RNA as a result. If this happens, the interactions between the two RNA molecules may alter the regulation of the parent gene. If this new interaction has a selectable benefit, the processed pseudogene in effect has become a new gene in its own right and future mutation and selection may hone this nascent function over time.

Diagram 6: Junk DNA

As we mentioned previously, the fact that transposons can be converted into functional sequences does not “confer” functionality on all transposons. Nor does the (very interesting) finding that “the dead may rise” from pseudogene to gene indicate that all processed pseudogenes are likewise functional or about to become so. Rather, both examples illustrate that new biological information can be obtained through the natural processes of mutation (in this case, duplication and insertion of DNA sequence to a new location in the genome) and subsequent selection.

In the next section, we’ll examine one final form of non-functional DNA present in genomes, and one that is of great discomfort to antievolutionary views: unitary pseudogenes.

In our previous section, we examined processed pseudogenes – transcribed gene copies that randomly insert into genomes. Unitary pseudogenes, however, are different: unlike processed pseudogenes, they are unique sequences in genomes, and not copies. They have the features one expects of “real” genes: regulatory sequences, introns, and protein coding sections – but with mutations that prevent them from being transcribed or translated. Like buildings in various states of repair, there is a similar range for unitary pseudogenes. If they have only been recently inactivated, they will be largely intact – like a recently abandoned building with a few broken windows. Others are further along in their degradation, like a stone building without a roof and grass growing up through the floor. Some are so far gone that one needs to peel back the turf to search for what remains of the foundation. Despite their various states of disrepair, they remain recognizable – in some cases, they can persist for millions of years before they slowly mutate beyond recognition.

The reason for these defective genes is straightforward: the organism that had the original mutation that removed the function of the gene was not significantly impacted by the loss. One example I have previously discussed is the human GLO pseudogene. The functional GLO gene is part of the biochemical pathway for making vitamin C, something that humans and other primates are not able to do: if we don’t get enough in our diet, we get scurvy. In an environment with adequate dietary vitamin C, however, the loss of the GLO gene is no big deal – and mutations that remove its function would not have been a disadvantage. The mutations that remove GLO function in humans are the same mutations we see in other species – they are an example of mutations in a nested hierarchy, the type of pattern that relatedness produces. This indicates that the mutations happened once, in a common ancestral species, and have been inherited by several species that descend from that ancestor, ours included.

So, what’s a defective gene like you doing in a species like this?

While it makes sense that mammals ought to be able to make vitamin C (even if humans and other primates cannot), in some cases pseudogenes seem much more “out of place.” One example from the human genome that we have discussed in the past, is the vitellogenin gene, a gene required for egg yolk formation in egg-laying organisms. This gene is present in the human genome as a pseudogene, even though humans are placental mammals – human embryos are nourished through a placenta, not egg yolk. This pseudogene was located in the human genome by predicting that its genomic location relative to its neighboring genes would be retained for a long time, even after its inactivation. Accordingly, researchers found a functional vitellogenin gene in the chicken genome, and noted the genes on either side of it (let’s just call them “Gene A and Gene B” for convenience). Gene A and Gene B are also side by side in the human genome, so the researchers looked between them for the signs of vitellogenin gene remains – and found them in that precise spot, still visible despite approximately 300 million years since we last shared a common ancestor with chickens:

Diagram: Common ancestry with chickens

Other examples like this abound: whales, for example, have unitary pseudogene remnants of genes devoted to an air-based sense of smell, even in cases where the whale species in question does not have an olfactory organ. A second example from whales are pseudogene remnants of visual pigments adapted for wavelengths of light found in terrestrial settings, not aquatic environments. These examples make perfect sense in light of the terrestrial ancestry of whales, but are challenging to account for from an antievolutionary perspective.

Pseudogenes: evolution’s silver bullet?

Unitary pseudogenes with shared mutations in nested hierarchies among related species are far from the only evidence for evolution, and are not even necessarily the line of evidence most convincing to specialists. Specialists can see the broad pattern of multiple lines of converging evidence that support common ancestry to an extent non-specialists cannot easily appreciate. Unitary pseudogenes, however, are valuable tools for demonstrating a sampling of those lines of evidence, and providing a window into the world of comparative genomics that, to paraphrase Dobzhansky’s famous quote, would make absolutely no sense except in the light of evolution.

Yes, the implications of unitary pseudogenes such as these are easy for even non-specialists to grasp: whales have the defective remnants of genes adapted to terrestrial vision and air-based smelling because they descend from terrestrial ancestors. Placental mammals, including humans, have a defective remnant of a gene used to make egg yolk because they descend from egg-laying ancestors. Unitary pseudogenes share identical mutations across related species because they were inactivated in a common ancestor, and were inherited by every species that descended from that ancestral species.

No special training in genetics is required to appreciate the strength of the evidence that these examples provide. Nor does it require special insight to see that attempts made by antievolutionary groups to refute this evidence face an uphill battle. Its daunting nature notwithstanding, some have undertaken just that task, since the evidence is too compelling to ignore, and too risky to leave unanswered.

Bringing it together: antievolutionary approaches to pseudogenes, unitary and otherwise, miss the mark

Now that we have covered significant ground with respect to what various classes of pseudogenes are and how they arise, we are now able to properly evaluate antievolutionary arguments put forward in an attempt to discredit these lines of evidence for evolution. Attempts to discredit unitary pseudogene evidence generally have one or both of the following two approaches, which we will evaluate in turn:

Approach 1: Discuss rare examples of processed pseudogenes that have acquired function, and imply that all pseudogenes, including unitary pseudogenes, will similarly be shown to have function.

This approach is a fairly common one in the antievolutionary literature, and examples abound. We have examined previously how processed pseudogenes may, in rare cases, acquire a function and come under selection. Note well: the vast, vast majority of processed pseudogenes are not functional and are slowly mutating beyond recognition as DNA not under selection. While rare examples that have acquired function are very interesting from a scientific perspective, they do not “confer functionality” on the remainder of processed pseudogenes, let alone on unitary pseudogenes.

The other issue with this argument is that in many cases we know what the function of the unitary pseudogene once was. We know what the function of vitellogenin is, for example – and we can find this gene in modern-day egg-laying animals. When we see the remnants of this sequence in the human genome it is a stretch to argue that it has another, as of yet unknown function. When we see the human pseudogene sitting between two other genes in the human genome the same order as we observe in the chicken genome, it stretches credibility well past the breaking point.

Approach 2: Claim that unitary pseudogenes with mutations shared across species are the result of non-random mutations that occurred independently in the two species, and are not inherited from a common ancestor.

This argument, though having an appearance of validity, is similarly doomed to frustration. While mutations are not entirely random (certain regions of the genome mutate more readily than others) there is no known mechanism that could create the precise, repeated pattern of shared mutations we observe between related species. The most significant attempt to mount this type of argument against unitary pseudogenes in general was directed at the GLO pseudogene, and I have already discussed the specific details of why that attempt was inadequate. No refinement of that argument, to my knowledge, has been put forward since.

In summary, pseudogenes in general, and unitary pseudogenes in particular, remain a significant thorn in the side of antievolutionary groups.

About the author

Dennis Venema

Dennis Venema

Dennis Venema is professor of biology at Trinity Western University in Langley, British Columbia. He holds a B.Sc. (with Honors) from the University of British Columbia (1996), and received his Ph.D. from the University of British Columbia in 2003. His research is focused on the genetics of pattern formation and signaling using the common fruit fly Drosophila melanogaster as a model organism. Dennis is a gifted thinker and writer on matters of science and faith, but also an award-winning biology teacher—he won the 2008 College Biology Teaching Award from the National Association of Biology Teachers. He and his family enjoy numerous outdoor activities that the Canadian Pacific coast region has to offer.