One of the challenges for discussing evolution within evangelical Christian circles is that there is widespread confusion about how evolution actually works. In this (intermittent) series, I discuss aspects of evolution that are commonly misunderstood in the Christian community. In this first of several posts on “junk DNA”, we explore how genomics can be employed to test for non-functional sequences by comparing sequences between related organisms. As you finish reading the essay, see if you can figure out the meaning of the figure above. We'll pose a question at the end.
Do genomes have non-functional sequences?
There are various ways to test the hypothesis that certain regions of DNA are non-functional, and in this series we will explore some of them. One way to estimate the fraction of non-functional DNA in a particular genome is to determine which portions of the genome can be freely altered by mutation without consequence to the organism. DNA sequences that cannot be mutated freely without a loss in function are said to be under purifying selection: as mutated forms of this sequence arise in a population, the loss of function associated with the mutated sequence reduces the likelihood that the organism will pass this mutation on to future generations. This type of mutation, in a functional sequence, has deleterious consequences. Another way to put it is that functional sequences are subject to natural selection, which acts as a filter to “purify” the genome at a particular location, but that non-functional sequences are free from the constraints of selection, and “anything goes” with respect to mutation.
Tell us again, Grandpa!
One way to think about this is to consider a humorous story that is told within an extended family (I think every family has these types of stories – I know my kids love to hear certain ones told and retold again). Certain incidental details of the story can be altered from telling to telling, and perhaps Uncle Joe tells it a certain way but Uncle Jeff tells it another with respect to those types of details. There are, however, certain features of the story that are absolutely non-negotiable, or the story doesn’t “work” (and telling these parts incorrectly will generate protests and corrections from the kids who know how the story goes and insist that you are not telling it correctly). These types of stories, like genomes, have some bits that can freely change and others that can’t. The bits that can’t change are under constraint and, in biological terms, subject to selection. The same factors apply in more concise form to jokes: some bits can change (and do, as the joke is told and retold) – but some bits cannot (for example, the punch line).
The best way to test for purifying selection is to compare the genomes of related organisms that have been separate species for some time. (To continue our analogy, you could determine what parts of the story are really important by comparing how each of the uncles tells it and listing out the parts that are the same in all the various versions). The genomes in the two species are modified versions of the same genome present in the common ancestor species: they started as virtually identical but have since experienced mutations in different locations over time. Mutations in functional sequences will have been subject to purifying selection to remove loss-of-function mutations, whereas mutations will have freely accumulated in non-functional sequences. The two genomes are thus a collection of similarities and differences, as we have discussed before:
In some ways, comparing the DNA sequence between related organisms is like reading alternative history novels. The hypothesis of common ancestry between similar organisms makes a very straightforward prediction about their genomes: it simply predicts that they were once the same genome, in the same ancestral species. This hypothesis also predicts that these two genomes, having gone their separate ways in the diverged species, will have accumulated changes once they separated. Like an alternative history, each genome has the same backstory, and then a history independent from the other after the point of separation.
These similarities and differences, however, will not be randomly distributed. Sequences subject to purifying selection will have fewer differences than sequences that can freely mutate. Accordingly, when compared side-by-side, the two genomes should have regions where differences are common, and where differences are rare. For example, consider a genome segment in two related species where there is one gene present. This gene has some regions that cannot be changed without significant consequences (the DNA letters that code for the amino acid sequence of the gene product, for example) and some regions that can be mutated without consequence (such as some sequences inside introns, the non-coding segments that separate gene coding segments and are spliced out of the final gene product):
What biologists observe when comparing sequences like this between two related organisms is that coding sequences, which obviously are required for the gene’s function, have far fewer differences between them than do sequences found in introns or in between genes. The idea is not that mutations are preferentially happening in those areas, but that mutations can occur everywhere in the genome, but are more likely to be selected out of populations if they alter functional sequences.
The expanding data set
This type of analysis gets easier to do the longer two species have been separated, and the more species one has to compare to each other. Very recently separated species will have a very high degree of genetic similarity simply because neither species has had appreciable mutations to a common ancestral genome. As such it is difficult to pick out the sequences that have been subject to selection, since functional and non-functional sequences are both still highly similar (virtually identical). It is only as species have been separated for a long time that a pattern begins to emerge: sequences that are functional remain “constrained” by purifying selection to remain more similar, and non-functional sequences accumulate mutations in the separate lineages that make them less and less alike.
Now that biologists have access to a wide range of mammalian genomes, this type of analysis has been done on the human genome with ever-increasing precision. Early studies comparing the human genome to other genomes, such as the mouse genome (compared in 2002) and dog genome (2005), suggested that only a small fraction of the human genome was subject to purifying selection (about 5%). Recent work published a few months ago has taken this approach to a whole new level: a genome-wide comparison of 29 mammalian species (!). These results are exciting from a biological perspective because this work helps scientists tease out what bits of the human genome are under selection, and what bits aren’t (which isn’t always obvious, because we don’t always know what sequences are functional or non-functional). This type of approach is non-biased: it requires no prior hypotheses of what types of sequences to look for, but rather simply looks for what has been selected to remain more similar over time. The results, based on the (very nearly) whole-genome sequences of 29 placental mammals, are in keeping with previous estimates: about 5-6% of the human genome is under purifying selection, and the rest appears to be rather free to accumulate changes. As a species, our genome seems to be about 95% incidental details and 5% punch line.
So, what sorts of things lurk out there in the “other” 95%? In the next post in this series, we’ll head out into the wilds of the human genome and have a look.
Editors Note: So now that you've read the essay, see if you can surmise the meaning of the figure at the top. This is a tiny stretch of DNA, 21 bases (units of code) long. Why do you think position #4 shows only an A and position #5 shows only a G, whereas other positions are not restricted in this manner? Pretend that you could represent the genome as a whole in this manner. Of the 3 billion bases in our genome, how many of them would be configured like position #4 or #5? What about the rest? Is the specific base (unit of code) functionally important for that set? Upon what, do you base your conclusions.? Finally do the presuppositions of the Intelligent Design Movement and Reasons to Believe pivot on how to interpret this data? How would such proponents interpret the data differently than mainstream biologists? Feel free to address these questions in the comment section or, if you prefer, just reflect on them.