In the last few posts in this series, we’ve examined the overall pattern we see when comparing related genomes to one another, and how multiple data sets neatly fit into the same family tree, or phylogeny. In this post, we’ll move on to a deeper understanding of phylogenies, and how it is actually expected that some features of genomes will be at odds with their family trees.
But first, a brief aside: this is a challenging topic, and one that might be confusing at first. Still, if you’ve come this far in this series, you already have the tools you need to understand what’s going on here, and with a little additional effort, you’ll have an even deeper understanding of related genomes than you did before. If, on the other hand, this particular topic remains a bit of a muddle, don’t worry – the rest of the series will not depend on understanding this finer point. Also, be sure to ask questions in the comments if things are unclear.
Let’s return to what is by now a familiar example of a phylogeny: that of humans, chimpanzees, and gorillas:
Phylogenies are also known as “species trees”, since “tree” is another name for phylogeny. A species tree shows us the overall pattern – which species share a common ancestral population more recently, and which share a common ancestral population more distantly in the past. In other words, as we noted in the last post in this series, a phylogeny is a measure of shared history and separate history for any two species. The longer two species have a common history, the more similar they are expected to be, on average. Humans and chimps, for example, continue to share a common history for several million years after the lineage leading to gorillas separates from the (human / chimpanzee) common ancestral population. This shared history is what on average, makes the chimpanzee and human genomes more similar to each other than either is to the gorilla genome. Individual genes (and their alleles) may have a different history within species as they separate from one another. For this type of analysis, we need to examine phylogenies for individual genes – so called “gene trees.”
If you think back to previous posts (here and here) on how variation (alleles) arise through mutation, it should be fairly intuitive that the same principles that can be used to group species into a phylogeny can also be used to group alleles of a single gene into a phylogeny. For example, consider the DNA sequence of three alleles of the same gene, which we can represent as the “yellow”, “red” and “blue” alleles (the colored boxes). Sequence differences that make these alleles distinct are highlighted in red text:
Using the same principles that we used for species as a whole, we can explain the origin of these three alleles by two mutation events (starting with a given that the yellow allele is the ancestral state):
So, within a population, we can reconstruct the allele history of an individual gene using the same methods we have previously applied to species as a whole.
Speciation with genetic variation along for the ride (or not)
So, mutation is constantly producing new alleles (variation) within populations, and processes such as natural selection and genetic drift work to either increase or decrease the frequency of alleles in populations over time. Also, we have spent considerable time discussing (here, here, here and here) how speciation events occur, starting with populations that separate from one another, and accrue differences over time that may lead to the formation of distinct species. All that remains is to bring these ideas together: to consider what might happen to variation (alleles) within a population as it goes through a speciation event. To do that, let’s track our hypothetical alleles through the speciation events that led to humans, chimpanzees and gorillas.
This species tree has the following populations: the population that is ancestral to all three species, designated “(H,G,C)” for “(Human, Gorilla, Chimpanzee)”; the population ancestral to both humans and chimps (H,C) and the lineages (populations) that lead to the present day species after their last speciation event with the species on the phylogeny (H), (G) and (C):
It’s important to keep in mind that a single line on the phylogeny is in fact a population, and populations can have genetic variation. Let’s place our three alleles into the (H,G,C) population:
Now we are set to explore possibilities for how these alleles will be inherited (or not) through the speciation events that will occur. One possibility is straightforward – all three alleles will be inherited by all three species. This possibility is called “complete lineage sorting” since it represents a perfect segregation of all alleles into all lineages. This requires that all three alleles be present in the subpopulations that divide into separate lineages, and that no alleles be lost over time in any lineage. While this is certainly possible, it is by no means certain. As we have seen, when populations separate it is unlikely that all alleles in the original population will be represented in both subpopulations after the divide. Also, it is possible that selection or genetic drift may cause alleles to be lost over time in one lineage but not another. Anything other than perfect segregation of all alleles into all lineages is called “incomplete lineage sorting” – and for a large genome, it is a given that at least some genes will exhibit this effect.
Incomplete lineage sorting – a worked example
The first challenge to complete lineage sorting that these three alleles will face is the speciation event that separates the (H,C) and (G) lineages. For the purposes of this example, let’s suppose that the red allele is excluded from the population that forms the (H,C) lineage, but that all three alleles persist in the (G) lineage. You will recall that this is an example of the founder effect – a nonrandom sampling that can exclude alleles from a new subpopulation by chance:
Now let’s examine one possible scenario following on from the (H,C) / (G) speciation event. In the (G) lineage, the yellow and blue alleles are lost over time. At the (H) / (C) speciation event, both the blue and yellow alleles segregate into both lineages, but in the (C) lineage, the yellow allele is later lost. Similarly, the blue allele is later lost in the (H) lineage:
For this particular gene, then, we have the following final pattern:
And at last we see the issue: the gene tree for these alleles is at odds with the species tree. Recall that in the gene tree, the red and blue alleles are more closely related to each other than they are to the yellow allele:
In the species tree, however, the two closest relatives (chimpanzees and humans) do not have the two most closely related alleles – they have more distantly related alleles.
Now that we have worked this example, hopefully the reason behind the discrepancy is clear – there is no guarantee that alleles will sort in a lineage to match up with the overall species pattern. If a gene has variation in a population undergoing speciation events, it is expected that some of the time it will assort with a pattern that does not match the species pattern – in some cases, it will have a gene tree that is “discordant” with the species tree. For a population with thousands of genes with multiple alleles present, it is a given that some alleles will assort into a discordant pattern. Far from being a problem for evolution, discordant trees are predicted by evolution. It would be a problem if we did not observe them – but in fact we do, and as we shall see next time, we observe them in precisely the pattern that matches what we would expect based on species trees.
We just introduced the (challenging) concept of discordant gene trees within a species tree that arise through incomplete lineage sorting (ILS). Next, we’ll take a look at one of the interesting implications of ILS – using it to estimate population sizes – before moving on to other topics. (Again, if this topic seems too challenging, feel free to bypass this post – the later posts in this series will not depend on understanding this information.)
We can use ILS to measure population size because discordant trees give us a way to measure the number of alleles present in an ancestral population (which in turn can be used to estimate the number of individuals in that population). Before getting into the details, however, let’s briefly review how speciation is a population-level phenomenon.
As you will recall from previous posts in this series, speciation events get their start when two populations become genetically isolated from one another (either completely, or partially). This allows the average characteristics of the two populations to diverge, which in turn may lead to speciation over time. The point to emphasize here is that both populations are populations – a group of interbreeding organisms of the same species. Populations, as we have seen, are capable of passing on much more genetic diversity than a single individual can – where one individual can carry only two alleles of a given gene, a population can maintain hundreds, or even thousands.
Discordant trees – a window to the past
With this in mind, we can return to our discussion of incomplete lineage sorting and the resulting discordant gene trees nested within a species tree. The example we used previously had gorillas and chimpanzees with more closely related alleles, and the human allele more distantly related:
We are now ready to discuss what we can infer from this pattern, and what it tells us about the (H,G,C) common ancestral population. First, this pattern tells us that the red and blue alleles were present before the chimpanzee lineage separated from the gorilla lineage. Since we know from the species tree that the (gorilla / chimpanzee) common ancestral population is the (H,G,C) common ancestral population, this confirms that the blue and red alleles were part of the variation that this population maintained. The next thing to notice is that the yellow allele is more ancestral – in other words, it has fewer mutations when compared to the red and blue alleles. This means that the yellow allele is older than the red or blue alleles. This places the yellow allele on the phylogeny prior to the (G) / (H,C) speciation event. Also, since humans have the yellow allele, it must have been present in the (H,C) common ancestral population at the point when it separates from the (G) lineage. Taken together, this means that the yellow allele was also present in the (H,G,C) population. In the absence of new mutations (which are excluded in these analyses) there is no other way to produce this pattern of inheritance unless all three alleles are present in the (H,G,C) population. Even though the present-day species have only one allele each, we can infer that their shared ancestral population had all three.
So, discordant gene trees are a window to the past that reveal the genetic diversity of an ancestral population – how many alleles it maintained for a given region of the genome. By comparing large sets of genome data from humans, chimpanzees and gorillas, it is possible to get an accurate estimate of population size for the (H,G,C) ancestral population (about 50,000 individuals). This measure, called the effective population size (denoted Ne) is the population size needed to transmit the observed amount of genetic variation from an ancestral population to the present day. The human / chimpanzee (H,C) common ancestral lineage, estimated using the same methods, also numbered about 50,000 individuals over its history.
Testing the model with an additional species – the orangutan genome
The sequencing of the orangutan genome (completed in 2011) provided researchers with an opportunity to check these estimates using an additional data set. The orangutan lineage branches off the primate phylogeny from a common ancestral population (i.e. the (H,O,G,C) population, where the “O” stands for orangutan) leaving the (H,G,C) ancestral population which will undergo speciation later:
Using prior estimates of (H,G,C) and (H,C) population sizes, the researchers were able to predict in advance that a very small fraction of the human and orangutan genomes should be more closely related to each other – i.e. that incomplete lineage sorting should have produced rare genome regions where the human and orangutan alleles are more similar to each other than to other primates. The expected value of such (H,O) paired regions (~1.2%) is tiny when compared to the predicted value for (H,G) paired regions (around 25%), in large part because humans, chimpanzees and gorillas underwent speciation in a relatively short period of time, whereas the time between the orangutan divergence and the later gorilla divergence is greater. The genome-wide fraction of our genome that more closely matches the orangutan genome is about 0.8% – remarkably close to the predicted value, and consistent with the Ne values estimated for the (H,G,C) and (H,C) populations from prior work. In other words, when comparing primate genomes, we see a pattern of incomplete lineage sorting – as expected, our genome matches chimpanzees most frequently, then gorilla, and then orangutan. (As an aside, it is formally possible that once the gibbon genome is sequenced and analyzed that there might be a trace of incomplete lineage sorting present to give (human, gibbon) allele groupings, but it is likely that this fraction of the genome will be too tiny to detect reliably, since gibbons branch off the primate tree well before orangutans do).
Summing up and looking ahead
Far from being a “problem” for common ancestry, incomplete lineage sorting is an expected consequence of populations undergoing speciation events – and a window into their genetic diversity. The end result within a phylogeny, as we have seen, is a subset of characteristics that have a discordant tree within the species tree. In the next post in this series, we’ll explore another effect that can also produce patterns at odds with a species tree: convergent evolution.