Evolution Basics: Incomplete Lineage Sorting and Ancestral Population Sizes
Note: This series of posts is intended as a basic introduction to the science of evolution for non-specialists. You can see the introduction to this series here. In this post we discuss how comparative genomics data can be used to estimate population sizes at different points within a phylogeny.
In the last post in this series we introduced the (challenging) concept of discordant gene trees within a species tree that arise through incomplete lineage sorting (ILS). In this post, we’ll take a look at one of the interesting implications of ILS – using it to estimate population sizes - before moving on to other topics. (Again, if this topic seems too challenging, feel free to bypass this post and the last – the later posts in this series will not depend on understanding this information.)
We can use ILS to measure population size because discordant trees give us a way to measure the number of alleles present in an ancestral population (which in turn can be used to estimate the number of individuals in that population). Before getting into the details, however, let’s briefly review how speciation is a population-level phenomenon.
As you will recall from previous posts in this series, speciation events get their start when two populations become genetically isolated from one another (either completely, or partially). This allows the average characteristics of the two populations to diverge, which in turn may lead to speciation over time. The point to emphasize here is that both populations are populations – a group of interbreeding organisms of the same species. Populations, as we have seen, are capable of passing on much more genetic diversity than a single individual can – where one individual can carry only two alleles of a given gene, a population can maintain hundreds, or even thousands.
Discordant trees – a window to the past
With this in mind, we can return to our discussion of incomplete lineage sorting and the resulting discordant gene trees nested within a species tree. The example we used previously had gorillas and chimpanzees with more closely related alleles, and the human allele more distantly related:
We then described an example of incomplete lineage sorting where gorillas and chimpanzees inherit the most closely related alleles, and humans a more distantly related allele:
We are now ready to discuss what we can infer from this pattern, and what it tells us about the (H,G,C) common ancestral population. First, this pattern tells us that the red and blue alleles were present before the chimpanzee lineage separated from the gorilla lineage. Since we know from the species tree that the (gorilla / chimpanzee) common ancestral population is the (H,G,C) common ancestral population, this confirms that the blue and red alleles were part of the variation that this population maintained. The next thing to notice is that the yellow allele is more ancestral – in other words, it has fewer mutations when compared to the red and blue alleles. This means that the yellow allele is older than the red or blue alleles. This places the yellow allele on the phylogeny prior to the (G) / (H,C) speciation event. Also, since humans have the yellow allele, it must have been present in the (H,C) common ancestral population at the point when it separates from the (G) lineage. Taken together, this means that the yellow allele was also present in the (H,G,C) population. In the absence of new mutations (which are excluded in these analyses) there is no other way to produce this pattern of inheritance unless all three alleles are present in the (H,G,C) population. Even though the present-day species have only one allele each, we can infer that their shared ancestral population had all three.
So, discordant gene trees are a window to the past that reveal the genetic diversity of an ancestral population – how many alleles it maintained for a given region of the genome. By comparing large sets of genome data from humans, chimpanzees and gorillas, it is possible to get an accurate estimate of population size for the (H,G,C) ancestral population (about 50,000 individuals). This measure, called the effective population size (denoted Ne) is the population size needed to transmit the observed amount of genetic variation from an ancestral population to the present day. The human / chimpanzee (H,C) common ancestral lineage, estimated using the same methods, also numbered about 50,000 individuals over its history.
Testing the model with an additional species – the orangutan genome
The sequencing of the orangutan genome (completed in 2011) provided researchers with an opportunity to check these estimates using an additional data set. The orangutan lineage branches off the primate phylogeny from a common ancestral population (i.e. the (H,O,G,C) population, where the “O” stands for orangutan) leaving the (H,G,C) ancestral population which will undergo speciation later:
Using prior estimates of (H,G,C) and (H,C) population sizes, the researchers were able to predict in advance that a very small fraction of the human and orangutan genomes should be more closely related to each other – i.e. that incomplete lineage sorting should have produced rare genome regions where the human and orangutan alleles are more similar to each other than to other primates. The expected value of such (H,O) paired regions (~1.2%) is tiny when compared to the predicted value for (H,G) paired regions (around 25%), in large part because humans, chimpanzees and gorillas underwent speciation in a relatively short period of time, whereas the time between the orangutan divergence and the later gorilla divergence is greater. The genome-wide fraction of our genome that more closely matches the orangutan genome is about 0.8% - remarkably close to the predicted value, and consistent with the Ne values estimated for the (H,G,C) and (H,C) populations from prior work. In other words, when comparing primate genomes, we see a pattern of incomplete lineage sorting – as expected, our genome matches chimpanzees most frequently, then gorilla, and then orangutan. (As an aside, it is formally possible that once the gibbon genome is sequenced and analyzed that there might be a trace of incomplete lineage sorting present to give (human, gibbon) allele groupings, but it is likely that this fraction of the genome will be too tiny to detect reliably, since gibbons branch off the primate tree well before orangutans do).
Summing up and looking ahead
Far from being a “problem” for common ancestry, incomplete lineage sorting is an expected consequence of populations undergoing speciation events – and a window into their genetic diversity. The end result within a phylogeny, as we have seen, is a subset of characteristics that have a discordant tree within the species tree. In the next post in this series, we’ll explore another effect that can also produce patterns at odds with a species tree: convergent evolution.
For further reading
Hobolth A, et al., (2007). Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov Model. PLoS Genet 3(2): e7 (source)
Holboth A., et al. (2011). Incomplete lineage sorting patterns among human, chimpanzee, and orangutan suggest recent orangutan speciation and widespread selection. Genome Research. 2011 March; 21(3) 349. (source)