The Origin of Biological Information, Part 4

| By on Letters to the Duchess

If your heart is right, then every creature is a mirror of life to you and a book of holy learning, for there is no creature - no matter how tiny or how lowly - that does not reveal God’s goodness.

Thomas a Kempis - Of the Imitation of Christ (c.1420)

Lost in (Sequence) Space

In Parts 2 and 3 of this series (see sidebar), we explored two concrete examples of how new structures and functions arose through mutation and natural selection: the ability of E. Coli to utilize citrate that appeared during a controlled laboratory experiment, and the duplication and divergence of a steroid hormone receptor gene that acquired a new hormone binding partner and went on to regulate new processes distinct from its predecessor.

Both of these examples were notable for their intricate level of detail that carefully teased out the intermediates on the path to new functions. Still, at the close of Part 3 we noted that

Over and against these lines of evidence, however, the Intelligent Design Movement claims that such novelty is inaccessible to random mutation and natural selection. Rather, they claim that functional protein shapes are incredibly rare and therefore so isolated from each other that random mutation and natural selection cannot bridge the vast gulfs between them.

The issue here is that functional proteins seem to be a very small subset of possible proteins. Proteins are chains of repeated structures (amino acids) that are typically one hundred or more repeats in length. There are 20 amino acids found in proteins, so at every position in a protein chain, there are 20 different possible choices. So, for a protein with only two amino acids (not even a realistic scenario) there are 202 possible combinations. For a protein with 100 amino acids, there are 20100 combinations – a vast “sequence space” of possible states, of which only a relative few will be functional.

As we have seen in Parts 2 and 3, proteins “explore” their sequence space through random mutation. Mutation may produce protein forms that reduce or remove function, changes that are neutral with respect to function, or changes that improve function (or add new functions). Over time, evolution predicts that proteins will “branch” through sequence space – with each modern form connected to a previous form of which it is a modified descendant. The Intelligent Design Movement (IDM), as we have noted, predicts a different pattern: isolated, separately designed (created), functional proteins that lack prior transitional forms.

In other words, the IDM views protein sequence space to be like the diagram on the left. The brown spheres represent functional protein shapes (each of which allows for some small variation within the sphere). These are separated by large gaps of nonfunctional sequences. In contrast, an evolutionary model predicts that modern-day functional sequences (brown spheres) are connected in sequence space by functional intermediates across time (black lines).

The two examples we have already examined in parts 2 and 3 (citrate metabolism and novel hormone / receptor pairs; see sidebar for links) are strong support for the evolutionary model: in both cases new functions and structures were connected to prior forms (that had different functions) through a series of functional intermediates. The question remains, however: are all proteins so connected? Are these examples rare exceptions? Certainly if evolution has produced the diversity in protein form and function that we observe today this pattern should be common.

Welcome to the Neighborhood

That was the question that recently led two researchers to examine a large number of protein enzymes with known functions: 28,862 different proteins from a wide array of organisms, to be exact. Specifically, the researchers examined “genotype neighborhoods”: proteins that have similar amino acid sequences and group together in sequence space (such as those represented by the spheres in the diagram above). A two-dimensional cross-section of two such spheres can be represented as follows (redrawn from Figure 2 in Ferrada and Wagner, 2010):

Where each sphere has a radius (r), and the two are separated in sequence space by a distance (d). The radius and the distance are percent differences in amino acids. For example, we may consider all proteins that differ by at most 2% of their amino acids within the two neighborhoods (r=1 for both). The distance between the two neighborhoods (d) is also a percent difference in amino acids (for example, d could be 10%).

Since the data set used by the researchers was for enzymes with known functions, pairs of genotype neighborhoods were assessed to determine if they contained the same enzymatic functions, or distinct functions. For example, if neighborhood 1 contains enzyme functions A, B and C, and neighborhood 2 contains only enzyme functions A and B, then enzyme function C is unique to neighborhood 1. The fraction of unique functions for pairs of genotypic neighborhoods can thus be analyzed as functions of r and d.

In other words, how different do two genotype neighborhoods have to be before new functions are encountered in protein sequence space? Are existing protein families situated in protein space as isolated islands of (independently designed) function in a sea of nonfunctionality, as the IDM predicts? Or can new functions be reached as enzymes explore sequence space through random mutation and natural selection?

Not surprisingly, the researchers found that as the percent amino acid differences (d) increased between two genotype neighborhoods, the fraction of unique functions increased. What was interesting (in terms of assessing the claims of the IDM) was that unique functions can be readily observed even for low values of d. For example, genotype neighborhoods with a 20% difference in amino acids (d = 20) had unique functions over 45% of the time when r was held constant at a 5% difference. Smaller differences, such as d = 10, did not eliminate unique functions (nearly 20% had unique functions; see figures 3A and 3B in Ferrada and Wagner for results for the data set as a whole).

A second interesting result was that even when genotype neighborhoods overlap (i.e. d is less than the sum of the two radii), they still may have unique functions:

This simultaneously underscores two observations: that highly similar sequences may have different functions (as is well known from other studies), as well as the contingent nature of proteins exploring sequence space (even closely related proteins cannot reach the same potential functions via a short search, depending on their position in their genotype neighborhood). This result is also consistent with what we have seen previously in parts 2 and 3: neutral mutations that move a sequence within its genotype neighborhood can bring it into reach of new potential functional states. Such neutral mutations were key in opening up future possibilities both for the evolution of citrate metabolism in E. Coli as well as in for steroid hormone receptors in vertebrates.

Does maintaining a specific protein structure prevent exploration?

Having obtained this result, the researchers went on to add a constraint to the analysis: they restricted their data set to protein sequences known to fold into a specific structure (the data for the TIM barrel domain can be seen in Figures 4A and 4B; compare with 3A and 3B). They chose a very common protein fold (called a TIM barrel) that many protein sequences can fold into (4,132 sequences in the data set), and that performs many different enzymatic functions (53 distinct chemical reactions currently known). The amino acid sequences that form a TIM barrel can be 100% different (i.e. d = 100) or very similar (d ~ 0). As before, the researchers examined how functions are distributed in sequence space for pairs of genotype neighborhoods, but now restricted to this structure alone. Significantly, their results were the same as before. Genotypic neighborhoods close to each other still showed different functions, and overlapping neighborhoods contained unique functions. To be certain that this was not an effect specific to the TIM domain, the researchers repeated the analysis for 36 additional structures, all of which gave similar results.

Put another way, constraining a protein to a particular three-dimensional structure (i.e. protein fold) does not seem to hinder its ability to traverse sequence space and acquire new functions in the process.

Taken together, this paper demonstrates some key findings for how protein sequences, structures and functions are distributed in protein sequence space:

  1. The distribution of protein sequences, structures and functions we observe is strongly consistent with the hypothesis that proteins traverse sequence space and acquire new functions over time through random mutation and selection.

  2. Functional sequences in protein sequence space are distributed such that a significant subset of protein families are close to areas with new functions. In some cases, genotype neighborhoods can overlap where one neighborhood contains functions that the other does not.

  3. Not all areas of a genotype neighborhood are equivalent: neutral mutations within a genotype neighborhood can move a sequence to regions where new functions can be reached, or into areas where those same functions are not accessible.

  4. Constraint on protein structure is not a constraint on acquiring new functions. When the analysis was restricted to a common structure, the same results were obtained (consistent for 37 different structures).

Moreover, this work is based on the largest sample size examined to date (over 28,000 proteins), and thus is much more likely to apply to protein sequence space as a whole than studies (such as those performed by members of the IDM) that attempt to extrapolate from studies of one protein (or a handful of related proteins) to protein sequence space in general. Despite the claims of the IDM, proteins do not appear to be “lost” in sequence space.

In the next post in this series, we’ll examine another line of genomics-based evidence for proteins acquiring new functions over time: the distribution of gene copies with distinct functions (paralogs) in vertebrates.




Venema, Dennis. "The Origin of Biological Information, Part 4" N.p., 25 Apr. 2011. Web. 19 February 2019.


Venema, D. (2011, April 25). The Origin of Biological Information, Part 4
Retrieved February 19, 2019, from /blogs/dennis-venema-letters-to-the-duchess/the-origin-of-biological-information-part-4

References & Credits

Further reading

Ferrada, E., and Wagner, A. (2010). Evolutionary innovations and the organization of protein functions in genotype space. PLoS ONE 5(11); e14172.

About the Author

Dennis Venema

Dennis Venema is professor of biology at Trinity Western University in Langley, British Columbia. He holds a B.Sc. (with Honors) from the University of British Columbia (1996), and received his Ph.D. from the University of British Columbia in 2003. His research is focused on the genetics of pattern formation and signaling using the common fruit fly Drosophila melanogaster as a model organism. Dennis is a gifted thinker and writer on matters of science and faith, but also an award-winning biology teacher—he won the 2008 College Biology Teaching Award from the National Association of Biology Teachers. He and his family enjoy numerous outdoor activities that the Canadian Pacific coast region has to offer. 

More posts by Dennis Venema