The Evolutionary Origins of Genetic Information: What Can Evolution Account For?

| By (guest author)

This post is the third in a four-part series that has been adapted from Stephen Freeland's scholarly essay (available here) on the origin of genetic information.

The description of evolution given above applies once the world contains a genetic material that can influence its own rate of copying by reflecting the environment. In living systems, these remarkable properties are produced by the Central Dogma of molecular biology (see Box 1 below). Perhaps a stronger argument for Intelligent Design is that no natural process could create such a versatile system in the first place?

It is true that at present, evolutionary science does not have a clear, detailed and well-accepted explanation for how the Central Dogma of molecular biology emerged. But does that mean it is time to embrace Intelligent Design as a better approach? By analogy, current medical science has not found the cure for cancer. Taken in isolation, this sound-bite could lead to the misleading view that existing research directions, developed for decades, are best written off as a failure. This would miss an important context. Many aspects of cancer are now being treated with far greater effectiveness than ever before as a result of ongoing research. However, these cures are not robust (all-encompassing) enough to be summarized into the statement “we have found the cure for cancer.” This status is typical of big questions within science: failure to reach the sound-bite goal should not be mistaken for evidence that the research program has failed. Scientific progress is measured by the insights that research produces, and their implications for where we might usefully look next. These insights may even open up new awareness of just how much we do not understand, but characterizing the past few decades of cancer research as an exhaustive search that has ended in failure would be more than premature: it would be actively misleading. This final section of the article offers context to help the reader judge whether a similar situation holds for current research into natural processes that explain the origin of genetic information.

Let us start by making entirely clear what scientists are looking for. As the previous section explains, the challenge is not to find a natural process that can create enough information for a simple genetic system. The universe is replete with information capacity and syntax – from the positions of stars within our galaxy (and billions of others) to the arrangement of atoms in a single grain of sand. Within living systems, most of this information is ignored - so the question is not “where did the information come from” (unless we wish to talk cosmology – a very different subject) but rather “how does nature create systems that focus on some of this natural information?” Put another way, the challenge for understanding the origin of genetic systems is to find how natural processes can simplify a large amount of thermodynamic information into a syntax that displays only the disciplined chemical semantics of a self-replicator.

The exact details of life’s genetic information system came into focus during the middle of the 20th century.1 In 1953 Watson and Crick published the structure of DNA,2 revealing the innate capacity of this molecule to replicate and evolve indefinitely. Thirteen years later, a consortium of scientists published the details of the genetic code by which the information carried by DNA is translated into specific protein sequences.3 The system was so fundamental to understanding life, yet so simple and easy to explain that it has become known as theCentral Dogma of molecular biology (Box 1). However, it was puzzling from an evolutionary perspective. Protein catalysts supervise the construction of individual nucleotides (the building-blocks for making DNA and RNA). Other proteins link these nucleotides into DNA or RNA sequences, depending on their type (deoxyribonucleotides into DNA, and ribonucleotides into RNA). Proteins can perform these roles because each one has just the right chemical properties to catalyze a specific chemical reaction (such as linking a molecule of the nucleotide “A” to T, G or C to start building a genetic message).4 Each protein is a long chain of amino acids (typically several hundred) that have been chemically linked together. The function and shape of a protein emerge spontaneously according to the sequence of these amino acids – just as the meaning of a word is carried (for us) by a sequence of letters drawn from the English alphabet.5 The only way to reliably build the right sequence(s) of amino acids to make the proteins of metabolism is to follow genetic instructions, one code-word (codon) at a time. In other words, for more than three thousand million years, everything living has needed proteins to make genetic information – and needed genetic information to specify how these proteins are to be made. 

At the time of discovery, this system looked like an example of what proponents of Intelligent Design might call irreducibly complexity. In other words, a complex system that cannot evolve from simpler precursors, because any simplification would lose the entire functional value of the system. This perception of an un-evolvable code was further enhanced by the discovery that the same exact genetic code is at work in organisms as different as human beings and E coli bacteria (Refer back to Figure 1: this is about as genetically different as living organisms can be!). Scientists of the time came to think that one genetic code was universal for all living systems on our planet. This led Francis Crick to propose that the genetic code is a “Frozen accident” of evolution,6 universal across life precisely because once it had formed (by some unknown event), it was so fundamental to all biochemistry that it could never change again. Specifically, he pointed out that any change to the rules of genetic coding would be equivalent to a simultaneous mutation in every single gene in the organism (Box 1).7 While evolutionary theory requires that occasional small mutations produce a better fit to the environment, the simultaneous mutation of thousands of genes seems extreme even by the standards of macro-mutationism. However, subsequent science has developed at least three major lines of research that undermine the concept of a frozen accident (and irreducible complexity) for genetic coding.8

First, it has been discovered that the genetic code is not universal. Around a dozen or so minor variations exist.9 These variations are mostly codes in which one or more genetic codons have altered their amino acid “meanings.” Some involve a more significant change – the addition of a 21st or 22nd amino acid.10 Everything indicates that these genetic codes evolved from the standard genetic code during the past few hundred million years, and continues to evolve today. Arguments for the evolvability of the code are strengthened by the finding that amino acids are assigned to genetic code-words non-randomly. In particular, codons are assigned to amino acids in such a pattern that common mutations produce minor variations as proteins are decoded. A growing body of evidence connects this feature of the code to the idea that considerable evolution by natural selection had gone into shaping this system.11 Everything suggests that the genetic code is evolved and evolvable after all.12

The second major insight into the origins of genetic coding is that multiple, independent lines of evidence suggest the standard amino acid alphabet of 20 building-blocks grew from a smaller earlier alphabet corresponding to an earlier stage in genetic code evolution. Many variations have been proposed.13 Most derive their views by considering only one or two types of evidence; sophisticated calculations of the amino acid sequences of truly ancient proteins, the repertoire of amino acids found in meteorites; simulations of an early, pre-biological planet Earth and so on. What is interesting is an un-looked for match between the broad findings of these different approaches. In particular, different approaches end up dividing the 20 amino acids of modern organisms into 10 that were around in the earliest systems, and 10 that arrived later, as by-products of early biological evolution. The members of each group are remarkably consistent,14 hinting directly at the process by which the genetic code evolved, growing more complex over time from simpler beginnings. Recent findings are also starting to make sense of why natural selection created this particular alphabet of building blocks.15

The third line of insight takes us backwards to the possible origins of genetic coding. Some scientists have used the SELEX approach described in a companion paper by Watts to define mini-sequences of RNA that specifically bind to a particular amino acid.16 Although results have been patchy, some amino acids seem to associate with surprising choosiness to the code-words assigned to them in the standard genetic code. This association suggests that the earliest steps in genetic coding may have been nothing more than simple physical affinities between two types of chemical.

Between them, these insights represent significant progress from the impossibly self-referential system viewed by Crick and those around him just 50 years ago. This half-century of research indicates that the standard genetic code at work in modern cells may be a product of substantial evolution that had taken place by around 3 billion years ago. But perhaps the most interesting progress is that few scientists still regard the emergence of life’s Central Dogma as the origin for genetic information.




Freeland, Stephen. "The Evolutionary Origins of Genetic Information: What Can Evolution Account For?" N.p., 5 Aug. 2013. Web. 17 February 2019.


Freeland, S. (2013, August 5). The Evolutionary Origins of Genetic Information: What Can Evolution Account For?
Retrieved February 17, 2019, from /blogs/archive/the-evolutionary-origins-of-genetic-information-part-3

References & Credits

The content of this post was originally published as part of a paper in the ASA's academic journal, PSCF. It is republished here with permission.

Box 1. An Introduction to Biological Coding and the Central Dogma of Molecular Biology

A code is a system of rules for converting information of one representation into another. For example Morse Code describes the conversion of information represented by a simple alphabet of dots and dashes to another, more complex alphabet of letters, numbers and punctuation. The code itself is the system of rules that connects these two representations. Genetic coding involves much the same principles, and it is remarkably uniform throughout life: genetic information is stored in the form of nucleic acid (DNA and RNA), but organisms are built by (and to a large extent from) interacting networks of proteins. Proteins and nucleic acids are utterly different types of molecule; thus it is only by decoding genes into proteins that self-replicating organisms come into being, exposing genetic material to evolution. The decoding process occurs in two distinct stages: during transcription local portions of the DNA double-helix are unwound to expose individual genes as templates from which temporary copies are made (transcribed) in the chemical sister language RNA. These messenger RNA molecules (mRNA’s) are then translated into protein.

The language-based terminology reflects the fact that both genes and proteins are essentially 1-dimensional arrays of chemical letters. However, the nucleic acid alphabet comprises just 4 chemical letters (the 4 nucleotides are often abbreviated to ‘A’, ‘C’, ‘G’ and ‘T’ – but see footnote27), whereas proteins are built from 20 different amino acids. Clearly, no 1:1 mapping can connect nucleotides to amino acids. Instead nucleotides are translated as non-overlapping triplets known as codons. With 4 chemical letters grouped into codons of length 3, there are 4x4x4 = 64 possible codons. Each of these 64 codons is assigned to exactly one of 21 meanings (20 amino acids and a ‘stop translation’ signal found at the end of every gene.) The genetic code is quite simply the mapping of codons to amino acid meanings. One consequence of this mapping is that most of the amino acids are specified by more than one codon: this is commonly referred to as the redundancy of the code. 

Although the molecular machinery that produces genetic coding is complex (and indeed, less than perfectly understood), the most essential elements for this discussion are the tRNA’s and ribosome. Each organism uses a set of slightly different tRNA’s that each bind a specific amino acid at one end, and recognize a specific codon or subset of codons at the other. As translation of a gene proceeds, appropriate tRNAs bind to successive codons, bringing the desired sequence of amino acids into close, linear proximity where they are chemically linked to form a protein translation product. In this sense, tRNA’s are adaptors and translators – between them, they represent the molecular basis of genetic coding. The ribosome is a much larger molecule, comprising both RNA and various proteins, which supervises the whole process of translation. It contains a tunnel through which the ribbon of messenger RNA feeds; somewhere near to the center of the ribosome, a window exposes just enough genetic material for tRNA’s to compete with each other to bind the exposed codons.


1. For a fascinating and accessible discussion of the incorrect ideas that paved the way for these discoveries, see: B. Hayes: “The Invention of the Genetic Code,” American Scientist (1998) 86: 8 - 14

2. Watson J.D. and Crick F.H.C. “A Structure for Deoxyribose Nucleic Acid” Nature (1953) 171: 737-738

3. Frisch, L., (ed.) “The Genetic Code”, Cold Spring Harbor Symposia on Quantitative Biology (1966):1 - 747

4. More accurately, “A”, “T”, “G” and “C” refer to the four bases used in genetic coding. Bases are part of a whole nucleotide – the base must be added to a molecular of ribose and a phosphate to form a nucleotide. The ribose-phosphate construction is used as a universal scaffolding with which to join together sequences of bases. This technical differentiation becomes important to the origin of genetic information because bases are relatively easy to produce under prebiotic conditions, full nucleotides much less so. This and other subtleties are described further in a later section, explained well in Robert Shapiro’s work (footnote 42).  

5. This key insight brought Christian Anfinsen a Nobel prize in 1972, and a brief overview is found in his classic paper: Anfinsen C.B. “Principles that govern the folding of protein chains” Science (1973) 181: 223–230

6. Crick, F. H. C. “ The origin of the genetic code”, J. Mol. Biol. (1968) 38: 367 - 379

7. Given that there are only really 64 different rules for converting genetic information into proteins, and an individual protein can be several hundred amino acids in length, most genes use each of these rules many times over

8. For a much more thorough and technical version of this section including several hundred references to the primary scientific literature, see Freeland S.J. (2009) “Terrestrial Amino Acids and their evolution” in Amino Acids, Peptides and Proteins within Organic Chemistry, Vol. 1 (ed. A. B. Hughes), Wiley VCH.

9. Knight, R. D., S. J. Freeland, Landweber L. F. “Rewiring the keyboard: evolvability of the genetic code”, Nature Reviews Genetics (2001) 2: 49-58.

10. For a brief overview, see “The 22nd amino acid.” Atkins JF, Gesteland R. Science. (2002) 296: 1409-10.

11. For an accessible overview of this topic, see Freeland S.J. and Hurst L.D. “Evolution Encoded,” Scientific American (2007) 290:84-91. A more technical and more recent treatment of this topic can be found in Novozhilov AS, Koonin EV. “Exceptional error minimization in putative primordial genetic codes.” Biology Direct (2009) 4:44.

12. For a detailed review, see Koonin EV, Novozhilov AS. “Origin and evolution of the genetic code: the universal enigma.” IUBMB Life (2009) 61:99-111.

13. More than fifty models are considered in Trifonov, E.N. “Consensus temporal order of amino acids and evolution of the triplet code.” Gene (2000) 261:139-51.

14. Compare the similarities in two recent reviews: Higgs, P. G. & Pudritz, R. E. (2009) “A thermodynamic basis for prebiotic amino acid synthesis and the nature of the first genetic code.” Astrobiology 9: 483-90; Cleaves, H.J. (2010) “The origin of the biologically coded amino acids” J. Theor. Biol. 263: 490-498

15. Philip GK, Freeland SJ. “Did evolution select a nonrandom “alphabet” of amino acids?” Astrobiology (2011)11:235-40.

16. The current status of data here is reviewed in Yarus, M., Widmann J.J. and Knight R. “RNA-amino acid binding: a stereochemical era for the genetic code.” J Mol Evol. (2009) 69:406-29.

About the Author

Stephen Freeland

  Stephen Freeland is currently the Director for the Individualized Studies program at UMBC ( His academic background (a bachelor’s degree in zoology from Oxford, a master’s in biological computation from York University, and a doctorate in genetics from Cambridge) has led him to spend the past twenty years researching the evolution of genetic coding. Steve’s current research explores the evolution of the amino acid “alphabet”—the set of twenty building blocks with which life has been making the proteins of metabolism for more than three billion years. Underlying this research is a growing interest in the cosmological question, “To what degree is life on Earth (or elsewhere) a result of chance?” As the son of a biology teacher who retrained as a Methodist minister, Steve has been blessed with an encouraging environment with which to explore the interface of science and faith since childhood.

More posts by Stephen Freeland