Introduction

Progression of life from the earliest forms to humans is often characterized as a series of major steps or key innovations, each providing a significant new capability to the newly evolved organisms that was lacking in more primitive forms. Debate on such major steps or innovations was re-ignited in recent times by Maynard-Smith (Smith and Szathmary 1995), who focussed on ‘Major Transitions’ as defined by changes in the nature of the individual that was the principle subject of selection, and the nature of information transfer within and between individuals. Maynard-Smith and Szathmary’s work has been followed by many analyses of what the key steps or innovations are on the path from Last Common Ancestor (LCA) to man, the evolutionary mechanism that lead to them, and how likely they would happen again if the ‘tape of life’ were rewound (Gould 1989) or if life evolved on another world. But what is a key innovation, and why do they occur?

There is extensive discussion about why many transitions or key innovations in the evolution of life happened, but these discussions are usually framed in terms that are specific to those key innovations. In this paper we suggest a broad approach that classifies the explanation for the appearance of a key innovation into one of three hypotheses. The approach is applicable to any major evolutionary step. Our approach addresses not the specifics of an evolutionary advance in its geological and environmental context, but rather the potential paths to life’s acquisition of a new capability. We apply this approach to reviewing one of the key innovations in the evolution of complex life, the evolution of the complex genetic controls of eukaryotes. Eukaryotes alone have developed complex, developmentally regulated multicellularity. This paper addresses the key genetic differences between eukaryotes and all other domains of life, which we will refer to here by the rather old-fashioned term “prokaryotes” for convenience unless referring specifically to the bacteria or archaea. We show that our approach can suggest which aspect(s) of eukaryotic gene control circuitry are the critical differences between prokaryotes and eukaryotes. We tentatively identify control logic rather than any feature of control chemistry as being the distinguishing difference between prokaryotes and eukaryotes. In a subsequent paper we will apply the approach more broadly to the appearance and evolution of life on Earth.

Model

A major step in evolution is by definition a rare event. Such an event could be the result of a single, highly unlikely step, or a series of steps. Examples of this second category are well known as “Multiple Hit” processes from the causation of cancer (Hanahan and Weinberg 2011; Loeb et al. 2003) and other diseases (Bains 2000; Leblond et al. 2012; Polyzos et al. 2012; Swerdlow 2012), and are likely familiar to the reader. Less familiar is that Multiple Hit processes are not uniform. The path to an innovation might require a number of specific innovations or steps. This is functionally equivalent to an innovation that requires a single, highly improbable step to occur, and the probability of a multiple serial innovation event is the product of the probabilities of its component steps. By contrast, the path to an innovation may require multiple steps that are selected from a larger pool of possible steps, and not a specific combination of steps. Such a multiple hit process has different probabilities, and hence different kinetics, from a process that requires one, specific combination of steps. To avoid confusion with the general term “Multiple Hit,” we term this second class of a Multiple Hit process “Many Paths”—several different paths can lead to the same outcome. The development of cancer is an example of such a process—several genes need to be mutated to cause a cell to become malignant, but many different combinations of mutations can cause malignancy, and can (within limits) be acquired in different orders.

The different types of explanation have significantly different implications for how evolutionary change occurs with time, and the nature of the innovation [and hence the chance that it would occur again if we ‘rewound the tape of life’ (Gould 1989)] can therefore be inferred from innovation timing, frequency, and mechanism, as we will discuss below. Under our schema, an explanation may be as follows:-

  1. (1)

    A Critical Path Hypothesis. The major event or innovation requires preconditions that take time to develop. However, the time is (at least mostly) determined by the nature of the event and the geological and environmental conditions of the planet, and so once the necessary preconditions exist on the planet then the event will occur in a well-defined timescale.

  2. (2)

    A Random Walk Hypothesis. The major event or innovation is highly unlikely to occur in a specific time step, and the likelihood does not change (substantially) with time. This may be because the event requires a highly improbable event to occur, or a number of highly improbable steps that have to occur in sequence. Thus, substantial time has to elapse before chance events allow the innovation to be made. Once life exists on a planet, ultimately the innovation will occur, but when it occurs is up to chance, and whether it occurs before the planet’s sun leaves the main sequence and renders the planet uninhabitable is not knowable.

  3. (3)

    A Many Paths Hypothesis. The major event or innovation requires many random events to create a complex new function, but many combinations of these can generate the same functional output, even though the genetic or anatomical details of the different outputs are not the same. So once life exists, the chance that the innovation will occur in a given time period is high, but the exact time is not knowable.

Each of these may also fall into a fourth category, termed as “Pulling Up The Ladder.” In this class of explanation, an innovation is likely (either because it is a Critical Path or a Many Paths process), but the results of the innovation destroy the preconditions for its own occurrence. The new organisms “pull up the ladder after themselves.” The endosymbiotic origin of eukaryotic organelles could be a ‘pulling up the ladder’ process, because once the eukaryote ancestor had acquired a proto-mitochondrion, there was no opportunity for it to acquire another.

An example of the Critical Path hypothesis might be the argument that complex animal life depends on aerobic metabolism, and hence on an oxygen atmosphere. Oxygenating the atmosphere and crust of a planet, so that an atmosphere with a high oxygen content can accumulate, takes a long time, perhaps a billion years on Earth (Schulze-Makuch and Irwin 2008). Thus, a geological amount of time might have to elapse between the appearance of oxygenic photosynthesis and the rise of complex animal life [for example, see (Catling et al. 2005)]. Once oxygenic photosynthesis has evolved, the evolution of large, complex animals is highly likely, but after a long delay.

An example of the Random Walk hypothesis might be the argument that the rise of the mammals required that the Therapsid precursors of mammals exist and that a diverse set of open ecological niches existed for them to radiate into. The former was true in the Triassic (Bi et al. 2014), but it took a random event (the Chicxulub impact combined with prior rapid climate change at the end of the Cretaceous) to make the latter happen [reviewed in (Archibald 2011)]. That impact could have happened at the end of the Jurassic, or during the Eocene, or could not have happened yet.

An example of the Many Paths hypothesis is the evolution of imaging vision (Land and Nilsson 2012). Many genes are involved, and a small number [such as Pax6 (Komik 2005) and the opsins (Collin et al. 2003)] are common to many or even all imaging vision systems, speaking to a common, pre-existing light detection apparatus. However, the parallel evolution of the insect, cephalopod, and vertebrate eyes using generally different genetic programs to produce very different anatomical structures shows that functionally equivalent structures for complex imaging can be generated from very different anatomy and genetics, and hence different evolutionary paths.

The reason for classifying innovations in this way is that the three classes of hypothesis have different implications for the likely timing of the events. Figure 1 illustrates the implications of these classes of hypothesis.

Fig. 1
figure 1

Timing implications of the three models. Illustration of our test for the hypotheses explaining the major innovations. The figure illustrates one specific transition, and its probability under three models. Top panel the probability that a transition will occur in any specific 100 My period (Y axis) after a given number of Gigayears (X axis). Critical Path (red)—will inevitably occur once a specific environmental threshold has been passed at 1 Gy. Random walk (green)—equal probability of happening at any time. Many Paths (blue)—will probably happen around 1 Gya. Lower panel Five independent lineages (clades within one planet or life on different planets) and the time of occurrence of that one transition under the Critical Path hypothesis (red), Random walk hypothesis (green), and Many Paths hypothesis (blue). Note that in lineage 3, the transition has not occurred at all in the time available under the Random Walk hypothesis (Color figure online)

  1. (1)

    Critical Path Hypothesis. One set of preconditions is needed for that transition. Once those preconditions (“causes”) are satisfied, the innovation will arise quickly, and will occur on all occasions that the preconditions are satisfied. The preconditions take only time to fulfill, there is no (major) random element in it (however, the time may be very substantial). As a consequence, if an innovation occurs through a Critical Path process more than once, it is likely to follow a similar evolutionary path in the different examples. Thus, independent evolution of a common function in the descendants of a common ancestor is likely to use similar mechanisms.

  2. (2)

    Random Walk Hypothesis. There are no preconditions other than prior existence of life that can achieve the innovation (e.g., nervous systems cannot evolve without cells). The innovation will occur at random, but since it is highly improbable it will not likely occur twice even if the preconditions are satisfied many times.

  3. (3)

    Many Paths Hypothesis. There are no preconditions other that prior existence of life that can achieve the innovation. However, once that precondition is met, the innovation will occur at a fairly reliable time (in generations) afterwards, and so will eventually occur on all occasions that the preconditions are satisfied. If an innovation occurs through a Many Paths process more than once, it is likely to use different mechanisms each time.

A Many Paths process is not functionally the same as a Random Walk process, and, as mentioned above, is only one example of a Multiple Hit process. It is well known (but still surprising) that if many random events have to occur to cause an output, but many combinations of random events can cause the output, then the timing of the output is more predictable than the timing of any of its component, individual events (see (Bains 2000) and (Bains 2004) for examples of the biological implications of this effect). This is a stronger statement of de Duve’s aphorism that “chance does not exclude inevitability” (de Duve 2005). Suggesting a Many Paths hypothesis also has the implication that, if an event can be caused by many combinations of random events, then it will inevitably happen, and it will have a high probability of happening in a defined time period. We can see a way of discriminating between the hypotheses from this formulation. If a major innovation has occurred only once, we might favor the Random Walk hypothesis. If it occurred many times spread through evolutionary time, we may prefer the Many Paths hypothesis. If it occurs many times with a very diverse set of mechanisms, we might also prefer a Many Paths hypothesis. If it occurred many times and in a closely defined time horizon, we may prefer the Critical Path hypothesis. If several independent evolutionary origins result in similar mechanisms (anatomical, molecule, genetic, or other), then the evidence is even stronger for a Critical Path hypothesis. Thus, we can decouple the overall likelihood of a transition in an evolutionary period from the specifics of how that transition actually occurred, providing we can either identify the time course over which multiple examples of the transition occurred (and match those to Figure 1), or identify multiple paths by which specific function has been acquired (and match these to one of the three models above).

We realize that many real world processes can share features of more than one of these. Thus, acquisition of chloroplasts by the Archaeplastida clearly is, to an extent, a critical path process (life had to have evolved the necessary cell types for it to happen), a Random Walk process (the evolution of oxygenesis may be a unique, unreplicated event (Blankenship and Hartman 1998; Holland 2006), a Multiple Path process (any one of many combinations of endosymbionts could have evolved, and indeed many did), and a Pulling Up the Ladder process. The key for discriminating which interpretation is relevant is, in our view, to focus on function, not structure. Throughout this paper we follow (Smith and Szathmary 1995) in being concerned with function, not mechanism or structure. We are seeking to distinguish analogous structures from homologous ones. For example, the mammalian placenta evolved only once (indeed the placental mammals are defined by their placentae). However, placental vivipary has evolved many times (Pollux et al. 2009; Wourms and Lombardi 1992). Some placental reptiles show erosion and invasion of maternal tissue by fetal tissue, resulting in direct fetal contact with the maternal blood stream (Blackburn and Flemming 2012), a feature that used to be thought to be exclusively mammalian. If we define a placenta as the anatomical structure that occurs in mammals, then (by definition) it has only evolved once. But if [after (Mossman 1937)] we define a placenta as “an intimate apposition or fusion of maternal and fetal tissues for sustenance and physiological exchange”, then it has evolved many times.

In this paper, we illustrate this approach with an analysis of the appearance of the complex architecture of eukaryotic genetic control within the framework of the three classes of hypothesis. We show that the evolution of the major components of eukaryotic genetics are Many Paths processes, which may have relied on a single, specific functional difference between the ur-eukaryote and other life. We argue that the appearance of the specifics of chemistry of eukaryotic gene control, such as RNA modulation of chromatin architecture or multiple alternative splicing, is not the key to its functional capability. The key to the eukaryotic genome is its logic, not its chemistry. To simplify the difference greatly, eukaryotic genes are by default ‘off,’ whereas prokaryotic genes are by default ‘on.’ Their default ‘off’ logic allows eukaryotic genomes to be expanded and complexified much more easily than prokaryotic ones, allowing eukaryotes to develop the staggeringly complex genome control architecture we see today.

In order to substantiate this hypothesis we need to show (1) that the difference in control logic is real, and (2) that essentially all aspects of the apparent differences in control chemistry between archaea, bacteria and eukaryotes are in fact different structural implementations of the same functionality. This second point requires us to review every one of the many modes of gene control in eukaryotes, which is (the authors admit) a dull and repetitive task. It is however essential to the logic of the argument. The reader who is willing to accept this point with less than an exhaustive demonstration is recommended to skip to “The Evolutionary Step to Eukaryotic Gene Control” section, which addresses the first part of our argument.

Genetic Control and Genome Complexity

One of the surprises of the human genome project was that humans only have a few more protein coding genes than Drosophila, and only about 5 times as many as E. coli. This shock to our self-esteem has been mitigated in the last decade by the realization that a lot of the DNA in complex organisms is related to gene control rather than to protein coding (Washietl et al. 2007), and that much of the ‘junk’ DNA is actually functional [as suggested 30 years ago on basis even older data from mutation and radiation studies (Bains 1982); however, see also discussion by (Ponting and Hardison 2011)]. All obligate multicellular organisms are eukaryotes, and so it might be postulated that eukaryotic gene control circuitry is uniquely exapted to the evolution of complex life, and as such its appearance represents a bottleneck (a Random Walk event) in the evolution of complex life.

Our objective is not to show that this complexity is not real. Rather, we argue that i) all the basic functionality in eukaryotic gene control has evolved several times with different chemistry in eukaryotes, bacteria and archaea, and ii) the difference between the ur-eukaryote and its prokaryotic contemporaries was genetic logic, not genetic chemistry. The ur-eukaryote probably had a genome not much larger than a modern-day prokaryote (Makarova et al. 2005). The establishment of the extremely complex control genetics of complex eukaryotes by expansion of this original control chemistry is therefore a Many Paths process.

It is now generally believed that the eukaryotic nuclear genome originated from an archaeal-like ancestor (see (Blackstone 2013; Brown and Doolittle 1997; Cavalier-Smith 2010; de Duve 2007; Spang et al. 2015; Weinzierl 2013; Williams et al. 2014) and references therein). Similarities in chemistry between archaea and eukaryotes are explained by common ancestry (i.e., are homologies). Similarly, archaea, prokarya, and eukarya evolved from a presumed single ancestor, the LCA (Doolittle et al. 1996; Glansdorff 2002), which must have had DNA, RNA, and protein synthesis, the structure of a central metabolism, and mechanisms to control all of that machinery. While sequence homology may not have been preserved across 3 billion years, structural similarity (and hence presumed homology) may have been [See for example the actin, MreB and Ta0853 proteins of eukaryotes, prokaryotes, and archaea, respectively (Roeben et al. 2006)], so finding similar molecules performing similar functions in different branches of life may be evidence of common descent, not independent origin.

Here we do not attempt to add to the literature arguing homology from similarity between major domains of life. Rather, we emphasize similarity of function arising from non-homologous (and usually non-similar) chemistry in the domain of the control of genes. Thus, finding a histone fold protein in archaea and eukaryotes is taken as evidence of their common ancestry. Finding proteins with no sequence or structural homology to histones forming the structural basis of kilobase nucleoprotein architecture in bacteria is very hard to explain by homology, and is more parsimoniously explained by independent evolution of that function from a different source chemistry. The presence of imaging lenses in the eyes in mammals and cephalopods does not prove that mammals are cephalopods, nor that their common ancestor had eyes. Rather, it suggests that the evolution of imaging vision is a process that can take multiple paths, and hence is likely to evolve as a function even if the specifics of its implementation are unique to each lineage. We will argue similarly for the components and mechanisms of gene control.

Control of gene activity may be split, entirely for our convenience, into control of transcription, of protein synthesis, and of mRNA and protein turnover. Transcription can further be split into basic RNA polymerase activity, local modulation of transcription, and global control. We will follow this classification, but note that this is not a reflection of a biological hierarchy of control. There is no such hierarchy—all the different processes above are interlinked, such that (for example) ubiquitin-mediated protein turnover directly modulates RNA polymerase activity, the lincRNA-p21 lncRNA induces a wide range of genes, but also suppresses translation, and is degraded by a specific miRNA (Huarte et al. 2010; Yoon et al. 2012). lncRNAs that allosterically regulate cytoplasmic proteins and lncRNAs that are secreted as cell–cell signaling molecules (Geisler and Coller 2013) are known. We will return to the importance of this interlinking below.

Our argument rests on two pillars. We argue that the core functionality of the nucleoprotein structure of eukaryotic gene organization actually evolved independently at least three times, and so is a Many Paths process. We then argue that the types of function that control gene activity within the context of that nucleoprotein structure also have evolved multiple times, although the specific chemical structures that implement the control processes are completely different in each case. We argue therefore that this is also a Many Paths process. These two arguments illustrate the two types of evidence that may support the Many Paths hypothesis.

The next two sections address the first of these assertions, that the core nucleoprotein architecture of life has evolved similar solutions multiple times, and so its appearance is a Many Paths process.

Core Chemical Components

The basic chemical components of eukaryotic gene activity are basal to life, and we will only touch briefly on them here. DNA and RNA synthesis clearly are central to the existence of life, and their origin is part of the Origin Of Life problem; indeed, in the “RNA World” model (Gilbert 1986), RNA synthesis was the origin of life. In all domains of life, genes are controlled by proteins and RNA specifically binding to each other and to DNA, and by chemical modification of proteins, RNA and DNA. All of these features have evolved many times, and are applied in many combinations to control essentially the same input-output logic. Thus (for example) catabolic repression is a common ‘logic circuit’ in the genetic control of metabolism in all domains of life, but has evolved from different proteins and genetic elements in bacteria, archaea and eukaryotes, and probably evolved several times in each lineage (Bini and Blum 2001).

The chemistry of the modification of the core components of gene control have evolved multiple times. DNA base modification is achieved in bacteria, archaea and eukaryotes by unrelated systems (Cao et al. 2003; Chan et al. 2004; Gaspin et al. 2000; Kumar et al. 1994). Modification of proteins to alter their interactions with DNA by protein methylation (Baumann et al. 1994; Eichler and Adams 2005; Martin and Zhang 2005; Reisenauer et al. 1999), acetylation (Eichler and Adams 2005; Yun et al. 2011), S-glutathionylation (Dalle-Donne et al. 2008), ADPribosylation (Fernando Bazan and Koch-Nolte 1997; Pallen et al. 2001) and poly-ADPribosylation (Haasa and Hottinger 2008) have appeared in diverse clades. RNA modification is similarly diverse and universal (addressed below). It is clear that all domains of life independently developed complex genetics from the same base chemical structures.

Nucleoprotein Evolution: Multiple Independent Origins

The structure of nucleoprotein performs two roles in eukaryotes. The first role is to compact a long genome into a small cell. The second is to control the transcription of that genome, through local and global structure. Below we show that both functions have evolved independently several times, and in some cases through co-opting similar chemistry.

DNA Compactification

In all cells DNA condensation is essential to keep the large genome molecule(s) inside the relatively small cell (de Vries 2010; Luijsterburg et al. 2008; Zimmerman and Murphy 1996). The existence of at least three classes of DNA-compacting proteins in prokaryotes (Sandman and Reevem 2001) shows that solutions to this problem have evolved several times (Drlica and Rouviere-Yaniv 1987). Archaea and eukaryotes share DNA-binding proteins with the ‘histone fold’ (Sandman and Reevem 2001), and all domains of life share Alba (“acetylation lowers binding affinity”) proteins that bind and compact DNA (Sandman and Reeve 2005; White and Bell 2002). Some of the archaeal nucleic acid binding proteins are similar to proteins (presumed to be homologues) in eukaryotes, where they function as transcription factors (Mantovani 1999). Several unrelated classes of topoisomerases (Champoux 2001; Corbett and Berger 2004), and integrases and recombinases (Argos et al. 1986; Hallet and Sherratt 1997; Sauer 1994) have evolved to manage the resulting topological problems. In eukaryotes and a few bacteria the compaction solution has included an intracellular membrane-bounded compartment for the DNA (Fuerst 2005; Fuerst et al. 1998). While prokaryotic genomes are usually smaller than eukaryotic genomes, prokaryotic organisms can compact as large an amount of DNA into a prokaryotic cell as is compacted into eukaryotic cells, as shown by the existence of 12 megabase bacterial genomes (Chang et al. 2011) and by polyploid prokaryotic cells (Soppa 2014; Zerulla and Soppa 2014) containing hundreds (Griese et al. 2011) to thousands (Mendell et al. 2008) of copies of megabase genomes in expression-specific structures (Komaki and Ishikawa 2000).

Archaeal and bacterial small, basic, DNA-binding proteins can be deleted (Zhang et al. 1996) without inevitably killing the cells: this is not generally true of eukaryotes. However, Dinoflagellates do not have histones, although they have some histone-like proteins similar to those in bacteria (Wong et al. 2003). They seem to use DNA itself as a structural scaffold for very large chromosome-like structures (Bouligand and Norris 2001). Dinoflagellates are considered to be ‘living histone knockouts’ rather than relics of a primordial, pre-histone eukaryotic gene organization, as their stem groups all have conventional histone chromatin chemistry (de la Espina et al. 2005). Mammalian sperm also replace histone with protamines, although the resulting nuclei show minimal gene expression (Braun 2001; Ward and Coffey 1991). Both examples demonstrate that different routes to packaging a eukaryotic genome into a cell are possible.

We conclude that the DNA compaction solution found by eukaryotes is one of a number of equivalent solutions, and as such its evolution represents a likely Many Paths process. The extent to which the different solutions are highly derived versions of an ancestral genome packaging chemistry present in LCA is unknown at the moment.

Control by Nucleoprotein

Until recently, it was generally believed that eukaryotes control genes through modulation of chromatin structure, and prokaryotes control genes through binding of specific control factors in the classic Jacob and Monod model (Jacob and Monod 1961). This is now understood to be an over-simplification, and that all domains of life use nucleoprotein structure to control genes. This can be done through local or long-range interactions (Luijsterburg et al. 2008).

Eukaryotes have sophisticated mechanism for modulating chromatin structure, which we will discuss in more detail in “Specific Mechanisms of Gene Control in Eukaryotes” section. In this section, we focus on the modification of local nucleoprotein chemistry.

In eukaryotes, ATP-driven chromatin remodeling involves swapping a variety of chromatin components, including removing H2A/H2B histone dimers (Kireeva et al. 2002) by the complexes of the SNF2/SWI2 superfamily to open chromatin for transcription (Mizuguchi et al. 2004; Olave et al. 2002; Shen et al. 2000). Yeast have three such ATPases, mammals seven (Olave et al. 2002). Proteins with sequence similarity to yeast and human SWR1 complex proteins have been found in bacteria and archaea, many of which have been identified as helicases,Footnote 1 i.e. similar, possibly homologous proteins have been co-opted to different functions in different domains.

In eukaryotes, this machinery is targeted to a gene by local chromatin structure, especially methylation and acetylation of histones, primarily H3 and H4 [see reviews in (Bracken et al. 2006; Khalil et al. 2009; Martin and Zhang 2005; Mikkelsen et al. 2007)]. This epigenetic code is then read by specific suites of recognition proteins that direct other enzyme activities to the site (Geng et al. 2012).

This is analogous chemistry to DNA-binding-protein modification in prokaryotes, but its targeting is different. Archaeal histones lack the N-terminal tails that are methylated and acetylated in eukaryotes, but many other proteins that are not related to (and hence are presumably not homologues of) histones (Soppa 2010; Wardleworth et al. 2002), including Alba (Marsh et al. 2005), are acetylated in vivo in bacteria and archaea (White and Bell 2002).

While the complexity of eukaryotic lncRNA-mediated, long-range control is unique to eukaryotes, a much simpler version of the logic of coordination of structure, clustering nucleoprotein chemistry and gene control, implemented with a quite different chemistry, is found in bacteria. The E. coli genome (and a number of other bacteria genomes) is organized into loops of ~10kb by DNA-binding proteins. One of the best characterized is the small, basic protein H-NS. H-NS organizes two groups of genes scattered around the E. coli chromosome into spatially close clusters of co-regulated genes. H-NS binds to DNA (Navarre et al. 2007; Navarre et al. 2006), oligomerizes to bring the distant genes together into one of two close physical clusters (Fang and Rimsky 2008; Wang et al. 2011), and co-ordinates their transcription by facilitating binding of RNA Polymerase (Pol) and accessory regulatory factors (Zhang et al. 1996). H-NS silencing is countered by a variety of mechanisms, including competition from similar proteins, but none involving modification of H-NS (Fang and Rimsky 2008). During DNA replication, the H-NS-coordinated loops are assembled fast, and in a specific order and position in the cell (Viollier et al. 2004). The similarity of all these features to the chromatin features of eukaryotes is obvious, albeit H-NS organizes far fewer genes over smaller distances and for simpler controls. Unrelated proteins appear to perform the same role as H-NS in B. subtilis (Smits and Grossman 2010).

In conclusion, nucleoprotein is found in bacteria, archaea and eukaryotes, and both its role as a DNA compactification system and its role in gene control is found in all kingdoms. The use of different proteins to achieve the same result shows that this was an independent origination of nucleoprotein function in the different domains of life. We conclude that the appearance of nucleoprotein as a substrate for gene control is a Many Paths process.

In the next section, we address the more complex question of whether the mechanism of the control of eukaryotic nucleoprotein is a unique set of capabilities, i.e. probably the outcome of a Random Walk process, or whether it can be considered a Many Paths event as well.

Specific Mechanisms of Gene Control in Eukaryotes

Eukaryotic gene control is astonishingly complex. It is hard to imagine that its origination was not a uniquely unlikely event. In this section we argue that this complexity hides a wealth of duplication, independent origination, and shares a wide range of features with unrelated genetic control systems in prokaryotes. In short, the specifics of gene control in eukaryotes show all the features of the outcome of a Many Paths process.

In order to control a gene, chemical function must be targeted to that gene. For convenience, we will consider local, distant, and global control systems in turn (i.e., systems that act at or within a few helical turns of the start of a gene, at hundreds or thousands of bases from a gene, or that affect every gene, respectively). We consider how gene control is achieved by considering what is doing the targeting. In every case, we will show that there are multiple, independently derived mechanisms that have evolved to achieve the same goal in different organisms, supporting a Many Paths process.

Local Targeting by DNA

DNA is not used widely as a targeting moiety in ‘normal’ genetic function in any domain of life. DNA is usually considered as the target of genetic specificity, not the specifying agent (although this ultimately is a rather semantic distinction). Some integrating DNA viruses and transposons use DNA:DNA interactions as a way to target change to a specific region of the genome. Site-specific recombination in vertebrate immune systems uses short ‘joining signals’ that are necessary and sufficient to direct enzyme-catalyzed recombination of antigen-binding gene precursors (Lewis and Gellert 1989). Recombination systems, directed by DNA sequence, are the mechanism for chromosome terminus replication in some organisms (Levis et al. 1993; Vaillasante et al. 2008). Some bacterial Crispr/Cas systems use DNA targeting (Cao et al. 2003; Chan et al. 2004). So despite their rarity, DNA targeting has evolved independently several times.

Local Targeting by Protein

Direct recognition of genetic elements by proteins has evolved many times in bacteria, archaea, and eukaryotes. The different categories of RNA polymerase, the several classes of structurally distinct DNA-binding proteins, which are found in prokaryotes and eukaryotes [reviewed in (Landschulz et al. 1988; Schwabe et al. 1993)], DNA synthesis initiation factors from bacteria (Messer 2002), all attest to the multiple, parallel evolution of proteins with affinity for specific DNA sequences. We also note that DNA synthesis can be primed by specific proteins (i.e., DNA:protein targeting) in bacterial phages and eukaryotic viruses (Salas 1991), where it presumably evolved independently.

DNA can also be indirectly targeted through sequence-specific chemical modification and subsequent recognition of the modified DNA by proteins that have limited sequence specificity or are sequence agnostic. 5-methyl cytosine is the best known of these modifications, and is generated in all three domains of life by diverse, non-homologous enzymes (Kumar et al. 1994), but 5-hydroxymethylcytosine (Tahiliani et al. 2009), 5-hydroxythymidine (Cliffe et al. 2009), 6-methyl adenine (Wion and Casadesus 2006) are also common across the three domains, and are linked with a diverse range of species-specific genetic controls as well as general cell processes such as DNA replication [see eg (Wion and Casadesus 2006)].

We consider it obvious that direct recognition of DNA and RNA sequences, and of proteins, by proteins has evolved many times, and that the evolution of any specific function achieved by DNA:protein or RNA:protein binding is a Many Paths process.

Local Targeting by RNA in Eukaryotes

Both prokaryotes and eukaryotes use RNA extensively in genetic control chemistry, with metazoa transcribing the majority of their genes into non-coding RNA that is believed to be associated with control function (Washietl et al. 2007). No class of small RNA has a unique function in any domain of life: all have been co-opted from their ‘original’ function to new ones. The development of RNAi in potential therapeutics (Vaishnaw et al. 2010) has accelerated the understanding of short regulatory RNAs, which therefore are classified into several functional classes (Joshua-Tor and Hannon 2010; Cech and Steitz 2014). The longer transcripts are simple called Long Non-Coding RNAs (lncRNAs), a designation that everyone accepts is unsatisfactory, but is inevitable because the function of nearly all these transcripts is unknown. lncRNAs have diverse evolutionary origins (Ponting et al. 2009), some are strongly conserved between species which suggests that they have an essential function (Guttman et al. 2010; Nagano and Fraser 2011; Ponting et al. 2009; Ulitsky et al. 2011) and are not ‘junk DNA’ (Pagel and Johnstone 1992) [although see (Rebollo et al. 2012)]. The roles and mechanisms of a few lncRNAs have been identified (discussed below). It is generally believed that some, probably most of the lncRNAs are involved in genome control (see (Geisler and Coller 2013; Mattick and Gagen 2001; Meister and Tuschi 2004; Mello and Conte 2004; Rinn and Chang 2012; Wang and Chang 2011) for additional reviews of lncRNA biology). Regulatory RNAs are also being discovered in bacteria, both short transcripts (Waters and Storz 2009) as well as ‘classical’ longer antisense RNAs.

Because RNA-based gene controls are so much more extensive in complex eukaryotes than in other organisms, we will dwell more exhaustively on this category of control mechanism. RNA-based targeting systems that use small RNA molecules (<100 bases) are usually classified by their mechanism, related to the protein complexes associated with the RNAs. These broadly classify as follows.

piRNA (Piwi complex-associated). These primarily suppress transposon activity in the germ line of multicellular animals by directing DNA methylation (Malone and Hannon 2009), but are also used for sex determination in Paramecium (Singh et al. 2014) and silk moths (Kiucho et al. 2014), in replacement for the protein factors that determine sex in many other species. The Piwi proteins and associated target RNAs are widely expressed outside the germline in diverse organisms (Ross et al. 2014). Transposons are also silenced through the protein-mediated DNA modification system in eukaryotes (Tahiliani et al. 2009), which has also been re-purposed as part of the antigenic switching machinery in some trypanosomes (Cliffe et al. 2009). piRNAs and associated proteins excise transposons from the ciliate macronucleus in a mechanism reminiscent of the bacteria CRISPR/Cas (Chalker and Yao 2011).

miRNAs are short hairpin RNAs (Meister and Tuschi 2004) and are a major controller of metazoan mRNA stability via the RISC complex (Meister and Tuschi 2004; Nykänen et al. 2001). The Argonaut protein of the siRNA-processing DICER complex is closely similar to the Ago protein of archaea (Song et al. 2004): however, whether they are homologues is still contentious. The function of prokaryotic Ago is not known, but genomic context suggests it is part of a viral defense system parallel to CRISPR/Cas which is similar in role to siRNA in eukaryotes (Makarova et al. 2009). lncRNAs also input into this system (Ruthenburg et al. 2007).

The miRNA repertoire on plants and animals appear to have evolved independently from a common basic mechanism: subsequent bewildering complexity in both lineages has evolved by duplication and divergence of the various components (Shabalina and Koonin 2008). Ctenophores have no miRNA system (Moroz et al. 2014).

siRNA system degrades unwanted transcripts, primarily (in metazoa) viral sequences (Malone and Hannon 2009), but also some endogenous sequences in yeast called Cryptic Unstable Sequences (CUTS) (Houseley 2012; Shabalina and Koonin 2008). The RNAse III component of the siRNA system has analogous proteins in proteobacteria. In Cryptococcus siRNA targets transposon transcripts, using a complex (SCANR) that has proteins similar to a spliceosome protein in eukaryotes (Dumesic et al. 2013). Bacteria have no RNAi system, but some small bacterial RNAs modulate mRNA stability through completely different mechanisms (Görke and Vogel 2008).

For controls through short RNA sequences in eukaryotes, we therefore see

  • Multiple, independent evolution of similar systems (eg animal and plant miRNA)

  • Chemically different systems achieving the same functional goal (eg sex determination on silk moths vs vertebrates)

Both are hallmarks of a Many Paths process.

Targeting by RNAs in Prokaryotes

Prokaryotes are now understood to transcribe dozens or hundreds of non-coding RNAs (Livny et al. 2006; Rivas et al. 2001; Vockenhuber et al. 2011), most of which modulate translation. Most require protein factors such as Hfq to assist imperfect base pair recognition of target RNAs (Papenfort and Vogel 2010; Waters and Storz 2009). Some longer bacterial antisense RNAs span several genes or operons with related function, and provide operon-scale translational control from a single molecule (Sesto et al. 2013). A number of longer bacterial RNAs are now known to have multiple regulatory functions (Papenfort and Vogel 2010), analogous to some locally acting lncRNAs in eukaryotes.

The CRISPR/Cas bacterial system also uses small RNAs as guides for nucleic acid destruction (Jore et al. 2012), but use a different enzyme machinery than RNAi (Hale et al. 2009). Some target incoming ssRNA for destruction, exactly analogously to the RNAi system (Horvath and Barrangou 2010). Others CRISPR/Cas chemistries target dsDNA of invaders, in a mechanism reminiscent of (but evolutionarily unrelated to) small RNA-directed de novo DNA methylation in eukaryotes (Cao et al. 2003; Chan et al. 2004).

RNA also guides base modification enzymes in prokaryotes and eukaryotes. In general, pseudouridine is inserted into rRNA with site-specific enzymes in bacteria, but with broad-specificity enzymes guided by snoRNAs into rRNA (Lafontaine and Tollervey 1998) and mRNA (Carlile et al. 2014) in eukaryotes. Archaea use an intermediate system that comprises guide RNAs and specific proteins, including sequence relatives of eukaryotic snoRNA (Aittaleb et al. 2003), for rRNA O-methylation (Bachellerie et al. 2002; Gaspin et al. 2000). Some promoter-specific DNA modification is guided by siRNA (Cao et al. 2003; Mello and Conte 2004), although much is guided by chromatin structure (discussed below). Again, bacteria use sequence-specific proteins for de novo methylation of DNA, which have regulatory roles as well as roles in the phage defense/restriction systems

Local RNA control is therefore not specific to eukaryotes but has evolved independently in prokaryotes. The elaborate local control systems found in eukaryotes carry out overlapping and mutually replaceable functions, which in at least some cases have evolved independently, and other chemical mechanisms to achieve the same role have evolved independently in different lineages.

Controls via Long RNAs

Higher eukaryotes are unique for their extensive use of long RNAs (arbitrarily, >100 bases) that control gene expression by control of the short- and long-scale chromatin structures. lncRNA can coordinate gene activity locally, or over megabase distances through chromatin folding in eukaryotes (de Santa et al. 2009; Lettice et al. 2003; Nagano and Fraser 2011; Ørom et al. 2010; Smemo et al. 2014; Yao et al. 2010), and the strength of enhancement is not related to the length of the loop (Sanyal et al. 2012; Wang et al. 2013), so this is not an extension of local gene control to longer distances. lncRNA scaffolding targets enzymatic activities to different regions of the genome [reviewed in (Mercer and Mattick 2013; Nagano and Fraser 2011; Rinn 2012; Wilusz et al. 2009)]. Typically lncRNAs will interact with many loci across the genome (Nagano and Fraser 2011), with gene activity requiring a combination of loop topology, protein binding and appropriate chromatin tags in a broad sequence context (Domené et al. 2013; Jin et al. 2013; Taher et al. 2011; Taher et al. 2012; Wang et al. 2013). This level of control interacts with more local levels of control, for example by directing chromatin modifying enzymes to specific regions of the genome (Mercer and Mattick 2013).

lncRNAs target chromatin modification enzymes to add ‘tags’ to chromatin: the chromatin tags in turn are targeted by protein, small and large RNAs. The Polycomb system proteins (PCGPs), that methylate histones over short (Müller and Kassis 2006) or long (Lee et al. 2006; Schwartz et al. 2006; Wang and Chang 2011) distances, are targeted by direct binding to promoters or repressors or recruited to chromatin by lncRNAs which bind both PGCPs and either other proteins or other RNAs (Khalil et al. 2009; Nagano and Fraser 2011; Wilusz et al. 2009). Although Polycomb is usually associated with gene repression, in some cases it has been recruited to gene activation pathways (Gao et al. 2014). Many of these systems are multi-component complexes or multifunctional molecules that recognize a combination of chromatin features (Ruthenburg et al. 2007). lncRNAs can also recruit histone modifying enzymes independently of PolyComb (Camblong et al. 2007; Houseley et al. 2008). After the transcription bubble has passed, Pol-II recruits proteins to epigenetically tag transcribed sequences, so as to repress promoter sequences occurring within the gene (Whitehouse et al. 2007; Yadon et al. 2010). As a side-effect of this, transcription of one RNA blocks transcription of a downstream overlapping transcript, yet another role of an RNA-controlled process (Thebault et al. 2011).

All of these interact and compete with each other for binding to target proteins and microRNAs (Tay et al. 2014). However, this bewildering catalog of complexity does not imply anything unique in eukaryotes, only the extraordinary expansion of capabilities seen to evolve in other systems (siRNA, simple repressor-operator systems) or other domains of life. All the processes which we summarize very briefly above have precedence in preceding paragraphs in terms of targeting DNA and protein modification, DNA transcription, and the associated enzymes through interactions of DNA, RNA, and protein with each other, sometimes in complexes that link distant genes. What is different in metazoan genomes is the amount of this activity, not its nature.

Splicing and Other RNA Roles

RNA splicing is found in all domains of life and it is likely that splicing was a mechanism that LCA had already evolved. Self-splicing Group II introns are not present in eukaryotic nuclear genome (Edgell et al. 2011), spliceosomal introns only in eukaryotes (see (Dumesic et al. 2013; Martin and Koonin 2006; Pyle 2012; Roy and Gilbert 2006) for reviews on the origin of splicesomal introns). While RNA-catalyzed self-splicing is the distinctive hallmark of these introns, their splicing in vivo requires protein maturases that accelerate splicing chemistry and act as RNA chaperone proteins [reviewed in (Lambowitz and Zimmerly 2004; Meng et al. 2005)]. Fourteen nuclear-encoded proteins are required for splicing Chlamydomonas chloroplast Type II ‘self-splicing’ introns, most of which share no similarity with the components of the nuclear spliceosome (Rivier et al. 2001). Combined protein and RNA machinery to rearrange RNA appears to have evolved multiple times, with varying degrees of dependence on the protein component (Meng et al. 2005).

The unrelated mechanism of translational skipping has a similar net effect to RNA splicing; the generation of a protein from non-adjacent regions of a transcript. Short ribosomal frameshifting is found ubiquitously, with different chemistry in different domains speaking to different origins (Belew et al. 2014; Brierley et al. 1989; Chandler and Fayet 1993; Cobucci-Ponzano et al. 2005; Dinman 2012; Lang et al. 2014)

RNA is also central to priming DNA synthesis at specific sites in prokaryotes and eukaryotes. However, a large number of phage and eukaryotic viral examples show that proteins can also prime DNA synthesis (Salas 1991). RNA plays structural roles in ribosomes, telomerase and other structures [reviewed in (Cech and Steitz 2014)], but as these are common to all the forms of life that use those structures, we assume these are homologies, not analogies.

RNAs can compete with DNA-binding proteins, including RNA polymerase in E. coli (Wassarman 2007) but also factors such as steroid receptor proteins in mammals, thus titrating their activity (Martianov et al. 2007; Poliseno et al. 2010; Salmena et al. 2011). tRNAs can also play this role (Kino et al. 2010). The RNAs concerned share no sequence similarity.

Lastly, the interaction of RNA with small molecules to modulate RNA function (‘riboswitches’) has evolved independently many times in all three domains of life (Breaker 2012; Coppins et al. 2007). Riboswitches are built from a large number of distinct motifs with limited or no sequence similarity, some broadly distributed across bacteria and archaea, some quite specific to smaller groups of organisms (Weinberg et al. 2010).

We conclude two things from this very short survey of RNA-based gene control:

  1. (i)

    That RNAs that can bind to DNA, to other RNAs or to proteins to control transcriptional activity have evolved many times

  2. (ii)

    That the functions carried out in metazoa by specific classes of RNA can be carried out by many classes of RNA in different organisms, and in many cases their functions can be performed by proteins in bacteria.

From this, we infer that the origin of function of the RNA-based chemistry of gene circuitry was a Many Path process, with many potential outcomes that would permit the subsequent Critical Path complexification of the full metazoan gene control circuitry.

Other Expression Controls

All branches of life have a wealth of (unrelated) transcription factors to control the process of initiation of RNA synthesis (Baliga et al. 2000; Bell and Jackson 2001). The transcription initiation complex in all domains shows DNA:protein as well as protein:protein interactions, with the overall architecture more similar between archaea and eukaryotes than between bacteria [with respect to the HTH Sigma family of proteins (Helmann and Chamberlin 1988)] and archaea (Soppa 2001; Weinzierl 2013), and consequent similarities in promoter sequences (Miller and Hahn 2006; Rhee and Pugh 2012). The dynamic initiation complex can be as large as a ribosome (Liu et al. 2013). There is some sequence similarity between the polymerases and some of the accessory factors between all domains (Bartlett et al. 2000; Bell and Jackson 2001), but others have evolved independently.

The transition for initiation to elongation can also be a point of control (Nechaev and Adelman 2011) as can termination. RNA polymerase can bind to a promoter and then ‘stall’ (Core et al. 2008; Muse et al. 2007; Nechaev et al. 2010) through several, different mechanisms (FitzGerald et al. 2006; Hendrix et al. 2008; Li and Gilmour 2013). RNA can interact with the translation process in a variety of ways to modulate translation—similar effects using completely different mechanisms have evolved for RNA modulation of translation in bacteria and eukaryotes (Grigg and Ke 2013; Valencia-Sanchez et al. 2006).

Elongation and termination require specific protein complexes, both have substantial differences in the three domains of life (see (Braglia et al. 2005; Jeong et al. 1995; Mischo and Proudfoot 2013) and refs therein): one eukaryotic termination system has similarities to PolyComb (Camblong et al. 2009), and formation of R-loops over G-rich terminators induce antisense transcription of the recently transcribed gene, and hence recruitment of histone methylation and siRNA mechanisms (Skourti-Stathaki et al. 2014). RNA degradation is also controlled, often through polyadenylation. Although the core polynucleotide phosphorylase activity has similarity (and hence presumed homology) between archaea, bacteria and eukaryotes, the polyadenylation complexes differ, and yeast at least has two different polyadenylation-based RNA degradation systems for ‘correct’ and aberrantly folded RNA [reviewed in (Houseley and Tollervey 2008)].

Protein turnover is modulated by a range of systems in prokaryotes (Battesti and Gottesman 2013) and eukaryotes (Geng et al. 2012; Pickart 2001). Protein abundance in eukaryotes (in mammals anyway) is controlled mostly at the level of translation, not protein breakdown (Schwanhausser et al. 2011). The role of the ubiquitin-proteasome system (now known to involve a range of protein tags) is mostly to clear degraded or misfolded proteins (Geiss-Friedlander and Melchior 2007; Schwartz and Hochstrasser 2003).

Thus the same general points apply—there are many systems, overlapping controls, and independent origins for many of them in different lineages.

The Evolutionary Step to Eukaryotic Gene Control

In the previous section, we have emphasized the following:

  1. (i)

    There are multiple types of control of gene activity in eukaryotes that overlap with each other

  2. (ii)

    That different control functions evolved many times, even if their specific chemistry is unique to each example, and that the same general type of genetic function is often carried out by different chemistries in different organisms

  3. (iii)

    That many types of control chemistry in eukaryotes have precedent in bacteria or archaea.

We argue that this shows that the evolution of the complex genome of (say) the metazoa is a Many Paths process, one that takes time but is highly likely to happen. If this is so, why are eukaryotes so obviously more genetically more complex than prokaryotes? Why are there not prokaryotes with as complex genomes as, for example, C. elegans?

We do not have a robust answer to this, but our analytical approach suggests a direction for hypothesizing. One core feature of eukaryotic gene control apparently appeared once early in eukaryotic evolution, and has not appeared in other lineages. Archaeal chromatin and bacterial DNA compaction proteins do not (in general) block transcription (Weinzierl 2013; Xie and Reeve 2004), unlike eukaryotic nucleosomes. [Though some archaeal histone-like proteins inhibit transcription in vitro, these systems are not exact models for the in vivo case (Chang and Luse 1997; Soares et al. 1998)]. Transcription of prokaryotic genes is under the control of sequences that recruit RNA polymerase to a gene, or recruit polymerase-recruiting or blocking proteins in bacteria and in archaea, despite the latter’s having RNA polymerase complexes similar to those in eukaryotes (Geiduschek and Ouhammouch 2005; Reeve 2003). By contrast, in eukaryotic organisms there is a global repression system for all genetic activity, and transcription of eukaryotic DNA requires relieving this global repression by energy-consuming modification of chromatin (Kireeva et al. 2002; Mizuguchi et al. 2004; Olave et al. 2002; Shen et al. 2000) as well as sequence-specific recruitment of specific transcription factors (Kireeva et al. 2005; Li et al. 2007). In short, the logic of eukaryotic chromatin is at a default ‘off’ state, whereas the nucleoprotein in other domains of life is at default ‘on.’

This is reflected in several functional observations. Expression vectors are engineered with specific genetic elements to ensure high levels of gene transcription. Those that function in bacteria and archaea can rely on chromosomal promoters alone, even if they integrate into the genome (although the T7 phage promoter is also popular in E. coli), and other genetic elements are included only to block transcription through repressor/operator control circuits. By contrast, eukaryotic vectors almost always have to contain viral promoters that have evolved to abrogate chromosomal gene control, and also additional viral enhancer sequences or complex chromatin control elements to enhance transcription (McCarty et al. 2004; Miller 1992) even if they replicate as episomes (Mumberg et al. 1995) (summarized in Figure 2). In bacteria, simply having promoter sequences that recruit transcription enzymes is sufficient to ensure transcription; in eukaryotes additional sequences to flag a sequence as transcribable are required.

Fig. 2
figure 2

Overview of expression vector components. Summary of features included in 159 expression vectors that drive expression of inserted genes. Vectors were classified as to whether the primary promoter sequence was viral or chromosomal. Vectors are also classified as to whether other, additional elements controlling expression levels were present—viral promoters, viral enhancers (‘viral-e’), synthetic enhancer elements including complex chromatin modulating synthetic segments (Williams et al. 2005) (‘synth-e’), chromosomal elements, and chromosomal operator elements of the Lac-operon, negative regulatory type (Chr-Op), or no additional elements over the base promoter (‘none’). Vectors are shown by the Domain in which they are designed to express protein. Data of mammalian and bacterial vectors from (EMBL 2015a; EMBL 2015b; Merck Millipore Inc. 2015; Promega Corp. 2015), Archaeal vectors from (Albers et al. 2006; Allers 2010; Allers et al. 2010; Aravalli and Garrett 1997; Contursi et al. 2003; Lucas et al. 2002; Peng et al. 2012; Santangelo et al. 2008; Schreier et al. 1999; Stedman et al. 1999; Zheng et al. 2012). Multiple families of vectors with essentially identical control systems and differing only in gene insertion sites or selectable genes are counted as one entry

Gene duplication is common in all domains of life, but in eukaryotes duplicate genes that have mutated to become pseudogenes are often retained in the genome (Mighell et al. 2000; Vanin 1985), whereas in bacteria and archaea they rarely are (Liu et al. 2004). In eukaryotes a high pseudogene load is the mark of a large genome, in prokaryotes it is the mark of the highly degraded genomes of evolving parasites, such as Mycobacterium leprae and Yersinia pestis [reviewed in (Bentley and Parkhill 2004)]. Even r-strategist eukaryotes like yeast have ~5% of their genome as pseudogenes (Harrison et al. 2002). We see this as supporting evidence for basic differences in control logic. In eukaryotes a gene is by default ‘off’ unless specifically activated, so pseudogenes are almost all transcriptionally silent (Zheng and Gerstein 2007) and hence of little phenotypic relevance. In prokaryotes a gene with a promoter attached is by default ‘on’ unless repressed, and so a mutated gene will have a significant chance of producing an aberrantly functional protein, and pseudogenes are observed to be efficiently selected against (Kuo and Ochman 2010).

Weaker but still intriguing support for the idea that mammalian genes are by default ‘off’ comes from somatic cell fusion experiments. Two decades of this now neglected area of research showed that if two cell lines that show different, differentiation-specific gene expression patterns are fused, the differentiation-specific genes that are expressed in only one originating cell line are usually not expressed in the hybrid [reviewed in (Gourdeau and Fournier 1990; Weiss 1982)]. This phenomenon (termed extinction) suggests that in competition between the expression status of a particular gene between the two genome states (‘on’ in one cell, ‘off’ in the other), the ‘off’ state is usually dominant. Derivative cell lines that have lost chromosomes often show re-expression of the differentiated phenotype, showing that the epigenetic imprinting of the differentiation-specific genes is not over-written, it is just suppressed by more powerful ‘off’ signals. Immortalized cell lines and cell fusion are not physiologically normal states, but the observation supports our general thesis.

Why is this relevant, if the chemistry of gene control can evolve multiple times? In complex organisms, all of the control systems described above interact with each other to define cell- and tissue-specific gene expression patterns (and hence phenotypes). In examples ranging from the mammalian development of white and brown fat (Peirce et al. 2014), and neurogenesis (Jobe et al. 2012; Schouten et al. 2012) to yeast mating type loci control (Buhler and Moazed 2007; Grewal and Rice 2004) we see all of miRNA, piRNA, protein transcription factors, specific DNA sequence elements, histone methylation, and acetylation used in a spaghetti code of interactions to define the biological endpoint. Even apparently highly specific enzymes such as telomerase are found, on closer examination, to have multiple roles in gene control (Li and Tergaonkar 2014).

The complexity of genetic circuits is therefore not just a function of the number of coding and regulatory elements, but of the number of ways they can interact, so that the number of distinct genetic programs is a polynomial function of genome size. It was a well-known observation from the dawn of molecular genetics that most of the genome is not transcribed in most cells of a multicellular body, nor in single-celled organisms most of the time [see for example (Chu et al. 1998; Ghaemmaghami et al. 2003; Menssen et al. 2011; Narlikar et al. 2010; Rabbani et al. 2003; Yamashita et al. 2000)]. To add a new set of genes to a genome, not only must a unique control network for that gene set be created, but a way of not activating all the other genes in the genome must be implemented as well. If the default status of genes is ‘off,’ then this second task is already achieved. If the default state of the genes is ‘on,’ then the first task is easier, but the second requires modulation of the control system for every other gene in the organism Footnote 2 . We note that in eukaryotes (animals, anyway) general release of the chromatin-mediated repression of genes is profoundly toxic ((Frost et al. 2014), and references therein).

Thus, we postulate that the evolution of a genome in which the default expression status was ‘off’ was the key, and a unique, innovation that allowed eukaryotes to evolve the complex control systems that they show today, not the evolution of any of those control systems per se. Whether the evolution of a ‘default off’ logic was a uniquely unlikely, Random Walk event or a probable, Many Paths event is the subject of future work.

Summary and Conclusions

In this paper, we present a simple classification of evolutionary innovations based on what sort of process leads to the appearance of the function that those innovations provide. We suggest that the process of innovation may be classified as follows:

  • Random walk (improbable, unlikely to be duplicated);

  • Many Paths (probable, likely to be duplicated through different mechanisms); and

  • Critical Path (probable, likely to occur multiple times in the same form).

We have sketched the vast field of gene control chemistry to show that all the key functions of gene control in eukaryotes are carried out by multiple classes of molecules, that similar molecules have adopted different functions in eukaryotes, prokaryotes, and archaea, and that there is good evidence for the independent evolution of many control chemistries and processes in different lineages and domains. All these observations support the idea that the development of eukaryotic gene control circuitry was a Many Paths process. Many Paths processes are highly likely to occur within a defined time ‘window’ given suitable environmental conditions; the timescale depends on the pace of the underlying individual component innovations, and the width of the ‘window’ depends on the number of possible innovations and the number of actual innovations needed to achieve the overall function [discussed in more detail in (Bains 2000; Bains 2004)]. As the timing of the appearance of both eukaryotes and of complex, multicellular genomes is controversial, it will be hard to constrain either timescale or window. However, the analysis does not depend on doing so and does suggest that evolution of a complex genome comparable to a modern plant or animal was not an unlikely outcome given the origin of life.

We suggest that a key difference between prokaryotes and eukaryotes is that the nucleoprotein of prokaryotes is by default open to transcription (‘on’), while that in eukaryotes is by default transcriptionally inactive (‘off’). From the appearance of this ‘default off’ state, the evolution of complex genomes was a likely Many Paths process.

We wish to emphasize that our analysis does not remove chance from large-scale evolution. The Chicxulub impact did have a profound impact on macro-faunal evolution (Archibald 2011). However, we should not over-glamorize these unique events, nor postulate that other unseen unique events are key to evolutionary innovation. The specifics of chemistry and topology of individual eukaryotic genomes are undoubtedly both unique and extremely unlikely to evolve twice. The evolution of complex genetic controls in eukaryotes was not deterministic. But the evolution of complex genomes was highly likely.