Abstract
Species concepts have been defined through a number of lenses, but are almost entirely empirical in nature. Fundamentally linked to various existing species concepts, an interpretation of genomic data through a species classification filter based upon a theoretical genotype–phenotype map with a monophyly requirement is discussed.
Avoid common mistakes on your manuscript.
Since the early days of evolutionary biology, many different conceptualizations of what a species is have been generated (see De Quieroz, 2007 for a review). Some of the most popular species concepts include a biological species concept involving reproductive capability between individuals (which may not work for bacteria, although see the discussion below), a phylogenetic species concept (which could be based statistically upon natural breaks in branch lengths between individuals in a phylogeny), an arbitrary sequence distance approximation to this (used in metagenomic data analysis; see Pust and Tümmler 2021 as an example), an observable phenotype species concept (which is not robust to phenotypic plasticity), an ecological species concept based upon the role played in an ecosystem, and relatedly a metabolic species concept (Fasani and Savageau, 2014). Most of these species concepts are empirical, based upon the observation and characterization of individuals together with their clustering.
Here I introduce a theoretical perspective on species that has some relationship to almost all of these species concepts, based upon the genotype–phenotype map. For a genome of size 10^9 base pairs, there are approximately 4^10^9 possible genomes (see Dervan 1986 for a more precise calculation, given that DNA is double stranded). This, of course, describes all of the genomes of precisely 10^9 base pairs. Genome sizes can in principle be as small as the length of an expressed replicator and a mix of transposable elements and other repetitive DNA with whole-genome duplication have given rise to genomes at least 3 orders of magnitudes larger than the human genome (Oliver et al. 2007). Let us arbitrarily use 10^13 as a revised maximum size, but the logic applies to any number used here. Let us add a 5th state for absent to make the space nested with smaller genomes. That then gives us a space of approximately 5^10^13 possible genomes. There is some redundancy in this space, which is shown together with a conceptualization in Fig. 1.
A large subset of these ~ 5^10^13 possible genomes will not encode life, so there will be a lot of empty space in the mapping of this to phenotype space. There are some core features necessary for cellular life (excluding viruses), including a replicase, some DNA repair, features of transcription and translation, and some combination of metabolism and transport (see, for example, Mushegian and Koonin 1996; Coleman et al. 2021) that must be encoded somewhere in the set of bases. However, as evolution is happening on this surface, all of the genomes of extant individuals must be connected on the surface by evolutionary processes (this is meant as a genomic extension of a classic model for protein evolution, see Ogbunugafor and Hartl 2016 for a recent discussion of this classic model). Each viable genome will have some combination of observable phenotypes that would correspond to morphological, behavioral, and metabolic species concepts. A tiny subset of them will have ever been observed in Nature. How these viable hypothetical genomes distribute on the space may be connected to the phylogenetic species concept. In emphasizing the importance of monophyly for sufficiently diverged species, one would need to make this part of a species definition based upon the classification of phenotypes. For example, the thylacine and the wolf might be discontiguous in genotype space but give rise to similar enough phenotypes to cluster together from a naïve phenotype-only definition (Rovinsky et al. 2021).
Embedded in this definition will be the gene content and exact sequences giving rise to sets of activities and expression levels that combine to make pathways and ultimately organisms that interact with their environment and each other. These are the rules that make a human, a human and a salmon, a salmon. Each species is defined by a large set of possible sequences. It will have boundaries and a defined volume that will differ between species. How this volume relates to other life history and population genetic traits is unclear but interesting. The connectivity of the space in viable genomes (through different types of mutational events) will describe both the expected speciation rate out of a particular species over evolutionary time as well as the sequence diversity that is available to sample within a species. See Fig. 2 for a depiction. It should be noted that larger-scale mutational events will naturally lead to empty spaces within a species volume due to the nature of the genotype–phenotype map and this needs to be considered carefully. For example, an insertion of length 1 or 2 toward the beginning of a coding sequence that flips two positions from absent to any nucleotide may be nonviable, while an insertion of length 3 that flips an extra base from absent to any nucleotide may create a viable genome for that species. One then needs to be mathematically precise in defining species boundaries using such an approach if one is to maintain monophyly as a core principle. A mathematical framework that could embed a genomic conceptualization of sequence space has in fact been described by Dress et al. (2010).
It was suggested above that a species concept based upon reproduction would not work for bacteria. It should be noted that Diop et al. (2022) suggest that gene flow causes bacteria to behave evolutionarily much more like sexual species. In this model, gene flow occurs much more frequently between organisms with more similar genomes. This reinforces a view of a genome-based species concept for bacterial lineages as well.
Perhaps this conceptualization does not change much, but does give the benefit of defining species in the absence of pure observation as well as potentially defining some mathematical properties about species described in genotypic space. This conceptualization is currently not accessible, but given advances in machine learning, theoretical understanding of the first principles driving different layers of the genotype–phenotype map, and genome-wide association studies with lots of data in different species, the actualization of such an approach is conceivable. Already, for dogs and for humans, some level of phenotypic prediction based upon a less than full understanding of genetic processes is possible (see for example, Brand et al. 2022; Morrill et al. 2022). If nothing else, this is a conceptualization to think about when defining species in various contexts.
References
Brand CM, Colbran LL, Capra JA (2022) Predicting archaic hominin phenotypes from genomic data. Annu Rev Genomics Hum Genet 23:591–612
Coleman GA, Davín AA, Mahendrarajah TA, Szánthó LL, Spang A, Hugenholtz P, Szöllősi GJ, Williams TA (2021) A rooted phylogeny resolves early bacterial evolution. Science 7(372):eabe0511. https://doi.org/10.1126/science.abe0511
De Queiroz K (2007) Species concepts and species delimitation. Syst Biol 56:879–886. https://doi.org/10.1080/10635150701701083
Dervan PB (1986) Design of sequence-specific DNA-binding molecules. Science 232:464–471. https://doi.org/10.1126/science.2421408
Diop A, Torrance EL, Stott CM, Bobay LM (2022) Gene flow and introgression are pervasive forces shaping the evolution of bacterial species. Genome Biol 23:239. https://doi.org/10.1186/s13059-022-02809-5
Dress A, Moulton V, Steel M, Wu T (2010) Species, clusters and the ‘Tree of life’: a graph-theoretic perspective. J Theor Biol 265:535–542. https://doi.org/10.1016/j.jtbi.2010.05.031
Fasani RA, Savageau MA (2014) Evolution of a genome-encoded bias in amino acid biosynthetic pathways is a potential indicator of amino acid dynamics in the environment. Mol Biol Evol 31:2865–2878. https://doi.org/10.1093/molbev/msu225
Morrill K, Hekman J, Li X, McClure J, Logan B, Goodman L, Gao M, Dong Y, Alonso M, Carmichael E, Snyder-Mackler N, Alonso J, Noh HJ, Johnson J, Koltookian M, Lieu C, Megquier K, Swofford R, Turner-Maier J, White ME, Weng Z, Colubri A, Genereux DP, Lord KA, Karlsson EK (2022) Ancestry-inclusive dog genomics challenges popular breed stereotypes. Science 376:eabk0639. https://doi.org/10.1126/science.abk0639
Mushegian AR, Koonin EV (1996) A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc Natl Acad Sci U S A 93:10268–10273. https://doi.org/10.1073/pnas.93.19.10268
Ogbunugafor CB, Hartl DL (2016) A new take on John Maynard Smith’s concept of protein space for understanding molecular evolution. PLoS Comput Biol 12:e1005046
Oliver MJ, Petrov D, Ackerly D, Falkowski P, Schofield OM (2007) The mode and tempo of genome size evolution in eukaryotes. Genome Res 17:594–601. https://doi.org/10.1101/gr.6096207
Pust MM, Tümmler B (2021) Identification of core and rare species in metagenome samples based on shotgun metagenomic sequencing, Fourier transforms and spectral comparisons. ISME Commun 1:2. https://doi.org/10.1038/s43705-021-00010-6
Rovinsky DS, Evans AR, Adams JW (2021) Functional ecological convergence between the thylacine and small prey-focused canids. BMC Ecol Evol 21:58. https://doi.org/10.1186/s12862-021-01788-8
Acknowledgements
The author would like to thank Barbara Holland and Louis-Marie Bobay for helpful comments that improved the manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares no competing interests.
Additional information
Handling editor: Michelle Meyer.
Rights and permissions
About this article
Cite this article
Liberles, D.A. A Genomic Conceptualization of Species. J Mol Evol 91, 379–381 (2023). https://doi.org/10.1007/s00239-023-10111-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-023-10111-6