Since the early days of evolutionary biology, many different conceptualizations of what a species is have been generated (see De Quieroz, 2007 for a review). Some of the most popular species concepts include a biological species concept involving reproductive capability between individuals (which may not work for bacteria, although see the discussion below), a phylogenetic species concept (which could be based statistically upon natural breaks in branch lengths between individuals in a phylogeny), an arbitrary sequence distance approximation to this (used in metagenomic data analysis; see Pust and Tümmler 2021 as an example), an observable phenotype species concept (which is not robust to phenotypic plasticity), an ecological species concept based upon the role played in an ecosystem, and relatedly a metabolic species concept (Fasani and Savageau, 2014). Most of these species concepts are empirical, based upon the observation and characterization of individuals together with their clustering.

Here I introduce a theoretical perspective on species that has some relationship to almost all of these species concepts, based upon the genotype–phenotype map. For a genome of size 10^9 base pairs, there are approximately 4^10^9 possible genomes (see Dervan 1986 for a more precise calculation, given that DNA is double stranded). This, of course, describes all of the genomes of precisely 10^9 base pairs. Genome sizes can in principle be as small as the length of an expressed replicator and a mix of transposable elements and other repetitive DNA with whole-genome duplication have given rise to genomes at least 3 orders of magnitudes larger than the human genome (Oliver et al. 2007). Let us arbitrarily use 10^13 as a revised maximum size, but the logic applies to any number used here. Let us add a 5th state for absent to make the space nested with smaller genomes. That then gives us a space of approximately 5^10^13 possible genomes. There is some redundancy in this space, which is shown together with a conceptualization in Fig. 1.

Fig. 1
figure 1

The representation of nested sequence space is presented for toy genomes of lengths 1 and 2. Here, for the genome of length 2, if the total possible genome length is 10^13, then i, j, and k have values between 0 and 10^13 such that their sum is 10^13. It becomes obvious that are many possible genomes of lengths 1 and 2 that are identical with different representations in sequence space from this characterization

A large subset of these ~ 5^10^13 possible genomes will not encode life, so there will be a lot of empty space in the mapping of this to phenotype space. There are some core features necessary for cellular life (excluding viruses), including a replicase, some DNA repair, features of transcription and translation, and some combination of metabolism and transport (see, for example, Mushegian and Koonin 1996; Coleman et al. 2021) that must be encoded somewhere in the set of bases. However, as evolution is happening on this surface, all of the genomes of extant individuals must be connected on the surface by evolutionary processes (this is meant as a genomic extension of a classic model for protein evolution, see Ogbunugafor and Hartl 2016 for a recent discussion of this classic model). Each viable genome will have some combination of observable phenotypes that would correspond to morphological, behavioral, and metabolic species concepts. A tiny subset of them will have ever been observed in Nature. How these viable hypothetical genomes distribute on the space may be connected to the phylogenetic species concept. In emphasizing the importance of monophyly for sufficiently diverged species, one would need to make this part of a species definition based upon the classification of phenotypes. For example, the thylacine and the wolf might be discontiguous in genotype space but give rise to similar enough phenotypes to cluster together from a naïve phenotype-only definition (Rovinsky et al. 2021).

Embedded in this definition will be the gene content and exact sequences giving rise to sets of activities and expression levels that combine to make pathways and ultimately organisms that interact with their environment and each other. These are the rules that make a human, a human and a salmon, a salmon. Each species is defined by a large set of possible sequences. It will have boundaries and a defined volume that will differ between species. How this volume relates to other life history and population genetic traits is unclear but interesting. The connectivity of the space in viable genomes (through different types of mutational events) will describe both the expected speciation rate out of a particular species over evolutionary time as well as the sequence diversity that is available to sample within a species. See Fig. 2 for a depiction. It should be noted that larger-scale mutational events will naturally lead to empty spaces within a species volume due to the nature of the genotype–phenotype map and this needs to be considered carefully. For example, an insertion of length 1 or 2 toward the beginning of a coding sequence that flips two positions from absent to any nucleotide may be nonviable, while an insertion of length 3 that flips an extra base from absent to any nucleotide may create a viable genome for that species. One then needs to be mathematically precise in defining species boundaries using such an approach if one is to maintain monophyly as a core principle. A mathematical framework that could embed a genomic conceptualization of sequence space has in fact been described by Dress et al. (2010).

Fig. 2
figure 2

A geometric representation of species is presented, where each species is depicted with a different hypersphere-like volume representing all of the possible genomic sequences (including missing nucleotides) that give rise to that species. This shape is expected to be irregular in a high-dimensional space. In the case of speciation between species A and species B, the intersection point of the spheres would be the transitional boundaries of speciation in sequence space. In the case of species C with the A–B ancestor, there is a small projection that might be defined as either species C or species B at various points. Here, the ancestor of species A and B would have been more B-like in genotype and phenotype. The phylogenetic tree corresponding to the moves in sequence space is shown at left. The mutations along the dotted arrow would correspond to those that occurred along the dotted branches

It was suggested above that a species concept based upon reproduction would not work for bacteria. It should be noted that Diop et al. (2022) suggest that gene flow causes bacteria to behave evolutionarily much more like sexual species. In this model, gene flow occurs much more frequently between organisms with more similar genomes. This reinforces a view of a genome-based species concept for bacterial lineages as well.

Perhaps this conceptualization does not change much, but does give the benefit of defining species in the absence of pure observation as well as potentially defining some mathematical properties about species described in genotypic space. This conceptualization is currently not accessible, but given advances in machine learning, theoretical understanding of the first principles driving different layers of the genotype–phenotype map, and genome-wide association studies with lots of data in different species, the actualization of such an approach is conceivable. Already, for dogs and for humans, some level of phenotypic prediction based upon a less than full understanding of genetic processes is possible (see for example, Brand et al. 2022; Morrill et al. 2022). If nothing else, this is a conceptualization to think about when defining species in various contexts.