Short history of microbial taxonomy

The history of microbial taxonomy during the last 100 years is one of a scientific field in which progress and conservatism meet. It is progressive as it incorporates the most advanced technologies, yet conservative because it adheres to standards and rules. A number of technological driving forces were operative during its development as a scientific discipline: the introduction of metabolic and phenotypic characterization of bacteria, numerical analysis of phenotypic data (Sneath and Sokal 1973), DNA–DNA hybridizations (DDH) and %G+C determinations (De Ley 1970), 16S rRNA gene sequencing (Woese and Fox 1977), and multilocus sequence analysis (MLSA) (Gevers et al. 2005) before the introduction of complete genome sequencing (Coenye and Vandamme 2003; Coenye et al. 2005; Thompson et al. 2009). The seminal work of Carl Woese in 1977 on the discovery of the three domains of life, triggered by the use of ribosomal rRNAs as evolutionary chronometers, was a new paradigm in microbial taxonomy (Woese and Fox 1977). Today, within the so-called polyphasic taxonomic approach, all schemes include measurements of evolutionary relationships using gene sequences (most notably the 16S rRNA gene) (Yarza et al. 2008) to determine the phylogenetic position of an isolate, combined with chemotaxonomic, physiological, and cultural properties (Colwell 1970; Stackebrandt et al. 2002). Polyphasic taxonomy is based on the phylogenetic framework. A comprehensive practical guide to polyphasic taxonomy has been published by Tindall et al. (2010) who state that novel taxa should be characterized as comprehensively as possible.

Polyphasic microbial taxonomy is recognized as an orthodox field meaning that the following fixed rules are applied for species delineation: (a) DDH values of at least 70 %; (b) at least 97 % rRNA gene sequence similarity (recently 98.7 % was proposed by Stackebrandt and Ebers (2006); (c) maximum 2 % of G+C span; and (d) differentiating chemotaxonomic and phenotypic features where great weight is placed on the phenotypic (chemotaxonomic) characterization using specialized technologies such as fatty acid methyl ester (FAME), polyamines, peptidoglycan types sphingolipids, and matrix-assisted laser desorption/ionization—time-of-flight mass spectrometer (MALDI–TOF MS); however, in most cases, these are not very useful for discriminating all species, i.e., all species in each major lineage, nor do they shed light on the biology of the microorganisms. Rosselló-Móra (2012) has given a comprehensive state of the art on microbial taxonomy, its principles, practice, and most recent developments (Rosselló-Móra 2012). This author favors the application of genome sequences information in microbial taxonomy.

We contend that current rules are impeding progress both in the description of new species and in the development of taxonomy as a scientific discipline. First, DDH is still considered a gold standard for species delineation in spite of demonstration that other techniques such as multilocus sequence analysis (MLSA) average aminoacid identity (AAI), and genome-to-genome distance (GGD) are portable and have greater discriminatory power (see, e.g., Gevers et al. 2005; Konstantinidis and Tiedje 2005; Auch et al. 2010). In fact, many journals specializing in taxonomy have not yet accepted alternative techniques to DDH. Moreover, because of technological and methodological hurdles, DDH is only performed by few laboratories that are highly specialized in taxonomy (Wayne et al. 1987; Stackebrandt and Ebers 2006) and performing DDH experiments might take years, slow description of new species considerably. Finally, journals such as systematic and applied microbiology (SAM) and International Journal of Systematic and Evolutionary Microbiology (IJSEM) require the concomitant extensive phenotypic characterization of closely related type strains every time a new species is being described (journals.elsevier.com/systematic-and-applied-microbiology/, ijs.sgmjournals.org/) even if data are available for the same type strains using the same reagents and machines in the same laboratories. Because this is time-consuming and unnecessary, it likely keeps many scientists from formally describing new microbial species. That taxonomy is a conservative science is not a new observation, and it is interesting to note that it took two decades for the acceptance of DNA–DNA hybridization as a reliable standard. However, we do not mean to imply that all rules should be overturned. In fact, deposition of strains in public collections and sequences in public databases must continue no matter what new taxonomic schemes are agreed on.

The failure of polyphasic taxonomy

Let us first state that, in the past, application of polyphasic taxonomy has enabled considerable progress and stability in microbial taxonomy and its nomenclatural legacy will be safeguarded. However, the “gold standards” of polyphasic taxonomy are increasingly outdated since orthodox microbial polyphasic taxonomy is neither able to keep up with the progress in environmental and evolutionary microbiology nor with the needs of clinical microbiologists and epidemiologists. Additionally, there is a mounting uneasiness with the definition of the microbial species itself (including bacteria and archaea).

Polyphasic taxonomy cannot keep up with the explosion in genome sequences, even at the broadest levels of taxonomic classification; at the time of writing, there are about 200 bacterial genomes in GenBank where the phyla are listed as ‘unclassified.’ Further, as tens of thousands of genomes are becoming available, the diversity within a species—much of which arises due to recombination between lineages—has led to the proposal of ‘fuzzy species’ (Fraser et al. 2007; Hanage 2013). Much of the recent progress in microbiology is due to the dramatic plunge in sequencing cost and speed; currently, sequencing a hundred small bacterial genomes at 10× coverage is <$10—that is, literally a few cents per genome, and third-generation sequencing methodology allows completion of a bacterial genome in a few hours. Additional costs related to, e.g., DNA extraction and library preparation, and bioinformatics (computer time), need to be taken into consideration, but will not undermine the use of genomes in species descriptions. In spite of the potential of genome sequences, only very few studies have applied genome sequences to date for new species descriptions (Moreira et al. 2014a, b).

Senior scientists who contributed to polyphasic taxonomy realize that the principles and practices of present-day polyphasic taxonomy should be questioned and microbial taxonomy be rethought. The role of microbial taxonomy is to provide a framework for reliable identification of organisms in order to learn about their functional role in a particular environment. The need to revisit polyphasic taxonomy has been articulated by Vandamme and Peeters (2014) using the taxonomy of the Burkholderia complex as an example. These authors state that “DDH had been historically introduced to approach whole-genome sequence (WGS)-derived information as closely as possible (Wayne et al. 1987) and now that we have direct access to WGS information, we want it to mimic the results obtained through (physical–chemical) DDH experiments.” This statement exemplifies the paradox of keeping DDH as a standard where attempts are being made to translate the old DDH species threshold into new WGS-based thresholds even though the information derived from the latter techniques is superior to DDH.

Approximately 600 new bacterial and archaeal species are described each year using polyphasic taxonomy (Konstantinidis and Stackebrandt 2013), and at such pace, it will take centuries to describe even a small fraction of the novel species present in the biosphere. It is therefore clear that on purely pragmatic grounds, we can no longer proceed with the present-day orthodox polyphasic microbial taxonomy as defined by the comprehensive guidelines. We can no longer be “keeping bacterial taxonomy as the playground of a few privileged with full access to a battery of phenotypic, genotypic and chemotaxonomic tools” (Vandamme and Peeters 2014).

Another reason for the failure of polyphasic taxonomy is that the standards for species descriptions are still based on aged approaches that are not appropriate for many of the species that are currently being described. Many of the tests herald from medical microbiology introduced in the late nineteenth and early twentieth century but are still applied to environmental isolates. Among the many examples that might be given to illustrate that polyphasic taxonomy is failing in the description of biodiversity are the cases of Burkholderia (Vandamme and Peeters 2014), Wolbachia (Ellegaard et al. 2013), and Pseudomonas (Alvarez-Pérez et al. 2013). For Wolbachia, two genetically distinct and irreversibly separated clades were distinguished (Ellegaard et al. 2013), but these cannot be described as species. One of the biggest problems is that polyphasic taxonomy is unable to deal with uncultivated microbes. In the ubiquitous SAR 11 (Pelagibacter) clade, which is thought to be the most abundant bacterial group in the world’s oceans, a number of phylotypes are recognized but poorly characterized by cultivation (Giovannoni et al. 2005; Brown et al. 2012). The most extreme case is new biodiversity described by single-cell sequencing to generate reference genomes of uncultured taxa from the marine bacterioplankton, e.g., two uncultured flavobacteria described by Woyke et al. (2009). In these and many other cases, polyphasic taxonomy is of little help in describing novelty. The prokaryotic code should include the description of uncultured organisms based on whole-genome sequences, particularly now with the advent of new technologies such as the single-cell genomics.

Whereas monoculture experimental standards and rules have guided the description of bacterial and archaeal species in the past, several colleagues have stressed that the time has come to integrate genomics as a reliable and reproducible standard into the taxonomy of the bacteria and archaea (Lan and Reeves 2000; Doolittle and Papke 2006; Fraser et al. 2009; Whitman 2009; Staley 2009; Klenk and Göker 2010; Zhi et al. 2012; Ellegaard et al. 2013; Chun and Rainey 2014). However, simply incorporating genome sequence data into polyphasic taxonomy as proposed by Ramasamy et al. (2014) might not be sufficient. Indeed, adding genome sequences to the list of key elements defined by Tindall et al. (2010) will not rejuvenate microbial taxonomy. We believe that taxonomists share together with ecologists and phylogenists the responsibility for a description of the microbial world. In fact, with the available genomic technology and sufficient metadata, we can construct the necessary standards and rules to develop robust and fast tools that describe and order microbial diversity.

The re-examination of the microbial species definition

A further and more fundamental failure of polyphasic taxonomy is that it uses a very broad species definition that is not based on an evolutionary species concept (Stackebrandt et al. 2002; Fraser et al. 2009). Recent progress in environmental microbiology has shown that classically described species often comprise assemblages of ecologically and genomically distinct populations. In fact, a species cutoff of 70 % as used in DDH leads to underspeciation within prokaryotes. The 16S rRNA gene on the other hand lacks resolution at the species level, even at the 98.7 % level. Universal cutoff levels to delineate species do not make sense since speciation is a dynamic process leading to sister taxa that is separated by variable sequence space (Shapiro and Polz 2014).

Although there is currently no consensus on a species concept for bacteria and archaea (cohan 2001; Rosselló-Mora and Amann 2001; de Queiroz 2005; Dykhuizen 2005; Nesbø et al. 2006; Staley 2006; Fraser et al. 2007; Achtman and Wagner 2008), taxonomy may nonetheless benefit from an evolutionary framework to order bacterial, archaeal, and eukaryotic microbial diversity into more natural units (Fraser et al. 2009). This framework has been provided by WGS, which allows identification of sequence clusters at high genotypic resolution based on variation in protein-coding genes distributed across the genomes. Importantly, such clusters are consistent with the vernacular notion of a species as a group of organisms that is more similar to each other than to any other species (Polz et al. 2006; Fraser et al. 2009). The discovery of clusters also offers a practicable solution to the species dilemma, i.e., to sidestep it for the moment and to continue with the pragmatic definition of species that emphasizes the existence and description of clusters of coexisting strains that are consistently similar on a genetic and phenotypic basis. This approach is not so much different from the present one, just shifting emphasis to molecular data. Such clusters may be defined from multiple different sources of genetic data (core gene sequences, microarrays or whole genomes) and form tractable units to address evolutionary and ecological questions.

The focus on sequence (phylogenetic) clusters as more natural units of organization for bacteria, archaea, and eukaryotic microbes is motivated by the following considerations. First, analyses of environmental isolates and metagenomes have shown that microbial communities consist of genotypic clusters of closely related organisms, with mounting evidence that these clusters display cohesive environmental associations and dynamics that differentiate them from other such clusters coexisting in the same samples (Hunt et al. 2008; Konstantinidis and DeLong 2008; Denef et al. 2010; Caro-Quintero and Konstantinidis 2012; Kashtan et al. 2014), and recent work has shown that it is possible to construct genomic backbone scaffolds for several hundred ‘species’ from a series of metagenomic samples and to then use this for template-based assembly of genomes from individual samples (Nielsen et al. 2014; Mick and Sorek 2014). Second, recent modeling and whole-genome analysis of clusters in the very early stages of divergence has suggested that, in spite of potential for horizontal gene transfer (HGT), selection is required for cluster formation in sympatry (recently reviewed in Polz et al. 2013; Shapiro and Polz 2014). But even if clusters form in allopatry, they are free to diverge ecologically because specific alleles or genes can spread in a population (i.e., location)-specific manner, as seen in individuals from large-scale metagenomic studies (Nielsen et al. 2014).

A fact that any species definition has to contend with is that bacteria and archaea can share genes across any species boundary imposed by taxonomists via HGT (Doolittle and Zhaxybayeva 2009). At face value, this violates the biological species concept as formulated by Mayr (1942). However, there is mounting evidence that many eukaryotes speciate by hybridization and that such events occur frequently (but have somewhat low probability of survival) (Mallet 2008). Moreover, recent population genomic analyses of clusters in the early stages of divergence have shown that although HGT occurs frequently, gene flow discontinuities exist between clusters even if they remain closely related (Cadillo-Quiroz et al. 2012; Shapiro et al. 2012). At least in one case, it was also demonstrated that these gene flow discontinuities are sufficient for adaptive alleles and genes to spread in a cluster-specific manner (Shapiro et al. 2012). It is important to realize that speciation events can be transient and need not necessarily lead to species (Mallet 2008; Wiedenbeck and Cohan 2011; Shapiro and Polz 2014). Hence, it will be a challenge for microbial taxonomists to delineate species that appear to have at least some permanence in the evolutionary spectrum.

A more natural definition of microbial species as proposed in the present text also solves the problem of the frequent observation that even closely related genomes can have high gene content variation that gives rise to at least some level of phenotypic variation. If, as argued above, clusters are gene flow units within which selection acts on gene frequencies, then it is possible that gene content variation arises due to frequency-dependent selection where the fitness of a genotype within a population depends on its frequency (Fig. 1; Cordero and Polz 2014). In fact, genes at low and intermediate frequency may be involved in niche complementarity, social interactions and predator–prey interactions (Cordero and Polz 2014). It has been argued previously that many genes occurring at low frequency within genomes are involved in predation evasion by varying surface antigenicity (Rodriguez-Valera et al. 2009; Cordero and Polz 2014). Moreover, intermediate frequency genes may be involved in frequency-dependent interactions such as public good production and cheating as well as niche-complementation (Cordero and Polz 2014). This may also explain some phenotypic variation frequently observed among closely related genotypes. In the context of taxonomy, it will be important to recognize that some traits may be patchily distributed within a population. For example, any excreted enzyme may act as a public good and invite cheating within the same population or species (Cordero et al. 2012). Phenotypic variation among strains of the same species is a well-known example of possible cheating (Moreira et al. 2014a, b).

Fig. 1
figure 1

Gene frequencies and the evolutionary and ecological processes, extracted from Cordero and Polz (2014). Populations are recognized as genotypic clusters separated by gene flow boundaries and can have distinct habitats. a High-frequency genes (green and orange arrows; also represented by short black lines in the gene flow map) are primarily maintained by vertical inheritance and homologous recombination. These genes are observed across multiple ecological populations and typically encode core metabolic and housekeeping functions that are independent of the different environments. b High-frequency genes (High*) can also segregate ecological populations. After being gained or lost in a population-specific manner, these genes could follow similar patterns of gene flow as other core genes. They are potentially involved in habitat-specific functions (for example, the adaptation to use either the orange or green substrates as a nutrient source). c Medium-frequency genes flow by vertical inheritance, homologous recombination, and gene loss. As illustrated in the figure, without considering population structure (in other words, that the green and orange genes are derived from two distinct populations), the frequency of these genes would be indistinguishable from that of the High* genes (50 %). Recent studies suggest that some of these genes might be involved in local biological interactions (such as those that are mediated by public goods), which create frequency-dependent selection. d Low-frequency genes reflect extremely high rates of gene turnover, which represents an evolutionary strategy to diversify, often precipitated by negative frequency-dependent selection emerging from interactions with predators (such as phage) or with the immune system (color figure online)

In summary, although we do not have an agreed upon species definition for bacteria and archaea, we propose that genotypic (phylogenetic) clusters can serve to easily and quickly formulate hypotheses of species (or populations). The properties of these units can then be further explored by genomics as outlined in the next section. However, we also note that it is often not easy to recognize the exact boundaries of clusters. This is because the extensive history of gene transfer may create “fuzzy” boundaries and nested structure of clusters when, as is typically the practice, analyzing phylogenetic structure using trees of concatenated genes (or genomes) (Hanage et al. 2005; Hanage 2013). A challenge for the future will therefore be to develop robust techniques that, we believe, should be based on analysis of patterns of contemporary gene flow rather than sequence similarity-based clustering.

Paradigm shift

Taxonomy must adjust to the genomics era, addressing the needs of its users in microbial ecology and clinical microbiology (Preheim et al. 2011), in a new paradigm of open-access genomic taxonomy (Thompson et al. 2013a). We witness already the tremendous efforts put into initiatives on prokaryote genomics, such as the Genomic Encyclopedia of Bacteria and Archaeae—GEBA (Wu et al. 2009; Klenk and Göker 2010), Genomes OnLine Database—GOLD (Kyrpides 1999; Pagani et al. 2012), and the Integrated Microbial Genomes—IMG (Markowitz et al. 2006, 2014).

Whereas the actual divorce between classical taxonomy, evolution, and ecology is hampering progress, the new paradigm of genomic taxonomy provides rapid diagnostics of microbial phenotypes and niches in an open-access manner. The open-access genomic taxonomy embraces the classification of species builds on many established genomic tools. Examples include genome signatures (e.g., genome-to-genome distance (GGD); Auch et al. 2010), average amino acid identity (AAI) (Rohwer and Edwards 2002), average nucleotide identity (ANI) (Konstantinidis and Tiedje 2005), Karlin genomic signature (Karlin and Burge 1995), supertrees analysis (Brown et al. 2001), codon usage bias (Wright 1990), metabolic pathway content, core-genome analysis, pan genome family trees (Snipen and Ussery 2010), and in silico proteome analysis, genotype-to-phenotype-to-genotype-derived metabolic features, including those features that may inform ecology (e.g., host–microbe interactions, and energy/nutrient cycling) and evolution (Dutilh et al. 2013, 2014; Amaral et al. 2014). Only recently species descriptions have began to include some measurements of genome-derived measurements of genetic relatedness based on, e.g., AAI/ANI, always with supporting DDH data, indicating that genomic taxonomy is not yet recognized by major journals as standards in species descriptions. Also none of these methods are included in minimal standards of species description. It also embraces the identification of strains based on diagnostic features disclosed in the new species descriptions. The application of genomic taxonomy is providing a predictive operational framework for reliable identification and classification. We argue for an open-access catalog of taxonomic descriptions with prototypes, diagnostic tables, links to culture collections, to genome and gene sequences, and to other phenotypic and ecological databases. Ideally, the open-access taxonomy is based solely on genome sequences that allow both the phylogenetic allocation of new strains and species in the taxonomic space and the phenotypic/metabolic characterization in open online databases.

A new species description needs to be based, first of all, on at least one complete genome (Thompson et al. 2013a). In this way, the genomic landscape of the novel bacterium becomes available to microbiologists. Ideally, additional representative genomes of strains belonging to the new species will be included in order to provide information on the intraspecies genomic and phenotypic variation. The species description process needs to be automated and openly available to all, i.e., open access. Genomic taxonomy has already been successfully applied as an alternative for the more traditional species description and re-classification (Thompson et al. 2009; Haley et al. 2010; Thompson et al. 2011a, 2013b; Moreira et al. 2014a, b). For example, the genus Listonella was reclassified as a later heterotypic synonym of the genus Vibrio (Thompson et al. 2011b), and a new taxonomic framework for the genus Prochlorococcus was proposed with the descriptions of new species (Thompson et al. 2013c).

The genome sequence of the new taxa can be used for automatic identification of a microbial species through open-access tools available in a web-based portal. The genome sequences can also allow for the rapid identification of major phenotypic features associated with that organism, and translation of genomic information into phenotype will be increasingly precise with more genomes being annotated. We argue that ultimately the analyses of genes coding for the specific proteins involved in the metabolic pathways responsible for diagnostic features (e.g., Voges–Proskauer reaction, indole production, arginine dihydrolase, ornithine decarboxylase, utilization of myo-inositol, sucrose and l-leucine, and fermentation of d-mannitol, d-sorbitol, l-arabinose, trehalose, cellobiose, d-mannose and d-galactose) may be an alternative to the time-consuming phenotypic characterization using the standard biochemical tests (Karp et al. 2005; Romero et al. 2005; Dutilh et al. 2013; Amaral et al. 2014). Diagnostic phenotypic data are very hard to retrieve and lack portability (see, e.g., Bergey’s Manual, The Prokaryotes). Huge amount of valuable phenotypic data are simply out of reach because they are available only in the species description papers, manuals or handbooks. On the other hand, researchers need electronic portable data in order to push forward different fields of microbiology. By using the genotype-to-phenotype strategy (Fig. 2), it will be possible to leverage genome information to overcome this serious shortcoming of current microbial taxonomy in microbial ecology and clinical microbiology. In addition, relating the wealth of resource-associated data to cultures deposited in microbial Biological Resource Centers will foster academic research and drive innovation in the bio-economy.

Fig. 2
figure 2

Genotype-to-phenotype approach in genomic prokaryotic taxonomy. A training set (type and reference strains) is subjected to whole-genome sequencing and gene content (including genes coding for the specific enzymes of a given metabolic pathway and the regulator proteins) analysis. Measured phenotypic features of the training set are obtained from the literature (e.g., Bergey’s manual and the Prokaryotes) and compared with the gene content in order to predict phenotypes. The phenotype of new strains is obtained by whole-genome sequencing using the diagnostic gene content defined in the training set (color figure online)

The manner in which phenotypic information is retrieved and presented in new species description and identification schemes will need to change in order to allow for open access of taxonomic data. Metabolic data are of paramount importance to link genomes and phenotype, but data accessibility needs also to be considered. We foresee two quite distinct situations in the process of open-access genomic taxonomy targeting biodiversity characterization. In the case of totally new taxa belonging to, e.g., a new phylum or class, for which metabolic data are scarce or even unknown and the genomic landscape is poorly known, significant efforts will be needed in order to provide experimental in vitro work to underpin the new descriptions. Of course, these are the most interesting cases in the context of biodiversity discovery. On the other hand, in cases of a new species description within a well-studied phylum (e.g., Proteobacteria and Firmicutes), phenotypes may be readily obtained by the genotype-to-phenotype-to-genotype approach. In this context, former microbial taxonomy studies (and the enormous phenotypic information available) performed in the last century will underpin the genotype-to-phenotype-to-genotype strategy. Efforts will be required to implement database-based high-throughput phenotypic methods that provide portable open-access data (Fig. 2). Methods of particular interest are those that reveal the amino acid sequences and the structure of molecules (e.g., secondary metabolites products, virulence factors). In comparison with phenotypic methods, the major advantage of new generation sequencing is the high throughput, relatively low cost, high information content, high data quality, and portability of data. The data can be easily checked for quality in different stages of taxa description and can also be deposited in public databases. International initiatives such as GEBA are already working on this goal by whole-genome sequencing of all type strains of known species (more than eleven thousand genomes) and by innovative technologies such as single-cell genomics of uncultured microbes for discovery of new biodiversity (Wu et al. 2009; Rinke et al. 2013). In spite of these large ongoing initiatives, most of the current species descriptions in the major specialized journals still use the polyphasic approach, because of the insistence that in vitro DDH and massive phenotyping remain the cornerstones of contemporary Microbial taxonomy. The majority of the known type and reference strains still have no genome sequence. Vandamme and Peeters (2014) have proposed a species description based on the full genome sequence and a minimal description of phenotypic characteristics, to be considered sufficient, cost-effective, and appropriate. The importance of increasing the rate of species descriptions is exemplified by the pace at which microbiome projects are advancing the study of culture-independent biodiversity of the most diverse environments and hosts which leads to the generation of Terabytes of DNA sequence in a matter of days (Huang et al. 2014; Franzosa et al. 2014; Li et al. 2014; Nielsen et al. 2014). As the ongoing microbiome projects advance, there will be a growing gap between the field of microbial community diversity and Microbial taxonomy. We argue that the open-access genomic taxonomy can help to close this gap by establishing a stable, reproducible, and informative framework. Taxonomy needs also to be affordable. The cost for a new species description based on the genome sequences will be considerably less expensive and quicker than based on the polyphasic taxonomy.

In silico phenotyping

To distinguish different strains within a bacterial species, or different species within a genus, the field of bacterial taxonomic classification has developed sets of phenotypic tests. Examples of phenotypes that may be measured include metabolism of specific organic compounds, resistance to antibiotics, phage sensitivity. Specific phenotypic tests suitable for classification can be developed for each taxonomic group. Because microbial phenotypes are the result of metabolic pathways or functions encoded on the genomes of the bacterial strains, the phenotype is a proxy for phylogenetic classification.

In the past decade, great advances have been made in DNA-sequencing technologies. Several competing companies now provide the necessary equipment and chemistry to obtain high-quality draft genome sequences of bacterial strains at affordable prices. Third-generation sequencing will soon allow for sequencing of bacterial genomes in a few hours for a few dollars (Didelot et al. 2012). These genomes contain a wealth of genetic information and enable direct classification with respect to all other sequenced genomes, i.e., without the use of a phenotype as a proxy. Moreover, bioinformatic advances now enable mining of these genome sequences to predict the phenotype of the sequenced strain, known as in silico phenotyping, avoiding costly experimental phenotypic screens that need to be performed in the laboratory. We have recently proposed an approach for in silico genomic phenotyping based on gene content screens (Fig. 2) (Amaral et al. 2014). In this study, genes involved in the molecular pathways leading to the phenotypes were selected and genome sequences screened for the presence of these genes. This allowed us to confidently predict phenotypic classifications to each of the genomes (Amaral et al. 2014) that can be tested experimentally. A large collection of phenotypes and the associated genes is contained in the SEED database (Overbeek et al. 2014). This database contains hundreds of expert-annotated, manually curated subsystems that can be rapidly projected onto new genome sequences, providing an automated approach for in silico prediction of phenotypes.

Identifying or predicting the genes that are involved in each phenotype is known as gene-trait matching. Recently, a complete in silico pipeline was outlined for the consistent annotation of bacterial genomes followed by automated gene-trait matching (Dutilh et al. 2013). Condition for this approach is that the trait is consistently measured for all sequenced genomes. By using this approach-dubbed “genome-wide association study for microbes” (GWAS-M), candidate genes contributing to the trait can be obtained. The approach employs a machine-learning tool, and by analyzing a training set of bacteria that differ with respect to the trait, it identifies which genomic variables best explain the trait variation. These genomic variables can then be used to infer the phenotype of a strain based on its genome sequence.

Advances in genome sequencing fuel the young field of bioinformatic gene-trait matching, and a few applications have been published thus far. An early example of this approach was based on a comparative genome hybridization (CGH) array, and involved the identification of genes associated to growth on sugars and nitrogen dioxide production in Lactobacillus plantarum (Bayjanov et al. 2012). More recently, a large collection of 274 Vibrio cholerae genomes was mined for genomic variables that explained not phenotypes, but the occurrence of the isolates in three niche dimensions, including space, time, and habitat (Dutilh et al. 2014). This study revealed that mobile genetic elements explained most of the variation in all these niche dimensions and may be used to classify the genomes. These examples illustrate the versatility of gene-trait matching and its power for identifying genes associated with specific bacterial traits.

Genome sequencing is not without its drawbacks. Next-generation or ‘second-generation’ sequencing has removed many of the biases of cloning that plagued earlier genome sequences, but whole-genome assembly is often complicated by short reads and the miriad of repeat regions in the microbial genome. Ribosomal RNA operons are frequently present in multiple exact copies, and phage genes, transposons, and insertion elements all contribution to computational confusion during the assembly process. Finishing genomes completely—so that every base is known and error free—is both expensive and time-consuming, typically requiring PCR walking across repeat regions. Consequently, most microbial genomes are only sequenced to “high-quality draft status” typically meaning <100 contigs. Third-generation sequencing technologies, such as Pacific Biosciences and Oxford Nanopore, have the advantage of long reads (10,000 bp or longer), although currently their throughput and base calling accuracy is lower than the second-generation machines. However, many bacterial genomes have been sequenced and assembled with a single run on Pacific Biosciences machines (Doi et al. 2014; Forde et al. 2014; Shiwa et al. 2014).

Genome annotation is generally based on similarity between predicted proteins in the genome and annotated proteins in the database. Of course, similarity-based annotation systems require a homolog of the predicted protein be known. Ideally, protein functions should be experimentally verified, but the function of very few proteins has been confirmed in the laboratory. Automated genome annotation therefore is susceptible to errors from missing information.

Genes coding for the proteins responsible for diagnostic phenotypic features can be retrieved using the RAST program and the KEGG metabolic database (http://www.genome.jp/kegg/). The BLASTP algorithm can then be used to identify genes associated with the biochemical pathways. The program ExPASy translate (ExPASy Bioinformatics Resource Portal) was used to analyse protein sequences. To automate searches for genes related to phenotypes of interest, specific programs and databases related to different taxonomic groups will need to be developed (73). For instance, amino acid FASTA files with coding sequences of a target phenotypic feature can be used as input in order to verify whether hits are found for the gene (enzyme) being searched in a specific database. Orthologs genes will have the greater BLAST scores and identity will be >40  % in this type of search. Gene sequence length normally needs to be >70  % of the query length. After these steps, if all the genes (enzymes) involved in a metabolic pathway are present in the genome, the organism is considered positive for a given phenotype, or if one or more genes (enzymes) in a metabolic pathway are absent, the organism is considered negative. It is also important to evaluate regulatory genes, global regulators of the different diagnostic phenotypic features/metabolic pathways, presence of indels in the gene sequences, sRNA regulation, and promoter sequences.

Despite sources of error (e.g., incomplete DNA sequencing and inaccurate annotations), our knowledge of microbial metabolism encoded in the databases is thorough. For example, in a recent study, we sequenced the genome of Citrobacter sedlakii, a previously unsequenced organism. At 320 contigs, our assembly was low-quality draft, but using Rapid Annotation using Subsystem Technology—RAST (Aziz et al. 2008), we annotated 1,399 reactions performed by enzyme complexes encoded in the genome. Only five genes were missed due to low sequencing coverage, and six genes were missed due to problems with the assembly and annotation (but we present in the genome upon further inspection; Cuevas et al. in preparation). This suggests that even genomes with a relatively low sequence coverage can be used to predict the metabolism that an organism performs which can then be used in taxonomic assignments.

Statements arguing in favor of a genomic microbial taxonomy

  • Microbial taxonomy is moving from polyphasic taxonomy into a new open-access genomic microbial taxonomy with a set of standardizes tools used on a genome sequence. Mere translation of thresholds of polyphasic taxonomy will not contribute to it (Kämpfer and Glaeser 2012; Vandamme and Peeters 2014).

  • The highest priority of a rejuvenated genomic Microbial taxonomy is to help describe better microbial diversity and to serve better the medical and environmental microbiologists and epidemiologists.

  • As scientists, it is our duty to question the basis of taxonomy, both theory and practice, as well as the validity of the schemes that we produce. Incorporating ecological, phylogenetic, and evolutionary dimensions is needed to define a biologically coherent species concept. Re-establishing the link between phylogenetics and taxonomy will allow a better understanding of microbial speciation (Zhi et al. 2012).

  • It will take time to develop a new coherent prokaryote species concept. A rush for a new species concept is not needed and would be counterproductive. International meetings on the topic might help to open up the discussion. Fortunately, we have the chance to welcome newcomers in the field, such as computer scientists, microbial ecologists, and evolutionary microbiologists. Microbial taxonomy seems to be in excellent shape, particularly in the Asian countries (Tamames and Rosselló-Móra 2012). The challenge now at stake for genomic Microbial taxonomy is to examine how the existing genomic databases, bioinformatics tools, and access facilities may be further developed into prototypes to be further tested and discussed. Automated methods such as the ones benchmarked by Larsen et al. (2014) will enable the use of WGS for higher resolution and more phylogenetically accurate classifications. It has been noticed that to date, microbial taxonomy has barely taken the wealth of information contained in completed sequenced genomes into account (Klenk and Göker 2010). It allows to incorporate taxonomy and typing in a high-resolution, reproducible, and portable scheme. The developments are expected to take place in parallel with the ongoing conservative practice of polyphasic taxonomy.

  • We propose the following general steps as a roadmap for species description within known genera: First, perform whole-genome sequence of the novel type and reference strains and calculate genome similarity within species and toward the closest known species by means of MLSA, GGD, and AAI; second, check in the published literature (i.e., species descriptions, Bergeys Manual, and The Prokaryotes) the list of useful discriminatory phenotypic features to be searched for in the genome sequences; third, apply the genotype-to-phenotype approach and define the presence of diagnostic phenotypes on the basis of the presence of the gene sequences, trying to obtain the maximum number of phenotypes based on genome sequences; fourth, perform the most basic phenotypic characterization of the novel strains in vitro, such as cell and colony morphology, growth at different ranges of temperature, pH, and salinity. Avoid doing, e.g., FAME, MALDI–TOF, AFLP, and other non-portable fingerprinting techniques; fifth, deposit the genome sequences of the novel type and reference strains in public open-access databases and the cultures in public collections; sixth, write concise text reporting the major findings obtained in the steps 1–5, in a manner that can be readily assessable by machines. Automation in the production of texts dealing with descriptions and updates of databases will be a plausible development. Analytical work and bioinformatics are also needed in order to use phenotypic information available in genome sequences. The new system clearly needs new tools to gain information from the genotype to the phenotype and back to the genotype.

  • Specialized journals, e.g., IJSEM and SAM are starting to get involved in an open scientific discussion on Microbial genomic taxonomy and offer a tribune for it (Sen et al. 2014; Chun and Rainey 2014; Ramasamy et al. 2014). This will attract bright young scientists, needed for the remolding the theory and practice of genomic microbial taxonomy.

  • It is necessary to emphasize that novel strains or strains with novel properties should be deposited in public collections (Stackebrandt et al. 2014). Genome databases are sadly full of sequences without a deposited culture in a recognized Culture Collection (Tamames and Rosselló-Móra 2012).