Keywords

1 A Brief Introduction to the History of Bacterial Taxonomy

If we open a student book of microbiology, we will find the definition of the taxonomy as “the science by which organisms are characterized, named, and placed into groups according to several defined criteria” and together with the phylogeny conforms the systematics, “the study of the diversity of organisms and their relationships” (Madigan et al. 2012). However, what is behind these definitions is something much more complex, with hundreds of laboratory techniques applied for a proper characterization of microorganisms that have been evolving over years. Taxonomy has been applied from the very beginning of human conscious of its environment, classifying the organisms of their surrounding; however, we will focus on bacterial systematics, whose origin goes back to the “animalcules” descriptions generated by Antonie Van Leeuwenhoek in the middle of the seventeenth century. Few improvements were obtained until the nineteenth century, where the first genera of bacteria were described, with the term Bacterium given for the first time to classify rod-shaped cells (Murray and Holt 2005).

The species concept, so easily defined for higher organisms, has been a source of discussion from the very beginning of prokaryotic taxonomy and still generates controversy among researchers. First bacterial species definitions include terms such as close resemblance and essential and distinguishing features, which induced to differentiate species according to their morphology, source of isolation, and pathogenicity. These features, although useful at the beginning, were lately shown to be highly imprecise and subjective (Brenner et al. 2001).

First principles for bacterial characterization, classification, and identification were proposed in the second half of the twentieth century, when a sole character despite its importance was proposed to be not enough for species definition. Instead, they proposed the use of a large list of biochemical tests and strain samples, to better characterize members of new defined species. This classification ended in the proposal of what was called numerical taxonomy, proposed by Sokal and Sneath (1964). Within this method, a whole range of tests, more than 100, were analyzed, and coefficients were established to calculate similarity between strains and species.

However, it was not until the taxonomists were able to extract nucleic acids from cells that a more “natural classification” based on nucleic acid analysis was possible to better define bacterial species. In the 1960s, the development of methodologies for nucleic acid extraction and renaturation (Marmur 1961; Marmur and Doty 1961) led to the first studies based on DNA homologies and genetic comparisons through DNA-DNA and DNA-RNA hybridization to be used in bacterial classification. A few years later, in the 1980s, the profiles of stable low molecular weight RNA (LMW RNA) were proposed for bacterial species differentiation (Höfle 1988), which separated by a modified electrophoresis technique named Staircase Electrophoresis (SE) increased the resolution of LMW RNA profiles allowing the differentiation of genera and species of both prokaryotic and eukaryotic microorganisms (Velázquez et al. 2001).

Also, the discovery in the 1980s of the polymerase chain reaction (PCR) by Mullis in 1983 opens the development of several fingerprinting techniques based on DNA amplification of given fragments or genes. Restriction Fragment Length Polymorphisms (RFLP) technology was described in 1980 (Botstein et al. 1980) and then applied to generate DNA profiling, which allowed us to find differences among strains. Other techniques for genotyping characterization of microorganisms were developed, such as ARDRA (Amplified rDNA Restriction Analysis), RAPD (Random Amplification of Polymorphic DNA), BOX-PCR (Repetitive extragenic palindromic sequences), and ERIC-PCR (Enterobacterial Repetitive Intergenic Consensus), allowing a fast classification into genetic groups of the microorganisms isolated in a sample (Carro and Nouioui 2017).

Apart from these applications, the development of the polymerase chain reaction allowed the use of a gold marker in prokaryotic taxonomy, the 16S rRNA gene. This gene was selected by a group of characteristics that made this molecule a unique taxonomic marker: its size, its slow evolution rate, and its ubiquity in bacteria. These characteristics allow us to propose a phylogenetic classification of Prokaryotes based on this gene (Winker and Woese 1991). In 1994, its use was already widespread among taxonomists to generate phylogenetic reconstructions of new taxa. In spite of its limitations to define by itself further than genus classification, a similarity cutoff value was proposed for species delineation, the 97% similarity in 16S rRNA gene sequence (Stackebrandt and Goebel 1994). This value was later updated to 98.7–99% similarity, depending on the genus (Stackebrandt and Ebers 2006), and validated with empirical datasets and statistical probabilities of failure (Meier-Kolthoff et al. 2013b). Strains presenting values over this range should be analyzed by DNA-DNA hybridization methods to define novel species.

The 16S rRNA gene belongs to the ribosomal operon, which also encompasses in bacteria the 23S and 5S rRNA genes as well as several intergenic regions. In addition to the 16S rRNA gene, the 23S rRNA gene and the intergenic spacer (ITS) between the 16S rRNA and 23S rRNA genes have been used with taxonomic purposes (Ludwig and Schleifer 1994; D’Auria et al. 2006; Yarza et al. 2010). The ITS region contains hypervariable sequence regions allowing the differentiation of bacterial genera, species, and strains (Peix et al. 2005). Also, this region has different sizes in different bacterial groups, which facilitates their use for metagenomic analysis of bacterial populations through a technique named RISA or ARISA (Ranjard et al. 2000). Currently, the complete ribosomal operons are used for the identification of bacteria in complex samples by metagenomics (Kerkhof et al. 2017; Cuscó et al. 2019; Martijn et al. 2019).

The 16S rRNA gene contains highly conserved regions in bacteria, which allowed its amplification and sequencing with universal primers annealing in these regions (Edwards et al. 1989). Two of these primers were used to obtain TP-RAPD (Two Primers Random Amplified Polymorphic DNA) patterns, which are not strain-dependent being able to differentiate among different bacterial species (Rivas et al. 2001). Universal primers can also be used to amplify a region of the 16S rRNA gene in Prokaryotes and 16S rRNA genes in Eukaryotes called UARR (Universal Amplified Ribosomal Region), which contains the V6, V7, and V8 domains (Rivas et al. 2004). These regions, particularly V6, are useful for metagenomic analysis of bacterial populations through NGS (Kumar et al. 2011; Temperton and Giovannoni 2012; Tremblay et al. 2015; Yang et al. 2016; Winand et al. 2019).

DNA-DNA hybridization allows the comparison of two complete genomes and therefore the calculation of a similarity percentage between them on the basis of the dissociation of the DNA strands. The fast development of this technique allowed the determination of a numerical value as a threshold to define whether two microorganisms belong to the same species, set in 70% by Wayne in 1987 (Wayne et al. 1987). Several methods were developed to improve this technique avoiding radioactive labeling as used to be performed since its proposal at the early sixties, including filter competition, optical renaturation rates, hydroxyapatite, fluorimetric, or microplates (Mehlen et al. 2004). However, the availability limited to highly specialized laboratories and low reproducibility of the results obtained between laboratories is always claimed for more stable methods (Carro et al. 2012).

Among those methods, a first proposal was the use of multilocus sequence analysis (MLSA), the analysis of a small set of protein-encoding genes, also known as housekeeping genes (Chimetto Tonon and Moreira 2016). As sequencing technologies became affordable for most laboratories, the use of this technique increased for classification of novel taxa. However, the first works on this technique showed that the selection of genes is relevant for the results obtained and what is worst, variable among genera, which led to the search for valid genes to be used for each genus to be analyzed (Carro et al. 2012). Nevertheless, once those genes were found for a given genus, the phylogeny of the concatenated sequence of them was robust and allows a better definition of the genus diversity and evolution (Adékambi and Drancourt 2004; Guo et al. 2008). More recently, the availability of whole-genome sequences and the generation of several overall genome relatedness index (OGRI), which will be discussed in the next section, have generated the searched stability for DNA-DNA hybridization, since the comparison of genome sequences offers the same values regardless of the performing laboratory and tools have been developed to allow the general use of this approach (Chun et al. 2018).

The compilation of all these analyses was called polyphasic taxonomy, which is the combination of phenotypic, environmental, and genotypic characteristics, together with the phylogeny of the strains, to generate a whole view of the microorganisms to be described for properly proceeding to their classification and identification. This polyphasic taxonomy is always in evolution including new developed tools for bacterial characterization.

2 Whole-Genome Sequences: How to Use Them in Bacterial Taxonomy

During the last decade, the number of genomes available exponentially increased (Fig. 1). This explosion in numbers is due to several reasons, including the improvement of next-generation sequencing (NGS) technologies, which allowed us to generate a whole draft genome in less than 48 h with a drastic reduction in the cost (Kremer et al. 2017), and multigenome sequencing projects such as GEBA (The genomic encyclopedia of Bacteria and Archaea), which aimed to filling the gaps of type-strain genome in the tree of life (Mukherjee et al. 2017). Half of the genomes generated until now correspond to strains of the phylum Proteobacteria, with the most abundant phylum being Firmicutes and Actinobacteria in second and third position, respectively, and only 5% of the genomes belonging to other phyla (Fig. 2). Since early 2019, the genome sequence is considered a key feature to present for the type strain of every species proposed, according to the instructions for authors of the International Journal of Systematic and Evolutionary Microbiology Journal. This prerequisite, included for other microbial description journals, has also helped to increase the number of available genomes. However, in spite of the potentialities hidden in the genomic information of those strains, most papers only meet the requirement of genome draft production, without trying to also unravel the useful information related to the proposed taxa. But how can we generate good sequencing data? Which of them should be used in taxonomic descriptions? How should they be presented? Answers to those questions are in the following subsections.

Fig. 1
figure 1

Number of prokaryotic genomes publicly available generated during the last decade. Source of information: EZBioCloud Statistics

Fig. 2
figure 2

Number of prokaryotic genomes by phylum. Source of information: EZBioCloud Statistics

2.1 Technologies to Generate Whole-Genome Sequences

Major breakthrough in DNA sequencing arrived with Sanger’s chain-termination technique in 1977, which was rapidly extended and widely used for the following three decades (Heather and Chain 2016). The automation of this technology gave which are now called the first-generation DNA sequencing machines, routinely applied in many laboratories to date. Using this technology, first genomes of prokaryotic organisms were produced, with Haemophilus influenzae completed in 1995 (Fleischmann et al. 1995) the first. The development of this technology allowed the simultaneous sequencing of hundreds of samples and was even applied for the generation of the first decoded human genome (Lander et al. 2001).

Improvements of the methodologies used for DNA sequencing have not stopped, and the following group of technologies has been called next-generation sequencing (NGS) or second generation of DNA sequencing. These technologies are based on DNA fixed to a solid phase and the measuring of pyrophosphate, which is determined when is converted into ATP by ATP sulfurylase, and the ATP is used as substrate for a measurable luciferase. These pyrosequencing technologies were firstly developed by Roche, allowing the mass parallelization of sequencing reactions (Margulies et al. 2005). Other companies developed their own systems, including Illumina, IonTorrent, or Life technologies, with Illumina being, with their HiSeq and MiSeq technologies, one of the most frequently used nowadays. Each of these sequencing platforms present advantages and disadvantages, as shown in Table 1. Some of these platforms, such as Illumina and SOLiD, are also known as “short-read” technologies, as the data generated by them are short reads between 30 and 500 bp, largely smaller than that obtained by Sanger sequencing (around 1000 bp) (Kremer et al. 2017). Nevertheless, NGS allowed a drastic reduction in the cost per base, mainly due to the higher sequencing coverage obtained by the higher throughput, generating millions or billions of DNA strands sequenced in parallel.

Table 1 A comparison of main characteristics of most used next-generation sequencing instruments

The third generation of DNA sequencing arrives with the technologies that avoid the direct action of the DNA polymerase, allowing long reads from very limited DNA samples. Among the technologies grouped under this name are the single molecule sequencing and real-time sequencing, with no DNA amplification. The single molecule real-time (SMRT) platform from Pacific Biosciences is the most used nowadays from this generation of DNA sequencers. PacBio platforms are able to produce really long reads, over 10 kb in length, which are especially useful for the generation of de novo genome sequences (Heather and Chain 2016). However, main flaw of this technology is the high error rate (Table 1), which has hampered the generalization of its use, as well as a higher price than other available systems. The second technology pushing up in this third generation is developed by Oxford Nanopore Technologies. What is expected from this technology is the generation of very long reads at really low cost; in addition, they have presented compact machines as MinION, having the size of a smartphone, allowing direct sequencing of material at sampling sites (Loman and Quinlan 2014). Poor-quality profiles obtained and high error rate are still the challenges for these technologies.

The platforms provided by Illumina, Ion Torrent, and PacBio are considered to meet the general standards for the description of new species (Chun et al. 2018). The combination of the most used platform of each generation is also highly encouraged, and some of the most important sequencing centers, as the Joint Genome Institute, apply this methodology. PacBio platforms allow obtaining long sequences, key to be able to close de novo genomes, while Illumina platforms allow the validation of the sequence, due to its low error rate. This combination offer best results in bacterial genome sequencing, but its application is often limited due to the cost involved in this double sequencing.

2.2 Minimal Standards for Genomic Data in Taxonomy

Once the reads of the genomic sequencing have been obtained, it is necessary to determine the quality of the results obtained as well as the assembling of them into contigs and scaffolds, for which a lot of specific software has been developed (Velvet, SPAdes, QUAST, etc.). NGS platforms provide their own statistics for sequencing raw data; however, the most important statistics to take into account for taxonomic purposes are the ones obtained from the final assembly. A good review of the available software tools was presented by Kremer and his colleagues (Kremer et al. 2017). Some of the key parameters according to Chun et al. (2018) that should be checked in a genome assembly and included in the genome description of the strain include

  • Number of contigs: the perfect number of contigs is one, and accordingly, this number should be as low as possible with the obtained data. However, for taxonomic purposes, this value could be higher if the redundancy or coverage of the data is enough, with values sometimes as high as 600 accepted.

  • N50: this is a good parameter to measure the quality of an assembly. N50 represents the minimum contig length needed to cover 50% or more of the genome when the contigs are summed from the largest to shortest.

  • Coverage: another important value is the sequencing depth of coverage and indicates how many times each base of the final assembly has been read on average, indicated as the folds. A recommended value proposed as minimum for a good coverage is ≥50X.

  • Genome size: most of the genomes generated until now are not closed, and this implicates that the genome size, taken as the sum of all the contigs length, is just an approximation, but this value gives an idea of completeness compared with other members of the genus and by in vitro calculation.

  • G+C content: this value is also an indicator of the quality, and it should be coherent with the expected data for the strain of study.

  • 16S rRNA sequence: in addition to obtaining this marker gene as an indicator of completeness of the genome sequence, it should be used to verify the authenticity of the genome, in order to verify that it matches with the Sanger sequence obtained from the strain for which the genome has been generated. This will avoid possible mistakes regarding strain contamination or labeling. Other housekeeping genes could also be used in the case of doubt.

  • Contamination of samples: it is possible that contamination of the DNA samples to be sequenced occurs, and even if these are in a minor amount, they could be incorporated into the final genome sequence. One tool that has been created for this purpose is CheckM (Parks et al. 2015), which is also used to study other quality parameters, it indicates the percentage of possible contamination in a genome. However, lateral gene transfer events should be carefully interpreted, as it is a common event in prokaryotes. Another tool, ContEst16S, focused on the presence of different 16S rRNA genes in the assembly has also been developed to find possible contaminations (Lee et al. 2017).

On the other hand, the information on how the genomes have been generated and make this information available is also essential when genomic data are used in taxonomic descriptions of species. Some of the main points to be taken into account include

  • Deposit genome information in public databases: two main databases should be used for the deposit of information at GenBank/EMBL/DDBJ database:

    • WGS database: assembled and quality checked genome should be deposited to allow comparison between your genome and others.

    • SRA database: raw sequencing data should be deposited too, as it could be used for improving the assembly once more information or better methods will be developed in the future.

  • Properly describe the sequencing, assembling, and annotation methods: including the sequencing instrument, the reagents used for library preparation, and all the software used in the process to obtain the final genome.

2.3 Overall Genome Relatedness Index (OGRI)

The OGRIs is a generic name to group all the bioinformatics methods defined to replace the wet-lab DNA-DNA hybridization (DDH) for the differentiation of species (Chun and Rainey 2014) in a reproducible and objective way. These methods utilize whole-genome sequences, and gene annotation is not previously required. Calculation of OGRI or DDH values with all the closely related species presenting a similarity value of 16S rRNA gene over 98.7% is compulsory when a new species is proposed, at least with one of the known methods.

Even before the genome sequencing was affordable for most microbiology laboratories, the utilization of in silico methods to replace DDH was proposed (Henz et al. 2004; Konstantinidis and Tiedje 2005). One of the methods used to correlate DDH values with digital DDH by computational comparison of genome sequences was the average nucleotide identity (ANI), representing a mean of identity values between multiple sets of orthologous regions (Konstantinidis and Tiedje 2005). An initial cutoff value of 94% was proposed to correspond to the traditional 70% for DDH, but this boundary was later adjusted to 95–96% after refining the method and simulating an artificially cutting of the genome similar to what was occurring in the DDH method (Goris et al. 2007). The implementation of MUMmer software (ANIm) instead of BLASTN (ANIb) helped to obtain faster results for ANI method (Richter and Rosselló-Móra 2009). Another implementation to solve the differences in reciprocal ANI values was proposed by generating a new algorithm, the OrthoANI, which uses only orthologous fragment pairs to calculate nucleotide identities (Lee et al. 2016).

Other OGRI proposed in parallel and widely distributed is a distance genome relatedness index, the genome BLAST distance phylogeny (GBDP) (Henz et al. 2004), for which calculation two genome sequences are aligned to each other and generate high-scoring segment pairs to apply a specific distance formula. The algorithm was lately improved with confidence-interval estimation thanks to a new statistical model proposed by Meier-Kolthoff and colleagues (Meier-Kolthoff et al. 2013a). These implementations generate the digital DDH (dDDH), which mimics the results of the classical DDH with confidence-internal estimation, enabling the user to statistically evaluate the outcomes. Therefore, the species boundary for dDDH values is 70%, the same as the one proposed for classical DDH (Wayne et al. 1987). A web-based tool was implemented to carry on those analyses known as genome to genome distance calculator (GGDC), which is available online.

Another distance-type index is maximal unique matches index (MUMi) (Deloger et al. 2009); however, its use has been much more limited, probably because it was proposed to provide higher resolution at the intraspecies level and analyzes the exact matches shared by the two sequences of study.

2.4 Genome Phylogeny

As the genomes of more type strains become available, the generation of phylogenomic trees using whole-genome sequences directly, or groups of genes obtained from them, should be compulsory to better determine the phylogenetic relationship of the strains of study with other species of the genus. This is even more important in genera for which the 16S rRNA gene has been shown not enough for species differentiation. In addition, phylogenomic approach is able to provide a better taxonomic framework for genus and higher taxa. This has been shown in many works published recently; those have allowed the analysis of whole phyla and proposed several reclassifications within its orders, families, genera, and species. Within these analyses, we can find several works on whole phyla, as Bacteroidetes (García-López et al. 2019; Hahnke et al. 2016) and Actinobacteria (Nouioui et al. 2018; Salam et al. 2020), as well as families, genera, and species, like Rhodobacteraceae (Simon et al. 2017), Micromonospora (Carro et al. 2018), and Pseudomonas fluorescens complex (Garrido-Sanz et al. 2016).

Within the available methods for generate whole-genome phylogenies, two main approaches have been proposed: a) the use of core or conserved genes between the genomes, which can vary from below one hundred genes to several thousand depending on the method applied, and b) the used of whole-genome sequences, based on amino acid or nucleotides. For the use of conserved genes, several approaches have been proposed, between them, the up-to-date bacterial core gene (UBCG) method proposed by Na et al. (2018) and freely available, which is increasingly used. This method has defined a set of 92 core genes to be concatenated that are conserved for all taxonomic ranks of Bacteria, allowing standard comparison regardless of the number of strains included in the analysis. Other methods currently applied include CSI phylogeny (Kaas et al. 2014), a webserver that identifies the single nucleotide polymorphisms (SNPs) and infers a phylogeny based on the concatenated alignment of them, or M1CR0B1AL1Z3R (Avram et al. 2019), a web server that finds orthologous groups, aligns them, and generates the corresponding phylogeny. For the use of whole-genome sequences, several approaches have also been proposed, REALPHY was developed by Bertels et al. (2014), and the pipeline is freely available online. In this method, sequences are mapped against reference genomes by bowtie 2 and the phylogenies inferred with PhyML. More recently, another website has been developed to generate whole-genome phylogenies, the Type (Strain) Genome Server (TYGS) (Meier-Kolthoff and Göker 2019), which is increasingly used probably due to its user-friendly interface, although it is limited to comparison of up to 20 genomes at present. The TYGS methodology is based on the Genome BLAST Distance Phylogeny method (GBDP) (Meier-Kolthoff et al. 2013a). The number and methodologies are continuously increasing and improving according to the technologies available, but online servers help to share these analyses, which are usually highly demanding on computer resources, with the research community, making them available to anyone with a computer and internet connection.

2.5 Genome Characterization: Where Should One Begin?

A huge amount of taxonomic papers that have included the genome of the corresponding type strain has limited its use to the dDDH or ANI calculations, lacking even the generation of the genome phylogeny. This fact is a pity, taking into account all the information that could be extracted from a genome sequence. So, we will list here some of the analyses that can be performd to improve the use of genomes in taxonomic manuscripts:

  • OGRI calculations: as shown before, several tools are available to calculate the relatedness of the studied strains with close relatives in the genus or family.

  • Construct the genome phylogeny: although not all the genomes of the species of interest are available, it should be equally constructed, as the 16S rRNA gene is not enough to define closely related strains for many genera, and a better idea of the position of the strain will be given. The ideal would also be to obtain the genomes of the closely related type strains and, in that way, completing the availability of the genomic information of the genus. In the genera with a group of well-defined housekeeping genes for taxonomy, these could be directly obtained from the genome too to generate an MLSA phylogenetic tree, including all the type strains for which these data are available.

  • Core and pangenome: the core genome refers to the genes that are shared by all known members of a taxonomic group without exception, while the pangenome refers to all the genes contained in all the strains belonging to the same taxonomic group. The pangenome includes the core genome and the accessory genome, not necessary for the survival of the species, and could be really high, with hundreds of strains probably necessary to complete one (Medini et al. 2005). Both concepts, core and pangenome, are important from an evolutionary point of view and should be analyzed when a relevant number of strains are known for the same species.

  • Bioclusters determination: some tools have been developed to determine the potential capacity of a strain to produce antibiotics or other secondary metabolites from genome information. Between them, antiSMASH has been gaining attention and is increasingly used to give a general idea of the potential activity that a new isolate can have. First version of this website available tool was proposed in 2011 (Medema et al. 2011), which is nowadays in version 5.0 (Blin et al. 2019). Another tool to look for biosynthetic gene clusters is ABC (Atlas of Biosynthetic gene Clusters), developed by the Joint Genome Institute and available at the IMG (Integrated Microbial Genomes) database (Hadjithomas et al. 2015). This tool has been developed based on predictions for all available genomes in IMG, and a last updated version has been recently released (Palaniappan et al. 2019).

  • Ecologic and phenotypic analysis: the genome sequences are full of information to better characterize the new taxa. However, it is sometimes difficult to decide what to look for or how to find it. Several approaches could be used, for example, we can decide first a series of characteristics that we are interested in and look for the genes already described that have those functions and then search for homologous genes within the genome. Another approach is to check generally the annotated genes of our genomes and decide which of them should be further characterized or analyzed. Within all the information that could be analyzed in a genome, two main characteristics should be proposed in the description of new species, the genes related to the ecological role of the studied microorganism (adaptation to the environment conditions, interaction with closely related organisms, etc.) and the genes related to the phenotypic abilities. Frequently, when the ability of a microorganism to use a carbon source or produce a specific compound is tested, differential results are observed among laboratories and even within the same laboratory (Riesco et al. 2018). On the other hand, the analysis of the genome allows determining the presence or absence of specific genes encoding for the production of specific compounds by a strain, although sometimes the laboratory conditions do not allow observing a positive result due to specific conditions needed or the nonfunctionality of the genes. Several tools and approaches could be used to determine or study those characteristics, such as SEED viewer, an intuitive and friendly user platform (Overbeek et al. 2014), after using the RAST server for genome annotation (Aziz et al. 2008).

3 Metagenomic Analysis: Do They Fit in Classical Taxonomy?

Last sequencing technologies developed have led to a huge step forward in microbial ecology studies, allowing the in situ characterization and identification of millions of bacteria that were never detected by classical isolation methods. Nevertheless, metagenomic results depend enormously on the taxonomy, as we will be able to identify the organisms that are properly described in the literature, while the other ones will generate an increase “microbial dark matter.” On the other hand, the classification of those organisms is in conflict with classical taxonomy, which needs culturable microorganisms to apply polyphasic taxonomy and deposit of strains in two independent culture collections to fulfill the postulates of the International Code of Nomenclature of Prokaryotes (ICNP) for name validation.

Whitman proposed in 2015 the use of genome sequences as the type material for taxonomic descriptions of prokaryotes (Whitman 2015), an article that have generated a debate among the taxonomists of the twenty-first century. As previously exposed, whole-genome sequences have induced a huge evolution in microorganisms classification and characterization in very few years, a fact that none has discussed. However, the absence of a whole organism that could be maintained and all its information reproduced generate uncertainties for the definition of new species, which concern a good number of researchers. At that time, Whitman proposed the possibility to deposit DNA in public collections that should be based on either a clonal population or a single cell. The change of the code to allow gene sequences as type material for the description of prokaryotic species was proposed by Whitman the following year in a taxonomic note (Whitman 2016). In this proposal, he remarks the importance of naming the prokaryotic diversity to allow the communication among researchers from different fields without misidentifications and justified the change in the code to allow the validation of the names from Candidatus taxa, the way in which are known uncultivated microorganisms that could not be validated since their existence is only known based on genome or genes amplification. Many taxonomic groups have been described as Candidatus in the last few years thanks to metagenomic analyses; however, the names proposed have not priority according to the Bacteriological Code of Nomenclature, and therefore, if a strain of these taxa is isolated and described could be given a completely different name. This situation generate two problems, the absence of a list of Candidatus names with the corresponding sequences where it can be checked if they were already described and the generation of different names to classify the same organisms. Nevertheless, the solution of this problem could be easily solved without introducing the validation of genomes or genes as type material (with the risk that this has as we will discuss later), only by giving priority to Candidatus names, a proposal that has already been made to the International Committee on Systematics of Prokaryotes (ICSP). In this way, the uncultivable strains will be named as Candidatus before the proposed name until a strain of these taxa would be cultivable, moment at which the proposed species will be updated with the same name to the standard taxonomy.

A different proposal was made by Konstatinidis and colleagues in 2017 (Konstantinidis et al. 2017), who suggested to generate an independent nomenclature for not-yet-cultivated taxa, proposing a series of standards and guidelines for the description of these taxa, based on genome sequences obtained from single-cell amplification or population binning that will be used as type material. They also proposed to have their own list of validly published names. Although they proposed some minimal standards of quality, they recognized that “metagenomes typically constitute a mosaic of different genotypes of a single population coexisting in the same environment,” which generate some uncertainties about the species that would be described.

Several concerns to the inclusion of genes or genome sequences as type materials have been claimed by other taxonomists, including the quality control, the limitation of the original source to reproduce the results (indeed, some of the already described species with valid names are no longer considered as types because the culture was lost at some point), the complexity of the maintenance of DNA, and the difficulties for the culture collections to distribute the DNA to other researchers in order to repeat the experiments. Moreover, a genome sequence could be artificially generated, and even if we would believe on the honorability of researchers, chimeras out of their control could be described as real diversity. Some of the arguments against this proposal are summarized in the article of Bisgaard et al. (2019), including

  • DNA material could be damaged or lost, and as it is not a proliferating material, data will not be reproducible.

  • Species descriptions will need to be revised as its DNA sequence is replaced by new version within the development of sequencing and assembling technologies.

  • Functional assessments for genomes are limited.

  • Genomic data do not always agree with gene expression, which could generate errors to establish taxonomic relationships.

  • Minimal standards for new taxa descriptions will be difficult to define and could induce to produce a high amount of new taxa descriptions based on single DNA sequences, generating a taxonomic and nomenclatural chaos.

  • Motivation for the isolation and phenotypic characterization of strains will decrease, and therefore, the study of intra and interspecies diversity will be reduced.

Similar arguments were also presented by Zamora et al. (2018) to evidence the problems of allowing DNA sequence data as type material in fungal taxa, trying to put on evidence the consequences of accepting the proposals to amend the International Code of Nomenclature for algae, fungi, and plants (ICN). They argue that using DNA as a type, it will be used just as information from a character of an organism, instead of the organism itself, severely limiting the characterization, and names should be given to organisms, not to characters of them. In addition, a major concern raised is the reliability of the DNA sequence data and a proper method to be checked, generating irreproducible science.

The proposal of a different nomenclature for not-yet-cultivated taxa was also discussed by some authors: Oren and Garrity expressed their concerns showing several examples of how the unregulated naming of taxa has previously led to chaos (Oren and Garrity 2018); while Overmann et al. argued that, in addition to the arguments previously exposed for genomes as type material (technological and conceptual limitations), confusion will be unavoidable if two different nomenclatures are created without links that avoid the generation of synonyms (Overmann et al. 2019). The limitation of phenotypic information that will be given for not-yet-cultured microorganisms is another point of concern. These authors are positive to give priority to Candidatus names when those microorganisms are described with sufficient morphological-cytological, metabolic, and ecological traits to clearly distinguish them from other taxa, to be informative, and induce new more successful cultivation attempts based on the generated information (the so-called “reverse metagenomics”).

According to Konstantinidis et al. (2020), the advantages outweigh concerns in the use of genome sequences as type material, as they try to demonstrate in their last correspondence letter to Environmental Microbiology, where they argue that the use of genomes will not generate weakness in the standards of prokaryotic taxonomy, trying to answer the authors who have expressed concerns like “who will take the time to grow and deposit their strains if a genome sequence is valid.” In addition, they discuss that only high-quality genomes should be used as type material, avoiding the future revision of the sequences discussed by Bisgaard et al. (2019).

However, it seems that for the moment, many taxonomists are not convinced and the use of type material, which is expected to continuously be available for researchers, will still be a culturable type strain. Nonetheless, this fact does not prevent the proposal of change of priority in the Code of Nomenclature for Prokaryotes, and noncultivable bacteria identified by metagenomics could be proposed as new species and named Candidatus, which name should be maintained once a strain of the Candidatus taxa could be finally isolated and cultivated. This simple solution will allow generating a whole new taxonomy based on “uncultivable” strains, which names could be valid and just waiting to develop the capacity to grow them under laboratory conditions.