Introduction

Prokaryotic taxonomy is pragmatic and gradually evolving as more and more organisms are being discovered with advancement in technological innovations. In the late nineteenth century, bacterial strains were delineated by using only phenotypic properties (Cohn 1872), which were soon found insufficient to classify diverse microorganisms that were subsequently isolated. Hence, physiological, chemotaxonomic and biochemical properties of bacteria were included in the bacterial classification system (Orla-Jensen 1909; Buchanan 1955). However, this approach was challenged in the 1970s and further strengthened by the inclusion of numerical taxonomy, DNA–DNA hybridization (DDH) (Brenner et al. 1969) and the introduction of the polyphasic taxonomy (Colwell 1970). Current taxonomic schemes are based on the polyphasic approach which specifically provides measures of evolutionary relationships using 16S rRNA gene and other genomic information such as DDH to determine the phylogenetic position of an isolate and is supplemented with phenotypic, chemotaxonomic and physiological properties to assess species novelty (Prakash et al. 2007).

The polyphasic approach, despite being widely used today, is in need of rectification due to (1) ambiguity in precise taxonomic assignments in light of genome data analysis, or due to inconsistency between taxonomic assignments and genomic data, (2) difficulty in classification of closely related strains and (3) discrepancy in assignments of some organisms which does not comply to their original assignments. Another major challenge that present day prokaryotic taxonomy is facing includes the goal to cover a huge landscape of around >106 bacterial and archaeal species (Yarza et al. 2014). So far, taxonomists have characterized only ~15,000 validly named species (Sutcliffe 2015) and there are many which are yet to be validly assigned with a taxonomic position. In line with these challenges, another factor which could be crucial is the limited use of genomic information in taxonomic identification of novel genera or species being introduced. At the genetic level, a species generally includes strains with approximately >70% DNA–DNA relatedness, <5 °C ΔTm, <5 mol% G+C content difference and >97% 16S rRNA gene identity (Wayne et al. 1987). In addition, a recent genome-level study has shown an intra-genus variation of up to 8% in %G+C content (Kumar et al. 2017). Recently, various genomic data analysis tools based on whole genome sequences such as Average Nucleotide Identity (ANI), Average Amino acid Identity (AAI) and in silico Genome to Genome Distance Hybridization (GGDH) have provided scientifically valid taxonomic standards for classifications. Thus, it is anticipated that taxonomy will be more steadily dependent on genomic signatures (Thompson et al. 2009, 2011) rather than relying only upon classical polyphasic characterization. This review focuses on a recent taxonomic workflow linking classical taxonomy with advances in genomics to supplement bacterial systematics. Although genomics data have been applied to microbial systematics for about a decade, the present review highlights recently described methods to accommodate the rapidly expanding field of genomics to classify microbial diversity.

Limitations of traditional taxonomy practices

While traditional methods have been reviewed from time to time, it is necessary to mention the limitations in brief in order to describe newer strategies for bacterial classification. During the past 30 years, microbiologists have relied mostly on single gene sequence information, namely the 16S rRNA gene sequence, for microbial classification. This universal prokaryotic gene is considered to behave as a molecular chronometer (Woese 1987; Vandamme and Peeters 2014; Yarza et al. 2014) and it allows analysis of phylogenetic relationships among distant taxa. The primary reason for its widespread use in the polyphasic approach is the universal presence of the 16S rRNA gene in all bacteria and archaea. The limitations include multiple copy numbers and intra-genomic sequence differences (2–5%) of this gene in some organisms (Schmidt et al. 2001; Acinas et al. 2004; Větrovský and Baldrian 2013; Kim et al. 2014). In addition, it has also been reported that the 16S rRNA gene has low phylogenetic resolution at the species level and poor resolving power for some genera due to its highly conserved nature (Jaspers and Overmann 2004; Hahn et al. 2016).

Despite this ambiguity, sequence variation in variable regions of the 16S rRNA gene sequence provides sufficient diversification to be considered for phylogenetic delineation of different taxa, while conserved regions are used as targets to design primers for polymerase chain reaction or hybridization probes. Apart from the 16S rRNA gene, other genes/genomic regions have also been used as marker genes for the study of taxonomic identification (Sharma et al. 2016b). The 16S-23S rRNA gene internal transcribed spacer sequence has been used to distinguish between Mycobacterium spp. and was found to be useful for species that are not distinguishable by 16S rRNA gene sequences (Roth et al. 1998). Similarly, 23S rRNA gene sequences have been helpful in distinguishing among Streptococcus spp. (Kotilainen et al. 2006). Other genes such as the citrate synthase gene in the genera Bartonella and Rickettsia and heat shock proteins in mycobacterial species have been used to define taxonomic relationships (Pai et al. 1997; Zeaiter et al. 2002; Lee et al. 2003; Fournier et al. 2003; Lassance et al. 2010; Verbeke et al. 2011).

For species delineation, another commonly used technique is DDH, which measures the genetic identity between pools of DNA of different strains. It provides the genetic distance between two organisms based on DNA hybridization percentage. As compared to the traditional radioactive method of DDH determination, a far better technique exploits the microtiter assay for DNA hybridization that uses fluorimetry to estimate DNA–DNA relatedness using SYBR Green I (Ezaki et al. 1989) and quantitatively determining fluorescence at increasing temperatures by using a Real-Time PCR thermal cycler (Gonzalez and Saiz-Jimenez 2005; Rosselló-Móra 2006; Tindall et al. 2010). However, even after such refinements, the DDH methods are not free from limitations, including a cut-off limit of ≥70% has not been applied consistently to all bacterial genera. In case of Rickettsia species, for example, a DDH of 70% would not distinguish Rickettsia rickettsi, Rickettsia conorii, Rickettsia sibirica and Rickettsia montanensis (Drancourt and Raoult 1994; Sentausa and Fournier 2013). Besides, this technique is based on comparative estimation; hence, no incremental database can be created. Additionally, this technique requires special facilities and is labor intensive, expensive and lacks reproducibility (Prakash et al. 2007).

Thus, it is evident from the above discussion, that characterization of bacteria cannot be based solely on approaches such as 16S rRNA gene sequence comparison and DDH values. However, to date, complete genome sequences and improved genome annotations have not yet resulted in reliable predictions of metabolic and chemotaxonomic features, as the ability to deduce the chemotaxonomic properties from genomic data is still in its infancy (Sutcliffe and Trujillo 2012). Thus, in the present scenario, we need an integrative method that employs the best aspects of the traditional polyphasic approach with genomic data to infer systematic relationships.

Coupled with advances in sequencing technologies and the availability of a large number of genome sequences (draft and complete), the application of computational tools have further provided an impetus to establish taxonomic schemes based on the evolutionary information contained in genome sequences. In many review articles (Klenk and Göker 2010; Thompson et al. 2011; McDonald et al. 2012), it has been projected that microbial taxonomy will be steadily more dependent on genome sequences rather than relying on the classical polyphasic approach, as will be described in this review. The use of genomic data is not new and genomic information can be easily harnessed to study inter- and intra-species relationships using concepts such as Karlin genomic signatures, AAI and in silico GGDH (Thompson et al. 2013). With these genome comparison concepts, we can now overcome the limitations posed by the traditional polyphasic approach. In addition, sequencing technologies have now become relatively affordable, easing their use in routine microbial identification (Loman and Pallen 2015).

Several research studies (Auch et al. 2010a, b; Meier-Kolthoff et al. 2013) and symposia discussions targeted at microbial taxonomy and systematics emphasized the study of genomics to resolve and overcome the limitations of the traditional polyphasic approach (Thompson et al. 2011; Větrovský and Baldrian 2013; Pillonel et al. 2015; Sangal et al. 2016). There was also a hot debate and discussion on this topic even at the Bergey’s International Society for Microbial Systematics Symposium Conference held on September 12th–16th, 2016 at Pune, India (http://bismis.org/?cm=body_bismis2016). The intuitive idea among the scientists engaged in taxonomy of prokaryotes is that it is only a matter of time till the current microbial taxonomy scheme will move towards genomics inferences (Bernard et al. 2010, Chun and Rainey 2014). As mentioned above, this shift is primarily due to limitations in the classical polyphasic taxonomy and the constant development of newer sequencing technologies and computational tools (Sangal et al. 2016).

Emergence of high throughput sequencing techniques

Advancements in next-generation sequencing technologies in the last decade have enabled easy access to economically feasible sequencing platforms. As a result, thousands of bacterial genome sequences (more than 87,400) are now available in the public database (https://www.ncbi.nlm.nih.gov/genome/browse/). However, most of these sequences do not represent type strains. Realizing this gap and exploiting the advantage of next-generation sequencing (NGS) technologies, several efforts are being made to sequence the type strains. For instance, the Genomic Encyclopaedia of Bacteria and Archaea (GEBA) was launched in 2007 (Wu et al. 2009) by the Joint Genome Institute with an aim to sequence 250 type strains from branches of the tree of life with low sequence representation. Since then, two phases of this project have been successfully completed (Krypides et al. 2014). Currently, the third phase of GEBA sequencing project is going on, which encompasses the genome sequencing of soil- and plant-associated bacterial type strains, including newly characterized type strains (Whitman et al. 2015). In any case, we believe that sufficient available supporting genomics data should be added to perform taxonomic characterization.

Minimum computational tools and reliable information required for genome annotations

As mentioned before, the emergence of cost-effective, high-throughput DNA sequencing technologies including Illumina, Solexa, Ion Torrent, Single Molecule Real Time (SMRT, PacBio) and Oxford Nanopore have made it possible to sequence bacterial or archaeal genomes even in general microbiology laboratories. Further advancement of these sequencing technologies coupled with lower sequencing cost has brought genomics to the forefront of modern microbial taxonomy. A new term called taxo-genomics to describe this approach has been coined by Ramasamy et al. (2014). Taxo-genomics as highlighted in many reports will not be a complete analysis of the genome(s) but will be dependent on minimum bioinformatics tools that can be easily handled and can at the same time provide sufficient information to resolve the taxonomy of a particular strain from draft or complete genomes (Dunlap et al. 2016; Kumar et al. 2015). Based on the currently available sequencing technologies and computational tools, we propose that taxo-genomics should include the use of comparative genomics methodologies, primarily Multi Locus Sequence Typing (MLST) (Miyoshi-Akiyama et al. 2013), Genome to Genome Distance Calculator (GGDC) (Meier-Kolthoff et al. 2013), Average Nucleotide Identity (ANI), Average Amino Acid Identity (AAI) (Konstantinidis and Tiedje 2005), Tetranucleotide frequency (Teeling et al. 2004), Codon Usage Bias (Ran et al. 2014), Pan genomic Analysis (Tettelin et al. 2005) and synteny analysis. However, this concept will need to be continuousy evaluated, as we understand that this scenario will change rapidly due to increasing developments in sequencing technologies and computational tools. The use of above mentioned minimum tools for taxo-genomics is described below.

Multi locus sequence typing/analysis

Multi-locus sequence typing (MLST) can be used to overcome the limitations of 16S rRNA gene methods and delineate closely related strains (showing >99% 16S rRNA gene sequence identity). This approach was used to characterize bacterial pathogen variants at the sub-species level as early as in 1998 when eleven housekeeping gene alleles were employed to reliably identify the major meningococcal lineages associated with invasive disease among Neisseria meningitides (Maiden et al. 1998). Further, these gene loci were compared and based on the variation, these sequences were provided with an allelic identifier. These alleles present at each locus were designated as a sequence type and the relationship between species was thus generated based on a comparison of these allelic profiles (Maiden et al. 1998).

A variation of MLST known as Multi Locus Sequence Analysis (MLSA) was developed to decipher phylogenies based on the concatenated sequences of various protein-coding housekeeping genes. Although the application of the 16S rRNA gene provides taxonomic resolution at the species level or above, integration of MLSA into prokaryotic taxonomy provides the additional advantage of assignment of taxonomic status at the sub-species level. Studies using MLSA (McTaggart et al. 2010; Glaeser and Kämpfer 2015) employ at least eight protein-coding genes for species delineation. Using the MLST approach, we have proposed several novel species having high 16S rRNA gene sequence identity to their neighbours e.g. Acinetobacter indicus A648T (97.6%) (Malhotra et al. 2012); “Thermus parvatiensis” RLT (99.5%) (Dwivedi et al. 2015) and Fictibacillus halophilus AS8T (99.9%) (Sharma et al. 2016b). However, there is a need to develop consensus among taxonomists to decide a universal set of genes to be employed in MLSA to accurately determine phylogenetic relationships among different species. For MLST, the following methods have been used:

  1. 1.

    rMLST This method predicts species based on the use of 53 ribosomal genes (Jolley et al. 2004).

  2. 2.

    TaxonomyFinder This method involves species prediction through the use of the proteome specific to taxonomic groups and incorporates data from three databases namely, PfamA, TIGRFAM and Superfamily, to cluster homologous proteins into protein families. This software is freely available at http://cge.cbs.dtu.dk/services/TaxonomyFinder/ (Lukjancenko et al. 2013).

  3. 3.

    AMPHORA2 This tool uses 31 bacterial and 104 archaeal protein coding marker genes for phylotyping purpose. It is freely available at https://pitgroup.org/amphoranet/ (Wu and Scott 2012)

Average nucleotide identity (ANI) and average amino acid identity (AAI)

With the wide availability of bacterial genome sequences, the gold standard of identifying genome relatedness, DDH, has been superseded by the more reproducible, fast and easy to implement overall genome relatedness index (OGRI) methods. ANI is the most widely implemented OGRI algorithm for identification and measurement of overall genomic relatedness between two strains (Beaz-Hidalgo et al. 2015; Li et al. 2015a, b; Rosselló-Móra and Amann 2015; Yi and Chun 2015; Lee et al. 2016a, b). For taxonomic identifications, BLASTN based rather than mummer-based ANI has been successfully employed so far (Ramasamy et al. 2014).

Konstanitinidis and Tiedje as early as in 2005 proposed the use of ANI and AAI for comparative genomic analyses especially in relation to taxo-genomics (Konstantinidis and Tiedje 2005). Both of these approaches can be used to delineate inter-genomic distance between closely related prokaryotic species. A closer 16S rRNA similarity score (>98%) has been predicted (Goris et al. 2007), wherein ANI has been applied as more of a traditional tool for bacterial taxonomy. Average Nucleotide Identity by Orthology (OrthoANI) was introduced later, when an analysis using a total of 14,745 genome sequences (representing members of 10 genera) was conducted. A total of 63,690 genome pairs were analyzed and it was found that 55% of these pairs exhibited over 0.1 % discrepancy and 1101 pairs showed more than 1% discrepancy between reciprocal ANI values (Lee et al. 2016a, b). This level of discrepancy between reciprocal ANI values is significant enough to affect subsequent taxonomic interpretation, as approximately 95–96% ANI value is considered as the species boundary (Goris et al. 2007; Richter and Rosselló-Móra 2009; Chun and Rainey 2014).

Use of AAI has also been demonstrated to correlate with MLST. Konstantinidis and Tiedje (2005) studied the relationships among 175 strains using complete genome sequences based on the shared gene content and AAI. In this study, conserved gene content was predicted using a two-way BLAST-based algorithm (tBLASTN). By comparing the result of 16S rRNA- and AAI- based tree constructions, they concluded that AAI offered better resolving power within species and is a good option for phylogenetic comparisons. In fact, this approach has a major role to play in taxo-genomics (Konstantinidis and Tiedje 2005). In any case, it is proposed that rather than using the mean of nucleotide identity values between fragments of the query strain and the genome of the subject strain as a measure of genome relatedness, the mean of two reciprocal ANI values should be employed to measure the genome relatedness as described by Lee et al. (2016a, b).

Genome-to-genome distance calculator as replacement of DDH

The recent advent of in silico genome-to-genome comparison endeavours to replace the cumbersome DDH procedure. Genome-to-Genome Distance Calculator (GGDC) (Auch et al. 2010a) is a procedure used to calculate inter-genomic distances for outlining species relationships. GGDC is based on the Genome Blast Distance Phylogeny (GBDP) program (Henz et al. 2004), which can be used for genome-based species and subspecies delineation. Comparisons among genomes are made pair-wise using alignment programs of GGDC such as high-scoring segment pairs (HSPs) or maximally unique matches (MUMs) (Auch et al. 2010b). Sequences can be directly uploaded on the GGDC web page available on the DSMZ website (http://ggdc.dsmz.de/). The user can choose the sequence similarity search tool from a list of algorithms including BLAST+, NCBI-BLAST, BLAT, BLASTZ, WU-BLAST and MUMMER (Meier-Kolthoff et al. 2013). Following this, distances are deduced based on score calculations. Finally, these values are converted into DDH values so as to make them comparable with laboratory ranges of DDH values. As with DDH, the species delineation cut-off with dDDH (digital DDH) is also 70% (Meier-Kolthoff et al. 2013). Additionally, even subspecies can be delineated using dDDH values for which a cut-off of 79% has been considered appropriate (Meier-Kolthoff et al. 2014a).

GGDC has been used to deduce DDH values in silico for the reclassification of Desulfurococcus mobilis as Desulfurococcus mucosus and reclassification of Desulfurococcus fermentans and Desulfurococcus kamchatkensis as Desulfurococcus amylolyticus (Perevalova et al. 2016). D. mobilis, D. fermentans and D. kamchatkensis had been characterized earlier using the polyphasic approach, but once their genome sequences were available, their classifications were re-assessed based on DDH values generated using GGDC. It was shown that reassociation values were much above 70% and hence their definitions were revised accordingly. Similarly, Methanocaldococcus bathoardescens has been successfully designated a novel species using purely in silico tools such ANI, GGDC and synteny analyses (Stewart et al. 2015). ANI and GGDC were also used as a replacement tool for DDH for the classifications of Paracoccus sanguinis (McGinnis et al. 2015) and Thermodesulfobium acidiphilum (Frolov et al. 2017). In another study (Lagkouvardos et al. 2016), the mouse gut metagenomic diversity was established and taxonomically assigned using purely in silico tools based on GGDC and %G+C difference analysis. In a recent study, “T. parvatiensis” was distinguished from its closely related neighbour Thermus thermophilus strains (>99% 16S rRNA gene identity) by employing GGDC, ANI, pan-genome and MLST (Tripathi et al. 2017). In various other studies, GGDC has been used for delimitation of both species and subspecies.

Recent additions to the online GGDC 2.1 calculator include the phylogeny pipeline for calculation of taxonomic relationships using individual genes by estimation of pairwise similarities. Another feature available is the %GC difference calculator (Meier-Kolthoff et al. 2014b). The %GC difference calculator is based on the idea that two strains of a single species cannot have a %GC difference more than one percent. GGDC has also been shown to provide accurate DDH values even from draft genomes (representing 97–99% of the genetic information) (Meier-Kolthoff et al. 2013) using Formula 2 (identities/HSP length). Formula 2 (also, the recommended Formula) calculates GGDC irrespective of the total length of the genome, whereas the other two formulas, [Formula: 1 (HSP length/total length) and Formula: 3 (identities/total length)], are dependent on the length of the genome and hence must be used only for complete genomes. In addition, GGDC also calculates confidence intervals (both model-based and resampling-based/bootstrapping) along with each pairwise genome comparison (Meier-Kolthoff et al. 2013) and delivers a value on the same scale as DDH, thus making it easy to compare. Hence, instead of performing inconvenient and not so reliable DDHs, there is an incremental demand to shift to GGD estimations that are rapid, convenient, reliable and accurate.

Tetranucleotide frequency

In addition to ANI and AAI, tetranucleotide frequency analysis, based on the differences in the frequency of occurrence of four nucleotides (A,T,G,C) between two genomes, can also be employed to classify a given pair of organisms or to supplement other data (Teeling et al. 2004; Kim et al. 2014). Here, all possible combinations of tetranucleotide frequencies (256) for each oligonucleotide sequence are calculated. This alignment-free method has been shown to correlate well with ANI. These methods are more reliable as these consider the whole genome as compared to the 16S rRNA gene sequence and single copy gene approaches like AMPHORA2 (which use 31 bacterial housekeeping genes and 107 archael essential marker genes). This approach acquires major significance when taxonomic marker genes are absent in the genome of a particular organism (Alsop and Raymond 2013). Despite all these limitations, these signatures have increased the taxonomic resolving power by differentiating even very closely related (intra-specific) organisms. In our opinion, ANI, AAI and tetranucleotide frequency data can also be merged with codon usage bias and codon preference for bacterial systematic as described below.

Codon usage bias and codon preference

Codon usage bias refers to the difference in the frequency of occurrence of synonymous codons in coding DNA. Alternatively, it can be said that in a wide variety of organisms belonging to a particular species, synonymous codons are used with different frequencies, a phenomenon which is known as codon bias (Hershberg and Petrov 2008; Lal et al. 2016).

In prokaryotes, there is a wide range of factors responsible for such selective preference for a particular codon, including gene expression level and percent GC composition. Codon bias is substantially correlated with the level of gene expression, with the strongest influence on the highly expressed genes (Gouy and Gauiter 1982; Akashi 2003). Many studies have correlated the expression of genes with selected codon synonyms in organisms like Caenorhabditis elegans, Drosophila melanogaster, Saccharomyces cerevisiae, Escherichia coli for optimized translation (Stenico et al. 1994; Akashi 1996; Ghaemmaghami et al. 2003; Fraser et al. 2004; Goetz and Fuglsang 2005). The influence of codon usage bias on encoded amino acids is associated with the %GC content, with the strongest effects on regions with high GC content (Li et al. 2015a, b). %GC content is probably determined mostly by genome-wide processes and less by specific selective forces on coding regions (Sueoka 1962). However, factors like amino acid conservation, protein stability, mutational bias for leading and lagging strand and extensive HGT under selective pressure are some other factors responsible for codon bias (Hershberg and Petrov 2008). Sharp et al. (1986) proposed Relative Synonymous Codon Usage (RSCU), which gives an idea of the most frequently used codons for a specific amino acid with respect to other codons. However, this estimation has inherent limitation to identify two closely related organisms, as the demarcation between them is solely on the basis of synonymous codons. With advancement of computational tools, it is now easy to generate and compare the codon usage bias even in closely related organisms in the form of codon usage bias tables. Commonly used genetic codon frequency tables in different expression host organisms are available online as a database (http://www.genscript.com/cgi-bin/tools/codon_freq_table) for the identification of new species (Nakamura et al. 2000).

Codon usage bias creates a pattern by selecting specific codons for an amino acid over others and can be specific for a gene, genes or genomes. However, this naturally occurring phenomenon can be measured and used for classification of a novel species as supplementary information as well as for comparative studies. Beyond the classical 16S rRNA method, Clusters of Orthologous Groups (COG) within species can be verified for codon usage bias to distinguish them in different species. Recently, taxonomists have been able to resolve the longstanding questions about codon bias and this pattern has been linked with the protein synthesis hypothesis and role in prokaryotic systematics (Plotkin and Kudla 2011; Wald et al. 2012; Ran et al. 2014; Babbitt et al. 2015).

Role of pan-genome in taxo-genomics

The pan genome refers to the sum total gene sets in a clade and is composed of both the core and the variable or accessory genome. With the upcoming challenges faced by the classical techniques, pan-genomics seems to be an additional resource to address taxonomic questions. DNA–DNA reassociation studies, multi-locus (16S rRNA and single-copy genes) typing and a variety of single and multigene approaches derive conclusions only on the basis of the core genome, but ignore the accessory genome. Pan-genomic analysis was introduced by Tettelin et al. (2005) in a study where 8 genomes of Streptococcus agalactiae strains were used to obtain the patho-genome of strains, which led to enhanced development of vaccines and understanding the virulence markers. Additional studies have been conducted where metagenomic recruitments were used to determine the pan-genome. For instance, Salinibacter genomes were taxonomically placed using metagenome binning of saturated brine shotgun data (Pašić et al. 2009). Similarly, over 30 genomes belonging to Actinobacterium, Nitrosomonadales, Polynucleobacter, Chlorobium, Holophagales, Methylotenera and Desulfobulbus were phylogenetically differentiated using global alignments and metagenome data from a freshwater lake (Bendall et al. 2016). Also, this approach was employed to characterize genomes from Actinobacteria from brackish waters of the Caspian Sea (Mehrshad et al. 2015). Additionally, a recent metagenomic study identified the core and accessory pathogenomes of opportunistic Cellulosimicrobium cellulans species (Sharma et al. 2016a). Pan-genomics thus helps to determine the accessory genome content, which can be a significant for delineating closely related species (Caputo et al. 2015). Future advances in bacterial systematics may also encompass pan-genomic information as a critical factor to delineate novel taxa.

Importance of synteny analyses for identifying taxonomic relationships

The concept of including all available genetic information for taxonomic purposes was termed as “integrative taxonomy” (Will et al. 2005). Synteny analysis is one such approach that holds potential not only for taxonomic characterization of bacteria but also for studying the phylogenetic and evolutionary relationships among bacteria. Synteny (“same ribbon” in Greek) means comparison of the order of arrangement of genes on a chromosome or a plasmid among different genomes. A generally accepted fact is that closely related genomes will have a similar arrangement of genes (synteny within strains > species > genera). A disordered arrangement of genes (or lack of synteny) may be seen among species of genera having undergone major genetic rearrangements and HGT events (Verma et al. 2017). In order to study synteny, genomes can be visualized linearly using Mauve (Darling et al. 2004) and circular representations can be generated using BRIG (Alikhan et al. 2011). Mauve is a cross-platform GUI genome alignment package that uses an anchored alignment approach to generate multiple alignments of long reads, contigs, scaffolds and whole genomes. It identifies locally collinear blocks among the genomes being aligned. More than two genomes can also be aligned using Mauve. Less synteny is observed as the taxonomic distance increases. Mauve compares the location of these blocks among the genomes, shedding light on the organizational arrangement of genetic blocks and large-scale genomic rearrangements.

M. bathoardescens is one such novel strain for which synteny analysis has paved the way for its classification and differentiation from nearest phylogenetic neighbours and specifically highlighted regions specific to individual genomes (Stewart et al. 2015). In a recent study, Pucker et al. (2016) used a reciprocal best hits (RBH) synteny approach to demonstrate the occurrence of large presence/absence variations (PAVs) in Arabidopsis thaliana genomic strains Nd-1 and Col-0, thus differentiating the two. Tools for analyzing synteny are evolving and recent additions to this tool kit include Multisyn (Baek et al. 2016), Phagonaute (Delattre et al. 2016), Synteny Portal (Lee et al. 2016a, b) and Vector Graph Toolkit of Genome Synteny and Collinearity (VGSC) (Xu et al. 2016), among others. SyntTax (Oberto 2013) is a tool that depicts taxonomy-based synteny arrangement. Synteny studies in the plant family Solanaceae (Wang et al. 2008; Rinaldi et al. 2016) have headed the way of understanding evolution through comparative mapping. In the case of Methanocaldococcus bathoardescens JH146-22, full-genome synteny analysis has been used to validate its taxonomy (Stewart et al. 2015). Among circular visualization methods to view and compare genomes, BRIG (Blast Ring Image Generator) is a powerful and easy to use application. BRIG compares the query genomes with a reference genome using BLAST (Altschul et al. 1990). Multiple genome comparisons can be shown in a single image displaying similarity between reference and query genome sequences concentrically. Visualization can be optimized by using cut-off e-value or minimum percentage identity to filter the results. BRIG comparisons can highlight regions of similarity and demonstrate overall genome similarity visually (Verma et al. 2014), thus helping to deduce relatedness of genomes based on whole genome sequences.

Linking functional genomics to taxonomy and systematics

While functional genomics does not have any direct link with taxonomy and systematics, it certainly plays a significant role in understanding evolutionary relationships among microorganisms. Using genome sequence data, attempts to study the functional aspects of a genome can be used as a modern approach beyond the conventional taxonomic methods (Khanna et al. 2011; Sharma et al. 2014; Puri et al. 2016). Functional genomics deals with the identification of genes and proteins and their interactions in different metabolic pathways. Functional potential of a genome can be analyzed by studying gene expression profiles, small non-coding RNAs, mutations (e.g. single nucleotide polymorphisms), proteomics, DNA methylation and genome-wide association studies (Chen et al. 2009). Functional profiling can be used to draw comparisons between different bacterial species by studying differentially enriched metabolic pathways in the genomes. FragGeneScan (Rho et al. 2010) and Web-MGA (Wu et al. 2011) are used for predicting open reading frames (ORFs) in the contigs whereas tools for functional annotation of predicted ORFs include KAAS (KEGG Automated Annotation Server; Moriya et al. 2007) and searching the COG database (Tatusov et al. 2001) for assigning KEGG ontology (KO) numbers and Clusters of orthologous genes (COG) categories, respectively. MinPath (Ye and Doak 2009) and FMM (Chou et al. 2009) are used for metabolic reconstruction of pathways. The above-mentioned tools mostly use KEGG as a single reference database and do not provide any statistical output whereas tools like metaSHARK (Hyland et al. 2006), MEGAN6 and MG-RAST (Meyer et al. 2008) use different databases, PRIAM, SEED and eggNOG, respectively. KEGG is an integrated knowledge-based reference database which consists of 17 main databases in four broad groups, namely genomic information, systems information (metabolic pathways), chemical information (metabolites, ligands, enzymes), and health information (drug and diseases).

Although taxonomic analyses using the 16S rRNA approach have been successful, they fail to explain gaps in evolution caused by the functional replacement of genes, HGT, or gene duplications (Hong et al. 2004). Phylogeny based on metabolic pathway content can account for overall evolutionary processes and therefore help in understanding adaptation mechanisms in micro-organisms inhabiting diverse environmental niches. Functional genomic analyses to infer the phylogenetic relationships show divergent functional profiles of taxa and clades (Chai et al. 2014). This correlation can be used to identify clade-specific cellular functions both with low and high parsimony scores. These clade-specific cellular functions can also be used in addition to conventional approach for taxonomic characterization of novel bacterial species. For instance, Conserved Signature Indels (CSIs) represented within the protein sequences are used as molecular markers to infer phylogeny and molecular placement of species into specific clades.

Future challenges in the evolving field of taxonomy

The enormous amount of genomic information that is becoming available by cultivation-independent approaches is forcing taxonomists to discover ways to deal with uncultured microbial diversity and explore it further. As is widely known, metagenomics refers to the study and analysis of genetic material recovered directly from environmental samples. It has been estimated that only 0.1% of the microbial diversity is cultured (Davis et al. 2005); therefore, taxonomic classification of remaining 99.9% of uncultured microbes using the standard polyphasic methods is very challenging. Nowadays, metagenomics plays a key role in identifying abundant bacteria present in environmental samples, some of which are difficult to isolate using canonical techniques. With advances in bio-computational tools, metagenomics data can be taxonomically classified following binning the metagenomics reads or contigs based on species-specific parameters like tetranucleotide frequency, %GC content and clade-specific markers. The draft genomes generated through this approach can be used to infer important metabolic pathways present in the uncultured organisms and this information can be further utilized for modifying cultivation approaches in such a way so that these difficult-to-culture organisms can be enriched by using special media, filtration processes, different temperatures or specific electron donors or acceptors. The use of metagenomics for taxonomic classification of uncultured microorganism is in its developing stage and will advance with time. With advancements in technology, modern tools are being developed that can efficiently perform taxonomic classification of majority of the microorganisms inhabiting any particular environment.

Using the information gathered from the in silico analysis, attempts are being made to culture or isolate the uncultured microbes (Joseph et al. 2003). The classical taxonomy based on the traditional polyphasic approach is shifting to a more advanced approach in association with bioinformatics. Co-culturing bacteria is also an alternative to effectively culture the uncultured (Stewart 2012) and here metagenomics along with metaproteomics can play a crucial role in deciphering the community dynamics. Not only cultivation and taxonomic classification, but a proper preservation of environmental samples is also important. For this, Biological Resource Centres with well-equipped infrastructures for ex situ preservation of native biodiversity for future research, reference, and applications should be established (Prakash et al. 2007).

Conclusions

Taxonomic assignments are critical in deciphering and classifying the rich microbial diversity that is being unraveled by science at an unparalleled pace. Advancements in genomics have led to robust and stable notation of characterization and assignment of taxonomic descriptions. Even with inherent limitations, the 16S rRNA gene-based taxonomic assessment is still the basis for systematics. However, with advancements in NGS methods and availability of genome data, there has been an incremental development of various genome analysis algorithms, thus moving taxonomic delineation of prokaryotes towards the new era of taxo-genomics.

Despite the debate on genome-based phylogeny (as data on genome-based taxonomy remains scarce even at present) there is a need to supplement effort and set some norms to characterize prokaryotic species with supplementation of basic genomic annotations which should lead to more distinctive bacterial classification. In total, the efforts focusing on both the minimalist and/or genomic approaches to identify novel taxa will prevent redundancy in prokaryotic nomenclature. Thus, the fundamental principles of taxonomy must not be abandoned, but supplementing the emerging taxonomic studies with genomic data would make the results more robust and conclusive. Based on the review of the available classical taxonomic approach along with the taxo-genomics and the upcoming taxonomic methods, as discussed in this study, we propose a minimum parameter or set of tools for performing taxonomic characterization of prokaryotes (Fig. 1).

Fig. 1
figure 1

Scheme for bacterial taxonomy linking classical taxonomic approach with current methods of genomics and futuristic diversity analysis concepts based on in silico data analysis