Introduction

With the advent of next generation sequencing (NGS) technologies, the field of (meta)genomics has revolutionized the landscape of microbiology leading to deluge of environmental and genome sequence data. With the availability of sequenced bacterial genomes, comparative genomics has emerged to be indispensable in elucidating evolutionary forces active across genera or species; however it remains incompetent to demonstrate results at the population level of in situ cohorts in an environment [111]. The problem can be resolved to some extent by using genomics and metagenomics data together delineating the pan-genome dynamics of a community distinctly elucidating environment specific lifestyles adopted by bacteria (Fig. 1) [12, 13].

Fig. 1
figure 1

Schematic representation of workflow for conducting genomic and metagenomic surveys both independently and in association to elucidate community dynamics. The steps with red asterisks are optional steps in the course of analysis. ‘G’ labeled near bins represent the genomic segregation from metagenome data. The name of assemblers and taxonomic validation techniques mentioned in this figure are mere examples representing generalized methodologies used in the field and does not reflect any biased opinion. However, these methodologies have been reviewed in references Oulas et al. [114] and Sangwan et al. [24]. (The figure was originally produced for this review. The examples for metagenomic recruitments were re-produced from Sharma et al. [17])

Ever since its discovery ‘metagenomics’ has been largely used to decipher the overall taxonomic composition at an environment focusing more on ‘meta’ and less on ‘genomics’ part except for the organismal reconstruction at the species/strain or pan-genome level (multiple species of a specific genus) [12, 14]. While there are numerous studies based on using genomics and metagenomics independently, it has not been until recently that the potential of mining genome and metagenome datasets together was exploited to unveil complex environment–host interactions. Using (meta)genomics in sync, can provide a better understanding of habitat independent gene acquisitions and functional contributions of taxa enriched in an environment [15, 16]. (Meta)genomics can also predict in situ growth dynamics, environment specific lifestyles, seasonal dynamics, and gene-specific genome-wide sweeps across resident populations [17, 18] (Figs. 1, 2). This review intricately covers the (meta)genomic approaches developed across the decade, starting from the metagenome-enabled discovery of ‘rare biosphere’ [13] until today and how using genomics and metagenomics data conjointly can increase the resolution of investigations concerned with progressive ecological interactions.

Fig. 2
figure 2

Diagram showing four major aspects connecting metagenomics (M) and genomics (G) as discussed in the review; a average genome size estimation of metagenome data to understand genome selection pressure, b understanding the growth dynamics of specific bacterium from metagenome data, c SNP analysis using temporal metagenomics, and d G+C driven codon usage and protein selection analysis across metagenomic-derived genomes (The figure was originally produced for this review. The concepts for ad were taken from Nayfach and Pollard [56], Korem et al. [16], Bendall et al. [15], and Ran et al. [74], respectively)

(Meta)genomics Enabled Assessment of Population Splitting Factors

When an organism is known to be highly abundant or is isolated from a specific environment along with availability of metagenome database for the same environment, it becomes feasible to recruit metagenome reads over reference genome [2, 19]. This can then be followed by annotation of the unmapped/under-recruited regions known as metagenomic islands (MGIs) [17, 20]. The MGIs are implicated to be mobile genetic elements (MGEs) which are part of the accessory bacterial genomes representing highly variable region among different lineages in the population [19]. The steps that can be followed and the outcomes are enumerated in Fig. 1. This analysis as depicted in Fig. 1 becomes more attractive for extreme environments which are characterized by enrichment of the dominant taxa (Fig. 1). For instance, chemical contaminated environment such as hexachlorocyclohexane (HCH) dumpsite has been found to be dominated by Sphingomonads and Pseudomonads [11, 12]. Similarly, soil microcosms are dominantly characterized by Rhodanobacter, Burkholderia, Acidobacteria [13]. The unmapped stretches across the reference genome called as Metagenomic Islands (MGIs) are annotated to highlight the population segregating functions of the in situ cohorts of a specific environment which otherwise is not possible using traditional genomics approach [20] (Table 1). The metagenomic recruitment analyses using environmental data from stressed niches also reveal the population dynamics due to slight environmental perturbations and the habitat specific genomic alterations over indigenous populations [21].

Table 1 List of metagenomic surveys performed in association with the genome reconstruction and reference based recruitments demonstrating different results

The recruitment of metagenome reads over reference genomes has been used extensively to demonstrate environment induced variations across genomes fundamentally involving alignment of reads on to microbial whole genomes using global alignment algorithms (Table 1). There are numerous algorithms that can be employed for metagenomic recruitments; few of the most used software/pipelines are discussed in the section below. The MGI annotation across genome of Salinibacter ruber using environmental data from saturated brines revealed genetic predominance of the cell wall biogenesis genes across the island region [19]. This suggested that population varied with respect to cell envelopes in saline environment indicating at a global strategy of population against phage predation owing to low eukaryotic grazing pressure [19]. A similar study was performed for identification of pathogenicity markers that employed MGI annotation by recruiting metagenome reads from healthy patients on pathogenic bacteria [22]. This led to mapping of virulence genes specifically in the species with uncharacterized pathogenicity markers such as Shigella, Escherichia coli, etc. (Table 1). A study using metagenome recruitment data across hexachlorocyclohexane (HCH) dumpsite led to reconstruction of last common ancestor of HCH degrading Sphingobium species after discounting MGIs and genomic islands (GIs) which provided an evidence for horizontal gene transfers (HGT) driven acquisition of HCH degrading enzyme arsenal mobilized by environmental pollution [12] (Table 1). In addition, metagenome reads based recruitment analysis also revealed the pivotal role of integron and its associated transposase gene (TnpA6100) in enabling HGTs as a stress response across Pseudomonas community inhabiting HCH contaminated environment [17]. Recently, temporal (meta)genomics using recruitments demonstrated intra-population heterogeneity across closely related population of Methylotenera via SNP analysis, unraveling the patterns of gene-gain/loss over time highlighting the role of environment in genome modulations [15] (Fig. 2; Table 1).

Tools for Read-Based Metagenomic Recruitments

Metagenomic recruitment largely depends on alignment algorithms and there exist multiple aligners for recruiting/binning metagenomic reads to a reference genome with different execution time and efficiency [25]. The most common formats of raw reads obtained from different sequencing platforms are FASTQ and FASTA. While BLAST and BLAT are the most common alignment algorithms, these are too slow for processing millions of reads generated from a metagenome [26]. Fast alignment approaches include Mapping and Assembly with Qualities (MAQ) [27], Short Oligonucleotide Analysis Package (SOAPv2) [28], Bowtie [29], Basic Oligonucleotide Alignment Tool (BOAT) [30], and Burrows Wheeler Aligner (BWA) [31] (Table 2). Bowtie and BWA both are based on Burrows Wheeler Transform for compressing the reference and the query, making it faster in comparison to BLAST like programs [31]. While MAQ [27] is based on spaced indexing, SOAP uses seed and hash look-up table for query and reference sequence. Between Bowtie and BWA, Bowtie is extensively used for very short reads while BWA is used for mapping of low-divergent sequences on to a large reference genome sequence [26]. Global Alignment Short Sequence Search Tool (GASSST) [32] is software based on global alignments of both short and long reads against large reference sequence with an impeccable edge in its ability to perform fast gapped alignments (Table 2). FR-HIT [26] also performs fragment recruitment with a higher tolerance for mismatches and gaps in contrast to SOAP [28], BWA [31] and Bowtie [29] (Table 2). Similarly, there are many other alignment based algorithms such as MEGAN [33], MetaPhlAn [34], PhymmBL [35], and Kraken [36] (Table 2). MEGAN employs BLAST-based database searching and recruiting the lowest common ancestor (LCA) to the sequence (reads) while PhymmBL uses Markov Model on BLAST results to increase the precision of recruitments [35]. MetaPhlAn assigns taxonomic assignments to sequences by recruiting only a subset of reads to clade-specific markers instead of whole genomes which makes it faster in contrast to other algorithms for huge metagenome datasets [34]. Kraken, which is the fastest of all, uses alignments of k-mers over entire microbial genomes achieving relatively higher precision (99.20%) at genus level [36]. All these software/pipelines have been used extensively in numerous studies; however the precision and efficiency might vary from data to data (Table 2).

Table 2 List of software for recruitment of metagenomic reads

One of the major challenges in metagenomics is binning of microbial community using very short read sequences. Most of the mapping techniques as discussed above depend on 16S rRNA gene databases or essential genes requiring a read length on a higher side. Most recently Freitas et al. [37] used hierarchical array of unique signatures. Current taxonomic profiling methodologies stay biased as gene based approach depends heavily on correct coding orientations which is not achievable while analyzing metagenome short reads data. GOTTCHA pipeline uses machine learning to determine the unique genomic region followed by deciphering the distribution and coverage of these specific regions [37]. Hence, depending on the type of raw data and system configuration available at our end, we can decide on the software to be used for metagenomic recruitment (Table 2).

Metagenomic recruitments of reference genomes highlight the relative abundance of genomes or pan-genomes at the sampling site by using 80–95% of identity cut-off for alignments, accommodating for species level recruitments [2, 12, 19] (Fig. 1). While exploring the population dynamics of a particular strain, stringency of the identity cut-off can be increase up to 98–100%. For (meta)genome recruitments generally the % identity threshold is defined as the number of identities between read and reference divided by the average read length. This value has been standardized as 80% for metagenomic binning over genome i.e. requiring 80% identity over 80% length of the read [19]. Metagenome tilling can thus highlight the modulation of the accessory genomes delineating the population splitting factors across bacterial ecotypes by annotating the MGIs (Fig. 1). However, there still lies a bottleneck of producing false-positives because of sequencing bias. Therefore, manual curation remains the most important step including scanning for tRNA flanking the island regions, differential G+C content, tetranucleotide frequency, and codon usage skew [19]. The downstream analysis becomes very important to confirm the accuracy of MGIs across a genome after recruitment of metagenome reads on to reference genomes. Annotation of the unmapped regions is achieved by basic database search using programs such as BLAST [38], GHOSTX [39], GHOSTZ [40], BLAT [41], HMMER [42] etc. against databases such as NCBI-nr [43], KEGG [44], Pfam [45], SwissProt [46] etc.

De novo Segregation of Metagenome Datasets into Genome Bins

While metagenomic recruitments on a genome can provide insights into environment specific genome modulations, assembly of near complete genomes using metagenome binning can provide accurate functional contribution of an individual genotype/population in a complex community. Nonetheless achieving a high fidelity bin without any cross contamination at strain/species level resolution remains a challenge with a moderately (coverage) sequenced metagenome (Fig. 1). The coverage (sequencing depth) of a sequenced metagenome remains the most significant parameter while recovering a genome from environmental data [51]. However, the ever expanding field of NGS overcomes this bottleneck leading to the assembly of genomes even with <1% relative abundance in a metagenome [23].

Essentially the recovery of near complete genomes from metagenomes is based on alignments against reference genomes and reference databases remain limited due to an overwhelming unexplored complex community exceeding the limited reference databases [52, 53]. De novo metagenome segregation approach uses tetranucleotide frequencies, G+C skew, and coverage which are assumed to be relatively constant across one genome (Fig. 1). However, there are known genomes having inconsistent base compositions and G+C content which compromise this de novo assembly approach [53]. Another method is based on retrieving set of specific genes such as 107 essential genes [54], 31 bacterial marker genes [55] directly from metagenome to separate organisms. These methodologies although extremely used are based on gene abundance which is sometimes exactly identical for closely related organisms and therefore known to be co-abundant [52]. Recently 7381 co-abundance gene groups (CAGs) were used to recover genomes from 396 human gut microbiome samples [52]. This methodology was used to assemble 238 microbial genomes belonging to archaea, bacteria and viruses. In order to perform benchmarking, 19 of the sampled individuals were fed with fermented milk products containing Bifidobacterium animalis subsp. lactis CNCM I-2494 which also has been already sequenced. Using CAGs B. animalis genomes was reconstructed and 95% of B. animalis reference genes were recovered from the genomes with 99.9% identity with respect to the reference B. animalis subsp. lactis CNCM I-2494 [52]. Hence, it is suggested to use co-abundance gene profiles of environmental metagenomes which is capable of segregating taxonomically related microorganisms with a higher accuracy in contrast to gene-based or composition based approaches.

Significance of Average Genome Size (AGS) Estimations Across Metagenome

Metagenomics besides being used for community profiling, can also be used to determine relative abundance of gene families and pathways between different sites. In order to accurately detect the gene abundance, it is most important to determine the average genome size (AGS) to give a statistically significant interpretation of variations of gene abundances between different metagenomes [56, 57]. AGS can be simply explained as the average of sizes of genomes present in a metagenome, which can vary between two sites thus introducing gene frequency variations or unreal variations [58]. Therefore, while comparing metagenomes with different AGS, false positives can be observed or sometimes stability among genes can be demonstrated between sites even when there is difference in real (i.e. false negatives) (Fig. 2a) [59]. AGS also is significant in estimating the evolutionary forces active on an organism thriving in a particular environment. Bacterial genome size demonstrates the environmental pressures, community, metabolic preferences and lifestyle of an organism [60]. For instance, it has been observed that bacteria with relatively larger genome sizes follow a generalist lifestyle whereas bacteria with smaller genome sizes follow more specialized lifestyle [61].

There are multiple softwares to estimate AGS across a metagenome including most used Genome relative Abundance and Average Size (GAAS) [60] which is based on BLAST searches of the metagenome reads data against a database of microbial genomes including bacteria, archaea and viruses. This approach stays biased as there are microorganisms which are not submitted in the database and the metagenome sample to be analyzed might contain novel organisms. Hence, GAAS is not a choice in case we are analyzing a metagenome of a unique niche. However, Raes et al. [58] had devised an approach where the AGS was calculated based on the abundance of reads assigned to 35 essential genes, which made it much faster. But this approach still carried a limitation i.e. it could only analyze metagenomic reads of length greater than 300 bp. In case of newer sequencing platforms where short reads library preparation is used, this methodology can’t be used with accuracy. In order to calculate AGS accurately, by overcoming above mentioned problems, recently Nayfach and Pollard [56], designed a pipeline called “MicrobeCensus”. MicrobeCensus depends on determining reads density on the housekeeping genes and can also be applied on to metagenome reads as short as 50 bp. This software considers 40 marker genes for the domains of bacteria and archaea and 114 for all bacteria only [62]. The markers list for 40 genes comprises of ribosomal protein units S2, S10, L1, L22, L4, L2, S9, L3, L14B, S5, S19, S7, L16, S13, L15, L25, L6, L11, L5, S12, L29, S3, S11, L10, S8, L18P, S15P, S17, L13, L24, translation elongation factor EF-2, translation initiation factor IF-2, metalloendopeptidase, ffh signal recognition particle protein, phenylalanyl-tRNA synthetase beta subunit, phenylalanyl-tRNA synthetase α subunit, tRNA pseudouridine synthase B, porphobilinogen deaminase, phosphoribosylformylglycinamidine cyclo-ligase, and ribonuclease HII [62]. Further, the accuracy of MicrobeCencus was established using “Median Unsigned Error” in order to account for errors due to over- and under-representation of sequences.

Using (Meta)genomics Data in Deciphering Growth Dynamics of Bacteria

The metagenomic sequence data can provide an understanding of the presence of microbiota at a particular niche including gut, biofilm, hot spring, etc. Recently Korem et al. [16] devised a way to obtain information on growth dynamics of particular bacteria/genome enriched in a metagenome. This methodology focuses on examining pattern of sequencing coverage specifically at origin of replication of bacterial genomes (Fig. 2b). It is well established that bacterial replication initiates at origin of replication (ori) bidirectionally, hence the regions already replicated will have two copies of ori in contrast to unreplicated regions. The same concept was earlier applied only on yeast cells with coordinated stage of replication [63]; however it stands true for all the cells at any stage of replication as well [64, 65]. Using genome data from multiple bacteria, it was found that the region in the proximity of ori is present in high copy number in actively growing bacteria as compared to the DNA segment present towards the terminus [63] (Fig. 2b). The ratio of copy number of region near ori to the DNA segment near terminus is termed as peak-to-trough ratio (PTR) and is a direct measure of growth rate of the bacterial genome [66]. PTR ratio of greater than 1:1 is a quantitative indicative of higher growth dynamics. This quantitative relationship was further proven experimentally using E. coli (strain K-12) cultures [16]. Similar patterns were observed across E. coli genomes retrieved from human fecal metagenome samples to that of in vitro cultures [52, 67, 68]. Using E. coli genomes from 583 databases, it was found that PTR varied from 1 to 2.4 which was similar to the ratio obtained in in vitro experiments i.e. 1–2.6 [16].

Further, as an extension of this concept, it was found that PTRs can also monitor clinical changes after treatment by antibiotics. For this, Citrobacter rodentium was treated with antibiotic erythromycin and PTRs reduced drastically. The reduction was evident within 30 min after administration. However, during antibiotic recovery (washing of cultures) an increase was observed. In order to determine the virulent bacterial activity in a disease condition, C. rodentium strains (both virulent and non-virulent) were observed for PTR patterns. For the first five days both the virulent and non-virulent strains showed similar PTR values, nevertheless, at 6–9 days post infection PTR values for virulent strains increased drastically in comparison to non-virulent strains. This was justified as an indication of mucosal adhesion and proliferation by virulent strains in contrast to non-virulent strains at that stage [16]. Furthermore, the FDR (corrected P value <0.005) association showed significant correlations between PTRs and several disorders/metabolic conditions like Crohn’s Disease [52], ulcerative colitis [69], fasting serum insulin, fasting blood glucose, and Type II diabetes [68].

Evidence for Genome-Wide Sweeps Using Temporal (Meta)genomics

Microbial communities are comprised of distinct phylogenetic groups within ecologically coherent populations due to high recombination rates of superior genes between the members of population [70]. In order to study the genetic heterogeneity, time-series metagenomics along with de novo genome assembly holds a great potential by directly tracking the evolutionary patterns driven gene-gain/loss throughout [15]. De novo reconstruction of genomes from the metagenome data provide the reference genomes which are again recruited over by the metagenome reads for SNPs deciphering the genetic diversification within discrete populations [71] (Fig. 2c). This has further enabled us to directly encapsulate evolutionary models such as genome-wide sweeps, which were not studied earlier [72] in the same yet phylogenetically diverse environment.

Metagenomic recruitment over the assembled reference genomes demonstrate two types of populations; one called ‘sequence discrete’ populations with recruitment at ≥99% identity and the other called ‘close sympatric’ populations with recruitment <90%. Sequence discrete populations represent highly similar genotypes with low extent of diversity which can be analyzed for SNPs. Bendall et al. [15] reported significantly different i.e. eightfold (SNPs per Mbp) for two close genotypes of the same genus i.e. Methylotenera assembled from 9-year study period indicating at astounding intra-population diversity (Fig. 2c). However, most of the SNPs for discrete populations did not show amino-acid substitutions. Further, time series metagenomics can also unveil the gene-gain/loss patterns across one specific population (reconstructed genome) [15]. In case the relative abundance for a specific set of genes is increased over time, it suggests that the gene was acquired horizontally, whereas in the case of decrease in frequency of genes proposes that the newly dominant lineage will eventually lack these genes for a specific population (i.e. genus or phylum or order) given a constant functional constraint. Hence, genome wide studies using temporal metagenomics can provide a clear understanding of both genome-wide sweeps and gene-specific sweeps taking place across intra- and inter-populations which can further explore evolutionary models controlling population dynamics of an environment.

Using population Genomes to Analyze Differential Codon Usage Preferences

It has been recently established that microbial communities at extreme environments evolve faster characterized by a strong purifying selection in order to undergo genome optimization under specific functional constraints [73]. De novo genome reconstruction of uncultivable diversity from extreme metagenomes can be used to explore relationship between %G+C, codon usage preferences and protein selection to validate the evolutionary pressures acting on the bacterial community under stressed environments (Fig. 2d). In order to derive the coupling force, correlation (R) between gene-specific optimal codon frequencies (F opt ) (an indicative of codon bias) and dN/dS is calculated [74]. It has already been established that strongest positive coupling exists across Cyanobacteria and Tenericutes followed by Firmicutes, Spirochaetes etc. whereas, Actinobacteria group has strong but negative correlation. Generally, low negative R highlights the selection of “high-status” genes which are central to metabolic pathways and thus evolve slowly with overwhelming purifying selection pressure (Fig. 2d). The value of R also shows significant relationship with %G+C. For genomes characterized with extreme %G+C, the Fopt values tend to be on higher side (Fig. 2d). Thus, the dependence of selection pressure in terms of codon preferences on nucleotide composition can provide insights into the poorly understood evolution patterns. Therefore, using metagenomics based genome reconstruction; the habitat specific evolutionary pressure can be estimated employing the genome data.

Interestingly, it was established that codon usage skew is specific to metagenome as a whole and is independent of the bacterial community enriched in the metagenome data [75]. This suggested that bacterial genera in the same metagenome can exhibit variable codon usage preferences but overall the metagenome is characterized by an accumulative codon bias which differs from the other metagenome samples markedly just like an observation for single microbial species/genome [76, 77]. To investigate phyletic independence, the species specific genes common between different metagenome samples were retrieved and distances were calculated between codon usage preferences of each [78]. It was found that codon usage of compared phylogenies showed greater variations between metagenomes than in different species of the same metagenome. Similar analysis has also been extended to variable environmental conditions which demonstrated that different species show lower variability of codon usage in case of constrained environmental conditions [7981]. Further, segregated genomic data from the metagenome showed consistent codon usage patterns within a genome of a metagenome. Hence, the constitution of sequence composition across genome and metagenome can elucidate evolutionary pressures in terms of codon choices and protein selection.

Application of (Meta)genomics in Clinical Microbiology

HGT driven variations are main contributors of transition of non-pathogenic bacteria into pathogenic bacteria and vice versa. Perna et al. [82] compared pathogenic E. coli O157:H7 to the non-pathogenic bacteria E. coli K-12 which led to the identification of candidate genes specifically responsible for pathogenesis of the pathogenic E. coli strains [82]. Furthermore, comparative genomics analysis across Bacillus strains revealed the significance of mobile genetic elements in imparting virulence to bacterial strains [83]. Genomic analyses between pathogenic and non-pathogenic Mycobacterium tuberculosis strains revealed new set of pathways in pathogenic strains in contrast to avirulent strains [84]. Most significant finding of this study was discovery of alternate metabolic pathways which shed light into their mechanisms of pathogenesis thus providing a base for developing diagnostic markers against tuberculosis [84]. Comparative genomics has provided significant information regarding credible virulence determining factors that can be further targeted for vaccine development. While comparative genomics has provided insights into the bacterial evolution of pathogenesis, the metagenomics approach has also been used for functional screening of virulence markers overall at an environment [85]. Metagenomics approaches have focused on the predominance of pathogenic bacteria in natural environments such as human gut. For instance, Sommer et al.[86] used metagenomics data to characterize antibiotic resistance potential of healthy human microbiome. A very strong correlation has been established between human microbiota imbalance and diseases such as irritable bowel disease [87], obesity [88], cystic fibrosis [89] etc. A metagenomics approach along with a functional screening of potential pathogenicity markers and antibiotic resistance has been used to investigate complex environments [85]. Metagenomic studies have also demonstrated the effect of factors such as environment, geographical location, antibiotics, age, and diet on the human gut ecosystem. But the metagenome sequencing not only sequences the pathogenic sequences but can also capture the human genetic sequences [90]. This might lead to an incorrect understanding of the pathogenic community hosted by human body. However, this has an advantage of providing access to the genetic changes that might be taking place in human body under diseased condition [91]. Additionally, shotgun sequencing can also identify pathogenic species at strain level resolution since it is based on whole genome based markers rather than only 16S rRNA gene [92]. This has been already reported in the metagenomic sequencing of cholera patients [93], tuberculosis [94], E. coli [95], and methicillin-resistant Staphylococcus aureus (MRSA) [96].

Another aspect of metagenomics in the field of clinical microbiology is targeted antimicrobial therapy after accurate diagnosis of the pathogen which can reduce antibiotic-associated side effects due to broad-spectrum antibiotic regimen [97]. This has led to a significant reduction of mortality rates in patients of ventilator associated pneumonia (VAP) [98]. Metagenomics-enabled accurate and rapid diagnosis of infectious diseases along with understanding of antibiotic resistance pattern can empower the physicians to use targeted antimicrobial therapies [99]. Therefore, both metagenomics and single-species-targeting genomics of pathogenic bacteria can provide insights into the pathogenesis and a better understanding of the virulence markers providing a platform for pathogenic diagnostics [100].

Application of (Meta)genomics in Fecal Microbial Transplants

Microbiome analysis is the most recent extension of metagenomics as of today which makes metagenomics a direct application in human health. The existence of symbiotic relationship between gut microbiota and human health is well established and human intestine is known to harbor around 1014 microbes with 35,000 different species [101, 102]. Interestingly, the number of microbes is nearly 10 times more than the number of cells in the human body [103]. Human gut microbiota is known to play significant roles in postnatal structural and functional maturation of gut, development of immune system and nervous system [104106]. Gut microbiota is also identified as to produce antimicrobial proteins such as cathelicidins, defensins and C-type lectins [107, 108]. Imbalance of microbiota can lead to disease states such as antibiotic-associated diarrhea and Clostridium difficile infection (CDI). Fecal microbiota transplantation has proven to be very helpful by restoring the disturbed microbiota. This was first reported by Ge Hong, who used fecal transplantation in treating food poisoning [109], however in modern medicine it was used for the first time to treat pseudomembranous colitis [110]. As of today there are many clinical reports on using FMT for disease conditions like CDI, autism, depression, inflammatory bowel disease, Parkinson, multiple schlerosis, obesity [111]. In this scenario metagenomics play a significant role by determining the microbial content in both healthy and diseased gut before and after the transplant. In addition to this metagenomics identify dysbiosis/imbalance of the human gut microbiome in the diseases, and can also determine novel changes in microbial functions [112].

Computational Challenges in Data Interpretation

With the advent of NGS, data generated for genomes or metagenomes include millions of short reads, which before any downstream analysis need to be assembled into manageable data (i.e. genome/metagenome) [113]. Multiple state-of-art assemblers such as Velvet, Ray, ABySS can assemble gigabytes of data into 10 and 1000 s of contigs of genome and metagenome, respectively [113]. Broadly, there are two types of assemblers: (1) reference based and (2) de novo assemblers [114]. Reference based assemblers can be used when there is availability of reference genomes to be used to order the contigs. These include MIRA4, MetaAMOS, Newbler which are not computationally exhaustive and use a closely related reference genome already deposited in databases [115, 116]. This set of assemblers however remains biased due to limitation of existing databases and cannot be used while exploring a unique environment [113]. De novo assemblers can assemble the raw reads into contigs based on graph theories like de-Bruijn graph without any reference genome [117]. Tools such as Velvet, MetaVelvet, ABySS, SOAP, SPAdes, Ray Meta, Meta-IDBA etc. are among the most used softwares as of today [28, 118123]. Due to processing of multiple nodes during assembly of reads, the de novo assemblers are computationally quiet extensive yet best suited while exploring unique environments. Metagenome assembly has improved over time but still it carries many challenges as of today majorly due to computational memory constraints and the biological complexity of the data [53]. The population bias introduced due to sequencing leads to predominance of specific genomes and no or less coverage for others [124]. Hence, in order to correctly assemble the data, coverage of reads needs to be on a higher side (~10×) [124]. The sequencing errors such as repeats incorporation has also been challenging for assembly as they can be misinterpreted for identical regions in one genome or conserved regions across different species or homologous segments across closely related strains [125, 126]. Under these circumstances precise analyses of assembly metrics such as N50, average coverage, and total assembly size can be used to measure the efficiency of good assembly [127]. Detailed discussion of (meta)genome assemblers remain outside the scope of this review article: for details please review Refs. [53, 114].

We have now entered the era of challenged data-interpretation shifting from the era of restricted data-generation. With consistently increasing data, there is a need for algorithms which can compare huge amount of data. Multiple algorithms are being scripted every day; however, they need large memory and specific hardware options which can be challenging. In addition, every goal in (meta)genomics requires a different set of algorithms for a specific objective. There is an increased improvement in development of data visualization tools given the significance of visualization of data in complete data analysis [128]. Genome analyses tools are largely command line based and does not work using Graphic User Interface (GUI) very efficiently due to high throughput data which hinders the progress of biology labs in this field and encourages collaborations across multidisciplinary labs.

Conclusions

Expansion of the emerging fields of genomics and metagenomics can provide an access to the complete genetic content of a bacterium of interest and community profile of an environment, respectively. However, the conventional study of genomics needs culture-based bacterial isolation which offers huge bias since more than 99% of the microorganisms are uncultivable. Therefore, metagenomics not only provides an overall taxonomic composition of an environment exhibiting the presence or absence of microbial entities but can also target a single unculturable bacterial (species/strain) genomics surpassing the need for isolation. This review comprehensively surveys the most recent techniques using both genomics and metagenomics data together which can provide detailed insights into environmental microbiology (Fig. 2).