Keywords

1 Introduction

Microorganisms make up only 1 to 2% of the mass of the body of a healthy human, but they are suggested to outnumber human cells by 10 to 1 and to outnumber human genes by 100 to 1. The majority of microbes were identified to inhabit the gut and have profound influence on human well-being (Bäckhed et al. 2005). It has been recognized that microbes play major roles in maintaining health and causing illness, but relatively little is known about the role that microbial communities play in human health and disease (Cho and Blaser 2012; Lampe 2008). The knowledge about the human microbiome that we currently possess is from culture-based approaches using the 16S rRNA technology. However, it has to be noted around 20–60% of the microbiome associated with human is uncultivable (Peterson et al. 2009). Projects such as Human Microbiome Project and MetaHIT (Qin et al. 2010) were launched with an intention to generate resources to enable a comprehensive characterization of the human microbiota and analysis of its role in human health and disease. Figure 12.1 provides an overview of the methods involved in human microbiome analysis.

Fig. 12.1
figure 1

Overall workflow of human microbiome analysis

Metagenomics , the term coined by Handelsman et al. (1998), made it possible for direct genetic analysis of species that are refractory to culturing methods. Using metagenomics, several types of ecosystems including extreme environments and low-diversity environments have been studied so far (Oulas et al. 2015). Decoding the metagenome and its comprehensive genetic information can also be used to understand the functional properties of the microbial community besides studying population ecology. This has provided an infinite capacity for bioprospecting that allowed the discovery of novel compounds of biotechnological commercialization (Segata et al. 2011). Initially metagenomics was used mainly to identify novel biomolecules from environmental microbial assemblages (Chistoserdova 2010). But the advent of next-generation sequencing techniques at affordable costs has allowed for more comprehensive examination of microbial communities such as comparative community metagenomics, metatranscriptomics , and metaproteomics (Simon and Daniel 2010).

In order to disentangle complex ecosystem functions of the microbial communities and fulfill the promise of metagenomics, the comprehensive data sets derived from the next-generation sequencing technologies require intensive analyses (Scholz et al. 2011). This demand has created the need for more powerful tools and software that have unprecedented potential to shed light on ecosystem functions of microbial communities and evolutionary processes.

2 Sequence Processing

Compared to conventional Sanger sequencing, several next-generation sequencing platforms provide huge data at much lower recurring cost. Though these technologies include a number of methods like template preparation, sequencing and imaging, and data analysis in common, it is the unique combination of specific protocols that distinguishes one technology from another. Besides that, it also determines the type of data produced from each platform, posing challenges when comparing platforms based on data quality and cost. As these new sequencing technologies produce hundreds of megabases of data at affordable costs, metagenomics is within the reach of many laboratories. The metagenomic analysis workflow begins with sampling and metadata collection and then proceeds with DNA extraction, library construction, sequencing, read preprocessing, and assembly. Either for reads, contigs, or both, binning is applied. Community composition analysis is made using databases. Some details of the workflow will be different in different sequencing facilities.

One has to take greater care when processing sequences of metagenomic data sets than when processing genomic data sets because in the later there is no fixed end point and lacks many of the quality assurance procedures (Kunin et al. 2008).

2.1 Preprocessing

Preprocessing of sequence reads is a critical and largely overlooked aspect of metagenomic analysis. Preprocessing comprises the base calling of raw data coming off the sequencing machines, vector screening to remove cloning vector sequence, quality trimming to remove low-quality bases (as determined by base calling), and contaminant screening to remove verifiable sequence contaminants. Errors in each of these steps can have greater downstream consequences in metagenomes.

2.2 Sources of Bias and Error in 16S rRNA Gene Sequencing and Reducing Sequencing Error Rates

Irrespective of the technologies used, the scientist needs to understand the quality of their data and how to reduce errors that affect downstream analyses. Two main categories of errors that are commonly observed with 16S sequencing are due to misrepresentation of the relative abundances of microbial populations in a sample (bias) and misrepresentation of an actual sequence itself due to PCR amplification and sequencing (error) (Schloss et al. 2011). Misrepresentation of the relative abundances might be due to DNA extraction method (Miller et al. 1999), PCR primer and cycling conditions, 16S rRNA gene copy number, and the actual community composition in the original sample (Hansen et al. 1998). On the other hand, error due to misrepresentation of an actual sequence is due to PCR polymerases that typically have error rates of one substitution per 105–106 bases (Cline et al. 1996), risk of chimera formation (Haas et al. 2011), and errors introduced by sequencers (Margulies et al. 2005). Because of their relative rates, sequencing errors and chimeras are of the most concern (Schloss et al. 2011).

Sequencing errors can be reduced by the following ways: removing sequence associated with low-quality scores, removing ambiguous base calls, removing mismatches to the PCR primer, or removing sequences that were shorter or longer than expected. Besides these, using denoising and removing sequences that cannot be taxonomically classified are also followed. But the later generally reduce the number of spurious OTUs and phylotypes and do not minimize the actual error rate. Laehnemann et al. (2015) has reported an extensive survey of the errors that are generated during sequencing by the commonly used high-throughput sequencing platforms.

2.3 Base Calling and Quality Trimming

Base calling involves identifying DNA bases from the readout of a sequencing machine. Popular base caller widely used is Phred (Ewing et al. 1998). The quality score, q, assigned to a base is related to the estimated probability, p, of erroneously calling the base by the following formula: q = −10 × log10(p). Thus, a Phred quality score of 20 corresponds to an error probability of 1%. Paracel’s TraceTuner (www.paracel.com) and ABI’s KB (www.appliedbiosystems.com) are the other two frequently used base callers, which behave very similar to Phred by converting raw data into accuracy probability base calls. Since metagenomic assemblies have lower coverage than genomes, errors are more likely to propagate to the consensus. Some post-processing pipelines ignore base quality scores associated with reads and contigs, and few take positional sequence depth into account as a weighting factor for consensus reliability. Because of this, for an average user, low-quality data will be indistinguishable from the rest of the data set. When poor-quality read that inadvertently passed through to gene prediction it may pass into public repositories. Hence, quality trimming is highly recommended.

2.4 Denoising

Denoising is a computationally intensive process that removes problematic reads and increases the accuracy of the taxonomic analysis. This is critically important for 16S metagenomic data analysis as it may give rise to erroneous OTUs, and it is sequencing platform-specific too. Illumina require less denoising than others. Though generally a considerable number of sequences is lost, it usually results in high-quality sequences (Gaspar and Thomas 2013) at certain level of stringency (Bakker et al. 2012). Notable software packages that are commonly used to correct amplicon pyrosequencing errors include Denoiser (Reeder and Knight 2010), AmpliconNoise (Quince et al. 2011), Acacia (Bragg et al. 2012), DRISEE (duplicate read inferred sequencing error estimation) (Keegan et al. 2012), JATAC (Balzer et al. 2013), and CorQ (Iyer et al. 2013). Denoiser uses frequency-based heuristics rather than statistical modeling to cluster reads and makes more accurate assessments of alpha diversity when combined with chimera-checking methods. AmpliconNoise is highly effective but is computationally intensive and applies an approximate likelihood using empirically derived error distributions to remove pyrosequencing noise from reads. These two tools do not modify individual reads; rather they both select an “error-free” read to represent reads in a given cluster. Acacia, on the other hand, is an error-correction tool, reduces the number and complexity of alignments, and uses a quicker but less sensitive statistical approach to distinguish between error and genuine sequence differences. DRISEE assess sequencing quality and provides positional error estimates that can be used to inform read trimming within a sample. JATAC algorithm identifies duplicate reads based on the flowgram that has been shown to be superior for noise removal in metagenomics amplicon data and also allows for a more effective removal of artificial duplicates. CorQ corrects homopolymer and non-homopolymer insertion and deletion (indel) errors by utilizing inherent base quality in a sequence-specific context.

2.5 Reducing Chimerism

Chimeras are fusion products that are formed between multiple parent sequences. These are falsely interpreted as novel organisms. These are not sequencing errors as they are not derived from a single reference sequence to which it can be mapped. Few commonly used programs for combating chimerism are Bellerephon, Pintail (Ashelford et al. 2005), ChimeraSlayer (Haas et al. 2011), Perseus (Quince et al. 2011), and Uchime (Edgar et al. 2011). The two algorithms most widely used for 16S chimera detection are Pintail and Bellerophon. The former is used by the databases like the RDP (Cole et al. 2009) and SILVA (Pruesse et al. 2007) and the latter is used by the GreenGenes 16S rRNA sequence collection (DeSantis et al. 2006). Pintail is generally visualized as 16S anomaly detection tool rather than a chimera detection tool. But interestingly most anomalies detected by Pintail were chimeras (Ashelford et al. 2005). Perseus, unlike Pintail and Bellerophon, does not use a reference database, but does require a training set of sequences similar to the sequences for characterization. Uchime outperformed ChimeraSlayer, especially in cases where the chimera has more than two parents and its performance was comparable to that of Perseus.

3 Sequence Assembly

The shotgun sequencing generates sequences for multiple small fragments separately which are then combined into a reconstruction of the original genome using computer programs called genome assemblers. These programs assemble shorter reads first into contigs, and these are then oriented into scaffolds that provide a more compact and concise view of the sequenced community. New challenges for the assembly process are posed by recent advances in genome sequencing technologies in terms of volume of data generated, length of the fragments, and new types of sequencing errors especially in metagenomics (Pop 2009). Earlier metagenomic data assemblies used tools that were originally designed for conventional whole-genome shotgun sequence (WGS) projects with minor parameter modifications (Wooley and Ye 2009). But recent ones have evolved as more robust specifically in handling samples containing multiple genomes. The assembly process can be approached either as reference-based assembly or as de novo assembly.

3.1 Reference-Based Assembly

In reference-based assembly, contigs are created by mapping on one or more reference genomes that belong to a particular species or genus, or sequences from closely related organism would have already been deposited in online data repositories and databases. Reference-based assembly tools are not computationally intensive and can perform well when metagenomic samples are derived from the areas that are extensively studied. Tools like GS Reference Mapper (Roche), MIRA 4 (Chevreux et al. 2004) or AMOS, and MetaAMOS (Treangen et al. 2013) are commonly used in metagenomics applications. The assemblies can be visualized using tools such as Tablet (Milne et al. 2009), EagleView (Huang and Marth 2008), and MapView (Bao et al. 2009). Gaps in the query genome(s) of the resulting assembly indicate that the assembly is incomplete or that the reference genomes used are too distantly related to the community under investigation.

3.2 De Novo Assembly

On the other hand, de novo assembly is a computationally expensive process requiring hundreds of gigabytes of memory and has long execution times, which assembles the contigs based on the de Bruijn graphs without any reference genome (Miller et al. 2010). Though tools such as EULER (Pevzner et al. 2001), FragmentGluer (Pevzner et al. 2004), Velvet (Zerbino and Birney 2008), SOAP (Li et al. 2008), ABySS (Simpson et al. 2009), and ALLPATHS (Maccallum et al. 2009) were built for assembling a single genome, even today they are used for metagenomics applications. EULER and ALLPATHS attempt to correct errors in reads prior to assembly, while Velvet and FragmentGluer deal with errors by editing the graphs. These often underperform when used for metagenome assemblies due to problems coming from variation between similar subspecies and genomic sequence similarity between different species. Besides that, difference in abundance for species in a sample was also affected by different sequencing depths for individual species. Tools like Genova (Laserson et al. 2011), MAP (Lai et al. 2012), MetaVelvet (Namiki et al. 2012), MetaVelvet-SL (Afiahayati and Sakakibara 2014), and Meta-IDBA (Peng et al. 2011) managed to create more accurate assemblies especially from data sets containing a mixture of multiple genomes by making use of k-mer frequencies to detect kinks in the de Bruijn graph. Using k-mer thresholds, they decompose the graph into subgraphs and further assemble contigs and scaffolds based on the decomposed subgraphs. The IDBA-UD algorithm (Peng et al. 2012) additionally address the issue of metagenomic sequencing technologies with uneven sequencing depths by making use of multiple depth-relative k-mer thresholds in order to remove erroneous k-mers in both low-depth and high-depth regions.

4 Analyzing Community Biodiversity

4.1 The Marker Gene

Microbial community fundamentally is a collection of individual cells, with distinct genomic DNA. In order to describe the community, it is impractical to fully sequence every genome in every cell. Hence, microbial ecology has defined a number of unique tags to distinct genomes called molecular markers. A marker is a small segment of DNA sequence that identifies the genome that contains it, eliminating the need to sequence the entire genome. Despite its numerous varieties, there are some which are desirable for properties for a good marker like it should be present in every member of a population and discriminate individuals with distinct genomes and, ideally, should differ proportionally to the evolutionary distance between distinct genomes.

By far the most ubiquitous and significant (Lane et al. 1985) is the small or 16S ribosomal RNA subunit gene (Tringe and Hugenholtz 2008) as the preferred target marker gene for bacteria and archaea. But in case of fungi and eukaryotes, the preferred marker genes are the internal transcribed spacer (ITS) and 18S rRNA gene, respectively (Oulas et al. 2015). The gold standard (Nilakanta et al. 2014) for the 16S data analysis is QIIME (Caporaso et al. 2010). Yet another popular tool is Mothur (Schloss et al. 2009) which provides the user with a variety of choices by incorporating software such as DOTUR (Schloss and Handelsman 2005), SONS (Schloss and Handelsman 2006a), Treeclimber (Schloss and Handelsman 2006b), and many more algorithms. Other tools include SILVAngs (Quast et al. 2012) and MEGAN (Huson et al. 2007). These marker gene analyses generally involve searching a reference database to find the closest match to an OTU from which a taxonomic lineage is inferred. Some widely utilized databases for 16S rRNA gene analysis include GreenGenes (DeSantis et al. 2006) and Ribosomal Database Project (Cole et al. 2007; Cole et al. 2009). Besides 16S, SILVA (Pruesse et al. 2007) also supports analysis of 18S in case of fungi and eukaryotes. Unite (Koljalg et al. 2013) can be used for analyzing ITS.

Unfortunately, not much databases are available for analyzing extremely diverse protists and viruses for which considerably less sequence information is available compared to bacteria. Humans are not only reported to carry viral particles consisting mainly of bacteriophages (Haynes and Rohwer 2011) but also a substantial number of eukaryotic viruses (Virgin et al. 2009). Like bacterial microbiota, viromes show similar patterns in different stages of human (Caporaso et al. 2011; Koenig et al. 2010), but the effects of these patterns in the human virome are mostly not understood, although certain bacteriophages in other animals are beneficial to the host (Oliver et al. 2009). The lack of a universal gene that is present in all virus makes amplicon-based studies difficult for characterizing the virome in its totality.

5 Analyzing Functional Diversity

This generally involves identifying protein coding sequences from the metagenomic reads and comparing the coding sequence to a database (for which some functional information is identified) to infer the function based on its similarity to sequences in the database. Besides picturing the functional composition of the community (Looft et al. 2012) or functions that associate with specific environmental or host-physiological variables (Morgan et al. 2012), they may also reveal the presence of novel genes (Nacke et al. 2011) or provide insight into the ecological conditions associated with those genes for which the function is currently unknown (Buttigieg et al. 2013). Functional annotation of metagenome involves two non-mutually exclusive steps: gene prediction and gene annotation.

5.1 Gene Prediction

This can be done on assembled or unassembled metagenomic sequences. Metagenomic reads/contigs are scanned for identifying protein coding genes (CDSs), as well as CRISPR repeats, noncoding RNAs, and tRNA. Predicting CDSs from metagenomic reads is a fundamental step for annotation. Gene prediction for metagenomic sequences can be performed in three ways: first, by mapping the metagenomic reads or contigs to a database of gene sequences; second, based on protein family classification; and, third, by de novo gene prediction.

Mapping the metagenomic reads or contigs to a database of gene sequences is a straightforward method of identifying coding sequences in a metagenome. This method of gene prediction can simultaneously provide functional annotation, if functional annotation of the gene is available. It comes under high-throughput gene prediction procedure as the mapping algorithms assess rapidly whether a genomic fragment is nearly identical to a database sequence or not. This method is generally useful for cataloging the specific genes present in the metagenome but not appropriate from predicting novel or highly divergent genes due to underrepresentation of genomes in sequence databases.

The second method is the most frequently used gene prediction procedure where each metagenomic read is translated into all six possible protein coding frames and each of the resulting peptides is compared to a database of protein sequences. Tools like transeq (Rice et al. 2000), USEARCH (Edgar 2010), RAPsearch (Zhao et al. 2011), and lastp (Kielbasa et al. 2011) translate reads prior to conducting protein sequence alignment. On the other hand, algorithms like blastx (Altschul et al. 1997), USEARCH with the ublast option, or lastx (Kielbasa et al. 2011) translate nucleic acid sequences on the fly. As this also relies on database, it can reveal only diverged homologues of known proteins and not useful for identifying novel types of proteins. Common functional databases includes SMART (Schultz et al. 1998), SEED (Overbeek et al. 2005), NCBI nr (Pruitt et al. 2011), the KEGG Orthology (Kanehisa and Goto 2000), COGs (Tatusov et al. 1997), MetaCyc (Caspi et al. 2012), eggNOGs (Powell et al. 2011), and PFAM (Punta et al. 2011). Integrated pipelines with integrated functional annotation like MG-RAST (Meyer et al. 2008), MEtaGenome ANalyzer (MEGAN) (Huson et al. 2007), and HUMAnN (Abubucker et al. 2012) are also available to automate these tasks.

Contrary to the above two methods, de novo gene prediction does not rely on a reference database for identifying sequence similarity. Rather, gene prediction systems are trained by evaluating various properties of microbial genes like length of the gene, codon usage, GC bias, etc. Hence this method can potentially identify novel genes, but it is difficult to determine if the predicted gene is real or spurious. Tools like MetaGene (Noguchi et al. 2006), MetaGeneAnnotator (Meyer et al. 2008), Glimmer-MG (Kelley et al. 2011), MetaGeneMark (Zhu et al. 2010), FragGeneScan (Rho et al. 2010), Orphelia (Hoff et al. 2009), and MetaGun (Liu et al. 2013) can be used for de novo gene prediction. Yok and Rosen (2011) recommended that gene prediction in metagenomes can be improved when multiple methods are applied to the same data like following a consensus approach. Though time-consuming, this method tends to be more discriminating than 6-frame translation while annotating (Trimble et al. 2012).

RNA genes (tRNA and rRNA) can be predicted using tools like tRNAscan (Lowe and Eddy 1997). Predictions of tRNA predictions are quite reliable, but not the rRNA genes. Other types of noncoding RNA (ncRNA) genes can be detected by comparison to covariance models (Griffiths-Jones et al. 2005) and sequence-structure motifs (Macke et al. 2001). These methods are computationally intensive and take long time for metagenomic data sets. Predicting ncRNAs are usually excluded from downstream analyses because of the complexity due to lack conservation and reliable “ab initio” methods even for isolated genomes.

Errors in gene prediction mainly occur due to chimeric assemblies or frameshifts (Mavromatis et al. 2007). Hence, the quality of the gene prediction normally relies on the quality of read preprocessing and assembly. Though gene prediction can be performed with both assembled reads (contigs) and unassembled reads, it is advised to perform gene calling on both reads and contigs. It was observed that gene prediction methods used on accurately assembled sequences predicted more than 90% when compared to predictions made on unassembled reads which exhibited lower accuracy (~70%) (Mavromatis et al. 2007).

5.2 Functional Annotation

Functional annotation of metagenomic data sets are made by comparing predicted genes to existing, previously annotated sequences or by context annotation. Metagenomic data will have complications when predicted proteins are short and lack homologues. Databases that are used for comparing protein sequences include alignment of profiles from the protein families in TIGRFAMs (Selengut et al. 2007), PFAM (Finn et al. 2008), COGs (Tatusov et al. 1997), and RPS-BLAST (Markowitz et al. 2006). PFAMs allow the identification and annotation of protein domains. TIGRFAM database include models for both domain and full-length proteins. Though COGs also allow the annotation of the full-length proteins, it is not frequently updated like PFAMs and TIGRFAMs. It is also recommended not to assign protein function solely based on BLAST results as there is a potential for error propagation through databases (Kyrpides and Ouzounis 1999). Context-based annotation methods include genomic neighborhood (Overbeek et al. 1999), gene fusion (Marcotte et al. 1999b), phylogenetic profiles (Pellegrini et al. 1999), and coexpression (Marcotte et al. 1999a). It was observed that neighborhood analysis was performed on metagenomic data, which, combined with homology searches, inferred specific functions for 76% of the metagenomic data sets (83% when nonspecific functions are considered) (Harrington et al. 2007) and is expected to be used in predicting protein function in metagenomic data in the future.

6 Metatranscriptomic Analysis

Metatranscriptome sequencing has been recently employed to identify RNA-based regulation and expression in human microbiome (Markowitz et al. 2008). Accessing metatranscriptome of the microbiome through metatranscriptomic shotgun sequencing (RNAseq) has led to the discovery and characterization of new genes from uncultivated microorganisms under different conditions. Few investigations (Bikel et al. 2015; Franzosa et al. 2014; Gosalbes et al. 2011; Jorth et al. 2014; Knudsen et al. 2016) have been performed on metatranscriptomics combined with metagenomics . Several technical issues affecting large-scale application of metatranscriptomics are discussed by Bikel et al. (2015). Though metagenomic and metatranscriptomic data provide extensive information about microbiota diversity, gene content, and their potential functions, it is very difficult to say whether DNA comes from viable cells or whether the predicted genes are expressed at all and, if so, under what conditions and to what extent (Gosalbes et al. 2011).

The bioinformatics pipeline for analyzing the data obtained from a metatranscriptomic experiment is similar to the one used in metagenomics. Basically this is also divided in two strategies: mapping sequence reads to reference genomes or pathways to identify the taxonomical classification of active microorganism and the functionality of their expressed genes and de novo assembly of new transcriptomes. For de novo assembly, there are several programs like SOAPdenovo (Li et al. 2009), ABySS (Birol et al. 2009), and Velvet-Oases (Schulz et al. 2012) that have been reported to be successfully applied to the metatranscriptome assembly (Ghaffari et al. 2014; Ness et al. 2011; Schulz et al. 2012; Shi et al. 2011). A program specially developed for de novo transcriptome assembly from short-read RNAseq data, Trinity (Haas et al. 2013), is one of the most used bioinformatics tools to assemble de novo transcriptomes of different species. It is a very efficient and sensitive in recovering full-length transcripts and isoforms (Ghaffari et al. 2014; Luria et al. 2014).

Metatranscriptome analyses involves stepwise approach for detecting the different RNA types, such as rRNAs, mRNAs, and other noncoding RNAs, facilitating the researchers to study them individually. The reads can be then compared against the small subunit rRNA reference database (SSUrdb), and later, the remaining unassigned reads can be analyzed with the large subunit rRNA reference database (LSUrdb)—the databases compiled from SILVA (Pruesse et al. 2007) or RDP II (Cole et al. 2009). The non-rRNA representation can be then identified from subtracting the LSU rRNA and SSU rRNA reads from the total reads obtained. The non-rRNAs are finally carried forward for functional analyses.

The functional diversity of the microbiome can be predicted by annotating metatranscriptomic sequences with known functions. cDNA sequences with no significant homology with any of the rRNA databases can be searched against the NCBI nr protein database using BLASTX (Altschul et al. 1997). The sequence reads that contain protein coding genes are identified, and their sequences are compared to the coding sequences of protein databases like the Kyoto Encyclopedia of Genes and Genomes (KEGG), protein family annotations (Pfam), gene ontologies (GO), and clusters of orthologous groups (COG). Thus, the function of the query sequence is assigned based on its homology to sequences functionally annotated in all the above mentioned databases.

Pipelines for combined metatranscriptomics with metagenomics include INFERNAL, a powerful tool for predicting small RNA in the metagenomic data (Nawrocki and Eddy 2013). HUMAnN is another automated pipeline, an offline platform, to determine the presence/absence and abundance of microbial pathways and gene families in a community directly from metagenomic sequence. This is done by converting sequence reads into coverage and abundance and finally summarizes the gene families and pathways in a microbial community (Abubucker et al. 2012). Other offline platforms used to analyze metagenomic data include MEGAN (Huson et al. 2007), IMG/M server (Markowitz et al. 2008), RAST (MG-RAST) (Meyer et al. 2008), and JCVI Metagenomics Reports (METAREP) (Goll et al. 2010).

7 Statistical Analysis in Metagenomics

Statistical analysis plays critical role in analyzing and interpreting metagenomic data. Even simple metagenomic analysis like estimate of species diversity seems not so straightforward and obviously needs statistical attention due to the artifacts created during the sequencing (discussed earlier).

Often critical statistical analysis precedes with normalization (i.e., normalization to a reference sample), a step that reduces the systematic variance and improves the overall performance for downstream statistical analysis. These include methods like centering, autoscaling, pareto scaling, range scaling, vast scaling, log transformation, and power transformation. Appropriate selection of data pretreatment methods and its significance have been by van den Berg et al. (2006).

Robust data processing algorithms for wide range of analysis are mostly created using repositories available from the open-source R-project (http://www.R-project.org) and the R-based bioconductor project (https://www.bioconductor.org/). These are widely considered to be the most complete collection of up-to-date statistical and machine learning algorithms (Xia et al. 2009). Common statistical analysis includes missing value estimation, diversity analysis, and univariate and multivariate analysis like directions of variance , cluster analysis, etc.

Missing value exclusion, missing value replacement, and missing value imputation can be identified by probabilistic PCA (PPCA), Bayesian PCA (BPCA), and singular value decomposition imputation (SVDImpute) (Stacklies et al. 2007; Steinfath et al. 2008). Univariate analysis includes three commonly used methods—fold-change analysis, t-tests, and volcano plots. The t-test attempts to determine whether the means of two groups are distinct. With t-value, P-value can be calculated which can be used to determine whether the distinction is statistically significant or not. The volcano plots compare the size of the fold change to the statistical significance level (Xia et al. 2009). Directions of maximum variance can be determined by principal component analysis (PCA) and partial least squares discriminant analysis (PLS-DA). PCA is an unsupervised method aiming to find the directions of maximum variance in a data set (X) without referring to the class labels (Y), and PLS-DA is a supervised method that uses multiple linear regression technique to find the direction of maximum covariance between a data set (X) and the class membership (Y). In both PCA and PLS-DA, the original variables are summarized into much fewer variables using their weighted averages called scores. Diversity analysis can be performed by estimating the alpha diversity, which provides a summary statistic of a single population, or beta diversity, which gives organismal composition between populations. Chao1 (Chao 1984), abundance-based coverage estimator (ACE) (Chao et al. 1993), and Jackknife (Heltshe and Forrester 1983) measure alpha diversity, species richness, and evenness (species distribution) expected within a single population. These results in collector’s or rarefaction curves (Colwell and Coddington 1994). Alpha diversity is often quantified by the Shannon Index (Shannon 1948) or by Simpson Index (Simpson 1949). Beta diversity can be measured by simple taxa overlap or quantified by the Bray-Curtis dissimilarity (Bray and Curtis 1957) or UniFrac (Lozupone and Knight 2005). Two major approaches of clustering analysis include Hierarchical clustering and partitional clustering. Hierarchical, which is also called as agglomerative clustering, begins with each sample considered as separate cluster and then proceeds to combine them until all samples belong to one cluster. The result of hierarchical clustering is usually presented as a dendrogram or as a heat map, which displays the actual data values using color gradients. Clustering methods include average linkage, complete linkage, single linkage, and Ward’s linkage. A dissimilarity measure includes Euclidean distance, Pearson’s correlation, and Spearman’s rank correlation. On the other hand, partitional clustering attempts to directly decompose the data set into a user-specified number of disjoint clusters. This uses methods like k-means clustering and self-organizing map (SOM). k-Means clustering create k clusters such that the sum of squares from points to the assigned cluster centers’ is minimized. SOM is an unsupervised neural network based around the concept of a grid of interconnected nodes, each of which contains a model.

Demands for new statistical methods to support emerging trends in metagenomics applications have resulted in more efficient implementations and better data visualization to lodge the tremendous increase in data analysis workloads. Web-based server with its user-friendly interface, comprehensive data processing options, wide array of statistical methods, and extensive data visualization and analysis support are playing key role. Servers like GEPAS (Herrero et al. 2003) and CARMAweb (Rainer et al. 2006), MG-RAST (Meyer et al. 2008), MEGAN (Huson et al. 2007), QIIME (Caporaso et al. 2010), Mothur (Schloss et al. 2009), and MetaboAnalyst (Xia et al. 2015) are few worth mentioning. Table 12.1 summarizes some the commonly used tools in microbiome analysis and their internet resources.

Table 12.1 Selected tools and their resources for microbiome analysis

8 Analysis of Human Microbiome

Since birth, continuous exposure to microbial challenges has shaped the human microbiome and whose perturbation affects both human health and disease (Segal and Blaser 2014). In recent years, the knowledge about composition, distribution, and variation of bacteria in the human body has dramatically increased. Besides external factors like air, food, and environment, routine activity, habit, and physiology create selective pressure of each organism. In order to understand the influence of human microbiome, several studies have assessed the microbial compositions in different locations like stool, nasal, skin, vaginal, and oral of health and unhealthy individuals (Kraal et al. 2014). Thus, determining the extent of the variability of the human microbiome is therefore crucial for understanding the microbiology, genetics, and ecology of the microbiome. Besides that, it is useful for practical issues in designing experiments and interpretation of clinical studies (Zhou et al. 2014).

Study demonstrating the feasibility of using the composition of the gut microbiome to detect the presence of precancerous and cancerous lesions (Zackular et al. 2014), ethnic relation to significant differences in the vaginal microbiome (Fettweis et al. 2014), and discovery closely related oligotypes, differing sometimes by as little as a single nucleotide, showing dramatic different distributions among oral sites and among individuals (Eren et al. 2014), a less robustly interrogated placental microbiome by Aagaard et al. (2014), altered interactions between intestinal microbes, and the mucosal immune system resulting in inflammatory bowel disease (IBD) (Kostic et al. 2014) have taken us to the next level of understanding the human microbiome . Other studies like understanding of the etiology and pathogenesis of reflux disorders and esophageal adenocarcinoma (Yang et al. 2014) and altered microbiome on pulmonary responses (Segal and Blaser 2014) will be definitely be critical and open door for future investigations.

9 Conclusion

Human microbiota includes microorganisms living on the surface and inside the body. They are important for the host’s health. These are highly dynamic and can be influenced by a number of factors such as age, diet, and physiology. Studies have shown that most of the human adult microbiota lives in the gut and follows specific microbial signatures but with high intraindividual variability over time. Any alterations of the human gut microbiome can play a role in disease development. Thus, exploring microbiome could make themselves as potent target for diagnostic and therapeutic applications. Since early microbial studies were bases on the direct cultivation and isolation of microbes, clinical applications posed several limitations especially growth conditions. Studies have shown that not all microbes are currently uncultivable. Methods to study cultivable organisms are also not suitable for the study of entire microbiome . Metagenomics helped in the direct genetic analysis of genomes contained within an environmental sample without the need for cultivating. Metagenomic studies using NGS-based methods can be approached by amplifying 16S rRNA genes using specific primers or through whole-genome shotgun sequencing. 16S sequences identified can be used to describe their community relative abundance and/or their phylogenetic relationships by clustering into operational taxonomic units (OTUs) using databases of previously annotated sequences. In whole-genome shotgun sequencing approach, where random primers are used for amplifying all microbial genes, the relative abundances of genes and pathways can be determined by comparing the sequences to functional databases.

Next-generation sequencing (NGS) technologies not only increased the throughput of bases sequenced/run but also reduced sequencing costs. This had a major impact on the field of metagenomics where a specific microbiome can be qualitatively and quantitatively characterized in depth without the selection bias and constraints associated with cultivation methods. Continuous advancements in sequencing technologies have not only allowed address more complex habitats but also have imposed growing demands on bioinformatic data post-processing. Analyzing the huge amount of data by these technologies has become the bottleneck especially in case of larger metagenome projects. From assembly to analysis, bioinformatic post-processing requires dedicated data integration pipelines, some of which have yet to be developed.