Keywords

5.1 Introduction

There are more and more evidences to confirm that human “microbiome” – microbes living in intimate association with us – forms a vital part of our biology and plays an important role in both health and sickness [1]. Metagenomics methods which sequence DNA without directly identifying [2] which organisms they come from and 16s rRNA sequencing [3] which sequence tag DNA for identifying the composition of organisms are two basic way of microbiome analysis.

Recently, huge amount of data are generated from plenty of microbiome projects such as Human Microbiome Project (HMP) [4, 5] and Metagenomics of the Human Intestinal Tract (MetaHIT) [6]. These datasets provide great opportunities to study the unknown world of microbes. Analyzing and mining these data will help us to better understand the function and structure of microbial community of the human body, thus the relationships to our health [7, 8].

However, the huge data volume, the complexity of microbial community, and the intricate data properties have introduced challenges for microbiome data analysis and mining [9, 10]. Bioinformaticians including computer scientist, mathematician, and microbiologist work together to develop computational approaches to tackle these challenging issues, roughly focusing on the following computational tasks: (1) dimension reduction and visualization approaches to explore and visualize microbiome data, (2) statistical methods to infer true correlations and relationships between microbes and diseases, (3) computational methods to identify and extract microbial interactions from microbiome datasets, and (4) dynamic modeling and time series analysis to model the ecological system in a holistic way.

Metagenomic data analysis is a timely topic; there is great need for better algorithms to analyze complex microbiome datasets. These efforts undoubtedly will lead to biological insights on how microbes impact human health. We will breifly introduce the current advances in four aspects mentioned above in microbiome data analysis and mining.

5.2 Dimension Reduction and Pattern Identification

After the preprocessing of the metagenomic data, DNA of metagenomic or 16s rRNA sequencing technologies could be summarized by metagenomic profiles [11] which summarize the abundance of functional or taxonomic categorizations in metagenomic sequences. A metagenomic profile matrix typically has hundreds of metabolic pathways, thousands of species or tens of thousands of protein families [12]. Machine learning and multivariate statistics have been employed on the profile matrix to explore and extract the complex patterns and correlations [13]. After dimension reduction, metagenomic profiles are usually represented by several “components” which may facilitate biological interpretation and discovery [14].

For example, PCA has been used frequently in metagenomic profiles to characterize the relationship of metagenomic samples [15]. Another method – MDS – which is based on the dissimilarities of data instead of similarity in PCA has been adopted as a standard technology for visualizing the taxonomic relationships in microbial communities [15]. Recently, a nonnegative matrix factorization (NMF) framework has been used in analyzing metagenomic profiles to gain a different and complementary perspective on relationships between functions, environment, and biogeography of global ocean and soil environment [16,17,18].

Microbiome datasets can be represented by metabolic paths, taxonomic assignment, or gene families [19]. To integrate information from multiple views, data integration approaches can be used to combine multi-view information simultaneously to obtain a comprehensive view which reveals the underlying data structure shared by multiple views [20]. A novel variant of symmetric nonnegative matrix factorization (SNMF) [21], called Laplacian regularized joint symmetric nonnegative matrix factorization (LJ-SNMF) has been proposed for this purpose. We conduct extensive experiments on several realistic datasets including Human Microbiome Project (HMP) data [4, 5]. The experimental results show that the proposed method outperforms other variants of NMF, which suggests the potential application of LJ-SNMF in clustering multi-view datasets.

Furthermore, linear correlation or regression methods are also employed to investigate the relationships among taxa or functions and their relationships to existing environmental or physiological data (metadata) such as Pearson Correlation and Canonical Correlation Analysis (CCA) [22]. CCA has been proposed for investigating the linear relationships of environmental factors and functional categorizations in global ocean [23].

The vast majority of methods employed in current metagenomics analysis are under the hypothesis that structures and relationships in a microbial community are linear. However, the interactions among microbiota are most likely nonlinear, and the mathematical spaces of microbiota are most likely in a manifold [24] or probabilistic space [25, 26] instead of Euclidean space. We could visualize and explore these structures using only several components which are the intrinsic dimensions discovered by manifold and probabilistic models. This provides a mechanistic understanding of how a microbial community is generated by probabilistic mixing of microbial components as well as a powerful tool for exploring the temporal dynamics of microbiome composition.

Finally, many kinds of nonlinear relationships such as taxa-taxa patterns and function-environment correlations could be investigated using the nonlinear statistical methods. We summarize these steps in a computational framework (see Fig. 5.1). The computational framework is based on our current understanding of metagenomic data, and we will integrate the advanced nonlinear dimension reduction methods and statistical methods to discover novel relationships.

Fig. 5.1
figure 1

Nonlinear analysis framework for metagenomic profiles

5.3 Relationship and Correlation

Another important problem in microbiome analysis is to identify the biomarkers (i.e., bacterial taxa, microbial genes, or pathways) that are associated with disease, where the microbiome data are summarized as the composition of the bacterial taxa, protein families, or metabolic pathways at different levels [27]. To discover biomarkers for diseases or environmental factors, the most common approaches focus on regression techniques incorporating the complex interaction patterns among species (or gene functions). We have developed a new regression framework called “manifold-constrained regularization” (McRe) [28], which inherits the strength of manifold embedding for regularization of linear regression. This method can incorporate species interaction network as prior information to infer novel relationships.

Several studies consider the regression analysis of microbiome compositional data, where the goal is to identify the biomarkers that are associated with a continuous response such as the body mass index (BMI) [9]. Compositional data are strictly positive and multivariate that are constrained to have a unit sum. Lin et al. [29] proposed a variable selection procedure for such models in high-dimensional settings and derived the weak oracle property of the resulting estimates [29]. Shi et al. [30] proposed a penalized estimation procedure for estimating the regression coefficients and for selecting variables under the linear constraints which is developed [30]. This provides valid confidence intervals of the regression coefficients and can be used to obtain the p-values which could be used to measure statistical significance [30]. Randolph et al. have formulated a family of regression models that naturally extends the dimension-reduced graphical explorations common to microbiome studies; the method could be viewed as a penalized version of the low-dimensional linear model for compositions [31].

5.4 Networking Microbiome

A network perspective provides unprecedented opportunities for integrating and analyzing big microbiome data for studying the structure and the function of microbial communities [32]. A microbial interaction network (MIN, e.g., species-species interaction network) shapes the structure of a microbial community and forms its ecosystem function and principle, i.e., the regulation of microbe-mediated biogeochemical processes [33]. Deciphering interspecies interaction is challenging in the wet lab due to the difficulties of coculture experiments and the complicated patterns of species interactions [34]. The knowledge of these small-scale microbial interactions such as pairwise competitions is often distributed widely in various media including PubMed literatures, biological databases, Wikipedia documents, etc., making it difficult to integrate and analyze [35]. Researchers have started to infer pairwise interspecies interactions such as competitive and cooperative interactions leveraging to heterogeneous microbial data including metagenomes, microbial genomes, and literature data. These efforts have facilitated the discovery of previously unknown principles of MIN, verified the consistency, and resolved the contradiction of the application of macroscopic ecological theory in microscopic ecology [36].

Species interact in a complex style with many types of interactions unknown. Previous works on species inference based on metabolic methods are based on the following two approaches. Bornstein proposed a computational method for inferring pairwise interactions from reconstructed metabolic network of species with whole-genome sequences available publicly [37]. The method can identify pairwise competitive and cooperative interactions. Another way is using flux balance analysis (FBA) models [38] to infer species interaction when the metabolic model of a species (or strain) is available [39].

More than 100 genome-scale metabolic network models were published. Constraint-based modeling (CBM) was already used for the inference of three potential interactions [40]: negative, where two species compete for shared resources; positive, where metabolites produced by one species are consumed by another producing a synergistic co-growth benefit; and neutral, where co-growth has no net effect. By using the FBA simulation community metabolic network, we can find key enzymes and reactions in the metabolic network, thus acting as a potential environmental and physiological fingerprint. In a two species system, the CBM solver aims to explore the type of interactions by comparing the total biomass production rate (denoted AB) in the pairwise system to the sum of corresponding individual rates recorded in their individual growth (denoted A+B). The CBM model is defined in Fig. 5.2, where VBM,m is the maximal biomass production rate in a system m, corresponding to species A and B. When AB <<A+B, A and B have a competitive relationship.

Fig. 5.2
figure 2

A constraint-based modeling to model pairwise interaction

5.5 Dynamics and Time Series Analysis

Microbial abundance dynamics along the time axis can be used to explore complex interactions among microorganisms [41]. It is important to use time series data for understanding the structure and function of a microbial community and its dynamic characteristics with the perturbations of the external environment and physiology [42]. Current studies usually use time sequence similarity [43], or clustering time series data for discover dynamic microbial interactions; these methods often do not take the full advantage of the time sequences. Thus the interactions among microorganisms cannot be accurately predicted. We have explored a vector autoregression (VAR) model [44] to lift the limitations of traditional methods. VAR models and interaction inference: Due to the high-dimensional nature of microbiomics data, the number of samples is far greater than the number of microorganisms; direct interaction inference by VAR is not feasible. In our previous studies, we have designed several graph regularization-based VAR (GVAR) methods for analyzing the human microbiome. We found that our approach improves the modeling performance significantly on several microbiome dataset. The experimental results indicate that graph regularization achieves better performance than other sparse VAR model based on elastic net regularization. However, the interpretation of the inference results is hard and far from complete. Furthermore, graph regularization, despite a classic manifold regularization method, suffers some problems because of its weak extrapolating ability. A novel regularization – Hessian regularization [45] – which fits the data perfectly and extrapolates nicely to unseen data will be utilized to overcome the issue.

In the future, state-space model [46] or probabilistic Boolean network model [47] could be used for modeling large-scale microbiome data for application. We will extend these methods by integrating specific information of the microbiomics data. The state-space model is a powerful method for simulation of dynamical systems, and it is widely used in engineering control systems which is a dynamic time-domain model to imply time as the independent variable. It is possible to extend the state-space model by considering the species delay in the regulatory network of relationships, not just describe the level of species richness’ impact on the internal state, and assume that the internal state can independently evolve. Species with time delay regulatory network of relationships are better suitable for microbial interactions, because the regulation between microorganisms is often a slow process with delay, rather than an instantaneous process.

5.6 Conclusion

The data from Human Microbiome Project (HMP) [4, 5], which includes more than 5000 samples with profiles of hundreds of taxonomic or functional categorizations, are constructed from 15 or 18 distinct body sites of 242 individuals. Methodological development is still in its infancy for effectively analyzing and mining the data. Many microbiome dataset are also from various studies focusing on disease, diets, and other investigations. These data have created a great opportunity for understanding and also a tremendous computational and theoretical challenge. There is a great need to develop novel mathematical and computational methods for finding nonlinear signal and patterns in human-associated microbial metagenomes.

The identification of complex structures and patterns of microbial communities is at the essential part of studies in microbial ecology. The expected method helps shed light on discovering the complex relationships among microbes. In the future, nonlinear methods should be considered as an important tool in analyzing metagenomics, not only because microbial function can be viewed at multi-scales, from individual genomes to communities to global cycles, but also the complex interaction across scales.