Keywords

1 Introduction

Measuring metabolites and interpreting their biological relevance within the contexts of different experimental conditions are the primary objective in metabolomics researches. To achieve this objective, two basic steps need to be performed: metabolite identification and functional analysis, with the former providing the necessary inputs for the latter operation. These two steps need to be executed in a coordinated manner to promote efficient biological understanding. However, significant challenges remain in both steps.

The ultimate goal of metabolomics is to achieve comprehensive and high-throughput metabolome measurement. This goal is hampered by at least three major obstacles: (1) small compounds have diverse chemical properties, making it difficult to assay many metabolites simultaneously using a single analytical platform; (2) there is no effective amplification technique available to facilitate detection of low-abundance metabolites (such as using PCR for DNA molecules); and (3) many metabolites lack unique spectral signatures to allow unambiguous compound asignment. Nuclear magnetic resonance (NMR) spectroscopy and gas or liquid chromatography coupled with mass spectrometry (GC- or LC-MS) are commonly used in combination to improve the metabolome coverage. Metabolite identification is mainly performed by searching the spectral features against a reference spectral library. However, searching a comprehensive spectral database often leads to many potential hits with similar matching scores, and researchers often need to manually choose the most probable identities based on the context and domain knowledge. This step represents a key bottleneck in current metabolomics studies. Better algorithms and more context-specific databases are needed to enable high-throughput and high-accurate metabolite identifications.

Knowing compound identities is the first step toward biological interpretation of metabolomics data. The conventional procedure after this step involves manually looking up the metabolites of interest in different compound databases, reading relevant literature, and finally synthesizing the information into a justifiable biological “story” based on the overall information obtained. This approach is subjective and time-consuming. Over the past decade, many computer-assisted data interpretation strategies have been developed. Among them, functional enrichment analysis using a predefined knowledge database has gained wide acceptance in omics data interpretation. The basic idea is to shift the unit of analysis from a single molecule to groups of functionally related molecules (i.e., those within the same pathway or biological process). This approach directly connects statistical significance with biological interpretation. More advanced algorithms have also been recently implemented that are able to integrate the dependencies and connectivities among different molecules to further reveal the biological insight and to improve system understanding.

Based on their strategies in dealing with metabolite identification and functional analysis, current metabolomics workflows can be summarized into three general categories: the chemometrics approach (also known as untargeted metabolomics), the metabolic profiling approach (also known as targeted or quantitative metabolomics), and the chemo-enrichment analysis approach (Fig. 8.1). The chemometrics approach focuses on identifying and interpreting a subset of spectral features that are found to have changed significantly during the experimental studies; the metabolic profiling approach aims to comprehensively characterize all metabolites in the spectra before subsequent statistical and functional analysis; and the more recent chemo-enrichment analysis approach directly maps spectral features into metabolic pathways/networks and then tests the enrichment of the collective chemical signals generated from these biological processes, which largely avoids the time-consuming step for accurate compound assignment.

Fig. 8.1
figure 1

The diagram summarizes the three computational strategies for metabolomics data interpretation: the chemometrics approach (top), the metabolic profiling (bottom), and the chemo-enrichment analysis (middle). The dotted lines delineate the two major steps in the process: metabolite identification and functional analysis. Note that these two steps are integrated into a single one in the chemo-enrichment approach

This chapter is organized into three sections. The first section introduces the main computational approaches for metabolite identification from common analytical platforms (Fig. 8.1, Step 1); the second section describes the three main bioinformatics approaches for functional enrichment analysis (Fig. 8.1; Step 2); and the last section compares the three metabolomics workflows for biological interpretation. Each section is further organized under subtitles describing the computational concepts, the available bioinformatics tools, and their main features.

2 Metabolite Identification Methods

Although it is possible to determine the identity of a single metabolite de novo through labor-intensive NMR or MS-based methods, this approach is generally infeasible in metabolomics in which hundreds to thousands of compound species are measured simultaneously. In practice, compound identification is based on matching features from sample spectra against a reference spectral database, and a closely matched hit will be considered as the putative identity of the corresponding spectral peaks. However, many metabolites do not produce unique, detectable signatures in their NMR or MS spectra to permit unambiguously determination of their identities. The situation is further complicated by peak shifts and overlaps typical in the spectra of complex biological samples. Direct database search tends to yield high percentage of false positives, and further labor-intensive manual refinement is usually necessary. To improve the efficiency of metabolite identification, two general computational strategies have been employed: (1) limiting the search space to only those biologically and biochemically possible candidates by developing more context-specific spectral databases, and (2) improving the peak assignment algorithms by incorporating prior knowledge based on spectral dependencies, biochemical connectivities and biological relationships.

2.1 Compound Identification from NMR Spectra

Proton NMR spectroscopy has been widely used in metabolomics studies involving human biofluids. Multiple small-molecule metabolites can be measured simultaneously without prior separation, which greatly simplifies the sample preparation requirements. NMR spectra are highly reproducible, and samples analyzed from one spectrometer will generate near-identical results to those measured on other types of spectrometers. These features have made NMR spectroscopy a platform of choice for large-scale collaborative metabolomics projects.

The Chenomx NMR Suite (Chenomx, Canada) is a widely used metabolomics tool for processing and profiling one-dimensional (1D) proton NMR spectra. The main feature of Chenomx is the integration of a powerful interactive visualization interface with a reference spectral library for over 600 metabolites that are detectable by NMR in common biofluids. Metabolite identification and quantification are achieved through manual peak fitting against those reference spectra. Another widely used commercial tool is the AMIX software package (Bruker Biospin GmbH, Germany), which offers similar features. The company has recently implemented a software (FoodScreener) that supports automated high-throughput targeted metabolomics profiling for wine, honey, and juice using defined spectra libraries.

Compared to commercial tools, public bioinformatics tools for NMR-based metabolomics tend to focus on spectral alignment, binning and batch processing [1, 2]. They usually lack user-friendly interface or comprehensive spectra libraries to support manual compound identification. As public NMR spectra libraries become increasingly available [3, 4], this situation has begun to change. For instance, the Bayesian automated metabolite analyzer for NMR (BATMAN) is an R package designed for deconvolution and quantification of metabolites from 1D proton NMR spectra of complex mixtures [5, 6]. The Bayesian model incorporates characteristic peak patterns of metabolites and also accounts for peak shifts commonly seen in NMR spectra of biological samples. BATMAN can compute relative concentrations of the compounds together with associated uncertainty estimates using a Markov chain Monte Carlo algorithm. The procedure is computationally intensive and usually requires hours of CPU time to process a single spectrum of common biofluids. Bayesil is a web-based tool that supports automated phasing, referencing, baseline correction, metabolite identification, and quantification for 1D proton NMR metabolomics spectra [7]. The algorithm is implemented based on probabilistic graphical models and a prior knowledge of probable biofluid compositions with built-in support for cerebral spinal fluid (CSF), serum, and plasma. Compared to BATMAN, Bayesil can process a spectrum in a few minutes with high precision and recall. For excessively overlapped NMR spectra of complex biofluid mixtures, two-dimensional (2D) NMR is often used to help resolve spectra ambiguities for metabolite identification purpose. The Bruker AMIX package (Bruker Biospin GmbH, Germany) can also support 2D NMR analysis. The Java desktop application MetaboMiner and the R package rNMR are two public bioinformatics tools for metabolite identification from 2D NMR spectra [8, 9].

2.2 Compound Identification from GC-MS Spectra

GC-MS offers a high degree of chromatographic resolution and reproducibility. The platform is suitable for measuring volatile, low-molecular mass (<500 Da), and thermally stable compounds such as sugars, fatty acids, and amino acids. For large and polar compounds, chemical derivatization is often employed to improve their volatility and thermal stability. The most commonly used ionization technique in GC-MS is electron ionization, which is very robust and reproducible. The characteristic mass spectral fragmentation patterns can be used to build a spectral library for metabolite identification.

Many software tools are available for metabolite identification and quantification from GC-MS-based metabolomics data. The automated mass spectral deconvolution and identification system (AMDIS) coupled with the National Institute of Standards and Technology (NIST) database is probably the most widely used software package for GC-MS data analysis [10]. The AnalyzerPro (SpectralWorks, UK) and ChromaTOF (LECO, USA) are the two widely used commercial tools for processing and profiling the GC-MS spectra for metabolomics studies. Compared to NMR-based metabolomics data, more public bioinformatics tools are available for GC-MS spectral processing, deconvolution, alignment, as well as compound identification. Popular tools include BinBase [11], MetaQuant [12], MetabolomeExpress [13], MetaboliteDetector [14], TagFinder [15], etc. With the availability of public GC-MS spectral databases [16, 17] and our improved knowledge on the metabolite compositions of common biofluids such as CSF, serum, and urine [1820], the GC-MS-based metabolomics is expected to be the most promising platform to deliver automated compound identification and quantification for a broad range of biofluids.

2.3 Compound Identification from LC-MS Spectra

Compared to GC-MS, LC-MS typically has lower chromatographic resolution and reproducibility. However, LC-MS techniques can access a much broader mass range (100–2000 Da) because volatilization or derivatization is not necessary. LC-MS is also a better choice for separating and identifying polar and nonvolatile compounds. Electrospray ionization and atmospheric pressure chemical ionization are the two most common ionization methods used in LC-MS. Both techniques will generate a molecular ion whose mass can be searched against a spectral database of known metabolites for possible identification. However, due to the finite mass accuracy of the MS equipment and the large number of potential formulas, using mass information alone is usually insufficient for metabolite identification [21].

To address this issue, many bioinformatics tools employ extra information to improve peak assignment and metabolite identification from LC-MS metabolomics data. One approach incorporates known chemical reactions among candidate compounds based on the metabolic pathways/networks to improve annotation, as certain combinations would make more biochemical sense when they are detected together. For instance, the MI-Pack and the ProbMetab are able to use the metabolic pathway information obtained from MetaCyc or KEGG to improve metabolite identification [22, 23]. The second approach takes into consideration of the dependency structures of multiple peaks (isotopologues, adducts, molecular fragments, and multiply charged ions) derived from each metabolite in a LC-MS spectrum to improve peak annotation. The MetAssign tool has implemented this approach [24]. The core algorithms used in these tools are based on graphical models, with most of them using a Bayesian approach to perform probabilistic annotation of metabolites.

3 Functional Analysis Approaches

Most metabolites can potentially participate in multiple functional roles within a biological system, and it is difficult to pinpoint the biological processes responsible for the profiles observed in a metabolomics experiment. A biological process is typically made of a group of molecules. If a biological process is changed in a study, the molecules involved should have a higher potential to be identified as significant by the omics platform. Motivated by this concept, functional analysis has shifted the unit of analysis from a single molecule to a group of functionally related molecules. Instead of testing a single gene or metabolite, researchers now directly evaluate whether a group of molecules (representing a biological process) is consistently changed (enriched). This approach greatly simplifies the omics data interpretation and is more sensitive in detecting subtle but consistent changes occurred in a biological process.

The functional analysis requires two components: a knowledge database defining functionally related molecule groups and a statistical algorithm to perform enrichment tests. The popular gene set enrichment analysis (GSEA) tool is shipped with a comprehensive collection of gene sets in the form of Molecular Signature Database (MSigDB), which greatly facilitates the subsequent development of tools for enrichment analysis [25, 26]. In metabolomics, except the public metabolic pathway databases such as KEGG [27] or MetaCyc [28], a comprehensive collection of functionally related metabolite groups was unavailable until very recently. The first large collection of metabolite sets appeared in 2010 with the publication of the MSEA tool containing >6000 groups of metabolites based on pathways, diseases, genetic variants, and cellular compartments [29]. The other useful resource is the ConceptMetab database containing >16,000 biologically defined metabolite sets developed based on GO, KEGG, and Medical Subject Headings [30]. The ongoing developments of ontologies for systematic metabolite annotations are expected to greatly facilitate the development of enrichment analysis tools for metabolomics [31, 32]. Below I will introduce the three main categories of statistical approaches for functional analysis for metabolomics data: over-representation analysis (ORA), metabolite set enrichment analysis (MSEA), and metabolic pathway/network analysis.

3.1 Over-representation Analysis (ORA)

The ORA approach is a traditional strategy for enrichment analysis. It starts with a list of metabolites of interest and tests whether certain metabolite groups appear more often than would be expected by random chance. This type of analysis can be performed using Fisher’s exact test, a chi-square test, a hypergeometric test, or its binomial approximation. To perform ORA, researchers need to first perform a statistical comparison such as t-tests or ANOVA and then select significant metabolites using a certain threshold or criterion (i.e., adjusted p-values <0.05). Fold change values are also considered sometimes during the selection process.

The ORA approach is very flexible to use and is simple to implement. It has been implemented in many metabolomics tools and databases including MSEA, MBRole, MetaPA, IMPaLA, MPEA, BiNChE, and ConceptMetab [2931, 3336]. A common critic of the approach is related to its somewhat arbitrary threshold to decide whether a metabolite is significant or not. For instance, different cutoffs sometimes lead to different interpretations, and ORA cannot be applied if no significant metabolites are found in a given study. Another limitation is that all metabolites are treated equally after the selection, ignoring their quantitative differences. Despite these shortcomings, ORA remains widely used in omics data interpretation [37].

3.2 Metabolite Set Enrichment Analysis (MSEA)

The MSEA approach has been developed to address the shortcomings associated with ORA. It directly tests the enrichment of functional groups using the complete concentration data without preselection of significant metabolites. The MSEA is named after the popular GSEA developed for gene expression data interpretation [26]. The original GSEA approach first uses a univariate method to rank all the genes and then tests whether the ranks in the gene set differ from a uniform distribution, using a weighted Kolmogorov-Smirnov test. The p-value for each gene set is calculated via permutation tests. Since then, many different variations of the GSEA have been developed with different performance characteristics [38]. For instance, the GlobalTest method has shown a general improved performance in terms of sensitivity, versatility, and computational efficiency and works especially well if most of the molecules within a group are associated with the phenotype in a modest way [38]. The algorithm is based on a generalized linear model to test whether a group of molecules is significantly associated with a specific phenotype [39].

Several bioinformatics tools have been implemented to support MSEA for metabolomics data. The web-based MSEA program (now part of MetaboAnalyst) is the first tool with such capacity to support functional analysis for quantitative metabolomics data [29, 40]. Like the original GSEA tool, it contains built-in libraries of defined metabolite sets associated with metabolic pathways, diseases, genetic variations, cellular compartments, etc. The GlobalTest algorithm is used for quantitative enrichment analysis directly from a metabolite concentration table. Another metabolomics tool with MSEA capacity is the MeltDB, which uses a modified GSEA method against the metabolite sets defined by the KEGG metabolic pathways [41]. With improved functional annotations for metabolite sets such as the ConceptMetab and metabolite ontologies [30, 31], more metabolomics tools with MSEA support will be developed in the near future.

3.3 Metabolic Pathway and Network Analysis

In the MSEA approach, groups of molecules labeled with biologically meaningful names are used to organize a large body of our current knowledge, making it a popular approach to aid in omics data interpretation. However, this “flat” representation of knowledge followed by enrichment tests based on group memberships ignores the connectivities and dependencies among molecules as well as the inherent overlaps/hierarchies among different groups. For instance, changes at a central location within a pathway tend to have a larger impact on its overall functions compared to changes at the very downstream. Integrating the functional analysis with pathway/network topology analysis will help improve the accuracy in ranking the resulting list of biological processes.

In gene expression data analysis, the TopGO is probably the first method that integrates knowledge about relationships between different GO terms into calculating the statistical significances to increase the explanatory power of GO enrichment analysis [42]. The signaling pathway impact analysis (SPIA) is another approach that combines the evidence obtained from classical enrichment analysis with a novel type of evidence that utilize the pathway topology to measure the impact on a given pathway [43, 44]. Both approaches have been shown to provide increased sensitivity and specificity when compared to other methods based solely on enrichment analysis. Many more tools have been implemented to take into consideration of pathway topology for enrichment analysis of gene expression data [45]. Applications of similar approaches to metabolomics are currently hampered by two obstacles: firstly, metabolomics typically can only measure a small fraction of any given metabolic pathway at the moment, which greatly limits our ability to evaluate the impact on the overall pathway; secondly, the development of a hierarchical ontology system for metabolite annotation has not been well established to allow easy plug-in by different bioinformatics tools, as is the case of gene ontology system. Therefore, current metabolomics tools focus primarily on enrichment analysis and visualization of metabolic pathways. The web-based tool MetPA (now part of MetaboAnalyst) is the first tool that supports both enrichment analysis and topology analysis within the context of KEGG metabolic pathways [36]. The MetScape is another tool implemented as a Cytoscape plug-in that is able to incorporate prior knowledge of pathways and molecular interactions for metabolomics pathway analysis and network visualization [46].

4 Metabolomics Workflows for Biological Interpretation

As indicated in Fig. 8.1, current metabolomics workflows can be largely divided into three general categories based on their strategies in metabolite identification and functional analysis: chemometrics approach, metabolic profiling approach, and chemo-enrichment analysis approach. The chemometrics approach focuses on identifying and interpreting a subset of spectral features that are found to be important within the study. It is relatively high throughput, as only the significant features need to be characterized. This approach is widely used in exploratory metabolomics studies and for discovery of novel biomarkers. A main drawback associated with this approach is the difficulties in biological interpretation, as a limited number of compounds are usually insufficient to pinpoint the underlying biological processes. In contrast to the chemometrics approach, the metabolic profiling approach aims to characterize all detectable metabolites from the spectral data before subsequent functional analysis. It generally yields better sensitivity, selectivity, and interpretability but is of very limited use for novel biomarker discovery. The main drawback associated with this approach is that the metabolite identification is usually time-consuming and labor intensive. The chemo-enrichment analysis approach has been recently developed to address the limitations associated with both chemometrics and metabolic profiling. It aims to estimate biological activities directly from the spectral features by mapping all possible metabolite matches to metabolic pathways/networks and then comparing the resulting profiles to identify the enriched biological processes.

4.1 The Chemometrics Approach

Chemometrics methods are a class of multivariate statistical methods heavily used in analytical chemistry and later metabolomics. These methods are especially useful for analysis and modeling of high-dimensional complex spectral data in untargeted metabolomics, where features (peaks or spectral bins) are highly correlated. The two most commonly used chemometrics methods are principal component analysis (PCA) and partial least squares discriminant analysis (PLS-DA). PCA aims to project a high-dimensional data into a low-dimensional space that captures the most variance of the data. The direction of projection is computed based on the data (X) only, without referring to the experimental conditions (Y). PCA is suitable for data overview and to understand the inherent patterns within the data. There is no guarantee that the directions of maximum variance will be the same as the directions of the variance associated with the experimental conditions. In contrast, PLS-DA aims to project a high-dimensional data X into a low-dimensional space that capture the most covariance between X and Y. It is often used to identify the spectral features that are different across experimental conditions. Orthogonal PLS-DA (OPLS-DA) is a variant of PLS-DA which uses orthogonal signal correction to maximize the explained covariance between X and Y on the first component, and the remaining components capture variance in X which is orthogonal to Y [47].

The chemometrics approach is composed of three general steps. A chemometrics method such as PLS-DA or OPLS-DA is first applied to analyze the spectral data to identify significant features associated with the experimental conditions. This step can be performed using several commercial or public tools. The SIMCA-P program (Umetrics, Sweden) is widely used by the metabolomics community. It offers excellent graphic capabilities and comprehensive analysis options for chemometrics methods including PCA, PLS/OPLS-DA, and SIMCA (soft independent modeling of class analogy). MetaboAnalyst is a web-based tool that supports comprehensive metabolomics data processing, normalization, and chemometrics analysis (PCA, PLS-DA, and more recently, Orthogonal PLS-DA [40, 48, 49]. For users who know how to program in R, many R packages are available for chemometrics analysis [50, 51]. After selection of significant spectral features, the second step is to perform compound identification using the tools and resources as described in Sect. 8.2. In the third step, the list of identified metabolites will be subject to ORA to find out which pathways or biological processes are significantly enriched biological processes are significantly enriched for biological interpretation (Sect. 8.3).

4.2 The Metabolic Profiling Approach

Metabolic profiling is often used to validate and expand upon results obtained from untargeted analysis. It is also increasingly applied to study variations of metabolite concentrations in relatively well-characterized biofluids such as CSF, blood, urine, etc. Although the process of metabolite identification and quantification is currently a rate-limiting step, this approach offers several distinctive advantages. For instance, metabolic profiling significantly improves statistical power by reducing the number of features from 1000–10,000 of features peaks to hundreds of metabolites. The manual process also largely removes missing values and spectral noises, which greatly facilitates downstream statistical analysis and biomarker discovery.

The biggest advantage of metabolic profiling is the ease of data interpretation. The complete metabolite concentration table can be directly used for MSEA, metabolic pathway, or network analysis using the tools described in Sect. 8.3. The web-based tool MetaboAnalyst provides extensive functions for functional analysis and interpretation for data generated from metabolic profiling approach. Importantly, the metabolite concentration data is very compatible with other omics data and can be analyzed together to help pinpoint the biological pathways involved in the experimental conditions. There are several bioinformatics tools that provide support for integrated analysis of metabolomics data with transcriptomics data. For instance, the MetaCore (Thomson Reuters, USA) allows joint analysis and visual exploration within its comprehensive collections of pathway and network [52]. The public tools IMPaLA and MetScape can accept a list of metabolites and a list of genes for joint analysis and visualization on metabolic networks [34, 46]. INMEX is a web-based tool that supports statistical analysis and joint enrichment analysis for data sets from transcriptomics and metabolic profiling studies [53].

4.3 The Chemo-enrichment Analysis Approach

The chemo-enrichment analysis approach is a more recent strategy developed to facilitate high-throughput interpretation of metabolomics data generated from high-resolution LC-MS platforms. The key idea is to redefine the metabolite sets, metabolic pathways, or networks using the spectral features (i.e., m/z) of the corresponding metabolites and then test the enrichment of these “collective chemical signals” within the untargeted metabolomics data. Accurate compound identification is not necessary because errors (i.e., incorrect peak assignments) tend to will be randomly distributed, while the true biological signals will be consistent, which can be detected by testing the enrichment of their collective chemical signals. The chemo-enrichment approach directly connects spectral features with biological interpretations without explicit compound identification. In practice, the metabolite identification is performed post hoc for those enriched biological processes of interest. The approach is useful in metabolomics studies for organisms with well-annotated metabolic pathways and networks.

There are a few tools that offer support for chemo-enrichment analysis. The mummichog is probably the first bioinformatics tool that implemented the concept [54]. It accepts two lists of spectral peaks (i.e., m/z values) – a significant peak list (i.e., those identified using t-tests) and a reference peak list (all features detected in the MS experiment). The significant peak lists are then searched against a database to find all potential matches to metabolic pathways and networks. The result is compared with those obtained based on peak lists randomly drawn from the reference peaks to compute statistical significance. The tool is available as a Python program. It has been recently implemented in the popular web-based tool XCMS Online to reach a broader audience [55]. MarVis-Pathway is a more recent stand-alone bioinformatics tool with chemo-enrichment analysis feature. It employs a hypergeometric-based approach to evaluate the enrichment of metabolic pathways directly from the untargeted metabolomics data [56].

5 Summary and Future Perspectives

This chapter introduces several key concepts and recent developments in computational strategies for metabolomics data interpretation. Compound identification constitutes a major bottleneck in current metabolomics studies. Accurate metabolite identification requires manual intervention and additional laboratory experiments. Advances in both analytical platforms and algorithms are making ways to enable high-throughput data interpretation. Integrating high-resolution analytics, context-specific reference spectral databases, together with advanced algorithms that incorporate chemical and biological information, we will be able to achieve accurate and high-throughput metabolite identification and biological interpretation.

Identification of metabolites (accurately or approximately) is a prerequisite for data interpretation. The list of compounds needs to be put into proper biological context by identifying their roles in metabolic pathways, their interconnectivity with other metabolites, links to genetic variations, or associations with pathophysiological conditions. The group-based functional enrichment analysis has been developed to address this issue. This is an active research area with a wide range of tools and implementations available. Given the current limitations of the knowledge databases and the statistical algorithms, the resulting enrichment p-values should be treated as a ranking system for data exploration and hypothesis generating rather than an absolute cutoff for decision-making purpose.

Compared to transcriptomics, metabolomics is closer to an organism’s phenotype and is more sensitive to environmental perturbations. Small compounds represent the final products of complex interactions between the host genetics and environment. The metabolome includes both the endogenous metabolites produced directly by the host organism and the compounds derived from microbial, xenobiotic, dietary, and other exogenous sources. As a result, metabolomics is increasingly applied to study the impact of diet, gut microbiota, and environmental exposures. Developing novel bioinformatics tools and specialized knowledge databases to support these applications are the new frontiers in the current computational metabolomics.