Keywords

1 Introduction

Metabolomics is a branch of “omics” science that is focused on the comprehensive characterization of the small molecule metabolites in the metabolome. Clinical metabolomics involves the application of metabolomic techniques toward discovery, diagnosis, and monitoring of human diseases. Clinical metabolomics can also be used to explore and understand the molecular basis to disease and disease mechanisms. As highlighted throughout this book, metabolomics is being successfully used in many areas of medicine and medical genetics, including the diagnosis and monitoring of inborn errors of metabolism (IEMs), the detection of various endocrine disorders, the characterization of neurodegenerative diseases, and the diagnosis of cancers and cardiovascular diseases. The success of metabolomics in these many diverse areas of medicine and medical genetics lies in the fact that metabolites represent the downstream products of upstream events occurring within the genome, the transcriptome, and the proteome. Indeed, metabolites are sometimes called the “canaries of the genome.” Just as canaries were used by miners in the 1800s to serve as sensitive indicators of toxic gases in coal mines, metabolites can serve as remarkably sensitive indicators of problems in the genome. Indeed, a single base change in a gene can lead to a 10,000-fold change in the concentrations of certain metabolites [1]. This exceptional sensitivity is the basis to newborn screening, in which metabolite tests have been used to detect IEMs (such as phenylketonuria) for many decades [2]. Metabolite concentrations are not only very sensitive to what goes on in the genome, but they are also very sensitive to what goes on in the environment. Indeed, metabolite concentrations are heavily influenced by nutrition, physical activity, exposure to environmental chemicals, the time of day, or the even the outside temperature [3, 4].

Because metabolites are the end products of complex interactions happening inside the cell (the genome and the transcriptome) and events happening outside the cell (the environment), metabolomics is ideal for assessing the interactions between genes and the environment (i.e., measuring the phenotype). Therefore, metabolomics offers clinicians and medical researchers an ideal route to measure and monitor both human phenotypes and human diseases in “real time.” This gives metabolomics an important advantage over genomics. While the genome can tell you what might happen, the metabolome actually tells you what is happening.

As highlighted throughout this book, there are two distinct “flavors” of metabolomics: (1) targeted metabolomics and (2) untargeted metabolomics. Targeted metabolomics is focused on the identification (and often absolute quantification) of a specific, predefined collection or category of metabolites in a tissue, biofluid, or biological matrix. Because of its ability to achieve absolute quantification, targeted metabolomics is widely used in clinical medicine, biomarker testing/discovery, and disease monitoring. On the other hand, untargeted metabolomics involves the broad, unbiased identification of the maximum number metabolites or metabolic features in a tissue, biofluid, or biological matrix. Untargeted metabolomics is not reliably quantitative, and so it is more widely used in early-stage biomarker discovery or hypothesis generation applications. Both targeted and untargeted metabolomics can be, and are currently, used in clinical metabolomics. Both can be used to discover biomarkers or biomarker profiles, and both methods can be performed with standard metabolomic platforms such as liquid chromatography mass spectrometry (LC-MS), gas chromatography mass spectrometry (GC-MS), and nuclear magnetic resonance (NMR) systems. All three of these analytical platforms are capable of separating, detecting, and characterizing hundreds, even thousands of chemicals in complex chemical mixtures. In almost all cases, when NMR, GC-MS, or LC-MS instruments are used to analyze clinical materials, they produce spectra or chromatograms consisting of many hundreds to thousands of peaks.

The quantity of data generated by these platforms necessitates the use of computers and a wide variety of bioinformatic software tools. The main bioinformatic challenges in metabolomics, and clinical metabolomics in particular, are (1) determining which peaks in these spectra match to which chemical compounds (metabolite identification); (2) detecting which compounds are significantly altered in concentration or abundance (determining metabolite significance); (3) determining which metabolites or combination of metabolites can serve as robust disease biomarkers (biomarker discovery); and (4) understanding the biological and genetic context for the observed metabolic changes (finding disease mechanisms or causes).

This chapter is intended to provide an overview to some of the bioinformatic tools that can be used to facilitate clinical metabolomics and provide key insights into disease processes. In particular, readers will be introduced to a number of software tools, data resources, and data standards for facilitating compound identification (for both targeted and untargeted metabolomics), for detecting which compounds are significantly altered in abundance, for identifying and assessing metabolite biomarkers or biomarker panels, and for understanding biological and genetic context of the observed metabolite changes. These will include discussions on the Metabolomics Standards Initiative (MSI), software tools for metabolite identification and quantification (such as AMIX, Bayesil, AMDIS, and XCMS), data resources for metabolite annotation such as the Human Metabolome Database (HMDB) and MarkerDB (a biomarker database), data analysis and biomarker discovery tools such as MetaboAnalyst, and resources for interpreting or characterizing disease mechanisms, such as the Small Molecule Pathway Database (SMPDB).

2 Bioinformatic Tools for Metabolite Identification

Depending on the method used (targeted vs. untargeted) and the type of compounds being measured (lipids vs. non-lipids), different levels of metabolite identification can be achieved. According to the Metabolomics Standards Initiative (MSI) [5], there are actually four levels of metabolite identification: (1) positively identified compounds, (2) putatively identified compounds, (3) compounds putatively identified to be part of a compound class, and (4) unknown compounds. Positively identified compounds correspond to those chemicals that have a name, a known structure, a CAS (Chemical Abstracts Services) number, or an InChI (International Chemical Identifier) string. To fall into this category, a compound must be identified using a purified, authentic standard collected under identical or near-identical data collection conditions. For targeted metabolomic studies, most metabolites are identified at a Level 1 standard. On the other hand, for most untargeted studies, achieving Level 1 identification is rare. Putatively identified compounds (Level 2) correspond to those where the compound is identified based on a spectral match (i.e., MS/MS) or an alternative parameter (i.e., retention time and exact mass) match to a reference database value. In these cases, an authentic standard is not available, meaning that there is some ambiguity about the compound’s true identity. Certainly, if the compound is known to exist in a human biofluid as indicated by numerous literature reports or data resources such as the HMDB, these Level 2 compound identifications are much stronger and may be considered “near positive.”

The third level of compound identification is typical of many lipids, where the exact structure of the compound cannot be determined, but it is known to be a specific class of lipid (a phospholipid or triglyceride) or a lipid where the total mass of the acyl chains is known, but the type of position of the individual acyl chains is not known (i.e. the phosphatidyl choline PC(38:3)). Level 3 identification is common for untargeted MS-based lipidomic studies. The fourth level of compound identification is the “unknown” category. Again, this is usually only seen with untargeted metabolomic studies. In many cases, a compound is labeled as an “unknown” simply because the investigator has not been very thorough in their analyses or because their software/database being used for compound identification is inadequate, incomplete, or too small.

2.1 Bioinformatic Tools for Metabolite Identification Via NMR

NMR-based metabolomics has been used in clinical metabolomics for a number of years. These include applications in diagnosing IEMs [6], detecting novel genetic disorders [7], and for lipoprotein profiling [8, 9]. There are three methods for performing metabolite identification by NMR. One is to manually spike the suspected compound into the sample, collect the NMR spectrum of the spiked biofluid, and confirm that the spiked compound changes the observed NMR spectrum in the expected manner. This method is obviously slow and less than ideal for analyzing large numbers of samples or identifying large numbers of metabolites. The second is to use manual chemical shift “lookup” tables to match observed NMR chemical shifts with known chemical shifts. Many NMR labs compile their own chemical shift reference tables to perform manual assignments. However, there are now several online NMR spectral databases that contain both experimentally measured and (accurately) predicted NMR spectra for thousands of human metabolites. These include the Human Metabolome Database or HMDB [10], the Natural Products Magnetic Resonance Database or NP-MRD [11], and the BioMagResBank or BMRB [12]. The HMDB is a particularly important metabolomic resource for clinical metabolomics. It currently contains the largest collection (>250,000) of known human metabolites along with detailed information on their structures, descriptions, names and synonyms, biological pathways, biofluid concentrations, and disease associations along with their corresponding MS and NMR spectra (to aid in compound identification). In particular, the HMDB has experimentally measured and predicted 1H and 13C NMR spectra for more than 220,000 human metabolites at NMR spectrometer frequencies ranging from 300 MHz to 1000 MHz. While somewhat smaller and less clinically relevant, the NP-MRD has experimentally measured and computationally predicted 1H and 13C NMR spectra for nearly 90,000 metabolites and natural products at NMR spectrometer frequencies ranging from 300 MHz to 1000 MHz. The BMRB has nearly 1000 commonly occurring metabolites with experimentally measured 1H and 13C NMR spectra at 400 MHz and 600 MHz. All of these databases support web-based, automated metabolite identification and NMR spectral searches and/or peak matching.

The third approach to perform metabolite identification via NMR spectroscopy is to use spectral deconvolution. Spectral deconvolution is a computational approach that involves taking a complex spectrum consisting of a mixture with many chemicals and simplifying it into individual spectra of its “pure” chemical components (Fig. 1).

Fig. 1
A set of 4 waveforms illustrate mixture, compound a, compound b, and compound c. The waveform indicates multiple peaks. A synthetic 3 profiler software screenshot has a waveform with multiple peaks. The highest peak is observed at 36886.3.

An image illustrating spectral deconvolution. On the left side is a simplified depiction of how an NMR mixture spectrum could be decomposed into three separate “pure” spectra from three separate compounds (A, B, and C). Summing the three spectra together produces the spectrum of the mixture at the top. On the right side of this image is a spectral deconvolution performed on a real NMR spectrum with the list of identified compounds and their concentrations

This process requires a specially constructed spectral database as well as carefully developed spectral fitting software. The spectral database used in NMR spectral deconvolution typically consists of reference 1D 1H or 13C NMR spectra of the pure compound(s) that are known or expected to be in the biological sample of interest. These reference NMR spectra must be collected under exactly the same conditions (same temperature, same solvent, same salt, same pH) under which the biological sample was analyzed.

The fact that most metabolites have distinct, almost invariant chemical shift “fingerprints” made up of several compound-specific peaks is one reason why spectral deconvolution works so well for NMR. Furthermore, given that most metabolites have many NMR peaks helps to alleviate the problem of spectral redundancy. To put it in another way, it is highly unlikely that any two distinct compounds will have the same number of peaks, chemical shifts, peak intensities, spin couplings, or line shapes in their NMR spectra. Another reason why spectral deconvolution works so well with NMR is because NMR peak intensities provide precise information about compound concentrations. In other words, accurate spectral matches not only lead to exact compound IDs, but they also lead to exact or near-exact concentration determinations.

There are several commercial programs that support NMR spectral deconvolution for metabolite identification, including AMIX (Bruker) and NMR Suite (Chenomx) (for small molecule metabolites). These software packages have large NMR spectral reference libraries consisting of hundreds of metabolites. Newer versions of these packages such as Bruker B.I. QUANT and Bruker FoodScreener as well as NMR Suite (Version 7 and above) now support semiautomatic deconvolution for higher-throughput analysis. In addition to small molecule metabolite analysis for clinical applications, several other companies, including LipoScience, Bruker, and Nightingale, have begun to offer lipoprotein analyses of serum or plasma samples through an automated or semiautomated spectral deconvolution. These programs or services generate quantitative clinically useful data on high-, medium-, and low-density lipoprotein particles (HDL, MDL, and LDL) from human plasma samples.

In addition to these commercial programs or commercial services for clinically based NMR metabolomics, several freely available academic programs have been developed to perform semiautomatic compound identification or quantification via NMR. These include Batman [13], AQuA [14], ASICS 2.0 [15], and rDolphin [16]. Unfortunately, these programs do not support automated data processing, which means a separate software package such as NMRPipe [17] or NMRFx [18] must be used to manually process the data prior to analysis. This requirement for manual processing significantly slows the analysis (hours per spectrum) and adds an element of human error to the analysis pipeline.

Recently, two new NMR spectral deconvolution programs called Bayesil [19] and MagMet [20] have been introduced. These programs support fully automated NMR metabolite identification and quantification. Both Bayesil and MagMet can perform fully automated data processing and spectral deconvolution of one-dimensional (1D) 1H NMR spectra to identify and quantify upward of 50 to 60 compounds in 3–4 min. Just as with the Chenomx NMR Suite, Bayesil and MagMet work with most NMR instrument models and field strengths but are limited to analyzing specific biofluids such as serum, plasma, or fecal water. Both Bayesil (http://bayesil.ca/) and MagMet (https://magmet.ca/) are freely accessible through web servers.

2.2 Bioinformatic Tools for Metabolite Identification Via GC-MS

GC-MS has been used in clinical chemistry for more than 50 years [21]. Indeed, many clinical labs continue to routinely use GC-MS as their go-to platform for metabolite profiling. GC-MS can be used to identify and quantify amino acids, organic acids, hormones, and other important clinical biomarkers. As a result, GC-MS can be used to diagnose and monitor many IEMs or other genetic disorders. While most clinical chemists are reluctant to call GC-MS analysis a form metabolomics, the simple fact is that clinical GC-MS is one of the most robust and most widely used metabolomic platforms in clinical chemistry.

Just as with NMR-based metabolomics, compound identification via GC-MS is best done through a form of spectral deconvolution. A typical GC-MS spectrum or total ion chromatogram (TIC) from a metabolite mixture will consist of dozens of sharp peaks (corresponding to ion counts) covering an elution time of about 30–45 min. Each peak often consists of one or more EI (electron ionization) mass spectra arising from one or more compounds. A variety of commercial GC-MS data analysis tools such as AMDIS, which stands for Automated Mass Spectral Deconvolution and Identification System [22], MassHunter (Agilent), ChromaTOF (Leco), and AnalyzerPro (SpectralWorks) can be used to identify and quantify metabolites. Once the EI-MS spectra are extracted, metabolite identification is performed in a similar manner to what is done for NMR. Namely, the extracted EI-MS spectra from the biofluid are compared, one at a time, to EI-MS spectral reference libraries containing the EI-MS spectra of thousands of pure, derivatized, and authenticated compounds. This process is done semiautomatically with users making metabolite identification calls based on the information and spectral image overlays that the computer programs provide.

There are three key factors that ultimately determine the quality of a compound identification by GC-MS: (1) the quality of the extracted query spectrum, (2) the quality of the spectral matching algorithm, and (3) the quality and comprehensiveness of the reference spectral database. Unlike NMR, where “false-positive” peaks are extremely rare, GC-MS is frequently plagued with an abundance of false-positive peaks. In some cases, up to 50% of features seen in GC-MS spectra are fragments, adducts, or derivatives of either the column matrix, the derivatization reagents, or of the metabolites themselves. Different software packages tend to handle these spectral artifacts differently. A study by Lu et al. [23] compared three commonly used GC-MS deconvolution packages (AMDIS, ChromaTOF, and AnalyzerPro) using a defined mixture of 35 compounds with widely varying concentrations. It was found that both the AMDIS and ChromaTOF packages produced unusually high numbers of false positives or false/impure spectra, while the AnalyzerPro package generally performed best.

Ultimately, the main factor driving the success in compound identification by GC-MS is the size and quality of the EI-MS spectral reference database. The most common and widely used EI-MS resource is the NIST (National Institute of Standards and Technology) EI-MS spectral database. The latest release contains EI-MS spectra for more than 300,000 compounds or derivatized compounds along with retention index (RI) values for another 140,000 compounds. However, most of the NIST compounds are not metabolites nor are they derived from biological materials. The paucity of real metabolites in the NIST library can lead to a number of false-positive identifications, especially if authentic standards are not used to verify the identity of a given compound. On the other hand, the HMDB [10] provides a much larger collection of human metabolites (>250,000) along with a much larger library of corresponding GC-MS data, including almost 2.3 million predicted and experimental EI-MS spectra and nearly seven million retention indices. In particular, the latest version of the HMDB currently contains 6,696,000 accurately predicted RI values for 26,880 parent compounds (and 2.1 million TMS and TBDMS derivatives of those parent compounds) and 2,282,000 predicted and experimental EI-MS spectra. These spectra and RI data can be readily searched, separately or together. However, it is important to note that the quality of most predicted EI-MS spectra is not yet sufficiently high to achieve 100% identification accuracy. Therefore, identifications made using predicted EI-MS data or predicted RI data should always be viewed as putative identifications (Level 2).

2.3 Bioinformatic Tools for Metabolite Identification Via LC-MS

Over the past three decades, LC-MS has become the most popular analytical platform for clinical chemistry and clinical metabolomics [2, 24]. Indeed, most newborn screening activities in the developed world are done using triple quadrupole (QQQ) tandem mass (MS/MS) spectrometers [2]. Relative to NMR or GC-MS, LC-MS methods offer much greater sensitivity, more comprehensive compound detection, and generally higher throughput. These advantages largely explain its growing popularity. Most LC-MS methods adopted in clinical chemistry laboratories employ targeted approaches that use authentic, isotopically labelled standards to simultaneously identify and quantify a small (15–25) number of high-priority metabolites. Most of these targeted LC-MS methods use defined tables of specific metabolite retention times (for their given liquid chromatography system), metabolite-specific multiple reaction monitoring (MRM) or single reaction monitoring (SRM) peak lists, and multipoint calibration curves for metabolite identification and quantification. A number of different software packages from various LC-MS vendors are available to facilitate targeted LC-MS analysis. These include Analyst (from Sciex), MassHunter (from Agilent), Progenesis QI (from Waters), and TraceFinder/LCQUAN (from Thermo Fisher). In addition to these vendor-specific packages, Biocrates Life Sciences provides a vendor-independent software package, called MetIDQ with its targeted metabolomic kits to perform semiautomated MRM-based metabolite identification and quantification. Modern, targeted LC-MS-based metabolomic software and methods typically allow the identification and quantification of up to 700 metabolites in a given sample in less than 30 min.

Untargeted LC-MS-based metabolomics is normally reserved for biomarker discovery as opposed to biomarker testing. A typical LC-MS spectrum from an untargeted metabolomic study will consist of many sharp peaks (corresponding to ion counts) covering an elution time of about 10–35 min. Each peak may consist of one or more ESI (electrospray ionization) m/z values arising from one or more compounds. As a result, untargeted LC-MS metabolomic studies can easily generate a huge number of spectral features or putative compounds (>10,000). This is many times more than what is seen by NMR or GC-MS. Many of these LC-MS features turn out to be noise peaks, column contaminants, insource fragments, adducts, and isotopic variants. As a result, untargeted LC-MS data typically requires a considerable amount of post-processing and peak consolidation to reduce the number of peaks to a reliable, countable number (preferably <2000 putative compounds).

Untargeted LC-MS data is often further complicated by the fact that liquid chromatographic data is substantially more variable from run to run than NMR or GC-MS data. As a result, metabolomic data acquired via untargeted LC-MS techniques typically requires additional de-noising, spectral alignment, and spectral averaging to ensure that the correct peaks are being picked and compared. This kind of spectral processing requires sophisticated software that either comes bundled with the LC-MS instrument or which is designed, written, and distributed by highly specialized MS laboratories. Examples of some of the instrument-specific tools include Mass Frontier (Thermo Fisher), MassHunter (Agilent), XCMS-Plus (Sciex), Profile Analysis (Bruker), and Progenesis (Waters). There are also a number of platform-independent freeware systems including XCMS [25], MS-DIAL [26], and MzMine2 [27]. All of these software packages support chromatographic and MS spectral alignment, peak finding, multivariate statistics (for data reduction), parent ion mass matching, molecular formula calculation, and MS/MS spectral matching.

Metabolite identification via accurate parent ion mass (or more correctly the mass-to-charge, m/z) measurement requires the use of very high-resolution MS instruments such as quadrupole time-of-flight (QTOFs), Orbitraps, or Fourier-Transform Ion Cyclotron Resonance (FT-ICR) spectrometers. If a parent ion mass is measured to 4–5 decimal places, which corresponds to a mass accuracy of <5 ppm, it is usually possible to determine the ion’s molecular formula and its putative identity (Level 3 identification) through a chemical formula calculator. Several commercial MS chemical formula calculators exist such as SigmaFit (Bruker), Formula Predictor (Shimadzu), and MassHunter (Agilent) as well as a number of freeware packages including 7-Golden-Rules [28] and SIRIUS [29]. By including restrictions on the types of elements typically found in metabolites as well as requirements on hydrogen/carbon ratios and isotopic abundances, it is often possible to reduce the number of feasible chemical formulas even further [30]. Unfortunately, even with these improvements, parent ion-based metabolite identification is still very risky as there are often many masses or molecular formulae that can still match dozens of metabolites in existing compound databases.

The preferred route of metabolite identification for most untargeted LC-MS metabolomic studies is to use both parent ion (or formula) matching and MS/MS spectral matching. MS/MS spectra, with their characteristic fragmentation patterns, provide very useful structural information about molecules. Successful LC-MS/MS spectral matching is critically dependent on having instrument-specific or condition-specific MS/MS product ion fragment libraries. Many of these libraries are bundled with the instrument-specific software packages mentioned earlier. On the other hand, commercial MS/MS databases, such as the NIST20 database and METLIN [31], as well as public MS/MS databases such as MassBank of North America (MoNA), [32], and HMDB [10] are normally used by the freeware packages (XCMS, MS-DIAL, and MzMine) to perform MS/MS spectral matching. Table 1 provides list of MS/MS databases with experimentally acquired (and predicted) MS/MS spectral data and their relative size.

Table 1 A list of MS/MS spectral databases with the reported numbers of compounds and MS/MS spectra

The challenge with using experimentally acquired MS/MS spectra from these MS databases is that each compound is often represented by dozens of different MS/MS spectra collected on different MS instruments under different ionization conditions or at different collision energies. So, while the number of experimentally collected MS spectra is large, the actual number of unique (parent) compounds represented by this diverse collection is quite small. Indeed, it is thought that the current experimental MS/MS spectral collection represents <20% of known or expected human metabolites. Given the striking shortage of experimentally collected MS/MS spectra, a number of investigators have started to use computational tools to predict MS/MS spectra for individual compounds where no experimental MS/MS spectra exist [33, 34]. Many of these in silico predicted MS/MS spectral libraries are now available through the HMDB [10] and the CFM-ID database [34]. These data resources support direct MS/MS spectral searches (to find specific compounds) as well as neutral loss searches (to find related compounds). Other computational MS/MS spectral interpretation tools, including CSI:FingerID [35] and SIRIUS4 [36], allow users to input an experimental MS/MS spectrum and will generate a structure match or a structure class match without the need to match against any predicted MS/MS spectra. These programs use a technique called chemical or spectral fingerprint analysis rather than spectral prediction [34].

Even with the best MS/MS spectral databases (experimentally acquired or computationally predicted) and the best chemical/spectral fingerprint analysis tools, it is still quite difficult to confidently identify (MSI Level 2) more than 400–500 metabolites via untargeted LC-MS-based metabolomics. While the instrumental times for most untargeted metabolomic LC-MS assays are relatively quick (15–20 min per sample), the data analysis times are often quite slow. Indeed, they often run for several hours per sample as most workflows require considerable manual inspection and manual intervention.

3 Bioinformatic Tools for Detecting Metabolite Differences

Regardless of whether targeted or untargeted approaches are used, one of the central goals of any clinical metabolomic study is to determine which peaks or which metabolites are significantly different for those individuals with a disease or condition relative to healthy controls. This comparison between healthy concentration values versus diseased concentration values is how most known disease biomarkers are measured and how new disease biomarkers are discovered. In clinical metabolomics, there are three routes for determining which metabolites are significantly different or significantly differentiating. These include (1) reference-based metabolite differentiation, (2) multivariate metabolite differentiation, and (3) multivariate peak differentiation.

Reference-based metabolite differentiation involves comparing quantitatively measured metabolite concentrations in a given biofluid for a diseased individual (measured via targeted metabolomic methods) with healthy, age-specific and sex-specific reference metabolite concentrations for the same biofluid, as reported in a database or a reference textbook. Reference-based metabolite differentiation is ideal for diagnosing individuals or for identifying biomarkers in individuals afflicted with rare diseases, such as IEMs, or other genetic disorders. This approach is the most widely used method for detecting significant metabolite differences in clinical metabolomics.

The second approach, called multivariate metabolite differentiation, detects metabolite differences by conducting case-versus-control studies using targeted metabolomic methods. This involves quantitatively measuring specific “named” metabolites from biofluid samples collected from multiple individuals with the disease of interest and biofluid samples from multiple healthy individuals (age and sex matched) without the disease. This approach, which requires the use of multivariate statistical techniques, is ideal for discovering multiple metabolite biomarkers for a given disease and for creating robust multi-marker profiles or multi-marker models.

The third approach to detecting metabolite differences is specific to untargeted metabolomics. Like the second approach, it uses a case-versus-control experimental design and multivariate statistics, but the goal is to use multivariate statistics in combination with relative peak intensity differences (as opposed to absolute concentrations of fully identified metabolites) to identify which peaks or features in the untargeted dataset are differentially abundant. The goal is to reduce the initial list of thousands of unidentified features or peaks to a more manageable list of a few dozen features that exhibit strong differences between cases and controls. From this smaller list of yet-to-be-identified features, it is possible to use various techniques (spectral matching, spike-in experiments, etc.) to identify the actual metabolites showing the most significant concentration changes. Multivariate peak differentiation is well suited for novel biomarker discovery and early-stage or putative biomarker identification. However, these putative markers must ultimately be validated using a targeted metabolomic technique that quantitatively measures the presumptive metabolites.

3.1 Bioinformatic Tools for Reference-Based Metabolite Differentiation

Reference-based metabolite differentiation requires a large and reliable set of human reference metabolite concentrations for different ages, sexes and biofluids. Currently the most complete set of healthy metabolite reference values for different biofluids for different ages and sexes is the Human Metabolome Database or HMDB [10]. The HMDB is widely regarded as the most complete open-access database on human metabolites and their disease associations. Currently the HMDB contains reference concentration values for 3073 metabolites in serum/plasma, 1757 metabolites in urine, 447 metabolites in cerebrospinal fluid, 883 metabolites in saliva and 1805 metabolites in feces. These values include the literature sources, age group or age range of the measured cohort and sex (if available). In many cases, multiple values are provided as different analytical methods can lead to slight differences in the reported concentration values. Abnormal concentrations are also reported in the HMDB, along with the associated conditions and the corresponding literature sources. Currently, the HMDB contains abnormal metabolite concentration data for more than 660 IEMs and other genetic disorders. Users can easily query the HMDB via its web interface with a specific metabolite name, metabolite structure or metabolite InChI identifier to obtain the corresponding biofluid concentrations and explore what else is known about the queried metabolite. Likewise, users may also query or browse the HMDB by disease or condition names and the results will provide the lists of altered metabolites associated with that condition.

Another useful source of reference metabolite concentrations and reference-based metabolite differentiation is MarkerDB [37]. MarkerDB is the world’s most comprehensive open-access biomarker database. It contains more than 26,600 genetic, protein, and metabolite biomarkers for 670 human disorders or conditions. MarkerDB not only provides metabolite concentration data for healthy individuals (with information on age- and sex-specific values), but it also provides data on the unhealthy metabolite concentrations (in different biofluids), descriptions of the associated disorders, detailed descriptions of the metabolites, and even biomarker performance indications, such as biomarker sensitivity and specificity. In this regard, the disease and biomarker data in MarkerDB is probably more complete than the data in HMDB.

Both HMDB and MarkerDB are primarily designed for querying or browsing a small number (<3) of metabolites and assessing their concentration differences relative to healthy normal values. If multiple (10 or more) metabolites need to be queried to determine if the observed metabolite concentrations are significantly different from normal, it is possible to use an online metabolomic web server called MetaboAnalyst [38]. In particular, the Single Sample Profiling (SSP) option within MetaboAnalyst’s enrichment analysis (EA) module allows users to enter long lists of metabolite names and concentrations which are then compared against those values reported in the HMDB for a wide variety of biofluids. Metabolites that are higher (H) or lower (L) than the normal reference values are flagged in the resulting output. Hyperlinks to the HMDB compound database entries are also provided. The SSP option with MetaboAnalyst provides users a fast and convenient route to perform reference-based metabolite differentiation from targeted metabolomic data. It also allows important metabolite features (i.e., those marked with H or L) to generate a biomarker profile for a given disease.

3.2 Bioinformatic Tools for Multivariate Metabolite/Peak Differentiation

When conducting biomarker discovery or biomarker validation studies in clinical metabolomics, it is standard practice to perform well-powered case-versus-control studies. Case-control studies often involve hundreds of subjects, thousands of biofluid samples, and the collection of very large metabolomic datasets. As highlighted earlier, any single targeted metabolomic assay can easily generate hundreds of named metabolites and metabolite concentrations. Likewise, an untargeted metabolomic assay can easily generate thousands of un-named “features” or peaks along with their corresponding relative peak intensities. Because the number of variables in these types of clinical studies is so large, special statistical methods must be used to help manage the data, differentiate up- or downregulated metabolites, and reduce the problems of overlap, false positives, and significance. In particular, the techniques that must be used are called multivariate statistics. In multivariate (short for multiple variable) statistics, the variables are called “dimensions.” One of the primary objectives of multivariate statistics is to reduce the number of variables or dimensions so that the problem can be tackled more simply using traditional univariate statistics, such as Student’s t-tests or ANOVA techniques. Multivariate statistics uses a class of mathematical techniques called dimensional reduction methods to make multivariate data look more like univariate (single variable) data. Dimensional reduction allows one to identify the key components in a large multivariate dataset that contain the maximum amount of information or maximize the differences among groups. As a result, dimensional reduction reduces a long list of metabolites to a shorter list of the most significant metabolites. This is the essence of multivariate feature/metabolite differentiation. The most common form of dimensional reduction is called principal component analysis or PCA.

3.3 Principal Component Analysis

Principal component analysis (PCA) is an unsupervised clustering technique. Clustering is the process of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. Clustering helps distinguish groups, such as cases and controls, from one another based on their metabolic parameters. In a more formal “mathematical” sense, PCA determines an optimal linear transformation for a collection of data points such that the properties of that set of data points are most clearly displayed along a small number of coordinate (or principal) axes. Simply put, PCA allows metabolomic researchers to easily plot, visualize, and cluster multiple lists of metabolites and their concentrations based on linear combinations of their shared features. PCA is most commonly used in clinical metabolomics to determine whether one or more samples are different from another. It also allows one to identify which variables or metabolites contribute most to this difference and whether those metabolites contribute in the same way (i.e., are correlated) or independently (i.e., uncorrelated) from each other. PCA is particularly appealing because it allows one to visually detect sample clusters or groupings. In particular, the results of a PCA are usually discussed in terms of scores and loadings. The scores represent the original data in the new coordinate system, and the loadings are the weights applied to the original data during the projection process. Plotting out the data using two sets of scores (one for the X axis and one for the Y axis) will produce a “scores” plot. The “weightings” of the individual components correspond to a PCA “loadings” plot. With untargeted metabolomic data, the loadings plot can be used to narrow down the list of features or peaks to just a few important ones that need to be identified. This makes PCA ideal for reducing the number of features in untargeted metabolomic data from 1000s to just a few dozen or less. It can also help reduce the list of metabolites in targeted metabolomic studies from 100 s to just a dozen or fewer. Furthermore, PCA can be used to identify the most important or most informative metabolites required to generate a biomarker profile for a given disease.

PCA can be easily conducted using a variety of free or nearly free software programs such as MatLab or the R project (http://www.r-project.org) using R’s prcomp or princomp commands. PCA can also be performed using freely available, downloadable software packages such as XCMS [25], MS-DIAL [26], MAVEN [39], and GALAXY-M [40], which are frequently used for processing LC-MS data. Freely available web servers are also available that support PCA and other common multivariate statistical techniques. The most widely used web server for multivariate statistical analysis in metabolomics is MetaboAnalyst [38]. MetaboAnalyst, which is freely available, provides an easy-to-use graphical interface that allow users to simply point and click to perform complex multivariate statistical operations or to generate colorful, interactive graphs or tables. Nearly one-third of all published metabolomic papers use MetaboAnalyst in the metabolomic data analysis pipeline.

3.4 Partial Least Squares Discriminant Analysis

PCA is not the only multivariate statistical approach that can be used to identify important metabolites or reduce the number of spectral features. Another type of multivariate statistical method that can be used for this purpose is known as supervised classification. Supervised classifiers are programs or algorithms that require that information about the class identities must be provided in advance of running the analysis. In other words, prior knowledge about which samples belong to the “cases” and which samples belong to the “controls” is used to label each of the samples. Examples of supervised classifiers include SIMCA (soft independent modeling by class analogy), PLS-DA (partial least squares discriminant analysis), and OPLS-DA (orthogonal projections to latent structures discriminant analysis). All of these techniques can be used to help convert extensive NMR, LC-MS/MS, and GC-MS metabolite lists (for targeted metabolomics) or their corresponding spectral features (for untargeted metabolomics) into much shorter lists of highly significant metabolites and/or features.

PLS-DA or partial least squares discriminant analysis is often used when PCA techniques do not generate sufficiently distinct clusters or sufficiently distinct metabolite sets. In particular, PLS-DA can be used to enhance the separation between data points in a PCA “scores” plot by essentially rotating the PCA components such that a maximum separation among classes is obtained. This enhanced separation allows one to better understand which variables are most responsible for separating the observed (or apparent) classes. Care must be taken in using PLS-DA methods because these classification techniques can be overtrained. That is, PLS-DA can create convincing clusters or classes that have no statistical meaning (i.e., they over-fit the data). The best way of avoiding these problems is to use permutation (random relabeling) approaches to ensure that the data clusters derived by PLS-DA are real and robust. A number of freely available metabolomic software packages and web servers, such as MetaboAnalyst, are able to perform these permutation tests. Another way of quantitatively assessing a PLS-DA model is to report R2 and/or Q2 values. Both R2 and Q2 are typically reported by metabolomic web servers and software packages such as MetaboAnalyst. R2 is the correlation index and refers to the goodness of fit or the explained variation, while Q2 refers to the predicted variation or quality of prediction. A poorly fit model will have an R2 of 0.2 or 0.3, while a nicely fit model will have an R2 of 0.7 or 0.8. In practice, Q2 typically tracks very closely to R2. However, if the PLS-DA model becomes over-fit, Q2 reaches a maximum value and then begins to fall. Generally, a Q2 > 0.5 if considered good, while a Q2 of 0.9 is outstanding.

If a robust PLS-DA model can be generated, the set of important metabolites (generated via targeted metabolomics) or features (generated via untargeted metabolomics) arising from the variable importance plot (VIP) can be more easily interpreted than those determined via a PCA loading plot. PLS-DA is generally among the most powerful and useful methods for reducing the number of features in untargeted metabolomic data from 1000s to just a few dozen or less. PLS-DA is also very effective in reducing the list of important or differential metabolites in targeted metabolomic studies from 100 s to just a dozen or fewer. Furthermore, PLS-DA can be used to identify the most important or most informative metabolites required to generate a biomarker profile for a given disease. The utility of PLS-DA in biomarker development and discovery is discussed in the next section.

4 Bioinformatic Tools for Biomarker Discovery

One of the principal goals of clinical metabolomics is to discover and/or measure metabolite biomarkers of human disease. Biomarkers are typically defined as objectively measurable biological characteristics that can be used to diagnose, monitor, or predict the risk of disease [41]. For example, blood glucose is a standard chemical biomarker for monitoring diabetes, while serum creatinine is a chemical marker for kidney function. Many traditional clinical chemistry biomarkers consist of just a single measured entity. However, metabolomics has allowed clinicians to measure multiple chemicals at once. This means it is now possible to measure multiple biomarkers or develop multi-biomarker panels to predict or diagnose diseases with greatly improved sensitivity and specificity. Indeed, it has long been common practice among physicians to combine multiple physiological biomarkers (age + BMI + triglyceride level + cholesterol level = cardiac disease risk) to improve biomarker sensitivity and specificity. Now, with metabolomics, it is possible to create diagnostic or predictive models from multiple metabolites, which can be used to classify individuals into specific groups (i.e., healthy vs. diseased) with much improved sensitivity and specificity.

Sensitivity and specificity have very formal definitions in biomarker studies and the biomarker literature. In standard case vs. control studies, sensitivity (Sn) is mathematically defined as Sn = TP/(TP + FN), and specificity (Sp) is mathematically defined as Sp = TN/(TN + FP), where TP is the number of true positives, TN is the number of true negatives, FN is the number of false negatives, and FP is the number of false positives. Sensitivity (also known as the true positive rate) can be considered as the probability of a positive test result given that a subject has an actual positive outcome. Specificity (also known as the true negative rate) can be considered as the probability of a negative test result given that a subject has an actual negative outcome. For instance, if a biomarker or biomarker panel has a sensitivity of 0.95 and a specificity of 0.60, this indicates that if a patient has a test score that is above the decision boundary there is a 95% chance that the patient is correctly diagnosed with the disease/condition; but if the test score is below the decision boundary, then there is only a 60% chance that the patient is correctly classified as being healthy. A promising biomarker must have both high sensitivity (i.e., to give a positive test result when the disease is actually present) and high specificity (i.e., to give a negative test result when the disease is absent).

One of the best ways to observe how a decision boundary affects sensitivity and specificity is through a receiver operator characteristic (ROC) curve. A ROC curve shows how the sensitivity and specificity change as the classification decision boundary is varied across the range of available biomarker scores. Because an ROC curve depicts the performance of a biomarker test over the complete range of possible decision boundaries, it allows the optimal specificity and associated sensitivity to be determined by visual inspection. When one evaluates a biomarker using a ROC curve, there is no need to be worried about the “data normality” of either the predicted positive or negative score distributions nor whether the two distributions have equal numbers of subjects or equal variance. As a result, ROC curve analysis is widely considered to be the most objective and statistically valid method for biomarker performance evaluation [42].

ROC curves are often summarized into a single metric known as the “Area Under the Curve” (AUC or AUROC). The AUROC indicates a biomarker model’s ability to discriminate between cases (positive examples) and non-cases (negative examples.). If all positive cases are ranked before negative ones (i.e., a perfect classifier), the AUC is 1.0. An AUC of 0.5 is equivalent to randomly classifying subjects as either sick or healthy (i.e., the classifier is of no practical utility). A rough guide for assessing the utility of a biomarker based on its AUROC is as follows: 0.9–1.0 = excellent; 0.8–0.9 = good; 0.7–0.8 = fair; 0.6–0.7 = poor; and 0.5–0.6 = fail (see Fig. 2).

Fig. 2
4 graphs plot T P R versus F P R. The first 2 graphs exhibit the A U C of 95% for an excellent biomarker and 70% for a good biomarker, respectively. The last two graphs illustrate a perfect biomarker with 100% A U C and a random type of biomarker with an A U C of 50%.

A depiction of several different ROC curves for different biomarker tests with the area under the ROC curves indicated. On the bottom left is an example of a perfect biomarker with a perfect ROC curve having an AUROC of 1.0. On the top left is an example of an excellent biomarker profile with an AUROC of 0.9. On the top right is an example of a moderately good biomarker profile with an AUROC of 0.7. On the bottom left is an example of a random biomarker with no predictive or diagnostics capability

Currently, the most useful tool for biomarker discovery, biomarker selection, and for performing sensitivity/specificity analysis (via ROC curve analysis) with metabolomic data is MetaboAnalyst [38]. In particular, the MetaboAnalyst biomarker module supports three common ROC-based analysis modes: (1) classical univariate ROC curve analysis, (2) multivariate ROC curve exploration, and (3) manual biomarker model creation and evaluation. The most popular and useful option is the multivariate ROC curve exploration which supports automated multi-biomarker selection and optimization using Monte Carlo cross validation (MCCV). This allows the biomarker panel’s AUROC to be maximized while minimizing the number of biomarkers being used. MetaboAnalyst will typically generate several biomarker models with different numbers of metabolites and different AUROCs to allow users some choice over what biomarker panel matches their biomarker requirements or performance expectations.

Four different biomarker modeling options are currently offered with MetaboAnalyst’s Biomarker module: (1) partial least squares discriminant analysis (PLS-DA), (2) support vector machine (SVM), (3) random forests, and (4) logistic regression. The most useful of these four options is the logistic regression model as it provides an equation, or set of equations, incorporating metabolite concentrations that can be universally used for calculating cutoff thresholds or decision boundaries. MetaboAnalyst also generates a number of useful graphs, ROC curves, confidence intervals, and charts to help users assess the selected biomarkers and biomarker models. The simplicity with which biomarker models can be developed (mostly via point-and-click operations) and rich graphical support in MetaboAnalyst within its Biomarker module makes it the ideal tool for biomarker discovery in clinical metabolomics.

It is important to note that the metabolomic data being uploaded into MetaboAnalyst’s Biomarker module should be absolutely quantitative. As with most analytical methods supported by MetaboAnalyst, the metabolite data uploaded into the biomarker module must be properly normalized, scaled, and transformed so that metabolite values are comparable and therefore more robustly analyzable.

5 Bioinformatic Tools for Biomedical Interpretation and Data Integration

Biomarker identification is usually limited to disease diagnosis or disease prediction [37]. Biomarkers are not necessarily intended to help uncover the underlying cause of the disease or explain specific disease mechanisms. Of course, metabolic biomarkers may be strongly associated with disease mechanisms, but without some kind of biological interpretation or some kind of biomedical context, association does not imply causation, nor does it lead to mechanistic insights. To properly determine disease causes or disease mechanisms from clinical metabolomic data, it is often necessary to turn to metabolic pathways or to integrate both genomic and metabolomic data together. Metabolite interpretation via pathway analysis often involves determining whether the identified metabolites belong to a single pathway or a smaller set of related pathways. In many cases, this requires searching or reading carefully through various online metabolic pathway databases.

Metabolic pathway databases provide a centralized collection of schematic pathways that depict the current state of the knowledge regarding metabolic (catabolic, anabolic, or signaling) processes that occur within a cell, tissue, or organism. Pathway databases combine large collections of carefully curated metabolite data, with large amounts of carefully collected protein and/or genetic data through a series of illustrated enzyme-mediated reactions, receptor-mediated signaling processes, or protein-aided transport activities. These represent the key molecular and cellular activities that underlie all physiological processes. Because pathway databases combine multi-omic (metabolomic, proteomic, genomic) data together along with general information about physiological or biological consequences, these databases can play a key role in the biological analysis or biomedical interpretation of metabolomic data.

Some of the most popular small molecule pathway databases include KEGG [43], the Reactome database [44], the “Cyc” databases [45], WikiPathways [46], the Small Molecule Pathway Database or SMPDB [47], and PathBank [48]. A number of commercial pathway databases also exist such as BioCarta, TransPath (from BioBase Inc.), and Ingenuity Pathway Analysis (from Ingenuity Systems Inc.). The most useful pathway database for clinical metabolomics is SMPDB as it offers that largest number and most diverse pathways specific to human biology and human diseases. In particular, SMPDB contains 150 signaling pathways, 20,250 disease pathways (covering many IEMs and genetic disorders), 468 drug pathways, and 27,800 metabolic (catabolic/anabolic) pathways.

Most pathway databases support interactive image mapping with hyperlinked information content that allows users to view chemical information (if a compound is clicked) or brief summaries of genes and/or proteins (if a protein or enzyme is clicked). Almost all pathway databases support some kind of limited text search, and a few, such as Reactome, SMPDB, and the “Cyc” databases, support the mapping of gene, protein, and/or metabolite expression data onto pathway diagrams. Only a few pathway databases, such as SMPDB, provide their pathway data in common, machine-readable data exchange formats such as BioPAX [49], SBML [Systems Biology Markup Language] [50], or SBGN-ML [Systems Biology Graphical Notation Markup Language] [51].

Nearly all of the major pathway databases used in metabolomics today (KEGG, the Reactome database, the “Cyc” databases, WikiPathways, and SMPDB) permit users to upload metabolite data and generate highlighted pathway plots indicating the location of key metabolites in a given pathway. Unfortunately, most metabolite/metabolism databases (such as KEGG, the Cyc databases, WikiPathways, Reactome) only contain anabolic or catabolic pathways associated with endogenous metabolites. Almost no information is provided on metabolite signaling pathways, disease pathways, metabolic diseases (such as phenylketonuria), or drug action pathways (how aspirin works). As a result, many metabolomic pathway analyses are limited to interpreting complex metabolite data in only the simplest of terms. An important exception to this is SMPDB. SMPDB resource contains hundreds of human-specific pathways including dozens of signaling pathways as well as hundreds of disease and drug pathways. Currently, SMPDB is the only open-access database that covers such a broad diversity of human disease or disease mechanism pathways – especially for small molecules. This makes SMPDB one of the most popular tools for interpreting and integrating clinical metabolomic data.

While pathway visualization can provide some important qualitative insight into the biological roles for metabolites detected in a clinical metabolomic study, it is also important to remember that more quantitative tools for pathway analysis also exist. In particular, pathway enrichment and pathway topological analysis are two quantitative methods that can be quite helpful. MetaboAnalyst offers several advanced pathway enrichment analysis procedures along with pathway topological analysis to help identify the most relevant metabolic pathways involved in a given clinical metabolomic study. The pathway analysis module in MetaboAnalyst uses simple point and click operations to support three types of analyses: (1) pathway enrichment analysis, (2) pathway topological analysis, and (3) pathway impact analysis. Pathway enrichment analysis can be done using either overrepresentation analysis or via metabolite set enrichment analysis using Fishers’ exact test, the hypergeometric test, and GlobalAncova [52]. Pathway topological analysis is based on the centrality measures of a metabolite in a given metabolic network. Centrality is a quantitative measure of the position of a metabolite relative to the other metabolites in a pathway. Centrality can be used to estimate a metabolite’s relative importance or role in a pathway or network diagram. MetaboAnalyst uses relative “betweenness” centrality and “out-degree” centrality to calculate the relative importance of a metabolite. Centrality means that metabolites located on the periphery of a pathway or those that are involved in side reactions have little consequence and are not particularly “central.” On the other hand, metabolites that are in pathway bottlenecks or those that serve as hubs or precursors for many reactions are more “central.” By calculating the topological importance of different metabolites in a given pathway, as well as the enrichment of certain metabolites in a pathway, it is possible to calculate a pathway impact score.

By plotting the pathway impact score against the number of significant metabolites appearing that pathway, it is possible to generate a plot that illustrates the most important pathways detected from a set of significantly altered metabolites in a given metabolomic experiment (Fig. 3).

Fig. 3
Left. A scatter plot displays the negative of log p versus pathway impact. Right. A flowchart of glycine, serine, and threonine metabolism.

An example of a pathway impact diagram from MetaboAnalyst. For the graph on the left side of the image, the X-axis displays the pathway impact score, while the Y-axis displays the level of enrichment. The size of the colored circles represents the number of metabolites in the illustrated pathway, and the color of the circle indicates its overall significance (with red being most significant and pale yellow being least significant). By clicking on the colored circles, it is possible to see more details about the pathway name, the pathway components, and their topological relationships. This expanded view is shown on the right side of the image with the pathway diagram being taken from KEGG and the individual metabolites being identified with KEGG identifiers

In this example, the X-axis displays the pathway impact score, while the Y-axis displays the level of enrichment. The size of the colored circles represents the number of metabolites in the illustrated pathway, and the color of the circle indicates its overall significance (with red being most significant and pale yellow being least significant). By clicking on the colored circles, it is possible to see more details about the pathway name, the pathway components, and their topological relationships. Within MetaboAnalyst, each detected metabolite is also “clickable” so that a box-and-whisker plot can be generated that illustrates the metabolite concentrations and range between the “case” and “control” samples.

In addition to pathway analysis, there are also a number of other approaches that can be used to interpret, visualize, or explore clinical metabolomic data. One particularly useful approach involves using a technique called metabolite set enrichment or MSEA [53]. MSEA is a form of functional enrichment analysis similar to gene set enrichment analysis (GSEA). For metabolite set enrichment to be effective, one usually needs a comprehensive database of metabolic pathways, a database of healthy/diseased metabolite concentrations, or a database with associations between metabolites and SNPs or metabolites and gene expression levels. Ideally, a good MSEA system should have all of these databases and support all of these functional analyses. In this regard, the MSEA module in MetaboAnalyst actually has all of these databases and functional tools, making it particularly useful for clinical interpretation. Another approach to interpreting clinical metabolomic data is to combine it with gene expression or protein expression data [54]. There are a number of bioinformatic tools that support this kind of integration. One example is MetScape [55]. MetScape is a plugin for the widely used open-source network analysis and visualization tool called Cytoscape. MetScape supports the interactive, network-based exploration and visualization of both metabolite and gene expression data by integrating both the KEGG and EHMN (Edinburgh human metabolic network) databases. MetScape allows users to identify enriched pathways from gene/metabolite expression profiling data, build and analyze gene/metabolite networks, and interactively visualize changes in gene/metabolite data. Another integrated “omics” approach that offers similar capabilities is called Integrated Metabolomic and Expression Analysis or INMEX [54]. This web-based tool is now available through MetaboAnalyst. Like MetScape, INMEX makes use of the KEGG pathway database as well as a number of pathways from SMPDB.

How these software tools and resources are used and how the data is eventually interpreted depends somewhat on the knowledge of the user. Naïve analyses performed by a naïve individual will lead to naïve interpretation. Taking the time to read the literature and to discover what else is known (genetically or metabolically) about a given disease or condition will allow for a much more efficient use of the software and a much more intelligent interpretation of the data. In this regard, it is always important to remember that bioinformatics should always be used as an aid to support and extend one’s own biochemical and biological knowledge.

6 Summary

This chapter has provided a high-level overview of the bioinformatic resources needed to analyze clinical metabolomic data. As highlighted at the beginning of this chapter, the main bioinformatic challenges in clinical metabolomics are (1) metabolite identification, (2) determining metabolite significance, (3) biomarker discovery, and (4) finding disease mechanisms or causes. To address these challenges, we introduced and discussed a number of software tools, data resources, and data standards for facilitating compound identification, for detecting which compounds are significantly altered in abundance, for identifying and assessing metabolite biomarkers or biomarker panels, and for understanding biological and genetic context of the observed metabolite changes. In particular, we discussed the Metabolomics Standards Initiative (MSI) for compound identification and introduced software tools for metabolite identification and quantification for NMR, GC-MS, and LC-MS/MS (such as AMIX, Bayesil, AMDIS, and XCMS). We also discussed data resources for metabolite annotation such as the Human Metabolome Database (HMDB) and MarkerDB, as well as data analysis and biomarker discovery tools such as MetaboAnalyst. Finally, we closed the chapter with a discussion on different bioinformatic resources for interpreting or characterizing disease mechanisms, such as the Small Molecule Pathway Database (SMPDB).

The field of clinical metabolomics has grown considerably over the past 10 years, and detailed descriptions of all the bioinformatic tools and resources that have been developed for clinical metabolomics could easily fill several books. This chapter is only intended to serve as an introduction so that individuals who are interested in pursuing clinical metabolomics and using bioinformatic tools for clinical metabolomics can better appreciate what is available, what is possible, and what still needs to be done.