1 Introduction

Spectral library searching is currently the most common approach for compound identification in untargeted metabolomics, with the earliest historical spectral libraries that can be traced back to the 1950s (Zemany, 1950). Metabolite annotation using spectral library searching is based on the concept that molecules undergo fragmentation that creates a reproducible “fingerprint.” Matching against a spectral library of ground truth MS/MS spectra collected with chemical standards of known molecules can then be used to narrow down structural hypotheses. During library searching, experimental MS/MS spectra are annotated by matching against the library MS/MS spectra and transferring compound labels from library to experimental spectra when a high-scoring match is achieved. This is the gold standard for metabolite annotation from MS/MS data only, and it forms a level 2 or level 3 annotation based on the guidelines of the Metabolomics Standards Initiative (Sumner et al., 2007). A level 2 annotation corresponds to library searching resulting in a structural hypothesis for a specific molecule, while a level 3 annotation is a match hypothesis to a molecular family. Especially isomeric compounds with identical precursor mass may result in more than one structural match. For example, it is impossible to distinguish between various stereoisomers of hexenoylcarnitine by MS/MS matching only (Fig. 1). To promote such a level 3 match to a level 1 identification, complementary analytical approaches, such as nuclear magnetic resonance (NMR), are needed, or all possible isomers in the molecular family have to be tested under the same mass spectrometry conditions to best determine MS/MS spectrum similarity, in addition to liquid chromatography (LC) co-migration of the compound of interest with the chemical standards to validate whether it elutes with the same peak shape and retention time.

Fig. 1
figure 1

Representative example of a molecular family level annotation from spectral library searching that matches to hexenoylcarnitine. The MS/MS spectrum contains several diagnostic fragments and neutral losses that make it possible to assign it to the acylcarnitines molecular family, as indicated on the molecular structures (Yan et al., 2020). However, routine spectral library matching cannot distinguish between the 14 potential stereo- and regioisomers, resulting in a level 3 annotation. This highlights the need for new strategies to communicate the results from spectral library searching, as narrowing down to the molecular family, even when the exact molecular identity is unknown, can often already be valuable for biological interpretation. Top is the experimental observed MS/MS spectrum, with a precursor m/z deviation of 11.6 ppm compared to the calculated m/z of the protonated ions

Although in proteomics, sequence database searching is the dominant strategy to annotate MS/MS spectra (Eng et al., 2011), the usage of spectral libraries has become increasingly popular for the analysis of peptide MS/MS data as well in recent years (Deutsch et al., 2018; Griss, 2016; Shao & Lam, 2017). Spectral library searching is more sensitive than sequence database searching, achieving a higher rate of spectrum identifications (Zhang et al., 2011), and results from spectral library searching and sequence database searching can be combined to maximize the number of identified MS/MS spectra (Shteynberg et al., 2013). This increased sensitivity is especially relevant for the analysis of data-independent acquisition (DIA) experiments, where mixtures of analytes within large, pre-specified mass ranges are measured, in contrast to data-dependent acquisition (DDA), which attempts to isolate and measure individual analytes (Hu et al., 2016). The resulting complex DIA spectra contain signals from multiple peptides, and most DIA analysis tools require detailed MS/MS fragmentation patterns from reference spectral libraries to annotate peptides.

As the authors of this perspective believe that open and transparent science has strong cascading benefits for the larger scientific community (Wilson et al., 2021) and are most familiar with the GNPS/MassIVE platform (M. Wang et al., 2016), most of the following discussion is contextualized in reference to this resource for untargeted metabolomics analysis. In this context, we discuss the state of spectral libraries for untargeted metabolomics in 2022, describe the essential role of spectral libraries in the development of computational tools, and highlight some open challenges and opportunities for the metabolomics community to address in the coming years.

2 Impact of growing and freely accessible spectral libraries

Over the past decade, MS/MS small molecule spectral libraries have steadily increased in size to include hundreds of thousands to millions of MS/MS spectra and hundreds of thousands of compounds (Fig. 1a). Some of the largest experimental small molecule spectral libraries that are currently available include both commercial libraries, such as the National Institute of Standards and Technology (NIST) tandem mass spectral library (https://chemdata.nist.gov/) and the METLIN Gen2 spectral library (Xue et al., 2020), and open spectral libraries, which also serve as aggregation sites for third-party community spectral libraries, such as the Global Natural Products Social Molecular Networking (GNPS) community spectral libraries (M. Wang et al., 2016) and Massbank of North America (MoNA; https://mona.fiehnlab.ucdavis.edu/). Additionally, mass spectrometry instrument vendors also provide commercial spectral libraries, such as mzCloud (https://www.mzcloud.org/). Excitingly, publicly and freely accessible MS/MS spectral libraries recently saw explosive growth (Fig. 2a).

Fig. 2
figure 2

Advances in spectral libraries for LC–MS/MS based untargeted metabolomics. a The GNPS community spectral libraries (non-commercial only) have grown from 23,790 MS/MS spectra in 2014 to 586,647 MS/MS spectra in 2022 (September 2022). Concurrently, the number of library spectra that matched to public data has grown from 4,727 MS/MS spectra in 2014 to 127,405 MS/MS spectra in 2022 (22% of the publicly available library spectra have matches to experimental MS/MS spectra in public data). b Fueled by growing spectral libraries, the MS/MS spectrum annotation rate for the GNPS continuous identification mode as part of living data (M. Wang et al., 2016), which periodically reanalyses all public datasets on GNPS/MassIVE with the latest spectral libraries, has increased from 2% of MS/MS spectra on average in 2014 to 13% in 2022

There also exist many other, often subject-specific spectral libraries, including Massbank (Horai et al., 2010) and Massbank EU (https://massbank.eu/MassBank/), the Human Metabolome Database (HMDB) (Wishart et al., 2021), the RIKEN tandem mass spectral database (ReSpect) (Sawada et al., 2012), the monoterpene indole alkaloid database (MIADB) (Fox Ramos et al., 2019), the Critical Assessment of Small Molecule Identification (CASMI) contest libraries (Schymanski & Neumann, 2013), European Molecular Biology Laboratory–Metabolomics Core Facility (EMBL-MCF) (Phapale et al., 2021), the Pacific Northwest National Lab lipids library (Kyle et al., 2017), the National Institutes of Health natural products library (Huang et al., 2019), the Lichen Database (LDB) (Olivier-Jimenez et al., 2019), fungal dereplication (El-Elimat et al., 2013), Chemicalsoft (Dresen et al., 2009), WEIZMASS (Shahaf et al., 2016), MSforID (Oberacher et al., 2011), the reverse metabolomics libraries (Gentry et al., 2021), and many others. Barring access restrictions, these spectral libraries are also often integrated into the previous spectral library aggregation resources, such as GNPS and MoNA. In this case, the SPLASH (SPectraL hASH) mechanism, which assigns unambiguous, database-independent hashed identifiers to MS/MS spectra, can be a useful tool for provenance of spectral data and detection of duplicate spectra that are shared across multiple data resources (Wohlgemuth et al., 2016), similar to how InChIKeys are used as chemical identifiers.

Several large proteomics spectral libraries exist as well. These include peptide MS/MS spectral libraries for multiple organisms (human, mouse, rat, yeast, etc.) from NIST (https://chemdata.nist.gov/dokuwiki/doku.php?id=peptidew:start), the ProteomeTools project of synthetic human peptide MS/MS spectra (Zolg et al., 2017), and the MassIVE Knowledge Base (MassIVE-KB) of the human proteome (M. Wang et al., 2018a). Different strategies for compiling spectral libraries are exemplified by the ProteomeTools (Zolg et al., 2017) and MassIVE-KB peptide spectral libraries (M. Wang et al., 2018a). On the one hand, ProteomeTools followed the traditional approach to generate a spectral library by synthesizing unique tryptic peptides from the human proteome and acquiring MS/MS data on multiple instrument platforms (Zolg et al., 2017). This was subsequently expanded to include additional tryptic peptides and modified peptides (Zolg et al., 2018), non-tryptic peptides (Wilhelm et al., 2021), and isobarically labeled peptides (Gabriel et al., 2022) to currently consist of more than one million unique synthetic peptides and over 14 million MS/MS spectra. In contrast, MassIVE-KB employed a data-driven approach towards spectral library creation by re-analyzing hundreds of millions to billions of public MS/MS spectra on the MassIVE data repository using sequence database searching (M. Wang et al., 2018a). The most confidently identified MS/MS spectra and their peptide labels were then extracted to create the MassIVE-KB human peptide spectral library, which currently contains 2.5 million unique peptides and 6 million MS/MS spectra (version 2.0.15). Although an equivalent strategy to sequence database searching in proteomics currently does not exist for metabolomics, approaches employed by ProteomeTools and MassIVE-KB demonstrate how alternative strategies can be used to create valuable collections of reference MS/MS spectra. Furthermore, as it is not uncommon to observe peptides in metabolomics data, it is conceivable that proteomics libraries can be repurposed to also inform a subset of metabolomics data through creative use of algorithms that find analogs of peptides or peptidic molecules.

Similarly, in untargeted metabolomics, each of the libraries provides complementary MS/MS data and pieces of information. For example, the commercial NIST small molecule spectral library predominantly contains human and plant metabolites, ReSpect contains plant metabolites, and the commercial METLIN library historically contained a significant proportion of lipids and dipeptides (full details on the current composition after its explosive growth (Xue et al., 2020) are unknown as the library and information on the molecules that are part of the library have not been released publicly). The GNPS libraries historically focused on natural products, but they have since grown to include many major publicly available reference libraries, including lipids, drugs, pesticides, primary metabolites, food derived metabolites, common contaminants, and microbial metabolites. Furthermore, these libraries are exchanged with MoNA, MassBank EU, and other resources, such that they are not only leveraged in the GNPS analysis ecosystem but also by other analysis systems such as MZmine (Pluskal et al., 2010), MS-DIAL (Tsugawa et al., 2020), and others. This broad sharing of spectral libraries ensures that untargeted metabolomics analyses can be performed against the largest possible spectral libraries, irrespective of the analysis platform. It should be noted that some spectral libraries, such as NIST and METLIN, are exclusively obtained in a single lab under more consistent experimental conditions, whereas other spectral libraries, such as MoNA and GNPS, are aggregated from community contributions and contain data that has been acquired in multiple labs, using different instruments, instrument platforms, and experimental protocols, and thus are more heterogeneous.

Some metabolomics spectral library resources do not only include direct experimental MS/MS data from pure reference compounds, but also MS/MS spectra that were obtained using computational tools. For example, the MoNA and HMDB spectral libraries are augmented with in silico MS/MS spectra that were simulated using e.g. LipidBlast (MoNA) (Kind et al., 2013) and CFM-ID (HMDB) (F. Wang et al., 2021). Additionally, NIST provides smaller spectral libraries focused on specific types of molecules, such as oligosaccharides (Remoroza et al., 2018, 2020) and acylcarnitines (Yan et al., 2020), that were annotated using analog searching (Burke et al., 2017)—a strategy to identify structurally related molecules that differ by a modification by using a very wide precursor mass window (on the order of 100 s Da)—rather than by measuring pure reference standards. GNPS contains secondary reference MS/MS spectra that have been annotated by high-quality matching against the NIST spectral library and “nearest neighbor suspect” MS/MS spectra (Bittremieux et al., 2022a) that were obtained by propagating annotations using molecular networking (Aron et al., 2020) across all public untargeted metabolomics data in the GNPS/MassIVE repository. By propagating annotations from existing spectral libraries to related MS/MS spectra it becomes possible to provide annotations that would otherwise not be accessible to the community. Therefore, these strategies expand the set of putative annotations that can be obtained in untargeted metabolomics experiments. This is especially relevant for molecules for which pure standards are not available, because their structures have never been synthesized or isolated from biological material, or because they cannot be described as a structure (e.g. sodium formate clusters or a specific modification of unknown regio- or stereochemistry). However, because the MS/MS spectra are not directly measured from pure reference material, additional care should be taken when interpreting annotations that match such library spectra. In other words, the user has to verify whether the annotations match the data and whether they make sense in the context of the experiment before investing precious time and resources to perform additional validation experiments.

Besides these traditional spectral libraries for untargeted metabolomics that focus on fragmentation data, other libraries that include complementary information or for different data acquisition methods are starting to become available. For example, some spectral libraries contain LC retention time information as well (Stanstrup et al., 2015; Tada et al., 2019), such as the METLIN small molecule retention time dataset (Domingo-Almenara et al., 2019). Additionally, with the increasing integration of ion mobility functionality in modern mass spectrometry instruments, ion mobility libraries that contain reference collision cross section (CCS) measurements are emerging (Zheng et al., 2017; Hernández-Mesa et al., 2018; Righetti et al., 2018; Picache et al., 2018; Schroeder et al., 2019; Z. Zhou et al., 2020). This availability of retention time and CCS reference measurements provides orthogonal information for metabolite annotation from untargeted MS/MS data. Additionally, spectral libraries for alternative data acquisition methods exist. For example, mzCloud organizes MSn spectra into “fragmentation trees,” and the METLIN-MRM spectral library is a multiple-reaction monitoring (MRM) transition repository for small-molecule quantitative mass spectrometry that contains MRM transitions for more than 15,500 unique molecules (Domingo-Almenara et al., 2018).

With the growing commodification of advanced instrumentation capabilities, there is a need for further expansion of alternative spectral libraries. Whereas most LC–MS/MS spectral libraries use collision-induced dissociation (CID) or higher-energy C-trap dissociation (HCD), various other fragmentation techniques, such as electron-induced dissociation (X. Chen et al., 2018), ultraviolet photodissociation (Bowers et al., 1984), charge transfer dissociation (W. D. Hoffmann & Jackson, 2014), and others (Heiles, 2021), can now be used as well. Because different fragmentation techniques can result in dramatically different MS/MS fragmentation patterns, traditional spectral libraries might not be suitable for MS/MS spectral matching of such data and custom libraries will be needed. Even when using CID/HCD fragmentation, different instrument platforms or employing different collision energies can produce MS/MS data that exhibit dissimilar fragmentation behavior. Consequently, it is not always possible to get a spectral match when data is collected differently. Nevertheless, we recommend searching experimental MS/MS data against the broadest possible relevant spectral libraries, irrespective of instrument platform details. Even if the MS/MS spectra differ to some extent, it can still be possible to obtain relevant matches, especially with modern algorithmic techniques that preprocess spectra to try to minimize the effects of experimental variability. Furthermore, some advanced MS/MS fragmentation strategies might enable synergies between previously disparate library generation efforts. For example, CID spectra can contain a non-negligible number of radical fragment ions (K. Chen et al., 2008; Xing & Huan, 2022), and fragmentation mechanisms from electron-induced dissociation techniques show significant similarity to fragmentation events under electron ionization, which is commonly used in gas chromatography mass spectrometry (GC–MS) (Ducati et al., 2021). This suggests that it could be possible to repurpose the information content from large amounts of historical spectral libraries that have been generated for GC–MS.

The increasing availability of large-scale and open spectral libraries is driving their growing role in computational mass spectrometry (Aksenov et al., 2017; Stein, 2012; Tsugawa, 2018; Vinaixa et al., 2016). Whereas in untargeted metabolomics experiments, using all commercial and openly available spectral libraries, only 2% of MS/MS spectra could be successfully annotated by spectral library searching less than a decade ago (M. Wang et al., 2016), in 2022 the spectrum annotation rate for untargeted metabolomics on the GNPS platform has increased to 13% (Fig. 2b). This increase by up to an order of magnitude in the number of unique MS/MS spectrum annotations that can be obtained is essential in advancing the amount of biological knowledge that can be achieved using untargeted metabolomics, and has only been possible by tremendous and continued efforts of various stakeholders—both academic and industry—and the metabolomics community at large.

3 Interpreting spectral library searching results

When interpreting spectrum annotations from spectral library searching, it is essential to have a clear understanding of the information that mass spectrometry can and cannot provide (Stein, 2012). For example, mass spectrometry may not always distinguish between isomeric molecules. Although the Metabolomics Standards Initiative provides guidelines to denote the level of identification rigor for reported metabolite identifications (Sumner et al., 2007), these do not fully capture the ambiguity related to isomers (e.g. using an ontology) and do not provide a system to build provenance into the confidence of spectrum annotations. Additionally, MS/MS spectra might not contain sufficiently discriminative information to annotate specific molecules if there are too few fragment ions or no unique fragment ions. Analyzing non-discriminative MS/MS spectra is equivalent to searching a genetic sequence database with a two-mer oligonucleotide, which would result in an excessive number of non-specific matches. Instead, when few ions are available or the sample contains multiple isomers, spectral library annotations might only go up to the molecular family if fragment ions correspond to similar (sub)structures that are shared by related molecules. Therefore, it is recommended that users do not restrict themselves to only the top MS/MS match obtained using spectral library searching, but carefully consider lower ranked MS/MS matches that fall within the user defined inclusion criteria of acceptable errors of MS and MS/MS ions, and minimum number of matching fragment ions. If there are multiple annotations with similar MS/MS match scores that correspond to isomeric molecules or belong to the same molecular family—which usually consists of isomeric structures—additional information is needed to further refine the most likely candidate structures. At present, verifying such ambiguity often still involves careful manual investigation by expert users, isolation and NMR confirmation, or purchase or synthesis of all possible structures to validate the assignments. In the future, we anticipate that a new generation of computational mass spectrometry tools that can directly communicate this information to the user will be developed, for example by rolling up spectrum annotations to the family level or indicating spectral evidence of the (sub)structures that can be unambiguously explained. The goal of these tools should be to clearly communicate the maximum amount of knowledge that can be derived from the mass spectral data and then follow up with additional experiments to differentiate among all possible annotations.

In the best case, a library MS/MS spectrum should be measured from only a single, pure reference compound. In practice, during large-scale spectral library generation efforts multiple reference compounds are measured simultaneously to minimize the data acquisition time that is needed. Although it is typically ensured that no near-isobaric compounds are simultaneously measured during such multiplexing of reference compounds to avoid potential confusion when annotating the library spectra, interference during MS data acquisition might still occur. Additionally, other typical quality considerations for mass spectrometry experiments (Bittremieux et al., 2018b), such as the presence of contaminants, carry-over, and other factors that can influence the data can impact spectral library generation.

However, it is also important to be mindful of the biases associated with using pure reference compounds to generate spectral libraries. First, this requires a physical specimen of the pure compound, obtained from commercial sources or through laborious purification of biological samples. Unfortunately the majority of biological molecules whose structures have been elucidated are not readily available for purchase. An example of this bias is the disproportionately large number of unique matches to medicines and drugs when analyzing human fecal samples, while there are much fewer matches to microbial metabolites, which are not well represented in reference spectral libraries. A second type of bias is via the adduct that is chosen for fragmentation. For example, protonated and sodiated adducts are most frequently considered, with two thirds of positively charged MS/MS spectra in the MoNA and GNPS spectral libraries corresponding to protonated adducts (Fig. 3a-b). However, many other adducts can be formed as well, especially during analysis of heterogeneous biological samples. Therefore, unless a complex background matrix is added to the pure standards, it is likely that an adduct that is observed in an experiment may not have been measured while generating library spectra from a reference compound. This is illustrated by the “ion identity molecular networking” approach, which was recently used to create a propagated spectral library that exhibits a broader coverage of different adducts, multimers, and in-source fragments (Fig. 3c) (Schmid et al., 2021). Nevertheless, because ion identity molecular networking can only find predefined ion forms, and we generally do not know the distribution and diversity of all ion forms that exist yet, several unanswered questions remain. For example, how many ions are protonated, sodiated, or acetonitrile-ammonia ion forms? How many ions are magnesium adducts, heterodimers, or other ion forms that are currently not considered? To alleviate these biases, although this is typically not performed, library spectra could be acquired by running pure reference compounds with a more representative background or in a biological matrix and unbiased searches need to be performed to find all ion forms of the standards. Alternatively, as is possible on the GNPS ecosystem, researchers that are experts in the biological systems under investigation can annotate experimental MS/MS spectra directly and add them to the reference libraries.

Fig. 3
figure 3

Distribution of ion adducts in public spectral libraries. The majority of positive ion mode MS/MS spectra in MoNA (a) and GNPS (b) are protonated, while other adducts, in-source fragments, multiply charged species, and multimers are minimally represented. c Ion identity molecular networking was used to extract novel reference MS/MS spectra that exhibit overall broader coverage of different adducts, multimers, and in-source fragments (Schmid et al., 2021). Note that these ion forms are found with a predefined inclusion list, rather than a comprehensive search for all ion forms that might be present in untargeted metabolomics data of a biological sample

Another important, yet often overlooked aspect when evaluating spectral library searching results is the confidence that is ascribed to the original library annotation. If an original library spectrum is incorrectly annotated, this error will propagate through all future studies that find a match to this library spectrum. Consequently, even when a match is obtained, the researcher should still make sure that this makes sense in the context of their experiment. Therefore, it is paramount that library spectra are of the highest possible quality and that their provenance is tracked, so that the end user can understand the origin of their spectral library annotations. Having a clear understanding of the provenance of reference MS/MS spectra is especially relevant when spectral libraries are crowd-sourced, with spectral data coming from heterogeneous sources with potentially differing quality levels, although mistakes have also been found in commercial spectral libraries.

To assign quality levels to MS/MS library spectra, several community resources, including GNPS and MoNA, use a rating system. For example, on GNPS, library spectra are categorized based on the source of the MS/MS spectra. “Gold” spectra are derived from synthetic samples that have been characterized using mass spectrometry and an orthogonal analytical method, such as NMR or crystallography, and can only be contributed by privileged users; “silver” spectra are obtained from an isolated or lysate/crude sample with a scientific publication confirming the presence of the molecule in the sample; and “bronze” spectra are other experimental MS/MS spectra that provide evidence for putative or partial annotations. Finally, there are “in silico” spectra that have been produced using computational approaches. The latter are not selected by default when performing spectral library searching using GNPS, however, as we believe that such spectra should be used with extreme caution and generally only give insights into molecular families rather than specific identities. GNPS also allows users to update spectral library annotations if the original submission contained limited details (e.g. someone may have denoted the spectrum as a saccharide but further insights revealed that the specific molecule is azithromycin, or the original submission did not include the molecular structure which was subsequently added) or to correct previously misassigned library spectra. In these cases, the GNPS system always retains a complete record of the full annotation history. Additionally, GNPS allows users to rate the quality of MS/MS matches from spectral library searching using four star (correct), three star (likely correct, e.g. could also be isomers with similar fragmentation patterns), two star (unable to confirm the annotation due to limited information), and one star (incorrect) ratings. MoNA assigns a five-star quality rating to all spectra based on the amount of metadata that was provided (ionization mode, instrument model, collision energy, liquid chromatography details, etc.), and top rated MS/MS spectra and a leaderboard of their submitting users are advertised on the MoNA homepage. Additionally, users can rate spectra as being either “clean” or “noisy.” These rating approaches allow users to manage expectations based on the evaluation of the library spectra so that they can make informed decisions based on the veracity of an MS/MS spectrum match, as well as provide feedback to help improve library annotations.

As an example of spectral library curation, a strategy for inter-library comparison was described to detect mis-annotated outliers by visual inspection based on an extensive checklist of potential issues (Wallace et al., 2017). Manual quality annotations have limited scalability, however, as they depend on scarce expert user knowledge and require a significant time investment. Because such domain experts often produce very trustworthy manual spectrum annotations and their expertise is not (yet) translated into community knowledge, this represents a unique opportunity to further improve the quality of spectral libraries. Alternatively, some computational approaches for MS/MS spectral library assessment have been proposed, including spectral entropy. Entropy is often likened to the disorder of a system. For example, there are more disorderly states in which a deck of cards can occur in random order (high entropy) than those in which the deck occurs in sorted order (low entropy). Spectral entropy (Li et al., 2021) was recently proposed as a measure to assess the quality of MS/MS spectra, with lower-quality spectra receiving higher spectral entropies. For example, there are small differences in the spectral entropy distributions of the highly curated NIST spectral library and more heterogeneous spectral libraries from MoNA and GNPS (Fig. 4). Nevertheless, we would argue against a simple maximum spectral entropy cut-off to determine whether MS/MS spectra are of sufficient quality. There is a strong (nonlinear) relationship between spectral entropy and the number of fragment ions, with MS/MS spectra that contain only a few fragment ions getting low spectral entropy scores (Li et al., 2021). Although such spectra might be arguably of higher quality and more “clean,” this could also indicate that some of the spectra with low spectral entropy contain insufficiently discriminative fragmentation information to achieve sensitive MS/MS annotation. Nevertheless, spectral entropy is an interesting criterion to support the automated quality assessment of MS/MS spectra, which forms an open challenge that warrants additional research.

Fig. 4
figure 4

Spectral entropy distributions for the GNPS, MoNA, and NIST20 spectral libraries. GNPS consists of 497,137 MS/MS spectra from the “ALL_GNPS_NO_PROPOGATED” library (downloaded on 2022–09-08), MoNA contains 145,361 MS/MS spectra from the “LC–MS/MS Spectra” collection (downloaded on 2022–09-08), and NIST20 consists of 1,026,712 MS/MS spectra (high-resolution MS/MS collection). Spectra were processed by removing noise peaks below 1% of the base peak intensity and normalizing fragment intensities to sum to one. a There is a strong relationship between spectral entropy and the number of fragment ions (Spearman correlation 0.963). b Although the NIST20 library contains smaller molecules than GNPS and MoNA, the difference in entropy distributions cannot be directly explained by the weight of the molecules (Spearman correlation 0.095)

Besides the quality of the library spectra, the veracity of the matches between library spectra and experimental spectra is also essential in determining whether to accept spectrum annotations. Typically, valid spectrum annotations are accepted based on common heuristics, such as a minimum cosine similarity threshold of 0.7 and minimum 6 matching peaks (M. Scheubert et al., 2017; Wang et al., 2016). However, such heuristics do not provide a statistical confidence estimate of the spectrum annotations, and as such, the number of false positives (i.e. incorrectly accepted high-scoring annotations) and false negatives (i.e. missed low-scoring annotations) are unknown. Although not widely used yet at this time, there are emerging strategies for estimating the false discovery rate of MS/MS spectrum annotations. For example, the Passatuto approach constructs a decoy library by modifying MS/MS library spectra based on re-rooted fragmentation trees to enable estimating false discovery rates using a target–decoy strategy (Scheubert et al., 2017). This allows the researcher to accept spectrum annotations with a controlled false discovery rate such that they can decide how many incorrect matches they are willing to include in their results. Although a few other methods to control false discovery rates in metabolomics have been introduced (Palmer et al., 2016; X. Wang et al., 2018a, 2018b; Alka et al., 2022), none are currently routinely used. Statistical control of MS/MS spectrum annotations is an important area of research to explore further and advance untargeted metabolomics into a highly scalable quantitative technique, and we anticipate that such tools will become routinely accessible in emerging MS/MS-based spectrum annotation software.

4 Spectral libraries as a source of machine learning training data

Besides their primary function for spectrum annotation, spectral libraries are also an extremely valuable resource to develop machine learning approaches for the analysis of mass spectrometry data (Kelchtermans et al., 2014). In proteomics, the availability of high-quality spectral libraries that can be used as large-scale training data has spurred the development of several innovative deep learning tools. For example, Prosit is a deep neural network that was trained on the ProteomeTools library to learn peptide fragmentation patterns and predict MS/MS fragment intensities with high fidelity (Gessulat et al., 2019). MS/MS spectra predicted by Prosit, as well as related tools that were developed in a similar fashion (Tiwary et al., 2019; Xu et al., 2020; X.-X. Zhou et al., 2017, p. 201), are now regularly used in lieu of experimental spectral libraries, for example, for the analysis of DIA data without the need to acquire a custom spectral library in advance. This illustrates how the important effort of synthesizing and measuring peptide standards provides continuing benefits outside of the original study by enabling the development of deep learning methods that can be used to simulate highly accurate MS/MS spectra for novel peptides to complement experimental spectral libraries. Similarly, MassIVE-KB was recently used to develop the GLEAMS neural network that can efficiently process hundreds of millions of MS/MS spectra at the repository scale to explore the dark proteome (Bittremieux et al., 2022b).

Small molecule spectral libraries are also used as the basis of computational tool and resource development in metabolomics (Krettler & Thallinger, 2021). For example, fragmentation patterns of acylcarnitines were derived from the NIST spectral library using the hybrid search strategy, which could then be used to extract and validate additional related acylcarnitine MS/MS spectra (Yan et al., 2020). The GNPS nearest neighbor suspect spectral library was created in a data-driven fashion by molecular networking of hundreds of millions of public MS/MS spectra on the GNPS repository in combination with reference MS/MS spectra in the GNPS community spectral libraries, and is a unique resource that provides insights into common modifications that molecules can undergo (Bittremieux et al., 2022a). Additionally, high-quality annotated MS/MS spectra in open spectral libraries are increasingly being used to train and validate machine learning methods in metabolomics (Krettler & Thallinger, 2021; Liu et al., 2021). For example, they can be used to learn relationships between MS/MS patterns and molecular (sub)structures (e.g. MS2LDA (van der Hooft et al., 2016), MESSAR (Liu et al., 2020)), develop machine learning-inspired spectrum similarity scores (e.g. Spec2Vec (Huber et al., 2021a), MS2DeepScore (Huber et al., 2021b), SIMILE (Treen et al., 2022)), simulate MS/MS spectra (e.g. CFM-ID (F. Wang et al., 2021)), and predict spectrum annotations (e.g. CSI:FingerID (Dührkop et al., 2015), COSMIC (M. A. Hoffmann et al., 2021), MassGenie (Shrivastava et al., 2021)).

These are inspiring examples of computational advances that are beginning to define the next generation of metabolomics analysis capabilities, which could not have been developed without the availability of comprehensive and high-quality open spectral libraries. Although these are already exciting advances in their own right, we believe that this is only the beginning of a more data-driven approach to computational metabolomics. Especially with the emergence and commodification of deep learning approaches, the availability of large training data is paramount to achieve optimal performance. Deep learning is an extremely powerful class of machine learning models that especially excels in deriving complex patterns from massive amounts of data and discovering otherwise hidden data structures (LeCun et al., 2015). However, care must be taken—as with any learning approach—that the analyses are reproducible and findings are carefully validated using follow-up studies (Gibney, 2022). In other words, computational tools, including those based on statistics or machine learning, can help investigators formulate hypotheses, but it is critical that any discoveries made with such tools are confirmed using follow-up experiments designed to refute the hypotheses. However, as public spectral libraries continue to grow, we excitedly anticipate that this will further power the development of creative machine learning and other computational solutions to provide further tools in the researcher’s arsenal to understand the rich data content that untargeted metabolomics provides.

5 Discussion

Spectral libraries are essential knowledge bases that form a bridge between the past and future of metabolomics: they capture the historical achievements of the metabolomics community in structure elucidation to empower the next generation of biological insights. Currently there are two prevalent strategies towards spectral library dissemination: as a commercial product or freely available for public use. Although commercializing spectral libraries can be appealing to offset the significant costs associated with generating them, open spectral libraries that can freely be used and reused provide a larger community benefit to advance science, by enabling biological discoveries and supporting the development of the next generation of computational and machine learning tools. We anticipate that with the ongoing shift towards open science and data FAIRness (Findable, Accessible, Interoperable, Reusable), open spectral libraries will keep growing in the near future to form increasingly comprehensive resources for the metabolomics community.

There are still some challenges associated with generating and using spectral libraries in metabolomics, however. Many spectral libraries have a considerable amount of missing information. When compiling crowd-sourced spectral libraries, there is a trade-off between requiring that all metadata has been unambiguously specified, which entails an additional time commitment and complexity for users submitting their data, and freely accepting contributions. The former results in a higher barrier towards contributing data to community spectral libraries, leading to smaller spectral libraries, while the latter results in less defined spectral libraries. As it is very challenging to completely eliminate all mistakes from spectral libraries, it is of the utmost importance to understand the provenance of spectral library matches. This allows the end user to make informed judgment calls to decide whether the matches should be followed up in subsequent experiments. Popular community spectral libraries, such as GNPS and MoNA, address this dichotomy by using a multi-faceted ranking system to rate individual MS/MS spectra, contributing users, and MS/MS assignments. Furthermore, a critical evaluation of any results by the user, irrespective of the spectral library source, is essential.

Despite their impressive growth in the past few years (Kind et al., 2018; Peisl et al., 2018; Stein, 2012; Vinaixa et al., 2016; Xue et al., 2020), spectral libraries still only cover a minor part of the known chemical space. For example, PubChem (Kim et al., 2021) currently contains information for 112 million unique compounds (September 2022), whereas all metabolomics spectral libraries combined account for less than 1% of those molecules. As spectral library searching can only annotate known molecules with reference MS/MS spectra or related molecules using analog searching, “unknown unknowns,” where experimental MS/MS spectra did not match any of the reference spectra included in the spectral library, cannot be identified (Stein, 2012). Some spectral library providers have started to integrate in silico MS/MS spectra alongside experimental MS/MS spectra to partially address this issue. Especially as spectrum prediction tools are getting increasingly better, this could be a viable strategy to expand the coverage of spectral libraries. At present, however, we strongly urge caution when accepting annotations based on simulated spectra only. It is still easiest to assess whether an MS/MS spectrum match is acceptable based on the user’s search criteria through manual inspection of experimental MS/MS data. Rather than being able to simulate MS/MS spectra for all 112 million compounds in PubChem, we anticipate that in silico spectra could be a valuable addition for a subset of specific molecular families for which the performance and quality of spectrum prediction tools is well understood and has been carefully validated. For example, high-fidelity peptide mass spectra simulated by deep learning-powered spectrum prediction tools are being increasingly incorporated into various proteomics bioinformatics workflows (Gessulat et al., 2019), and the LipidBlast library, which consists of approximately half a million simulated MS/MS spectra, is available through MoNA to annotate lipids (Kind et al., 2013).

Furthermore, there is a mismatch between the compounds included in spectral libraries and the MS/MS spectra observed in experimental data. For example, out of 586,647 MS/MS spectra present in the GNPS community spectral libraries, 22% have been found in experimental datasets deposited to GNPS (Fig. 2a). This indicates that many of the compounds represented in reference libraries are not observed in metabolomics experiments, or that the library MS/MS spectra were created in a different fashion than for experimental data, such as when the preferred ion form is not included (Fig. 3). Notably, even as the public libraries have grown spectacularly over the previous decade, the rate of matched library spectra has remained relatively consistent. This illustrates the previously described bias in the commercial availability of pure reference compounds that are typically used for spectral library creation efforts. It also indicates that many relevant biological compounds are currently still missing from available spectral libraries, and that careful prioritization of the reference compounds to include is an important aspect of generating spectral libraries that provide maximum benefit.

An emerging approach towards creating comprehensive metabolomics knowledge bases is to expand upon traditional spectral libraries by integrating controlled and structured metadata information alongside the mass spectral data. “Reference data-driven metabolomics” uses not only annotated MS/MS spectra, but also all unannotated spectra in combination with metadata-annotated source data (e.g. were the samples derived from foods, personal care products, medications, etc.) as a pseudo spectral library (Gauglitz et al., 2022). This strategy was exemplified by linking approximately 100,000 MS/MS spectra to 3,600 foods. A key aspect of this approach is that the foods are organized in a hierarchical ontology to enable granular downstream analyses of the food origins. For instance, an example path in the food hierarchy consists of “fruit → citrus → lemon → pink lemon.” This enables performing analyses akin to microbiome science, in which the data may be interpreted at the class, genus, species, or even strain level depending on the research question at hand. Although this approach does not produce exact molecular identities, it provides essential insights into the origin of the data by matching against the reference source data, such as food. Reference data-driven metabolomics using the GNPS platform can increase the number of interpreted MS/MS spectra by up to an order of magnitude, and it has been used to obtain empirical assessments of dietary patterns from untargeted metabolomics data (Gauglitz et al., 2022). Metadata-driven analyses can be broadly applied beyond diet readouts to also investigate other exposures (e.g. medications, personal care products, agrichemicals), disease phenotypes, organ system distributions, taxonomic matching, and many other uses. The key aspect that empowers reference data-driven metabolomics is that the spectral data are linked to controlled and curated metadata to be used as a pseudo-reference library. A less flexible but related metadata system is available for GC–MS data using BinBase, which covers a limited set of metadata (Lai et al., 2017).

Although applied to proteomics, a related approach consists of “spectral archives,” which include MS/MS spectra that have been repeatedly observed, irrespective of whether they could be annotated (Frank et al., 2011). Spectral archives can be built by large-scale clustering of MS/MS spectra across multiple datasets or in an entire repository (Bittremieux et al., 2022b; Frank et al., 2008; Griss et al., 2013, 2016). In this fashion, commonly observed spectra can be grouped and unannotated spectra can be linked across multiple experiments to find correlations with identified compounds (Stein, 2012).

There are also challenges associated with the ever-increasing size of spectral libraries. First, this makes it more difficult to process the data, and better compute infrastructure and optimized algorithms are necessary to process large spectral libraries (Bittremieux et al., 2019, 2018a). Cloud-based solutions, such as the GNPS analysis platform, have the potential to be extremely scalable while hiding this complexity from the user. For example, GNPS allows users to query their data against 1.2 billion open MS/MS spectra using the Mass Spectrometry Search Tool (MASST) to discover public datasets that contain similar MS/MS spectra (M. Wang et al., 2020; West et al., 2022). Developing and maintaining such platforms requires suitable, continued investments and a team willing to maintain the resources for the benefit of the community. The same is also true for MetaboLights (Haug et al., 2013), Metabolomics Workbench (Sud et al., 2015), HMDB (Wishart et al., 2021), MetaboAnalyst (Pang et al., 2022), MZmine (Pluskal et al., 2010), MS-DIAL (Tsugawa et al., 2020, p. 4), and other popular untargeted metabolomics resources. In response, to overcome some of these challenges, subscription models or commercial libraries such as NIST, mzCloud, or METLIN Gen2 continue to be needed. Second, interoperability of various tools and resources is important. There currently does not exist an official data standard for spectral libraries yet. Frequently used spectral library file formats include the mzML (Martens et al., 2011), mzXML (Pedrioli et al., 2004), Mascot Generic Format (MGF), and the NIST MSP formats. Unfortunately, some of these formats are only loosely defined, change over time, often without explicit versioning, and spectrum metadata can be encoded in various non-standardized ways, limiting the usability and portability of such spectral library files. The Proteomics Standards Initiative of the Human Proteome Organization (HUPO-PSI) (Deutsch et al., 2017), which has previously developed fundamental mass spectrometry data standards such as the mzML peak file format (Martens et al., 2011), is currently working on a specification for spectral libraries (https://github.com/HUPO-PSI/mzSpecLib/). Although the HUPO-PSI primarily develops data standards for proteomics, many of their efforts are relevant for any application of biological mass spectrometry. The HUPO-PSI working groups are open to any community contributions, and interested parties are encouraged to engage in the development of this nascent spectral library format to ensure its full compatibility with applications in metabolomics.

In conclusion, we want to re-emphasize the exciting times ahead for spectral libraries in metabolomics. The community has become increasingly aware that capturing metabolomics knowledge in the form of reference MS/MS spectra accelerates discoveries. Existing spectral libraries have grown tremendously in the past few years, and we expect this growth to continue. Bigger libraries, especially those that are freely available for community use, will enable researchers to get more and better annotations from their data and achieve important biological insights. Additionally, it will be possible to develop increasingly powerful machine learning algorithms by training them on large spectral libraries. As some of these machine learning tools will improve the annotation rate in metabolomics and derive more value from existing and new data, this will make it possible to annotate new high-quality MS/MS spectra for inclusion in the next iteration of spectral libraries. As such, the growth of open spectral libraries and development of machine learning tools will proceed in lockstep to power a virtuous cycle and advance metabolomics in the upcoming years.