Introduction

Data processing workflows in non-targeted analysis (NTA) and suspect screening analysis (SSA) routinely identify a small percentage (often <5%) of likely chemical compounds in environmental samples [1, 2]. Improvements in compound identification can enhance exposure assessment, especially when the use of confirmation standards is not practical or possible (at the “tentative” or “probable” degrees of certainty [35]). Online reference databases can be useful for identifying “known unknowns” by searching intrinsic properties, specifically molecular formula and monoisotopic mass, and rank-ordering by the number of associated references or data sources [6, 7]. In this process, the most likely candidate “known unknowns,” which are those compounds unknown to a researcher but known in a reference dataset or resource, are elevated to the top of a search results list. Researchers have previously reported that the freely available chemical database ChemSpider (http://www.chemspider.com/) [8, 9] proved more useful than the Chemical Abstract Service (CAS) RegistrySM when identifying known unknowns, with a key distinction of ChemSpider being the ability to search by monoisotopic mass [7]. Since this initial work, additional studies have reported using ChemSpider (among other databases) to support structure identification [1013]. However, to enhance compound identification strategies, calls have also increased for improvements to open reference databases and analysis workflows (including “one-pass analysis”), and for public sharing of mass spectral data [2, 10, 11, 14].

The United States Environmental Protection Agency (US EPA) is developing a public resource for computational chemistry, toxicology, and exposure research efforts. This freely available resource, known as the CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard; hereafter referred to as the Dashboard), is part of a suite of databases and applications developed by the National Center for Computational Toxicology (https://www.epa.gov/aboutepa/about-national-center-computational-toxicology-ncct), and integrates data from the Distributed Structure-Searchable Toxicity (DSSTox) database (DSSTox_v2) [15]. The underlying database has been expanded, with an emphasis on curation and characterizing data quality, to include hundreds of thousands of chemicals. Recent efforts have involved incorporating specific search tools into the Dashboard to benefit NTA. The Dashboard’s current utilities include the ability to search a reference database of ∼720,000 chemicals by monoisotopic mass and molecular formula. In this research, we evaluated the effectiveness of the Dashboard in the identification of known unknowns, comparing results against those from the de facto freely available online database for mass spectrometry based structure identification, ChemSpider, using the same method of rank-ordering of associated references or data sources reported by Little et al. [7]. Determining the utility of the Dashboard relative to the current standard of freely available chemistry databases will benefit future research applications both within the US EPA and the scientific community as a whole by highlighting the effectiveness of tools designed for NTA users with a new, highly curated chemical reference database.

Methods

A total of 162 chemicals were selected for the assessment of the Dashboard using search and data source rank-ordering techniques (see Electronic Supplementary Material (ESM) Table S1). The selected chemicals (n = 162) were compiled from the Little et al. [7] article that initiated this approach for NTA and from recent environmental and NTA literature. Selected chemicals include pharmaceuticals, dyes, surfactants, chemicals used in manufacturing, and personal care products that have been previously reported in environmental media (water [2, 13, 16], wastewater [16], dust [1], etc.). Monoisotopic masses, formulae, and structural identifiers for all chemicals are reported in the ESM (see Table S1).

The workflow of known unknown identification by data source ranking has been previously described [6, 7]. The same workflow was followed here with minor amendments. Using the Advanced Search option in the Dashboard, a user can enter either a defined mass range (i.e., 263.87 to 263.89 amu) or a single mass with an associated error range (i.e., 263.881 ± 0.005 amu), see ESM Figs. S1S6 for more details. Currently, the Dashboard allows for mass search ranges and error to be entered in atomic mass units (amu) only. Therefore, monoisotopic masses of selected chemicals were searched using the Advanced Search tools in both ChemSpider and the Dashboard with an error of 0.005 amu. Most accurate mass measurement instruments can achieve a standard deviation of 5 ppm or better mass error; in order to be applicable for users with a range of accurate mass capabilities, the error window used in this work (0.005 amu) encompasses at least two standard deviations for all but the highest molecular weight chemicals. Advanced Search results were sorted in descending order by the number of associated references (in ChemSpider per Little et al. [7]) or data sources (in the Dashboard). References in ChemSpider are the number of external IDs for a given chemical and data sources in the Dashboard represent the number of times that a dataset in the DSSTox database contains a particular chemical. Prevalence across many data sources and/or references is indicative, in this context, of a chemical’s relative likelihood of occurrence [7]. The rank of each chemical of interest within the search results after sorting was recorded (Fig. 1). The method was repeated in each application using molecular formulae for every chemical of interest to compare results of formula-based searching to those of mass-based searching.

Fig. 1
figure 1

Advanced search results table in the CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard) after an advanced search of monoisotopic mass 228.115 ± 0.005 amu. Results are ranked in descending order by the number of data sources

For a complete comparison, ranking results in both applications of the 89 chemicals from Little et al. [7] were also evaluated independently to explicitly assess the Dashboard relative to the dataset that initiated this approach. Little et al. [7] also evaluated their workflow on a set of large molecular weight unique commercial polymers not included in the set of 89. For continuity of comparison, these 12 compounds were searched and rank-ordered following the above methods separately from the 162 chemicals.

No modifications to the search parameters or software were made during this study. All methods are demonstrated in the ESM (see Figs. S1S6) and can be repeated in the publicly available Dashboard. Searches were executed in both applications in July 2016. Statistical analyses were conducted in the R Statistical Computing Environment [17].

Results and discussion

Overall rank-ordering

The goal of rank-ordering unidentified chemicals using their monoisotopic mass or molecular formula is to bring the most likely candidate chemicals to the top of the list for either tentative identification or further investigation. Entering monoisotopic masses with an error range of 0.005 amu and ranking by data sources, the average position rank of all 162 chemicals in the Dashboard was 1.31 with the number 1 rank occurring 85% of the time (Table 1). Using ChemSpider, the average position rank across all chemicals was 2.20 with the number 1 rank occurring in 70% of the 162 searches (this average includes the removal of an outlier where the rank of one particular chemical was 201); average position rank in the Dashboard was significantly lower than in ChemSpider (Mann-Whitney U test, p = 0.0005). Formula-based searching yielded improved ranking statistics, consistent with what has been previously reported in the literature [7]. Mean rank position and percentage of chemicals occurring in the number one position improved when searching molecular formulae in both applications and independently, the Dashboard significantly outperformed ChemSpider (p = 0.0083, Table 1 and ESM Tables S2S3). Interestingly, mass-based searching in the Dashboard resulted in similar mean rank position and a higher percentage of chemicals in the number one rank position than formula-based searching using ChemSpider. Chemical formula assignment can vary in certainty with varying mass accuracy. As mass accuracy declines, more potential formulae can be generated from the same monoisotopic mass, introducing more error to formula assignment. Therefore, skipping the step of formula generation and assignment before chemical identification would represent an ideal situation leading toward a one-pass analysis [11]. These data indicate that for the chemicals included in this study, it is just as reliable to directly search the Dashboard using a monoisotopic mass than it would be to attempt to first generate a formula and search ChemSpider using the formula.

Table 1 Summary statistics of rank-ordering all 162 chemicals using data sources or associated references in both the CompTox Chemistry Dashboard and in ChemSpider

Rank-ordering of chemical class

The two largest classes of compounds compiled for this study were pharmaceutical drugs and industrial chemicals. When searching monoisotopic masses, 82 and 76% of pharmaceutical drugs ranked number one using the Dashboard and ChemSpider, respectively (Tables 2 and 3). Pharmaceutical drugs are increasingly important in environmental NTA and risk assessment due to their ubiquitous presence in water and other environmental media [18, 19], and correctly identifying these compounds is important to document for researchers in environmental and human health risk assessment. Greater than 80% of the chemicals in several other compound classes ranked number one using mass-based searches in the Dashboard, including industrial chemicals, steroid hormones, pesticides, and veterinary drugs (Table 2). For those classes containing more than five chemicals, personal care products resulted in the worst average rank position of searched masses in both ChemSpider and the Dashboard. Two chemicals in particular, paraxanthine, a caffeine metabolite, and hexyl dodecanoate, a skin conditioning emollient, fell outside of the top five rank-ordered results when searched by both mass and formula. In the case of paraxanthine, two other more prevalent metabolites of caffeine precede it in the data source ranking. Hexyl dodecanoate has several constitutional isomers, many of which are also emollients, which rank ahead of it in terms of number of sources. This identifies a potential drawback of this rank-ordering workflow in that metabolites and isomers may not be distinguishable by data source ranking alone.

Table 2 Results of searching by monoisotopic mass and rank-ordering by number of data sources in the CompTox Chemistry Dashboard, listed by compound class
Table 3 Results of searching by monoisotopic mass and rank-ordering by number of associated references in ChemSpider, listed by compound class

Comparison to Little et al. datasets

For continuity and comparison, the 89 chemicals used to document ChemSpider’s utility in known unknown identification were analyzed further (Table 4). On this smaller subset, the Dashboard again significantly outperformed ChemSpider (p = 0.009) when searching monoisotopic mass, and the average rank of molecular formula searches were similar (Table 4). A greater number of chemicals ranked number one when rank ordering after a mass search in the Dashboard than after a formula search in ChemSpider, mirroring what was observed on the entire set of 162 chemicals. However, one chemical within the Little et al. [7] list was present in ChemSpider but not in the Dashboard. Tephrosin, a natural toxin, is not contained within the DSSTox database, and therefore not searchable in the Dashboard. Additionally, ChemSpider’s performance based on this analysis did not match that which was previously reported [7]. Specifically, the number of times each chemical ranked number one when searched by molecular formula declined.

Table 4 Summary statistics and rank-ordered position in the CompTox Chemistry Dashboard and ChemSpider of the 89 compound subset from the Little et al. [7] study

A set of 12 large molecular weight chemical compounds (all MW >600 Da) were evaluated separately from the list of 89 in the initial research by Little et al. [7] to determine identification efficacy of unique commercial polymer additives. For a complete assessment, these 12 compounds were separately evaluated following the same methods. Two of the 12 compounds were absent from the Dashboard while all 12 were contained within ChemSpider (see ESM Table S4). By rank-ordering, all of the compounds in this list that were contained in the Dashboard ranked number 1 by both mass and formula searching. However, this does highlight that chemicals outside the domain of the database are not captured in this method, indicating that for true unknowns, other identification processes need to be incorporated.

The number of entries in ChemSpider has doubled since 2012, from 26 million to 57 million today. More entries can be beneficial (as reflected in the omissions in the Dashboard), but it can also interfere with the identification of likely candidate chemicals as reported in this research (Table 4). This is also true for other resources such as PubChem (presently containing more than 90 million chemicals [20]) as well as the Chemical Abstracts Service (CAS) RegistrySM (containing more than 100 million chemicals). A comparison of the number of possible results returned from formula searches in each platform illustrates this complication (see ESM Table S5). For the formula of piperine (C17H19NO3), PubChem returns 20,000 possible results, ChemSpider returns 9000, and the Dashboard returns 100. Based on data source ranking, piperine was the top result in the Dashboard and the fourth highest in ChemSpider. The Dashboard is being developed with a focus on high-quality data of particular value to the environmental sciences and toxicology communities. Large-scale collections of chemicals extracted from patents and chemical vendor collections are not included in the database as support for these efforts is already provided by PubChem and ChemSpider. This approach leads to a cleaner database allowing for more precise known unknown identification.

Ongoing work

Rank-ordering methods

Additional search and rank-order criteria are presently undergoing testing within the CompTox Chemistry Dashboard for further improvements in known unknown chemical identification. Under the premise of this work and the work of others (e.g., [6, 7]), chemicals of interest in environmental media are likely those with the most sources, or are the most “popular” chemicals. Preliminary results indicate that searching the unique InChIKey identifier of chemicals of interest in Google, and rank-ordering the results by the number of result hits, provides an even more accurate identification than using the Dashboard and data sources. These data could be used to enhance or replace data sources within the Dashboard for known unknown investigations. Additionally, rank-order statistics improve when tightening the search window around a monoisotopic mass. Further research developing a sliding mass search scale based on relative monoisotopic mass (i.e., a smaller search window around a smaller mass) could result in more accurate identification of known unknowns.

To further identify chemicals in environmental media, functional use and product occurrence data, as contained in the US EPA’s CPCat database [21], can be incorporated into searching and rank-ordering. Chemical use and function category data, organized with descriptors such as detergent, food-additive, etc., are currently available in the Dashboard. These data may further inform tentative chemical identification through filtering by use category relative to sample medium or through compiled use ranking metrics; testing in the Dashboard is ongoing. Further research to create a weighting-based or tiered ranking approach for identification using all aforementioned criteria as inputs is underway.

MS-ready structures

Charged and salted forms of chemicals contained within chemical reference databases complicate the search and identification process as these forms are not consistent with the form an analyst would detect via high-resolution mass spectrometry in NTA. As an example, the colorant FD&C Blue No. 1 (or Brilliant Blue FCF) is present in both ChemSpider and the Dashboard as a charged molecule with two sodium ions. Therefore, when searching a neutral unidentified monoisotopic mass on both applications, neither resource would return the chemical identified via NTA. Chemical structure curation and standardization can remove duplicates and inconsistencies in structures to allow for cleaner tentative identification. Mansouri et al. developed chemical structure standardization approaches to create quantitative structure–activity relationship (QSAR)-ready structures for use in estrogenic receptor activity screening [22]. This workflow has since been applied to all chemical structures contained in the DSSTox database and exposed in the Dashboard. QSAR-ready structures are neutral, de-salted, and contain no stereochemistry information, and are consistent with the chemical forms detected in mass spectrometry (when corrected for charge-state). In other words, structures standardized into QSAR-ready form happen to offer us MS-ready structures as a benefit. These will be incorporated into the Dashboard, allowing users to be able to easily identify the associated substances, whether they be salts, associated with solvents of hydration, etc. The ability to search MS-ready structures has already been delivered via an iOS mobile app by making our data freely available from the NCCT website (ftp://newftp.epa.gov/COMPTOX/Sustainable_Chemistry_Data/Chemistry_Dashboard). The m/z EPA CompTox app (https://itunes.apple.com/app/m-z-comptox/id1148436331) is already freely available, thereby providing accessibility for NTA users.

API development

Planned developments for the Dashboard include an application programming interface (API) and access to a suite of web services. Programmatic access will allow third parties to investigate and interrogate the data within the database for their own known unknown analyses. Within an investigation of observed chemical features, a user could include ChemSpider for expansive coverage, the Dashboard for focused high-quality data, and even more focused resources like FOR-IDENT (http://for-ident.hswt.de/) [23] for water-specific analyses, among others. Additional capabilities within the API will enable the user to access and incorporate algorithmically generated mass spectral fragmentation resources and metabolite databases for known unknown chemical identification (including spectral library resources like MassBank [24] and mzcloud [25], in silico fragmentation resources like MetFrag [12, 26], and metabolite databases such as Metlin [27]). Chemical metabolites and degradants in environmental media present a difficult problem from an identification perspective. Using the Dashboard to identify known unknowns in the workflow presented here does not include an avenue for metabolites or fragments. However, linking the Dashboard via web services to the open resources available for algorithmically generated metabolites and mass spectra can advance chemical identification in NTA through structure elucidation and metabolite identification.

Conclusions

The Dashboard is a highly curated freely available online reference database that is an effective investigative tool for the identification of known unknowns. Comparisons with the ChemSpider database, a primary database for mass spectrometrists to utilize for structure identification purposes, show better performance overall for the test sets reported here. Expanding the data, functionality and access to support projects within the EPA, and in the scientific community as a whole, will further demonstrate its utility for risk analysis and general chemical identification both as part of larger, more developed workflows and as a stand-alone investigative tool. Future research on expanded utility employing further chemical identification mechanisms will advance the field of NTA and chemical identification in a public arena for widespread use.