Introduction

In recent years, the focus on organic trace pollutants in the aquatic environment has shifted to an increasing number of polar, highly water soluble compounds. The so-called emerging contaminants (ECs) are environmental pollutants that have not yet been considered in environmental screening programs. Most ECs are released by the discharge of municipal, industrial, and agricultural wastewater into surface waters, and afterwards to other environmental compartments such as soil, air, and groundwater. The importance of the issue of ECs has been shown by a survey of various pharmaceuticals in different surface waters in Europe and the USA, which revealed pharmaceuticals as ubiquitously occurring contaminants [18]. Recent reviews have provided an excellent overview of the multitude and variety of newly detected contaminants from domestic, commercial, and industrial use, e.g., artificial sweeteners, perfluorinated compounds, pharmaceuticals, hormones, disinfection by-products, UV filters, brominated flame retardants, benzotriazoles, naphthenic acids, siloxanes, musk fragrances, and transformation products (TPs) [9, 10]. The high number and the wide range of chemical structures which have to be considered pose a big challenge for analytical methods to monitor ECs.

In addition, TPs increase difficulty of the analytical work owing to their high numbers, their often unknown structures, and their unknown impact and fate in the environment [1116]. TPs are metabolites from human and animal metabolism, and are also produced when anthropogenic pollutants undergo biological, chemical, and photochemical degradation in the environment or in different steps of water treatment, such as biological or chemical processes (e.g., chlorination, ozonation, and advanced oxidation).

To cope with those challenges, there is much interest in prioritizing ECs on the basis of their occurrence and toxicity data to preselect the most relevant compounds [17]. The use of multiresidue analysis methods by liquid chromatography (LC)–tandem mass spectrometry (MS/MS) is the measurement approach to obtain information on the occurrence and fate of ECs in the aquatic environment [1826]. The increased availability and application of LC–high-resolution mass spectrometry (HRMS) is a considerable step forward and has been developed to a powerful tool for screening of environmentally relevant compounds over the last few years [10]. Liquid chromatographs coupled to quadrupole time-of-flight (Q-TOF) mass spectrometers and linear ion trap instruments with orbitrap technology [27] are mostly applied in environmental analysis [2830]. High mass resolving power between 20,000 and 100,000 (at full width at half maximum, FWHM), good mass accuracy (often better than 1 ppm), good isotopic abundance accuracy (3–20 %), and high sensitivity in the picomolar to femtomolar range characterize the new generation of HRMS instruments [30, 31]. It has to be emphasized that these figures of merit are often not directly comparable since they depend on several operating parameters, such as the scan speed and mass range, and the properties of the analytes, such as the ionization efficiency and the molecular weight.

HRMS measurements increase the selectivity for screening of known micropollutants in complex matrices, but for unknown water contaminants more importantly they allow the deduction of the elemental composition from the accurate mass. The key to the deduction of elemental composition is that isotopes of all chemical elements have different mass defects, as exemplified for selected isotopes in Fig. 1. The mass defect is the difference between the nominal and the exact mass of a chemical element. This can lead to positive or negative mass defects, which theoretically allow one to determine a unique elemental composition assuming that the mass spectrometer has a mass resolving power and a mass accuracy in the range of 0.1 mDa even for very complex mixtures [32, 33]. Currently, this is generally not achieved in LC–mass spectrometry (MS) even with high-mass-resolving Fourier transform ion cyclotron resonance mass spectrometers. Current Fourier transform ion cyclotron resonance instruments achieve a mass accuracy of approximately 1 mDa or lower, whereas Q-TOF and orbitrap instruments achieve a range of 2–10 mDa depending on the mass [30, 31].

Fig. 1
figure 1

The different mass defects of elemental isotopes enable a unique elemental composition for any molecule to be determined from a sufficiently accurate mass measurement. (Adapted from [32])

Therefore, high isotope accuracy is an important feature when using the isotope pattern as a further criterion to select the most probable molecular formula. One has to keep in mind that some chemical elements which may occur in ECs such as fluorine, phosphorus and iodine are monoisotopic and hence not suitable for this approach.

LC-HRMS has been increasingly used for nontarget screening of contaminants in environmental samples. Here we have to distinguish between different nontarget approaches depending on the information on the contaminants available from the sample and from spectral libraries or chemical compound databases. Recent reviews have discussed different analytical approaches for LC-HRMS screening of environmental micropollutants [30], structure elucidation of small molecules by MS in life sciences [31], processing and analysis of metabolomics data [34], and computer tools for structure elucidation in effect-directed analysis and metabolomics [35, 36]. The scope of this critical review is to discuss the possibilities and limits of nontarget screening of ECs, with a focus on recent applications and developments in data evaluation and identification used for nontarget screening. The applications have been chosen mainly from the area of water contaminants. Computer-based methods from the metabolomic field were selected only if they appear applicable in environmental analysis, and if they are not already included in most LC-MS workstations or software packages. The fundamental concept of this review is to demonstrate successful approaches and limits of nontarget screening for ECs. We do not consider recent improvements in instrumentation and ionization techniques or different screening approaches such as target, suspect, and nontarget screening. Recent developments in these fields can be found elsewhere [30, 31].

Applications of nontarget screening in water analysis

Suspect and nontarget screening approaches cannot be strictly distinguished from each other. Generally, in suspect screening, information on possibly occurring compounds is used for the evaluation of HRMS data, whereas real nontarget screening starts without any a priori information. Figure 2 illustrates the workflow for nontarget screening, and includes several or all of the steps that have been described in the recent literature [2830, 3739].

Fig. 2
figure 2

Workflow for the evaluation of the molecular formulae and identification in nontarget screening

Nontarget screening typically starts with the accurate mass from LC-HRMS measurements followed by data processing steps to remove noise, blanks, or artifacts [40]. Next, automated deconvolution is performed to extract peaks of all possible compounds. The mass peaks of different ions of one compound are often merged to one feature (e.g., [M+H]+, [M+Na]+, [M+NH4]+). The resulting data set is then analyzed using statistical methods to evaluate the most relevant features by comparison of different samples and blanks. From the relevant features, the elemental composition is calculated and the most probable molecular formulae are evaluated by matching the isotope pattern. For identification, the molecular formulae are searched for in MS/MS databases or libraries. The retention time is often used as a further criterion to reduce the number of hits [29]. Identification is achieved when the MS fragmentation and retention time of the unknown compound fit to the library spectrum and the retention time of a reference compound.

If no match in an MS/MS database or library is available, searches in large chemical databases such as PubChem and ChemSpider are performed. This search generally results in several hundred to several thousand hits for a possible structure. MS fragmentation can be used as a criterion to select the most probable hits. Since the chemical databases generally do not contain any MS data, in silico fragmentation has to be used and then the fragments have to be matched against the measured MS fragments [4143]. This results in a number of proposed compound structures. However, unequivocal identification still needs standards or complementary information from other analysis methods, such as NMR analysis in conjunction with MS fragmentation as was demonstrated for the structural identification of biotransformation products of iodinated X-ray contrast media [44, 45].

Recent articles have illustrated the application of accurate mass measurements, the calculation of molecular formulae, and searches in user-defined or NIST libraries and the Merck index [28, 29, 46, 47]. With use of Q-TOF measurements and a user-defined database with 2,500 water pollutants plus 100 mass spectra, the structures of the three compounds N,N-dicyclohexyl-N-methylamine, carbamazepine, and triphenylphosphine oxide were proposed, but they have not been confirmed with standards [47]. The limited mass resolving power of 5,000 (FWHM) of a Q-TOF instrument resulted in a considerably high number of proposed molecular formulae for unknowns, which could be reduced by use of isotope patterns and a database search [46]. Four unknowns were identified and confirmed by matching mass fragments and retention times: the fungicide enilconazole and the herbicides prometryn, terbutryn, and diuron. The proposed structures of the molecular formulae of a further three unknowns (C10H9O2F2S2Cl3, C14H26O4, C13H12N2O2) could not be explained. In wastewater effluent, 463 features of Q-TOF measurements with a high mass accuracy of less than 2 ppm were found and matched to 51 compounds by use of retention times and accurate masses, from which 26 compounds were identified on the basis of data from a user-defined library with accurate masses, isotope patterns, and in-source fragments of 300 pesticides and 80 pharmaceuticals [29]. Seventeen of the detected compounds had no data on in-source fragments and were further subjected to scheduled MS/MS measurements. The compounds included five pesticides, 16 pharmaceuticals, and five metabolites. The retention time was an important criterion in preselection of relevant features and in library matching. The 463 features would have matched 463 potential compounds in the user-defined library had the retention time not been considered. Furthermore, several isomer pairs in the database such as sulfathiazole–ketoralac, theophylline–paraxanthine–theobromine, fluoxitine–nadolol, and ranitidine–clomipramine could only be distinguished by their retention times. The main TPs acetaminophen and azithromycin were identified on the basis of the matching of similar structural moieties and hence fragments with library compounds. A self-created database with a user-defined retention index based on fenuron and chloroxuron as internal standards has been used for the identification of metolachlor oxanilic acid, and alachlor minus chloromethane [39]. However, several unknowns could not be explained because of the lack of reference compounds. So far, the application of different approaches of nontarget screening by LC-MS has revealed some promising results. However, it is common that only one or a few compounds of real unknowns could be identified [28, 29, 3739, 41, 4753].

In comparison, screening by gas chromatography (GC)–MS and LC–particle beam–MS often resulted in more proposed compounds owing to the availability of comprehensive MS libraries for electron ionization (EI) MS and the often more meaningful EI mass spectra [37, 50, 51, 5459]. In this context, it is also worth mentioning that the development of direct-EI interfaces could be a complementary technique to obtain EI mass spectra in LC-MS [60, 61]. However, in comparison with EI, soft ionization with electrospray ionization (ESI) facilitates the finding of the molecular mass of the unknowns if there is no library spectrum available [40].

Qualified nontarget screening

Generally, data reduction, MS library search, and the use of further MS and chromatographic data (MS fragmentation, retention time) are most promising. Mueller et al. [62] applied statistical approaches and a search in a user-defined MS library to evaluate potential drinking water contaminants from a landfill leachate. In contrast to using the signal intensity, all features with a retention time and mass-to-charge (m/z) ratio obtained by a Q-TOF full scan were used. Pattern matching of samples with temporal, spatial, or process-based relationships was done by computing the operations union (A ∪ B), intersection (A ∩ B), and complement (A/B) of the different data sets using Venn diagrams. Venn diagrams are used to teach elementary set theory and show all logical relations between a given collection of sets. This helped to reduce the total number of features detected in groundwater affected by a landfill leachate from 1,729 to eight relevant compounds occurring in the raw water taken for drinking water treatment. Three contaminants were then identified as relevant for drinking water quality, since they still persisted after ozone treatment. A search in a user-defined database (DAIOS: Database Assisted Identification of Organic Substances) [63] resulted in 1-adamantylamine, crotamiton, and carbamazepine as the most probable candidates.

Another approach to evaluate relevant compounds was presented by Helbling et al. [15], who investigated TPs of different pharmaceuticals and pesticides in batch experiments for biodegradation. They used a combinatory approach of suspect and nontarget screening. Proposals for suspected TPs were generated by a metabolite prediction software tool (University of Minnesota Pathway Prediction System, UM-PPS) [64] and their accurate masses were matched against measured full-scan MS data. For nontarget screening, Helbling et al. first performed a pattern matching of MS data of original samples at time 0 and at certain degradation times. After further use of a series of mass filters, they were able to reduce the number of measured features to a list of candidate TPs that were formed during the biotransformation experiment. The filters included the m/z ratio, a retention time domain constraint, a background subtraction algorithm, a constrained molecular formula fit, and a plausibility check based on the presence of 13C monoisotopic masses. The resulting candidate TPs were then confirmed or rejected by visual inspection of the mass and tandem mass spectral data regarding the relative abundance of 13C and/or 37Cl monoisotopic masses and/or adduct masses and product ions of each parent compound and TP. As a result, the suspect screening successfully predicted 21 plausible TPs, whereas the nontarget screening resulted in the proposal of 26 TPs. The most probable structures of the five additional TPs were obtained by interpretation of mass spectral data.

A similar approach was used by Kern et al. [65], who looked for plausible microbial TPs of xenobiotics formed in the environment. The examples revealed that if no previous information is available, identification of compounds based only on accurate mass and mass fragmentation data will be very challenging. This was also confirmed for the identification of metabolites of organic water pollutants using LC-HRMS [66]. The successful identification of unknowns is in some way limited by the availability of chemical compound databases and mass spectral libraries with high-resolution information.

LC-HRMS libraries and databases for organic water pollutants

As already mentioned, a simple but reliable approach for the successful identification and confirmation of unknowns is the comparison of measured accurate product/fragment ion mass spectra with accurate mass spectra of authentic reference compounds, as it provides additional selectivity instead of using only the exact mass and the corresponding isotopic pattern. Most of the reported approaches for the identification of unknowns in forensic, metabolomics, and environmental research use user-defined low-resolution and accurate tandem mass spectral or in-source fragmentation libraries [39, 47, 67, 68]. A variety of commercially available and user-defined mass spectral libraries have been developed for certain MS instrument types and settings. Poor reproducibility of tandem mass spectra from different instrument types is found in comparison with EI mass spectra. Various collision energies are applied in tandem mass spectra, and therefore, the relative intensities of ions differ considerably. However, similar product ion patterns have been observed across a range of collision energies [69]. Consequently mass spectral matching fails if the signal intensity is a criterion, but it is successful if only the fragment pattern is considered. Several studies have shown the reproducibility and transferability of tandem mass spectra for use with multiple instrument types [31, 7072] and that instrument-independent tandem mass spectra can be obtained by application of multiple collision energies for fragmentation [7375]. In 80 % of all cases, the mass spectrum of an unknown compound could be assigned to a structure if it was compared with two or more reference mass spectra recorded with different instruments or with different collision energies. Hence, a considerable collection of tandem mass spectra obtained with different collision energies and with different instruments can improve the overall performance of a successful library search.

Several commercially and publicly available spectral libraries aim to identify compounds independently of the instrument type and settings. Most of the LC-MS libraries have been developed and published by researchers in life sciences (e.g., proteomics and metabolomics). A recent review [36] on computational MS for metabolomics summarized existing compound libraries containing ESI mass spectra, which include the commercially available NIST reference library and the freely accessible metabolite libraries METLIN, Human Metabolome Database (HMDB), and MassBank. Only the NIST reference library, METLIN, and MassBank contain unit-resolution mass spectra and accurate mass spectral data. The NIST reference library released in 2011 contains 85,344 high- and low-resolution tandem mass spectra of 7,172 different ions from 3,877 compounds, and also includes environmentally relevant compounds.

METLIN [73, 76, 77] is a metabolite database currently containing 29,500 accurate tandem mass spectra from 5,327 metabolites. The tandem mass spectra are recorded with one type of Q-TOF instrument in the positive and negative ESI modes using four different collision energies (0, 10, 20, and 40 eV).

MassBank [78] contains 3,357 entries with accurate mass data and tandem mass spectra obtained with different types of instruments, settings, and ionization modes. It has many options for searching for a mass spectrum, such as by peak, compound name, exact mass, molecular formula, substructure, instrument type, single or multiple fragmentation, and type and mode of ionization. One of the major advantages of MassBank is the free accessibility and the possibility to upload both nominal and accurate mass spectra in common and different data formats [36]. This allows the collection of a considerable amount of useful mass spectral information from a broad research community, which might help to improve the overall performance of successful nontarget analysis. Although some data entries on metabolites are certainly useful for research on ECs, the mass spectral information associated with environmental pollutants is rather scarce. Nevertheless, computational techniques and tools for a reliable library search are well developed, and the spectrum-matching tools and search functions are already optimized for ESI-MS/MS library search. Therefore, it might be possible to extend the database with information on environmental pollutants [35, 78]. MassBank is currently being expanded with environmental pollutants using accurate tandem mass spectral data collected by a network of reference laboratories (NORMAN network) [79].

A database for water pollutants with emphasis on nontarget screening is the DAIOS database [80], which contains numeric information on the nominal and accurate masses of precursor and product ions (e.g., from MS/MS, TOF-MS, and Fourier transform MS). The mass spectral data can be searched for precursor and product ions. Additional useful metadata such as information on sampling points, existing production plants, agricultural uses, special urban situations, molecular data, and chromatographic conditions are compiled to constrain the search or to check the plausibility of the compounds searched. DAIOS currently contains about 344 substances, which is a comparably low number, but is open extension by further users.

Generally, the amount of accurate mass spectral information for environmental contaminants in currently available accurate mass libraries and databases is far from comprehensive. The situation is quite different for EI mass spectra; the current NIST reference library [81, 82] contains more than 240,000 EI mass spectra from 212,961 compounds, whereas the Wiley Registry (ninth edition) [83] contains 662,000 spectra from 592,000 compounds.

Therefore, an important approach of nontarget screening has to rely on general chemical databases and has to deal with the lack of MS library information.

Nontarget approaches based on comprehensive chemical databases and computer-based fragmentation

The nontarget approach is a rather challenging task if no compound databases or library information is available, and the proposal of compound structures is thus based only on HRMS data. The result of this approach is clearly limited to some extent. A novel approach to query chemical databases for structural interpretation in the metabolomics field was reported by Hill et al. [43]. They used measured monoisotopic molecular weights for 102 test compounds to retrieve candidates from a comprehensive chemical database (PubChem). On average, 272 candidates were proposed for each test compound. With the rule-based software program MassFrontier [84], fragmentation spectra were generated for all candidates and then compared with the experimental collision-induced-dissociation spectra of Q-TOF-MS measurements of the unknown structures. As a result, for 65 of 102 test compounds, the highest-ranking candidate matched the correct structure, for 87 compounds the right structure was within the first 20 candidates, and for 98 of 102 compounds the correct elemental formula was ranked first.

This reveals that matching experimental with computational fragment spectra is a promising approach to rapidly discriminate among compounds with the same molecular formula. This approach was refined and customized by Wolf et al. [42] in the form of the open source license software tool MetFrag. Direct database queries in PubChem, ChemSpider, and KEGG [36, 8587] can be performed on a Web-based platform. From possible candidates, a rather fast algorithm for in silico fragmentation is performed using the bond disconnection approach and a small set of rules to describe molecular rearrangements. The numbers of molecular fragments which match the measured peaks are then scored on the basis of the number of fragments explaining the measured peaks and the bond dissociation energy. The higher the bond dissociation energy, the less likely the fragment is considered. Further details on MetFrag can be found in Wolf et al. [42], whereas details on the systematic bond disconnection approach can be found in Hill and Mortishire-Smith [84].

Other more specific approaches are promising to predict and explain some of the fragmentation mechanisms, but have much higher demands on computing time such as the application of ab initio calculations in density functional theory [88].

Other commercially available and freely accessible computational tools (e.g., ACD/Fragmenter, Mass Frontier, Sirius Starburst, and SmartFormula3D) are available to support the identification of unknown compounds. However, MetFrag revealed better performance than MassFrontier on the data set used by Hill et al. [43], with lower standard deviations of the correct ranks [36].

We are interested in how this approach is applicable to environmental contaminants and their metabolites. For this purpose, we selected mass spectral data of 21 ECs and metabolites from contaminated sites [62, 89] and from the compound classes pharmaceuticals [15], diagnostics [45, 90], and pesticides [62, 65]. We used the exact mass of the compounds as input for an upstream search in PubChem with mass windows of 2, 5, and 10 ppm using MetFrag. Between 28 and 2,420 candidate structures were retrieved for each compound (Table 1). By matching one or more exact mass fragments with in silico generated fragments of the first 100 hits, we considerably reduced the number of possible candidates to between 2 and 36, depending on the number of isomers which cannot be distinguished by mass spectral fragmentation. The correct structure was ranked first for eight compounds, and was among the first ten ranks in 18 cases. In four cases the correct compound could not be found in the chemical database (no. 21, the TP of iomeprol; nos. 1, 13, and 14, succinic acid derivatives of benzofuran and methylnaphthalene; see Table 1). Hence, this approach works only for compounds which are listed in a chemical database.

Table 1 Proposed compounds from a MetFrag search based on an upstream search of the accurate mass in PubChem and matching of the in silico fragments of the candidates with the measured fragments

Further information on the sample and its contamination, and on the separation and analysis such as chromatographic retention times and ionization efficiencies would be necessary to further reduce the number of most likely candidates. This has been demonstrated for low-resolution GC-EI-MS and LC-EI-MS data [91]. The retention index and boiling point correlation, octanol–water partition coefficients, steric energies, and the linear solvation energy relationships approach have been further used to limit the number of candidate structures [92]. A combination of computer-based structure generation and mass spectral classifiers has been applied for low-resolution GC-MS data in effect-directed analysis [93]. This approach can be used as an alternative to library search or if a match in the database is not available.

Conclusions

The recent literature on different nontarget screening approaches using LC-HRMS reveals some promising results. In most cases only one or a few compounds could be identified by the nontarget approach, which requires several steps from measurement of data to compound identification. Generally, data reduction, MS library search, and the use of further MS and chromatographic data (MS fragmentation, retention time) are required for successful nontarget analysis. Unequivocal identification of unknowns still requires mass spectral information from authentic reference standards using user-defined, public, or commercially available ESI-MS/MS databases or libraries with high-resolution information.

However, high-resolution mass spectral information associated with environmental pollutants is still scarce. The currently available ESI-MS/MS databases and libraries are still unsuitable for a comprehensive library search, and hence a comprehensive identification of nontargeted analytes is not possible. Freely accessible and publicly available MS libraries with the possibility to upload accurate mass spectra in common and different data formats allow a considerable amount of useful mass spectral information to be collected from a broad research community, which might help to improve the overall performance of successful nontarget analysis. Such libraries with well-developed computational search options already exist for use in metabolomics research.

A big step forward has been achieved with computer-based tools if no MS library or MS database entry is found for a compound. Our examples of selected compounds and metabolites from recent publications have demonstrated that numerous search results from a large chemical database can be limited to only a few by in silico fragmentation. Still, only very few compounds could be identified or tentatively identified in environmental samples by nontarget screening. In most cases the availability of comprehensive MS libraries with a focus on environmental contaminants is the limiting factor. Further information on the analyte characteristics in chromatography and ionization will gain increasing importance in nontarget screening.