Introduction

The chronic discharge of Endocrine Disrupting Compounds (EDCs) into the aquatic environment increased consciousness at the political, regulatory, societal, and scientific levels. This is due to the fact that EDCs can pose a threat to biodiversity and exert harmful effects at ultra-trace levels [1,2,3]. For example, the exposition to natural and synthetic estrogens has been correlated to the feminization of fish populations [4]. Adverse effects on fish reproduction and modifications of the behavior were also reported following exposition to synthetic steroid hormones (progestogens and glucocorticoids, respectively) [5, 6].

Three estrogens, namely estrone (E1), 17beta-estradiol (17βE2), and 17alpha-ethinylestradiol (17αEE2), were included in the first European Water Framework Directive (WFD)’s Watchlist in order to generate high-quality data of their occurrence in aquatic environments and contribute in environmental risk assessment [7]. As a consequence, their occurrence was much improved [8], and a ubiquitous state of contamination has been reported worldwide for these estrogens [3, 9, 10] together with other steroid hormones [3, 11,12,13,14]. Nonetheless, it should be pointed out that non-estrogenic steroid hormones such as androgens, glucocorticoids, and progestogens have been less investigated to date.

Nowadays, very few multi-class methods can address the challenge of reliable quantification of ultra-trace steroid hormones in waters [15, 16]. Indeed, most of the recently published methods are characterized by inadequate limits of quantification to implement the European WFD [17, 18]. For non-regulated compounds, the challenge resides in quantifying steroid hormones at such low concentration levels, which may pose a risk to environmental and human health. Furthermore, owing to their physicochemical properties, steroid hormones can partition between water and suspended particulate matter (SPM) in water bodies [10, 19, 20]. Even if the European WFD has underlined the importance of analyzing the total fraction of organic micropollutants, including steroid hormones [21], very few monitoring methods can meet this requirement [19, 22]. Nonetheless, this requirement only applies if the European Member States have not empirically demonstrated that the measured concentration of the micro-pollutant in the dissolved fraction is the same as in the total fraction of waters. Indeed, that would prove that targeting the total fraction is irrelevant for environmental monitoring.

Quality of data is crucial since measurement results can underpin key decisions in environmental monitoring. Indeed, reliable data are compulsory for the effectiveness of monitoring, risk assessment, and prioritization, and for the scientific knowledge on steroid hormones. For instance, the cocktail effect and the combined mode of action of steroid hormones need to be further investigated to improve risk assessment [17, 23].

On another note, the comparability of the data at the European level and validation procedures are issues that should be addressed when monitoring chemical substances under the European WFD. In view of harmonizing, a European technical specification, the CEN/TS 16800 describes a framework for the validation of analytical methods in environmental media (solids and water). A similar methodology was given in the French NF T90:210 standard for water quality. However, approaches for estimating performance characteristics such as limit of quantification (LQ) are various (signal-to-noise approach, setting of a maximum allowed tolerance (MAT)…) and rarely described in sufficient detail in scientific literature. All this contributes to a lack of comparability between the available analytical methods. To assess the performances under operational conditions as closely as possible, matrix effects should be considered in the validation. Moreover, it is of fundamental importance to use as many representative samples as possible to get a realistic and robust assessment of the performances of the method. However, this aspect is generally poorly addressed in published papers. The impact of matrix effects was scrutinized in another work that highlighted that internal standards could help reduce prejudicial matrix effects in steroid hormones quantification [24]. Three conditions are mandatory to ensure the quality and comparability of measurement results: the establishment of metrological traceability, the demonstration of measurement accuracy, and the estimation of measurement uncertainties. These three conditions have, however, received little attention in the literature.

In this study, a validation approach was developed and optimized for the determination of 21 steroid hormones of four families (androgens, glucocorticoids, estrogens, and progestogens) in whole waters by an isotope dilution two-step solid-phase extraction (SPE) followed by an ultra-performance liquid chromatography separation coupled to tandem mass spectrometry detection (UPLC-MS/MS). Statistical tools described in European standards (CEN/TS 16800 [25] and NF T90:210 [26]) were used to test calibration models and assess the accuracy of the method in representative matrices. Moreover, when possible, certified reference materials (CRMs) traceable to SI units were used to ensure the traceability of measurements. In addition, a thorough evaluation of the measurement uncertainty was performed following the Guide to the expression of Uncertainty in Measurement (GUM) [27]. Finally, to demonstrate its suitability, the method was implemented in a monitoring survey of Belgian surface waters.

Material and methods

Chemicals and reagents

Acetonitrile (ACN) and methanol (MeOH) (Baker Analyzed LC-MS grade) were purchased from Atlantic labo (Bruges, France). Water (ULC/MS) and acetone extra dry were acquired from Biosolve chemicals (Dieuze, France). Ethyl acetate (HPLC Plus, purity = 99.90%) and dansyl chloride (HPLC derivatization, purity = 99%) were acquired from Sigma-Aldrich (Saint-Quentin Fallavier, France). A solution of humic acid was prepared following an existing protocol [28], using humic acid sodium salt (technical grade) also provided by Sigma-Aldrich. Formic acid (AnalaR Normapur®, purity  > 99%) and EDTA, Na2 (ethylenediaminetetraacetic acid disodium salt, 0.05 mol L−1 in aqueous solution) were acquired from VWR (Fontenay-sous-Bois, France). Acetic acid (Optima® LC/MS, purity  > 99%) was acquired from Thermo Scientific (Illkirch, France). Evian® water, a commercially available natural mineral water in glass bottles, was used as a reference water. An in-house material containing suspended particulate matter (SPM) was used.

Selection of native and isotope-labelled analytical standards

21 native compounds belonging to four classes were selected in a previous study [29]:

  • 2 androgens (ANDRO): 4-androstenedione (AD) and testosterone (T)

  • 6 estrogens (ESTRO): 17alpha-estradiol (17αE2), 17beta-estradiol (17βE2), 17alpha-ethinylestradiol (17αEE2), diethylstilbestrol (DES), estriol (E3), and estrone (E1)

  • 5 glucocorticoids (GLUCO): betamethasone (BET), cortisol (COL), cortisone (COR), dexamethasone (DEX), and prednisolone (PRED)

  • 8 progestogens (PROG): 17alpha-hydroxyprogesterone (17HPT), 21alpha-hydroxyprogesterone (21HPT), chlormadinone acetate (Ac CHLOR), cyproterone acetate (Ac CYP), cyproterone (CYP), drospirenone (DRO), levonorgestrel (LEV), and norethindrone (NOR)

To establish the metrological traceability of measurements, high-purity certified reference materials (CRM) available during the study [30] were selected. CRMs for 17βE2 (CRM 6004-a, purity = 98.4%), COL (CRM 6007-a, purity = 99.3%), and T (CRM 6002-a, purity = 99.8%) were provided by the National Metrology Institute of Japan (NMIJ). CRM for 17HPT (S041, purity = 98.7%) was provided by the National Measurement Institute of Australia (NMIA).

For other native compounds, analytical standards with the highest purity were selected.

The choice of isotope-labelled compounds was a crucial step. Labelled compounds from a sufficient mass difference with target compounds to prevent a detrimental impact on quantification should be selected. For example, quantification biases were reported when using cortisol-d2: an interference was reported with a natural isotope of cortisol present in the sample. Similar issues were reported with the use of testosterone-d2 [31]. Furthermore, H/D exchanges between labelled compounds and reconstitution solvents have been reported by Davison et al. with increased exchanges at low pH values [32]. Labelled compounds with 13C are generally recognized as more reliable, owing to their position within the carbon backbone of steroid hormones, making them unavailable for any exchange. All these considerations led to systematically rejecting compounds with less than 3 Da of difference with their native analogous and, when possible, to favor compounds labelled with 13C. Twelve isotope-labelled compounds were finally chosen, considering their labelling quality, physicochemical properties, availability, and cost. Isotope dilution was implemented for every analyte using labelled analogue (deuterated or 13C) except for 17αE2, E3, BET, COR, COL, 21HPT, Ac CHLOR, Ac CYP, and CYP.

All the native and isotope-labelled compounds (13C and 2H) were purchased as pure analytical standards or solutions (in methanol or dioxane at 0.1 mg mL−1) from LGC standards (Molsheim, France), Merck KGaA (Darmstadt, Germany), Toronto Research Chemicals Inc. (Ontario, Canada), Cambridge Isotope Laboratories Inc. (Tewksbury, USA), and Cayman Chemical (Ann Arbor, USA).

CAS numbers and purity of analytical standards, as well as some physicochemical properties (log Kow and pKa), are available in Supplementary Information for the 21 native compounds (Table S1) and the 12 isotope-labelled compounds (Table S2).

Preparation of stock and working solutions

Individual stock solutions were gravimetrically prepared at about 0.1 mg mL−1 in methanol for all the native and isotope-labelled compounds and were analyzed to check for cross-contamination (absence of another selected compound).

Working solutions of the target compounds and the isotope-labelled compounds were gravimetrically prepared by diluting the individual stock solutions in methanol.

Specifically, for the validation of the method, two independent batches of individual solutions of the native compounds were prepared when possible: the first for the preparation of calibration samples and the second for the spiking of samples.

All stock and working solutions were stored in the dark at  − 20 ± 5 °C. No instability of compounds in working solutions was detected during a storage period of 5 months.

Measurement procedure

The analytical procedure used in this study was thoroughly described by Mirmont et al. [29]. Each critical step of the measurement procedure was controlled by weighing.

To summarize, whole water samples (1 L) were spiked with EDTA (0.1%; v/v) and isotope-labelled compounds at concentrations ranging between 0.5 and 2.5 ng L−1 for 17αEE2-d4 and LEV-d6, respectively. After extraction on C18 Atlantic® Ready Disk (Biotage, Uppsala, Sweden) using a Horizon Technology SPEDEX®-4790, purification of samples using Supelclean™ LC-NH2 SPE (500 mg, 6 mL) cartridge (Merck, Darmstadt, Germany) was performed.

As two chromatographic runs were required for the quantification of trace compounds in whole water, each extract was divided into two aliquots: one aliquot was used for the analysis of GLUCO in the negative mode, and the analysis of ANDRO and PROG in the positive mode; the other aliquot was dedicated to the analysis of ESTRO in the positive mode after a dansylation step.

Liquid chromatography was performed using an Acquity® UPLC H-Class system (Waters, Guyancourt, France). Separation was achieved using an Accucore™ Biphenyl column (2.6 µm; 2.1 × 100 mm) equipped with an Accucore™ Biphenyl pre-column (2.6 µm; 2.1 × 10 mm) and a pre-filter (0.2 µm) (Thermo Scientific, Waltham, USA).

The UPLC system was coupled to a Xevo TQ-MS® triple quadrupole mass spectrometer (Waters, Guyancourt, France) equipped with an electrospray ionization (ESI) source. The mass spectrometer was set with a capillary voltage of 3 kV in positive mode and − 2 kV in negative mode. Source and desolvation temperatures were set at 150 °C and 650 °C, respectively. Desolvation and cone gas were set at 1000 and 50 L h−1, respectively. The MS acquisition was performed in Multiple Reaction Monitoring (MRM) mode, and data were processed with TargetLynx™ (Waters).

For each compound, one transition was selected for the quantification, and one transition was selected for the confirmation. Both ions and MS parameters for the 21 target compounds and the 12 isotope-labelled compounds are detailed in Mirmont et al. [29].

Quality controls

To check the initial system performances and to monitor any prejudicial loss of sensitivity during the analytical runs, mixtures of target compounds (1 ng mL−1 and 10 ng mL−1) and isotope-labelled compounds (25 ng mL−1) were injected regularly in each analytical run. A 20% tolerance was stated based on previous laboratory experience with chromatographic peak areas. Moreover, possible carry-over due to memory effects from one sample to another during analytical runs was scrutinized by the injection of solvent blanks (same conditions as samples) after every sample.

Besides, to check any contamination during the overall analytical workflow, blanks constituted of Evian® water spiked with isotope-labelled compounds at a concentration ranging between 0.5 and 2.5 ng L−1 for 17αEE2-d4 and LEV-d6, respectively, were used in each analytical series.

Solvent and procedural blanks were systematically looked into in order to check for the absence of compounds of interest or interferences.

Positive controls were examined in each analytical run. They were constituted of Evian® water samples spiked with the compounds at concentrations ranging from 0.1 for 17αEE2 to 5.0 ng L−1 for LEV and with the isotope-labelled compounds at concentrations ranging from 0.5 for 17αEE2-d4 to 2.5 ng L−1 for LEV-d6 and NOR-d6.

Confirmation of identification of target compounds was performed, fulfilling the ISO 21253-1:2019 requirements: retention time with a tolerance of 2.5%, monitoring of two distinct transitions and their abundance ratio (based on peak area) with a 30% tolerance between samples and calibration samples [33].

Method validation

Method validation was performed using some of the specific statistical tools provided in the NF T90:210 standard.

Selection of representative samples

Seven water samples with an extensive range of physicochemical properties were selected as representative of environmental conditions to which the method will be applied for monitoring in surface waters. Water samples were collected in 4 different locations in France (Fig. 1): the Rance, the Saône, the Yvette Rivers, and the Créteil Lake. Furthermore, two synthetic samples exempt from all target compounds were used as blanks to assess the accuracy at the LQ level. To mimic environmental waters, these two synthetic samples were prepared by spiking Evian® mineral water with humic acid to a dissolved organic carbon (DOC) concentration of 5 mg L−1 or with suspended particulate matter (SPM) at a concentration of 50 mg L−1. These levels were derived from concentrations of DOC and SPM found in average French surface waters [34].

Fig. 1
figure 1

Location of the selected sites for method validation (in France) and for environmental monitoring (in Belgium, text in boxes)

The physicochemical properties of the seven water samples used for method validation are given in Supplementary Information (Table S3).

Calibration model

The quantity Q of a target compound in a sample can be calculated thanks to a calibration function described in the following equation (Eq. 1):

$$Q=(a\frac A{A^\ast}+b)Q^\ast$$
(1)

where \({Q}^{*}\) is the quantity of the internal standard (isotope-labelled compound); \(\frac{A}{A*}\) is the chromatographic peak area ratio between the target compound and its internal standard; and a and b are the slope and the y-intercept of the calibration function.

Calibration samples with the closest area ratios \(\frac{A}{A*}\) observed in samples were chosen to custom-build a suitable calibration curve for each target compound. According to the target compounds and linearity of the calibration model, one or two calibration models were considered for the quantification. Multipoint calibration curves consisted of 5 to 11 points, and for each compound, the previous equation was determined from the calibration model. Calibration samples were gravimetrically prepared in a mixture of water and acetonitrile (same conditions as for sample analyses). The concentration of target compounds ranged from 0.1 to 62.5 ng mL−1, and the concentration of isotope-labelled compounds ranged from 2.5 to 12.5 ng mL−1. Concentrations of the different compounds in calibration samples are given in Supplementary Information (Table S4).

Calibration models were evaluated in intermediate precision conditions over seven different days, with calibration samples randomly injected in triplicate under repeatability conditions.

Calibration criteria were set as follows: the back-calculated concentrations of calibration samples should be within ± 20% of the nominal value at the lowest concentrated calibration sample, and within ± 15% for the other calibration samples to be considered acceptable.

Establishment of measurement traceability to the SI units

Solutions prepared using high-purity CRM for COL at 30.0 ng mL−1, for 17HPT at 3.0 ng mL−1, for T at 1.5 ng mL−1, and for 17βE2 at 0.3 ng mL−1 were analyzed to assess the calibration trueness and establish metrological traceability of measurements to SI units. Bias was calculated for every solution in repeatability conditions. Bias was defined in the following equation (Eq. 2) as:

$$\mathrm{Bias}\;\left(\%\right)=\frac{{\mathrm C}_{\mathrm{measured}}-{\mathrm C}_{\mathrm{theoretical}}}{{\mathrm C}_{\mathrm{theoretical}}}\times100$$
(2)

where Cmeasured is the mean measured concentration in solutions (n = 3) and Ctheoretical is the theoretical concentration in solutions determined gravimetrically.

Method accuracy

According to the International Vocabulary of Metrology (VIM), measurement accuracy is defined as the “closeness of agreement between a measured quantity value and a true quantity value of a measurand” [35]. To assess the accuracy of measurement results, the first step is to estimate the measurement precision and bias. The second step consists of setting a maximum allowed tolerance (MAT). For that purpose, the two following inequalities have to be verified (Eq. 3 and Eq. 4):

$$\mathrm{MAT }(\mathrm{\%})\mathrm{ < B}(\mathrm{\%}) -\frac{{\mathrm{2s}}_{\mathrm{FI}}}{{\mathrm{C}}_{\mathrm{ theoretical}}}$$
(3)
$$\mathrm{B }(\mathrm{\%})\mathrm{ + }\frac{{\mathrm{2s}}_{\mathrm{FI}}}{{\mathrm{C}}_{\mathrm{ theoretical }}}\mathrm{< MAT (\%)}$$
(4)

where MAT is the maximum allowed tolerance, B is the bias as described in the “Establishment of measurement traceability to the SI units” section, and SFI is the standard deviation in intermediate precision conditions.

The MAT was fixed depending on the compound and the level of concentration. As requested by the NF T90-210 standard, a MAT was set at 60% at the LQ and 35% at concentrations above the LQ.

As no matrix-matched CRMs were available, the pure analytical standards purchased were considered as reference and working solutions were used to gravimetrically spike samples from the seven selected matrices (see “Selection of representative samples” section). Independent duplicate samples (1 L) of each matrix were prepared according to the procedure described in the “Measurement procedure” section and analyzed in intermediate precision conditions for each investigated level of concentration (one operator, seven different days (1 day/matrix), preparation of two replicates in repeatability conditions for each level of concentration).

Measurement uncertainty

Every measurement result should be expressed with its associated expanded uncertainty U to allow the comparison of measurement results. A coverage factor (k) of 2 is generally chosen with a confidence level of 95%.

Measurement uncertainties were evaluated by following the Guide to the expression of Uncertainty in Measurement (GUM).

As a first step, an Ishikawa diagram, given in Fig. 2, was constructed by listing all the parameters and potential sources of errors that may influence the calculation of the concentration of a target compound in a sample. This diagram is a precious tool in the evaluation of the global uncertainty budget.

Fig. 2
figure 2

Ishikawa diagram. C, concentration of the target compound in the sample; C*, concentration of the internal standard in the spiking solution; Qlin, modelized quantity ratio \(\frac{Q}{Q*}\) in the sample calculated thanks to constants a and b from the linear calibration model; \(\frac{A}{A*}\), area ratio in the sample; Qet, corrective factor linked to the uncertainty on concentration of calibration samples; \({f}_{syst}\), correction factor linked to the variability of the bias (systematic measurement error); \({m}_{3}\), mass of the empty sample bottle; \({m}_{4}\), mass of the bottle after sampling; \({m}_{2}\), mass of the empty bottle before addition of internal standards; \({m}_{1}\), mass of the bottle after the addition of internal standards; \({f}_{prec}\), correction factor linked to the intermediate precision of the method (random measurement error)

Then, the concentration of a target compound in a sample was expressed in a mathematical form with the following equation (Eq. 5):

$$\mathrm{C} = \frac{{\mathrm{C}}^{*}({Q}_{lin})\mathrm{(}{Q}_{et}\mathrm{)(}{m}_{2}-{m}_{1}\mathrm{)}}{\mathrm{(}{m}_{3}-{m}_{4}\mathrm{)/1000}}+ {f}_{\mathrm{prec}}+ {f}_{\mathrm{syst}}$$
(5)

where C is the concentration of the target compound in the sample in ng L−1, C* is the concentration of the internal standard in the spiking solution in ng g−1, Qlin is the quantity ratio \(\frac{Q}{Q*}\) in the sample calculated using the constants a and b from the linear calibration model, Qet is a corrective factor linked to the uncertainty linked to the preparation of calibration samples, \({m}_{2}\) is the mass of the empty bottle before addition of internal standards in g, \({m}_{1}\) is the mass of the bottle after the addition of internal standards in g, \({m}_{3}\) is the mass of the empty bottle before sampling in g, \({m}_{4}\) is the mass of the bottle after sampling in g, 1000 is a conversion factor from ng g−1 to ng L−1 considering that sample density is equal to water’s density, \({f}_{\mathrm{prec}}\) is a correction factor linked to the uncertainty associated to the intermediate precision of the method (random measurement error), and \({f}_{\mathrm{syst}}\) is a correction factor linked to the uncertainty associated to the variability of the bias (systematic measurement error).

Corrective factors Qet, \({f}_{\mathrm{prec}}\), and \({f}_{\mathrm{syst}}\) are respectively equal to 1, 0, and 0. They do not participate in target compound concentration calculation. However, they are described in the mathematical model given in Eq. 5 because their associated relative uncertainties contribute to the global uncertainty budget.

Standard measurement uncertainties of each input value in the quantification model of Eq. 5 were evaluated. u(\({Q}_{\mathrm{lin}})\) was calculated by polynomial regression using the least square method. \(u({Q}_{et}\)) was calculated considering uncertainty on standard purity (given by the supplier’s certificates) and uncertainty of weighing due to preparations and dilutions. u(\({m}_{i}\)) related to all weighted masses were given by the latest calibration certificate of each scale used. The uncertainty contribution of the working solution of isotope-labelled compounds was not included since the same solution was used for the preparation of the calibration samples and spiked samples. u(\({f}_{\mathrm{prec}}\)) is the standard deviation of the independently measured concentrations of the same theoretical concentration in intermediate precision conditions (one operator, seven different days (1 day/matrix investigated), preparation of two replicates in repeatability conditions for each level of concentration). u(\({f}_{\mathrm{syst}}\)) is the standard deviation of independent recoveries under intermediate precision conditions of the same known quantity of a given compound added in the sample with (Eq. 6):

$$\mathrm{u }({f}_{\mathrm{syst}})=\sqrt{{b}_{{\mathrm{rm}}{\mathrm{s}}}^{2}+{u}_{\mathrm{add}}^{2}}$$
(6)

where brms is the root mean square of the deviation from a 100% recovery (%), uadd is the uncertainty in the concentration of the added compound (%), and uadd (%) is obtained from the following equation (Eq. 7):

$${\mathrm u}_{\mathrm {add}} = \sqrt{{\mathrm u}_{\mathrm v}^{2} + {\mathrm u}_{\mathrm {conc}}^{2}}$$
(7)

where uv is the relative uncertainty component of the volume added (with a gravimetric control) (%) and uconc is the relative uncertainty component of the concentration of the spiking solution (%)

Finally, the combined standard uncertainty was calculated following the general law of propagation of uncertainty using Wincert® software (version 3.13.0311.0026).

Application to environmental monitoring

A sampling campaign was conducted in the Walloon part of the Meuse District in Belgium at four locations in July and December 2020: the Meuse, the Ourthe, the Lesse, and the Sambre rivers (Fig. 1). These monitoring stations were selected for their representativeness of different pressures: livestock farming and urban pressure.

Environmental grab samples were collected in amber glass bottles previously calcinated at 450 °C for 4 h. Samples were transported and stored at 4 ± 3 °C. Sample preparation (see the “Measurement procedure” section) was performed within 24 to 48 h after sampling.

Results and discussion

Method validation

Calibration model

The bias from the back-calculated concentrations of calibration samples was considered satisfying for 16 compounds out of 21 for the seven analytical series performed under intermediate precision conditions. For five compounds, a higher bias was observed. For E3, differences of 22 and 25% were observed at the lowest concentration calibration sample (E3 concentration of 0.4 ng mL−1). Similar observations were found for 21HPT, COR, and LEV with a bias of 22, 24, and 30%, respectively, at the lowest concentration calibration sample (concentration of 0.3, 1.0, and 3.0 ng mL−1, respectively). Lastly, for DES, a difference of 25% was observed at a concentration of 5.0 ng mL−1.

These biases may originate from the lack of isotope-labelled compounds for E3, COR, and 21HPT. The quality of labelling may also contribute to these biases.

Establishment of measurement traceability to the SI units

A bias of less than 5% was obtained when standard solutions of high-purity CRM (17βE2, 17HPT, COL, and T) were used, demonstrating the trueness of the calibration. The detailed data are summarized in Table 1. Considering the state of the art on available technical resources, this result was considered rather satisfying even if the metrological traceability for every target compound shall, as soon as possible, be established to guarantee the quality and comparability of data.

Table 1 Analysis of high-purity CRM standard solutions

Method accuracy

The accuracy profiles, e.g., a graphical representation of accuracy with concentration, starting at the LQ, are shown in Fig. 3 for the 21 target compounds of this study.

Fig. 3
figure 3figure 3

Accuracy at different levels of concentrations for the 21 target compounds of this study

Estrogens

Validated LQs ranged between 0.035 ng L−1 for 17αEE2 and 0.5 ng L−1 for DES. For 17βE2, E1, 17αE2, and E3, a LQ of 0.1 ng L−1 was validated. The confirmation of the identification according to the four criteria detailed in the “Quality controls” section was achieved for all estrogens at the LQ level.

Nonetheless, all target estrogens, except DES, were quantified or detected in non-spiked samples from the Saône, the Yvette Rivers, and the Créteil Lake. Data were not considered when natural levels were close to or higher than a specific spiking level. These observations highlight one of the struggles in the validation of analytical methods using real matrices: the difficulty of finding matrices exempt from target compounds. This is particularly problematic for the estimation and validation of analytical performances at trace levels, as it is the case in this study. Indeed, it appears rather impossible to assess the LQ when natural occurrence levels are higher than or close to the LQ. Finally, for 17βE2, the LQ of the method is adequate, considering the WFD Average Annual Environmental Quality Standards (AA-EQS) in surface waters lowered at 0.18 ng L−1 in November 2021 [36]. For 17αEE2, the AA-EQS was reduced to 0.017 ng L−1 in November 2021, but the still challenging level of 0.035 ng L−1 was reached. For E1, the 3.6 ng L1 Predicted No-Effect Concentration (PNEC) is approximately one hundred times higher than the validated LQ, demonstrating it is compatible with the implementation of the method for the characterization of the aquatic environment. Glineur et al. and Lardy-Fontan et al. have developed methods for the quantification of traces of estrogens in the total fraction of raw surface waters. Lardy-Fontan et al., with an approach similar to the one used in this study, reported higher LQ of 0.4 ng L1 for E1 and 17βE2 and a higher LQ of 0.1 ng L−1 for 17αEE2 [10]. Glineur et al. reported lower LQs of 0.021, 0.053, 0.028, and 0.036 ng L1 respectively for 17αEE2, E1, 17βE2, and E3; however, they used a signal-to-noise approach to estimate the LQ and only estrogens were targeted in their method [37].

As shown in Fig. 3, the accuracy of the present method was found to be within 35% tolerance at the higher concentration levels for the three WFD-regulated estrogens 17αEE2, 17βE2, and E1, as well as for other estrogens (17αE2 and E3). For DES, the criteria of 35% tolerance could not be met and had to be extended to a 40% tolerance.

Lardy-Fontan et al., with a similar approach to this study, demonstrated using Evian® mineral water and Oise river the accuracy of the method with a 30% tolerance for E1 and 17βE2 at 1.2 ng L−1 and 3.6 ng L−1, respectively and with a 40% tolerance for 17αEE2 at 0.4 ng L−1 [10]. Glineur et al. fixed a tolerance of 40% and applied the same approach to analyze estrogens in the total fraction of seven natural surface waters. The accuracy was finally demonstrated by Glineur et al. at 0.5 ng L1 and 2.0 ng L−1 for 17βE2, 17αEE2, and E3 and at 2.0 ng L−1 and 8.0 ng L−1 for E1 [37]. Nonetheless, these two studies only targeted estrogens, contrary to the present work in which four families of steroid hormones are targeted.

Consequently, the present study’s results were considered satisfactory considering data on the literature, WFD regulation requirements, and knowledge needed for DES, for which environmental occurrence is very few documented to date.

Androgens

Validated LQs were 0.5 and 1.0 ng L−1 for T and AD, respectively. At these levels of concentration, the confirmation of the identification of T and AD was possible according to the criteria described in the “Quality controls” section. These LQs are relevant for environmental monitoring considering the 100 and 14 ng L−1 PNEC found in the literature for T and AD, respectively [38].

As shown in Fig. 3, the accuracy of the method was demonstrated with a 35% tolerance at higher levels of concentration for AD and T. To the author’s knowledge, there is no other study in which androgens were investigated in whole water samples.

Progestogens

As shown in Fig. 3, validated LQs ranged between 0.25 ng L−1 for 21HPT and DRO and 5.0 ng L−1 for NOR. For 17HPT, a LQ of 0.5 ng L−1 was validated. For Ac CYP, CYP, and LEV, a LQ of 1.0 ng L−1 was reached. For Ac CHLOR, a LQ of 2.5 ng L−1 was validated. The confirmation of the identification of all progestogens was possible at the LQ level. For DRO, the validated LQ is relevant considering the 2.0 ng L−1 PNEC [38]. However, for the other target compounds, to date, no PNEC are available, and complementary information on ecotoxicity is still needed to confirm the relevance of these LQs for environmental monitoring.

Shen et al. developed a method for the simultaneous quantification of 61 progestogens in filtrated river water, and reported LQs ranging between 0.03 and 0.4 ng L1 with a signal-to-noise approach [39]. These LQs are lower than in the present work. Nonetheless, it should be pointed out that this method only targeted one class of steroid hormones in filtrated samples. To the author’s knowledge, there is no other study in which progestogens were investigated in whole water samples.

For 17HPT, 21HPT, LEV, and DRO, the accuracy of the present method met the 35% tolerance at high concentration levels. However, if a tolerance of 40% is chosen, the accuracy of the method was verified for NOR but not for Ac CYP, Ac CHLOR, and CYP. The present method gives semi-quantitative information on the occurrence of these few documented compounds. Globally, for these compounds, the performances of the method could be improved by implementing isotope dilution, which is a powerful tool for compensating eventual losses during sample preparation and/or alteration of the signal during sample analysis [24].

Glucocorticoids

Validated LQs ranged between 0.1 ng L−1 for BET and DEX and 0.5 ng L−1 for COL, COR, and PRED. Confirmation of the identification was impossible for GLUCO because no signal was detected for the confirmation transition, which is about ten times lower than the quantification one. However, the confirmation of the identification of GLUCO was possible at concentrations from 10 ng L−1.

Validated LQs were found to be relevant for PRED and COL, considering their respective 230 and 2000 ng L−1 PNEC [38]. However, more information is needed for other glucocorticoids.

With a signal-to-noise approach, Shen et al. reported lower LQs ranging between 0.01 and 0.13 ng L−1 for the determination of 68 glucocorticoids in filtrated river water but no other class of steroid hormones were investigated in this study as in present work [40]. To the author’s knowledge, there is no other study in which glucocorticoids were investigated in whole water samples.

For BET, COR, DEX, and PRED, the accuracy of the present method was demonstrated with a 35% tolerance. Considering a higher tolerance of 40%, the accuracy of the method was verified for COL. The performances of the method were considered satisfactory. Particularly, for COL, performances could be improved by implementing isotope dilution.

In the literature, only a few multi-class and ultra-trace methods were found to give a reliable quantification of steroid hormones in river waters. A method targeting all four investigated classes of compounds, reaching the sub ng L−1 range in the dissolved fraction of surface waters for T, COL, COR, and NOR with lower LQ at approximately 0.3 ng L−1 and higher levels for all the other compounds investigated, was documented [41]. With a signal-to-noise approach in filtrated surface water, Goeury et al. reported a LQ of 0.30 ng L−1 for E1, 17αE2, 17βE2, and E3; of 1.50 ng L−1 for 17αEE2, LEV, and NOR; and of 0.45 ng L−1 for AD and T [42]. It should nonetheless be pointed out that comparison of method LQ is a rather complex task since their evaluation has been rarely described in detail (in terms of the matrix used and the statistical approach chosen for data treatment). Moreover, the LQ is often only estimated considering the dissolved fraction of compounds. Furthermore, to the author’s knowledge there is no other study in which the estimation and validation of LQ were conducted in matrices with such contrasted physicochemical properties as in present work. Nevertheless, this methodology allows a more robust assessment of method performances. Indeed, Tavazzi et al. chose Milli-Q water [43], and other authors such as Gong et al. or Goeury et al. chose a single natural matrix [13, 43 for the validation of LQ. The same observations were drawn for the accuracy of analytical methods. These discrepancies in methodology strengthen once again the difficulty of comparing method performances.

Measurement uncertainty

Expanded measurement uncertainties (k = 2) estimated with the GUM approach are given in Table 2 for each target compound. These values were obtained under operational conditions in intermediate precision conditions (one operator, seven different days (1 day/matrix investigated), preparation of two replicates in repeatability conditions for each concentration level).

Table 2 Expanded measurement uncertainty (k = 2) for 21 target compounds

As shown in Table 2, at the LQ level, uncertainties ranged between 58 and 75% for WFD-regulated estrogens. At 1 ng L−1, uncertainties were reduced to values ranging between 14 and 25% for these compounds. Similar observations were found for 17αE2 and E3. For DES, uncertainties were higher than those for other estrogens, with a value of 45% at 5.0 ng L−1. This is consistent with the performances of the method in terms of accuracy, detailed in the “Method accuracy” section. Moreover, the estimated uncertainty is, for E1, in line with the QAQC (2009/90/CE) [44] requirements that set a maximal acceptable uncertainty of 50% at its AA-EQS (PNEC of 3.6 ng L−1). Concerning 17αEE2 and 17βE2, this prerequisite was not met considering their recently updated AA-EQS in November 2021 (lowered at 0.017 and 0.18 ng L−1respectively). Nonetheless, for 17αEE2 at the challenging concentration of 0.05 ng L−1, uncertainty was only 36%. For 17βE2, uncertainties are only at 23% at 0.4 ng L−1. Other steroid hormones showed the same trends, with uncertainties ranging between 27 and 66% at the LQ and between 16 and 57% at higher concentrations. Uncertainties were globally higher for compounds for which accuracy was not verified with a 35% tolerance (Ac CHLOR, Ac CYP, CYP, NOR, and COL).

In most cases, the main contributions to the global uncertainty were the intermediate precision and variability of the bias. However, for some compounds (17αEE2 and Ac CHLOR, for example), the calibration model was also found to contribute to the global uncertainty (see Supplemental Information (Fig. S1)). These results highlighted the advantage of using the GUM approach as it allows a close evaluation of measurement uncertainties thanks to mathematical modelling compared to other approaches like the ISO 11352 one, where only the main sources of uncertainty are considered (empirical approach) [45]. Considering the targeted concentration levels and the variety of target compounds investigated, uncertainties were considered satisfactory.

To the authors' knowledge, there is no method by which uncertainties can be compared with regard to the investigated compounds in comparable conditions. Tavazzi et al. described a method for the characterization of the three WFD-regulated estrogens in the total fraction of surface waters. Using the GUM approach, expended uncertainties (k = 2) of 12% for 17αEE2 at 0.05 ng L−1 and 6 and 16% at 0.3 ng L−1 for 17βE2 and E1, respectively, were reported. However, these very low expended uncertainties were estimated under intermediate precision conditions using ultrapure water (Milli-Q®) [43] and not natural waters, as it was the case in this present study). Lardy-Fontan et al. used the top-down approach described in the ISO 11352:2012 to assess the measurement uncertainty for the analysis of these three compounds in the total fraction of surface waters (Oise River with SPM  > 100 mg L−1 and DOC  > 5 mg L−1). Reported expanded uncertainty (k = 2) was 50% for17αEE2 at 0.1 ng L−1 and 35% for 17βE2 and E1 at 0.4 ng L−1 [10].

Considering the few available data in the literature and WFD regulation requirements, the present method showed performances in terms of LQ, accuracy, and uncertainty in real matrices that were considered satisfactory and sufficient for its intended use. Therefore, the developed method offers the potential of becoming a candidate reference method since it shows the best possible performances in the light of the state of the art of available technical resources. Indeed, the method aims at bringing reliable, comparable, and traceable data. Finally, the developed method was implemented in a monitoring survey.

Fitness for purpose for environmental monitoring

Table 3 presents an overview of measured concentrations in samples collected during two campaigns in July and December 2020 on Belgian rivers: the Meuse, the Ourthe, the Lesse, and the Sambre Rivers (Fig. 1). Chromatograms of the detected compounds in the Sambre River during the campaign of December 2020 are available in Supplementary Information (Fig. S2).

Table 3 Measured concentrations of target compounds in Belgian waters

Only five estrogens (17αEE2, E3, 17αE2, 17βE2, and E1) and three glucocorticoids (BET, COL, and COR) were detected in the samples.

The identification of these estrogens was confirmed by the four criteria mentioned in the “Quality controls” section. Nonetheless, it was impossible to confirm the identification of glucocorticoid compounds as no signal was detected for the confirmation transition. Consequently, reported concentrations for these compounds are given only as indications. E1 was the only compound detected in all samples and was found at concentrations ranging between 0.19 ± 0.11 and 1.78 ± 0.39 ng L−1. COL and E3 were detected in at least three samples in each campaign. 17αEE2 was detected in one sample at a very low concentration. Among the few comparable data available, Lardy-Fontan et al. reported concentrations in whole water ranging between 0.4 and 7.71 ng L−1 for E1 and between 0.4 and 1.30 ng L−1 for 17βE2, and concentrations ranging between 0.1 and 0.23 ng L−1 for 17αEE2, in French surface waters [10]. In the Sambre river, Glineur et al. quantified E1 at 0.67 ng L−1 and only detected 17βE2; E3 and 17αEE2 were not detected [22]. Even if the occurrence of 17αEE2, 17βE2, and E1 has been extensively studied in the literature [46], most published methods only target the dissolved fraction, and only a few data are available on whole water [37]. Moreover, very little data on the GLUCO, PROG, or ANDRO occurrence are available in the total fraction of inland European waters. T, E1, 17βE2, 17αEE2, 17HPT, COR, COL, and DEX were detected in surface waters in Northern Italy at levels ranging between 2 and 3 ngL−1 for glucocorticoids and 76 ng L−1 for E1 [47]. T and AD were detected in drinking water collected in the Pécs in Hungary at 0.013 and 0.014 ng L−1, respectively [48]. It is nonetheless difficult to put the data from the present study into perspective with other published works, notably because of the issue of the representativeness of samples [3, 49].

The concentrations reported in this study for WFD-regulated estrogens and COL were lower than the PNEC or AA-EQS for 17βE2 and E1. Considering this aspect, a good status of the investigated rivers can be derived for these compounds. However, traces of 17αEE2 were detected at a concentration above the recently updated AA-EQS (0.017 ng L−1 since November 2021), which is lower than the LQ validated in this study (0.035 ng L−1) [36]. Therefore, further investigation to achieve a LQ that can meet the WFD requirement should be implemented. As no PNEC or EQS exist to date for the other compounds, the potential risk to the environment could not be assessed. Soon, closer attention should be brought to LEV and NOR since they are good candidates to be included in the next watchlist.

This study allowed getting information on the occurrence of GLUCO in European rivers. Further monitoring for these compounds is, however, needed to gather more data for their environmental risk assessment.

Conclusions

For the first time, an isotope dilution two-step SPE-LC-MS/MS method for the characterization of natural and synthetic hormones of four classes (ANDRO, GLUCO, ESTRO, PROG) in the total fraction of surface waters was validated according to current French and European standards (NF T90:210 and XP CEN/TS 16800) in representative matrices and proved to be robust and sensitive.

Regarding existing ecotoxicity data (PNEC), and WFD regulation requirements, the performances of the method were globally considered satisfactory in terms of LQ and accuracy. However, only semi-quantitative data could be obtained for six compounds (DES, Ac CHLOR, Ac CYP, CYP, NOR, and COL), bringing a first level of information on these very few documented compounds in the context of environmental monitoring. An improvement in the method performances might be easily reached through the improvement of the purification step or the systematic use of isotope dilution, for example.

Even if very rarely addressed in published scientific papers, the metrological traceability and the estimation of measurement uncertainty are mandatory to ensure data quality, efficient data interpretation, and comparability of measurement results within laboratories and, more generally, Europe. Therefore, these fundamental metrological aspects were cautiously and thoroughly investigated in the present study. Measurement traceability to the SI units was established using available CRMs. Measurement uncertainties were fully evaluated following the “Guide to the expression of Uncertainty in Measurement” (GUM) under intermediate precision conditions and not under repeatability conditions, as it is too often the case in most published works.

In this study, a detailed methodological validation rarely seen in literature was conducted based on stringent quality controls to analyze these four classes of hormones in surface waters. It also underlined the importance of implementing method validation using representative matrices and the complexity of reaching confident and accurate data at trace levels for such a wide range of target compounds.

The validated method was then successfully applied to environmental monitoring in Belgium. Occurrence of some ESTRO and GLUCO was observed in the sampled Belgian rivers, underlining the need to continue tracking these substances in the aquatic environment to better understand their fate, occurrence, and ecotoxicity, but also to support risk assessment, monitor prioritization, and address the scientific need of characterization for target compounds in the total fraction of surface waters, particularly for non-estrogenic compounds.