Introduction

Solvents are widely used in many sectors of industry and everyday life such as detergents, agrochemicals, cosmetics, pharmaceuticals, paints, varnishes, inks, etc. Their use has become and more supervised since 2007 by the European regulation Registration, Evaluation, Authorization and restriction of Chemicals (REACH) [1]. Solvents have then to be registered to the European Chemicals Agency (ECHA, http://echa.europa.eu/) requiring, in particular, ecotoxicological information such as acute aquatic toxicity on different trophic levels (fish, invertebrates, and algae). Acute aquatic toxicity is widely evaluated through EC50 values which represent the concentration of chemicals leading to an effect on 50 % of the tested population referred to in a test period [2]. The effect is death for fish, immobility for invertebrates, and inhibition of growth for algae on 96 h, 48 h and 72 h, respectively. To limit in vivo testing, alternative methods are advocated by REACH and other organisms like OCDE [3] such as quantitative structure-activity relationships (QSAR). QSARs are mathematical models relating the physico-chemical properties of molecules with their toxicity. QSARs can then be applied after validation to other molecules to predict their activity on the basis that similar chemicals have similar activities [4, 5]. Many QSAR models predict EC50 of fish [6, 7], invertebrates [8, 9], and algae [10, 11]. Various descriptors are involved in such models as LogP [12, 13], hardness [14], HOMO energy [11, 15] or Balaban indice [8]. QSARs are generally dedicated to specific chemical families as for example benzoic acid [13, 16], alcohols [17], benzene [18], hydrocarbon [19], phenol [20] and may regard or not the toxic action mechanism involved [10]. Only a few general models have been developed to predict invertebrates or algae EC50 with physico-chemical and theoretical parameters for a large set of chemicals [12, 14, 15, 21, 22]. Other models exist based on fragment methods [2325]. However, to our knowledge, no QSAR was developed for a large dataset of organic solvents.

In a previous study [26], a 4-parameter QSAR was developed allowing the prediction of fish LC50 for organic solvents with the octanol-water partition coefficient, LUMO energy, surface tension, and dielectric constant, regardless of the toxic action mode. The purpose of this study is to complete this previous work by studying EC50 prediction of organic solvents for the two other trophic levels, namely invertebrates and algae. Indeed, the knowledge of EC50 values for fish, invertebrates, and algae allow the determination of the aquatic predicted no-effect concentration (PNEC-aquatic) that represents the concentration below which the exposure to a chemical causes no adverse effects to species in the environment [2]. PNEC-aquatic is of major relevance for environmental risk assessment.

Here, three strategies (namely random selection, EC50-based selection, or space-filling approach) were applied to select the solvents of the training sets used for the QSAR development. The predictive performances of the QSAR models were compared to that of LogP-based relations available in the ECOSAR program (http://www.epa.gov/oppt/newchems/tools/21ecosar.htm).

Material and methods

Data set

The experimental acute toxicity was expressed as pEC50 = log(EC50) for both invertebrates and algae trophic levels which are representative for ecotoxicological evaluation of industrial chemicals. EC50 denotes the concentration in mmol/L producing the 50 % immobilization of invertebrates or the 50 % growth inhibition of algae population referred to in a test period of 48 h and 72 h respectively [2]. For each trophic level, different species were considered: Daphnia magna for invertebrates, Scenedesmus subspicatus and Selenastrum capricornutum for algae. EC50 values were collected in INERIS (http://www.ineris.fr/), ESIS (http://ecb.jrc.ec.europa.eu/esis/) and ECHA (http://echa.europa.eu/) databases for good reliability of the data. When several reliable experimental values were available, the geometric mean was used. The final database includes 154 chemically heterogeneous solvents: 122 solvents were described with invertebrate EC50 values, 75 solvents with algae EC50 values, and 141 with fish LC50 (see Table 1). The trophic level of acute toxicity data found in the literature is specified for each solvent. The pEC50 ranged from −2.67 to 2.73 and from −1.72 to 2.91 for invertebrate and algae trophic level, respectively. Chemicals may be categorized for the three trophic levels as (1) very toxic (EC/LC50 < 1 mg.L−1), (2) toxic (EC/LC50 < 10 mg.L−1), (3) harmful (EC/LC50 < 100 mg.L−1), (4) not harmful (EC/LC50 > 100 mg.L−1) (see Table 1).

Table 1 List of the organic solvents in the dataset with the corresponding type of acute toxicity data (F for fish, I for invertebrates and A for algae) and the ecotoxicological class for the three trophic levels assigned according to the European Commission (EC, 1991): (1) very toxic (LC50 < 1 mg.L-1), (2) toxic (LC50 < 10 mg.L-1), (3) harmful (LC50 < 100 mg.L-1), (4) not harmful (LC50 > 100 mg.L-1)

Solvent descriptors

Both physico-chemical and quantum theoretical descriptors have been used for solvent description. These 33 descriptors were widely described in a previous paper [26]. Briefly, the physico-chemical descriptors used are those classically employed for both solvent classification [27, 28] and toxicity QSARs [8, 13] purposes: octanol/water partition coefficient (LogP), molecular weight (Mw), boiling point (bp), density (d), molar volume (Vm), dipole moment (μ), dielectric constant (ε), refractive index (nr), surface tension (γ), vapor tension (Pvap), Hildebrand parameter (δ), and Hansen solubility parameters (δd, δp, and δh). Descriptors values were found in several databases [2933] (https://reaxys.com; https:// scifinder.cas.org).

The ecotoxicity behavior of chemicals may stem from chemical reactivity and selectivity. Therefore, it has been chosen to include relevant quantum descriptors that were computed by the density functional theory (DFT) after geometric optimization with ADF software (http://www.scm.com). Descriptor calculations were performed with the Perdew, Burke, and Ernzerhof (PBE) [34] generalized gradient approximation (GGA) exchange-correlation functional method and a triple zeta (TZP) basis set. These theoretical descriptors were the energy of the highest occupied molecular orbital (HOMO), the energy of the lowest unoccupied molecular orbital (LUMO), the hardness (LUMO-HOMO), the electronegativity ((HOMO + LUMO)/2), the polarisability (α), the maximal (qmax) and minimal (qmin) atomic Mulliken charges, the maximal (Vmax) and minimal (Vmin) electrostatic potential values, the surface area (Surf), the molecular volume (Vol), and the molecule’s ovality (Ov). Three descriptors were added and are related to the electrostatic potential (ESP) computed between -3 eV and +3 eV on the solvent accessible surface around the molecule: the surface Sneg with ESP lower than −0.1 eV, the surface S with ESP ranging from −0.1 eV up to 0.1 eV, and the surface Spos with ESP larger than 0.1 eV. S represents hydrophobic regions while Sneg and Spos represent hydrophilic ones in relation with base or acid Lewis nature. Topological descriptors were also calculated using ProChemist software (http://pro.chemist.online.fr/) such as Wiener [35], Balaban [36], Randic [37], Kier Indices [38].

Model development procedure

Descriptor selection

QSAR models were developed by multiple linear regression (MLR) using the ordinary least squares method. The use of this simple learning method makes such models easy to interpret and to apply. MLR calculations were achieved by using the enhanced replacement method (ERM) with Matlab 7.9 software (QSAR/QSPR search algorithms Toolbox; www.mathworks.fr/products/matlab/). ERM algorithm requires a smaller number of linear regressions than a time-consuming Full Search method while obtaining identical results [39]. Descriptor selection was performed by the Kubinyi function (FIT) [40], expressed as:

$$ FIT=\frac{R^2\left(N-d-1\right)}{\left(N+{d}^2\right)\left(1-{R}^2\right)} $$
(1)

where R2 is the determination coefficient, d the number of descriptors selected in the model and N the number of solvents in the training set. The optimal number of descriptors selected in the model dopt corresponds to the maximum value of FIT in the plot FIT vs d. The FIT statistical parameter is preferred to the Fisher ratio F too sensitive to changes in small d values and poorly sensitive to changes in large d values. The choice of the descriptors was confirmed by performing Student’s t-test at a confidence level of 95 %.

Model validation

All the models developed were evaluated through the determination coefficient R2, the adjusted determination coefficient \( {\mathrm{R}}_{\mathrm{a}}^2 \), and the mean absolute error or mean residual (MAE). The predictive power and robustness of the models developed were assessed by internal and external validation techniques.

Fivefold cross-validation was employed for model internal validation. The training set was randomly divided into five subsets of approximately equal size. Four subsets were used as the training set and the last one as the test set. This procedure was repeated five times so that every subset is selected as a test set once. The squared cross-validated correlation coefficient \( {\mathrm{R}}_{\mathrm{CV}}^2 \) was computed.

In order to check that developed QSARs did not depend on a particular distribution of solvents and according to a previous work [26], three strategies have been used to split the solvent dataset into training and test sets for external validation purpose. Briefly a random selection (1) of training and test sets was performed and repeated five times with a four to one size ratio. A Y-based selection (2) was used to obtain a good representation of all types of solvents in terms of ecotoxicity profile. This strategy classified solvents by ascending pEC50 and to kept three out of four solvents in the training set. To select a training set representative of the solvent space studied, a space filling (SF) technique (3) based on minimax distance criterion was used [41] (cover.design function, fields package, R software (www.r-project.org/main.shtml/)). The profile of the candidate solvents was defined through DFT descriptors as well as LogP well-known to be the relevant characteristic for ecotoxicity explanation. To reduce solvent profile dimensions, principal component analysis (PCA) was first performed and solvents were described by their scores on the principal components explaining 90 % of the dataset variance before to be screened with the SF algorithm. The solution set was determined after 20 runs for convergence and quality purpose.

As reviewed by Chirico and Gramatica [42, 43], several external validation criteria may be used to assess QSAR predictivity and robustness: predictive squared correlation coefficients such as Q2 F1 [44], Q2 F2 [45], Q2 F3 [46, 47] and other criteria such as \( {r}_m^2 \) [48, 49], the Golbraikh-Tropsha method [50] or the concordance correlation coefficient [42, 51]. Here, the classical squared correlation coefficient Q2 F1 which is advocated in the OECD guidelines [52] was used and is expressed as:

$$ {Q}_{F1}^2=1-\frac{{\displaystyle {\sum}_{i=1}^{n_{test}}}{\left({y}_i-{\widehat{y}}_i\right)}^2}{{\displaystyle {\sum}_{i=1}^{n_{test}}}{\left({y}_i-{\overline{y}}_{train}\right)}^2} $$
(2)

where \( {y}_i \) and \( \widehat{y_i} \) are the observed and the predicted log(EC50) respectively of the test set solvents. \( {\overline{y}}_{train} \) is the mean observed log(EC50) of the training set solvents. To well-assess the external predictivity of the QSARs developed, a second external validation criteria was also evaluated, the concordance correlation coefficient (CCC) expressed as:

$$ CCC=\frac{2{\displaystyle {\sum}_{i=1}^{n_{test}}}\left({y}_i-\overline{y}\right)\left(\widehat{y}-\overline{\widehat{y}}\right)}{{\displaystyle {\sum}_{i=1}^{n_{test}}}{\left({y}_i-\overline{y}\right)}^2+{\displaystyle {\sum}_{i=1}^{n_{test}}}{\left({\widehat{y}}_i-\overline{\widehat{y}}\right)}^2+{n}_{test}{\left(\overline{y}-\overline{\widehat{y}}\right)}^2} $$
(3)

where \( \overline{y} \) and \( \overline{\widehat{y}} \) are the mean observed and the mean predicted log(EC50) respectively of the test set solvents. The CCC criteria was chosen since Chirico and Gramatica [43] demonstrated its high reliability (compared to the other validation criteria) by studying the predictivity of QSARs developed from simulated data with different levels of bias. As recommended by these authors, the used acceptance values of Q2 F1 and CCC were 0.70 and 0.85, respectively.

Applicability domain

Developed QSARs also require the definition of the corresponding applicability domain (AD) for estimating the reliability in the prediction of a new molecule [53]. Predicted activity for only those compounds that fall into this domain may be considered reliable. AD may be determined by using several approaches [54]. Here, we used a common one based on the leverage values for each chemical [55] which are calculated as follows:

$$ {h}_i={x}_i^T{\left({X}^TX\right)}^{-1}{x}_i $$
(4)

where xi is the descriptor vector of the chemical i and X the model matrix derived from the training set descriptor values. A warning leverage h* is defined and expressed as \( {h}^{*}=\frac{3p}{n_{train}} \) where p is the number of model parameters [54]. A chemical belonging to the training set with both hi > h* and small standardized residuals (smaller than a value of 3 corresponding to 99 % of the normally distributed data) would reinforce the model; while high leverage compounds with large standardized residuals are expected to badly influence the model. The representation of the cross-validated standardized residuals vs the compound leverages is called the Williams plot and allows detecting both the response outliers and the structurally influential chemicals of a model.

ECOSAR

Several QSAR-based programs are available to predict the ecotoxicological risks associated with chemicals, such as ECOSAR, TOPKAT, DEREK, MCASE, ADAPT, etc. [56, 57]. ECOSAR is a freely available program which was developed on experimental data by the United States Environmental Protection Agency (US EPA). Moreover, due to its good predictive performances [58, 59], ECOSAR is widely used for chemicals risk assessment [10, 16, 5862]. Thus, in this work, the QSARs implemented in ECOSAR which are only based on LogP value and dedicated to neutral organics were used to predict invertebrate or algae pEC50. Neutral organic compounds correspond to alcohols, acetals, ketones, ethers, alkyl halides, aryl halides, aromatic hydrocarbons, halogenated aromatic hydrocarbons, halogenated aliphatic hydrocarbons, sulfides, and disulfides.

For invertebrate (Daphnia), the relation between EC50 and LogP is developed for 115 neutral organics compounds (R2 = 0.771):

$$ \mathrm{Log}\ 48\hbox{-} \mathrm{h}\ \mathrm{E}\mathrm{C}50\ \mathrm{Invertebrate}\ \left(\mathrm{mmol}/\mathrm{L}\right) = -0.8157\ \mathrm{logP} + 1.2695 $$
(5)

.

For algae, the relation between EC50 and LogP is developed for 51 neutral organics compounds (R2 = 0.596):

$$ \mathrm{Log}\ 96\hbox{-} \mathrm{h}\ \mathrm{E}\mathrm{C}50\ \mathrm{Algae}\ \left(\mathrm{mmol}/\mathrm{L}\right) = -0.6271\ \log\ \mathrm{P} + 0.5687 $$
(6)

.

Results and discussion

Data set characterization

The three Hansen solubility parameters were used to summarize the chemical profile of the solvents for graphic representation purpose. In order to observe the solvent space studied in a relatively reliable 2D-representation, a PCA was performed by using the Hansen descriptors as initial variables (Figure 1a). As observed in Fig. 1b, PC1 was related both to polarity and hydrogen bonding parameters while PC2 reflected the disperse part of the solubility parameter. The solvent dataset was chemically heterogeneous as also shown by the large coverage of Hansen solubility space in Fig. 1a. Moreover, this representation confirms that the whole dataset is well-characterized by LC or EC50 for each trophic level: fish [26], invertebrates, and algae. The chi-square test (with the significance level 0.05) indicated that the distributions of both invertebrates and algae EC50 (Fig. 2) conform closely to the normal distribution: the corresponding p-values were 0.401 and 0.175, respectively. Invertebrate logEC50 (48 h) ranged from −2.7 to 2.7 while Algae Log EC50 ranged from −1.7 to 2.9.

Fig. 1
figure 1

Representation of the solvents characterized in the data set by fish LC50 (●), invertebrate EC50 (), and/or algae EC50 (○) in the PC1/PC2 score plot determined by PCA from the three Hansen solubility parameters (a) and the corresponding loading plot (b). PC1 explains 51.8 % of the variation and the second component PC2 34.4 %

Fig. 2
figure 2

Histograms of both invertebrates and algae pEC50

Comparisons of toxicity of trophic levels

Figure 3 shows the toxicity correlation between each pair of trophic levels. As observed by many authors [21, 63, 64], the highest correlation (R2 = 0.777) was obtained between fish and Daphnia magna transducing their similar sensitivity to organic solvents (Fig. 3a). Aniline was highlighted since its toxicity is much higher toward invertebrates than fish (pEC50 = −2.67 and −0.33 respectively). Aniline and derivatives are considered to be narcotics for fish and more toxic to invertebrates [65]. Similar correlations were obtained between the toxicity values of algae and fish (R2 = 0.580) and algae and Daphnia magna (R2 = 0.613) (Fig. 3b and c). These results are in agreement with several studies dedicated to interspecies toxicity correlations [21, 64, 66, 67]. Amine compounds (propylamine, tributylamine, n-butylamine, and ethanolamine) toxicity was much higher toward algae than fish or invertebrates as observed by Christensen et al. [68] or Escher et al. [69] while the opposite effect was observed for the most lipophilic solvent of the dataset d-limonene. For the solvent classes large enough (halogenated compounds, ether and orthoesters, acyl compounds and alcohols), we were not able to highlight species more sensitive or less sensitive than the others. The toxicity of the ten aromatic hydrocarbons studied was high toward the three trophic levels.

Fig. 3
figure 3

Toxicity of invertebrates vs fish (a), algae vs ish (b), and algae vs invertebrates (c)

QSAR development for invertebrate pEC50

Descriptor selection

The descriptors were selected from each training set by multiple linear regression (ERM) with FIT criterion which allows a good compromise between the number of descriptors selected in the model and model fitting quality. The maximum values of FIT corresponded to a 4 or 5-parameter QSAR for invertebrate pEC50 prediction, depending on the strategy of training set selection. As expected, LogP was included in all the models developed and was associated with the surface tension and the minimal atomic Mulliken charges (qmin). HOMO energy and/or dielectric constant were also selected. The space-filling design approach allowed the selection of training solvents well-representative of the whole dataset with a good coverage of the solvent space. This strategy led to a 4-descriptors model involving LogP, ε, γ, and qmin. From the EC50-based training set, these 4 descriptors were associated with the HOMO energy.

Validation

QSAR including LogP, ε, γ, and qmin as explanatory variables (see Table 1 in Supplementary material) led to the best regression performances. This model was externally validated. Results were quite similar for all training sets regardless of the selection strategy: the determination coefficients (R2 train, R2 A,train) ranged from 0.689 up to 0.752 and MAEtrain between 0.512 and 0.535 (see Table 2). For test sets, regression quality was satisfactory (Q2 F1 > 0.7 and CCC > 0.85), especially with SF design (Q2 F1 = 0.864, MAEtest = 0.425 and CCC = 0.907). The QSAR developed was also internally validated with fivefold cross-validation and a corresponding determination coefficient \( {\mathrm{R}}_{\mathrm{CV}}^2 \) =0.704 in good agreement with the model.

Table 2 Training and test sets characteristics for the three selection strategies used (defined by using (1) random, (2) EC50-based, and (3) space-filling selections)

The best QSAR for invertebrate pEC50 prediction was the following:

$$ Log\ 48-h\ Inv.\ EC50\ \left( mmol/L\right) = 1.276\left(\pm 0.418\right) - 0.480\left(\pm 0.061\right)\ LogP - 0.048\left(\pm 0.011\right)\gamma + 0.027\left(\pm 0.007\right)\ \varepsilon - 0.951\left(\pm 0.454\right)\ {q}_{min} $$
(7)

Figure 4a shows the predicted vs experimental invertebrate pEC50 both for training and test sets. In a previous work [26], we already highlighted the significance of LogP, ε, and γ to predict the fish LC50 of organic solvents. Compared to the QSAR developed for invertebrate pEC50, only the fourth parameter differed with the LUMO energy selected instead of qmin. These similar results corroborate that fish and daphnids have similar sensitivity to chemicals and especially to narcotic compounds which represented the major part of the solvent dataset. Polar or non-polar narcosis involves non-specific non-covalent interactions with membranes leading to their disruption. This baseline toxicity mechanism is essentially governed by LogP so many QSARs (as in ECOSAR program) solely used LogP as a single descriptor. As expected and as observed by many authors [14, 24, 70], the negative coefficient of LogP shows that lipophilic chemicals will have more toxic effects than hydrophilic ones since they are more bioavalaible (transmembrane passage is promoted) and bioaccumulative. The dielectric constant (ε) shows the ability of a solvent for charge separation and provides a rough measure of its polarity. This parameters appeared to be relevant for daphnids toxicity explanation with a positive coefficient. The negative contribution of γ indicates that high surface tension would increase chemicals toxicity which may be explained by a promoted membrane penetration. The last selected descriptor was the minimal atomic Mulliken charges (qmin) which translate the ability of the solvent to function as an electron-pair donor (Lewis base). Low qmin would promote toxicity. One explanation could be that H2O would strongly solvate Lewis bases limiting their reactivity and then their ecotoxicity. Faucon et al. [14] showed that ecotoxicity of 96 heterogeneous chemicals toward daphnids increase with the decrease of both LogP and the electronic descriptor hardness (absolute value). They explained the latter behavior by a less favorable solvation of soft compounds which became more reactive and toxic than hard basis or oxygen anions.

Fig. 4
figure 4

Predicted vs experimental p(EC50) of invertebrates (a) and algae (b) for training (●) and test (○) solvents

To visualize the applicability domain of the developed QSAR, the William’s plot was represented in Fig. 5a. From this plot, the AD is established inside a squared area within ±3 standard deviations and a leverage threshold h* of 0.13. No chemical of the test set exceed the warning leverage h* indicating that their predicted activity can be considered as reliable as those of the training chemicals. The activity of four molecules belonging to the training set was less well predicted but remained satisfactory (standardized residuals < 3). According to fish pLC50 prediction [26], the strongest positive residual was obtained for a glycol solvent (diethylene glycol) which exhibits very high 48 h Inv. EC50 (48,900 mg/L). As Papa et al. [71], we expected that the high EC50 values should be difficult to precisely measure.

Fig. 5
figure 5

Williams plot for the QSAR model of invertebrate (a) and algae (b) pEC50

The solvents which were the most underestimated by the QSAR developed were amine compounds: aniline, diethylenetriamine, and 1,2-diaminoethane. Similar results have been observed for fish pLC50 prediction of organic solvents [26]. Amine solvents may exhibit toxic action in excess of narcosis baseline through specific mechanisms called amine narcosis [65]. Figure 5a also shows that ethylene carbonate has a leverage value 0.41 greater than h* with a small standardized residual. This solvent may stabilize the model and make it more accurate.

Finally, the QSAR (Eq. 6) implemented in ECOSAR program was used to predict Inv. pEC50 of the 85 neutral organic solvents contained in the dataset. A comparison with the performance of the developed 4-parameter model (Eq. 7) showed similar results for these compounds with R2 = 0.745 and MAE = 0.497 for ECOSAR relation and R2 = 0.758 and MAE = 0.482 for Eq. 5. However, Eq. 7 was relatively robust since its predictive power for Inv. pEC50 of the remaining 37 reactive or ionizable solvents of the dataset led to R2 = 0.639 (MAE = 0.675) against R2 = 0.135 (MAE = 0.860) for ECOSAR model.

QSAR development for Algae pEC50

As for fish and invertebrate pEC50 prediction, we tried to develop a QSAR for algae pEC50 modeling regardless of the mechanism of toxic action involved. No satisfying models (R2 around 0.5) were obtained essentially due to the presence of amine solvents in the dataset. Therefore, we chose to remove from the initial dataset all the amine solvents namely ethanolamine, propylamine, n-butylamine, tributylamine, 1,2-diaminoethane, morpholine, and aniline. Amine solvents are well-known to be highly toxic toward algae and this behavior is often related to a pH-dependent toxicity [68, 69]. Neuwoehner and Escher [72] suggest the high toxicity of aliphatic amines in algae is due to a toxicokinetic effect induced by their speciation and not to a specific mechanism of toxic action. Aliphatic amines speciation would be different in the external medium compared to the algae cell in which pH remains independent of external pH.

Space-filling and EC50-based selections led on the basis of FIT criterion to a 2-parameter QSAR for algae EC50 prediction involving LogP and LUMO energy (see Table 1 in Supplementary material). QSAR models with only these two descriptors were already found by several authors [20, 73, 74].

The 2-parameter QSAR was externally validated showing robust results for each selection strategy: the determination coefficients (R2 train, R2 A,train) ranged from 0.706 up to 0.744 and MAE from 0.427 up to 0.513 for both training and test sets (Table 2). Both external validation criteria Q2 F1 and CCC were greater than the acceptance values (0.7 and 0.85, respectively) as defined by Chirico and Gramatica [43] indicating that the model may be accepted as externally predictive for new organic solvent. The fivefold cross-validation method used for internal validation purpose led to a satisfactory determination coefficient \( {\mathrm{R}}_{\mathrm{CV}}^2 \) =0.729.

The best QSAR for algae pEC50 prediction was the following:

$$ Log\ 72-h\ algae\ EC50\ \left( mmol/L\right) = 1.348\left(\pm 0.139\right) - 0.621\left(\pm 0.059\right)\ LogP + 0.249\ \left(\pm 0.095\right)\ LUMO $$
(8)

Figure 4b shows the predicted vs experimental algae pEC50 both for training and test sets.

As expected and according to many authors [11, 13, 18, 20, 21, 7376], the negative coefficient of LogP in the QSAR indicates that higher the lipophilicity, higher the toxicity of organic solvents toward algae. The positive coefficient of LUMO energy suggests that highly electrophilic compounds resulted in high toxicity in agreement with many authors and as observed in the QSAR developed for solvent fish acute toxicity solvents [26].

The Williams plot of the algae QSAR is presented in Fig. 5b. As for the developed QSAR for invertebrate pEC50 prediction, it can be clearly seen that the solvents are following a well-defined domain of applicability (with h* = 0.12). Only d-limonene was out of the AD due to its very high LogP value of 4.83. However, the corresponding predicted activity remains satisfactory with a corresponding standardized residual below the value 3. Once again, the predicted activity of the test solvents may be considered reliable.

For comparison, the ECOSAR relation (Eq. 6) was used to predict algae pEC50 of the 67 neutral organics solvents included in our dataset. Although Eq. 6 is based on activities measured over a test period of 96 h instead of 72 h for Eq. 8, Eq. 6 led to 0.640 and 0.571 for MAE and R2 respectively which is in agreement with the determination coefficient R2 = 0.596 characterizing Eq. 6 (see ECOSAR section). While the developed QSAR (Eq. 8) allowed reaching 0.766 for R2 and 0.459 for MAE. Moreover, Eq. 8 remained usable to predict Algae pEC50 of the four esters remaining in the dataset with MAE = 0.492 against 0.886 for ECOSAR model (Eq. 6).

Conclusions

The prediction of the ecotoxicity profile of organic solvents by QSARs is of major relevance, especially to limit in vivo experiments. The description of aquatic toxicity requires the knowledge of the effects of a substance on organisms living in the water and represented by three trophic levels, i.e., vertebrates (fish), invertebrates (crustaceans as Daphnia spp.), and algae. In a previous work, we developed from a large dataset a reliable 4-parameter QSAR able to predict the fish LC50 of organic solvents, regardless of the mechanism of toxic action involved. Here, to complete this study and well-describe the ecotoxicity profile of organic solvents required by REACH regulation, we used the same approach to develop QSARs for the EC50 prediction of two other trophic levels, namely invertebrates and algae.

From the experimental activity data found in the literature and according to other studies, organic solvents showed similar toxicity toward fish and invertebrates while algae exhibited a different behavior. The 4-parameter QSAR developed for invertebrate pEC50 prediction included LogP, surface tension, dielectric constant, and the minimal atomic charge. As expected, this model is very similar to the one developed for fish LC50 prediction which includes LUMO energy instead of the minimal atomic charge. A 2-parameter QSAR involving LogP and LUMO energy allowed well-predicting algae pEC50 for all solvents other than amines which are well-known to exhibit specific toxicity behavior toward algae [68, 69]. These models have been obtained from a large solvent dataset and were validated by both internal and external validation techniques as required by the REACH regulation and according to OECD guidelines [52, 53]. They constitute a major tool for a reliable assessment of environmental risk related to organic solvents.