Introduction

New chemicals enter the market at a rate of about 1000–2000 per year (Judson et al. 2009). A number of synthetic chemicals are produced and introduced to the environment through daily life (Levet et al. 2016). Some herbicides, such as atrazine and striazine, can induce developmental toxicity (Diana et al. 2000; Allran and Karasov 2001; Saka et al. 2017). 4-Nitrophenol, a commercial compound with significant functions in many industries, was reported as a potential carcinogen, teratogen, and mutagen (Mitchell and Waring 2000). These chemicals can directly or indirectly harm humans and aquatic organisms through bioaccumulation or amplification. Therefore, there is greatly increasing concern about their toxicity; thus, it is important to assess the hazard and risk of chemical substances released into the environment (Zhang et al. 2013).

The toxicological information required in risk assessment is usually obtained from experimental determination. However, most traditional acute toxicity tests using animals or organisms are material-consuming, time-consuming, and technically difficult (Luis et al. 2007; de Melo et al. 2016; Wang et al. 2019). They also violate the reduction, replacement, and refinement (3R) principle of animal experiments (Hamadache et al. 2016). Moreover, there is a large gap between existing experimental data and actual need for accurate toxicological information of chemicals for risk assessment. It is urgent that we develop predictive methods as an alternative to animal experimentation. An in silico method, the quantitative structure-activity relationship (QSAR), is a technique with wide application prospects and is one of the best methods to fill this gap (Papa et al. 2013; Zhang et al. 2019). The QSAR method cannot only predict and evaluate the ecological toxicity and environmental behavior of unknown chemicals, but also explore the toxic mechanism of pollutants and support the environmental risk assessment of organic chemicals (Tsakovska et al. 2008).

Many studies have been carried out to develop QSAR models for predicting the toxicity values of various chemicals to aquatic organisms, for example, Vibrio fischer (V. fischeri). QSAR models were developed to predict the toxicity of a set of 52 aromatic sulfone chemicals against V. fischeri, and the results showed that the toxicity was markedly related to water solubility (de Melo et al. 2016). QSAR techniques were adopted to predict the toxicity of alkylated aromatic hydrocarbons towards V. fischeri with the partial least square (PLS) analysis, and goodness-of-fit was identified through a high statistical value (R2 = 0.956) (Wang et al. 2016b). The toxicity of 24 bromide-based ionic liquids (Br-ILs) against V. fischeri was used to establish a QSAR model with a relatively high correlation coefficient; results indicated that the toxicity was related to the energy of the lowest unoccupied molecular orbitals and the volume of Br-IL cations (Wang et al. 2015a). Most QSAR models were established based on the similar functional groups or usages/applications of chemicals, such as alcohols (Belanger et al. 2018), anilines (Tugcu and Sacan 2018), phenols (Abbasitabar and Zare-Shahabadi 2017), specifically acting antibiotics (Neale et al. 2017), pesticides (Martin et al. 2017), surfactants in personal care products (Di Nica et al. 2017), and antidepressants (Minguez et al. 2018). However, the application domains of those models were limited to narrow ranges of chemical species. Thus, they cannot be used to predict toxicity for a large number of chemicals with structural diversity. It is necessary, therefore, to develop general models which can cover different chemicals with structural diversity (Lessigiarska et al. 2004; Levet et al. 2016; Bakire et al. 2018).

The mode of action (MOA) is essential in the understanding of toxic mechanisms. It also plays a key role in the development of QSAR models. A variety of methods for the assignment of chemicals to specific MOAs are available (KÖnemann 1981; McKim et al. 1987; Verhaar et al. 1992; Russom et al. 1997). The Verhaar scheme (Verhaar et al. 1992) is one of the most widely used methods in which chemicals can be classified based on physicochemical properties and structure rules. Chemicals are categorized using the Verhaar scheme into five different MOAs which are baseline chemicals, less inert chemicals, reactive chemicals, specifically acting chemicals, and chemicals that are not possible to classify using the Verhaar scheme, respectively. Up to now, the QSAR models for baseline chemicals against different aquatic organisms were mostly generated. A general baseline toxicity QSAR model for the fish embryo’s acute toxicity has been developed with R2 = 0.97 using liposome-water partition coefficient (logKlipw) instead of the octanol/water partition coefficient (logKow) (Klüver et al. 2016). A relationship has been found between fish toxicity and theoretical Volsurf molecular descriptors for 36 baseline chemicals with a robustness of R2 = 0.823 (de Moraise Silva et al. 2018). Although a number of QSAR models for baseline or less inert (polar narcotics) compounds have been established, fewer models have been developed for reactive and specifically acting compounds. Development of QSAR models for these compounds is crucial because of their greatly toxic effect to humans and environmental organisms.

In this paper, toxicity data of acute luminescence inhibition to V. fischeri were compiled for 1221 chemicals with structural diversity and different MOAs. The purpose of the paper is to develop linear and nonlinear QSAR models to predict the toxicity of more organic chemicals with different MOAs to V. fischeri. At the same time, the property or structure factors that attribute to the acute toxicity of organic chemicals to V. fischeri were investigated based on the MOA-based models. According to international principles of QSAR model development (OECD guideline, 2007), the robustness and application domains of developed models are discussed in this paper. This information is very valuable for risk assessment of organic chemicals in aquatic environment, specifically for reactive and specifically acting compounds.

Material and methods

Toxicity data to V. fischeri

The acute toxicity data of organic chemicals to V. fischeri for 15 or 30 min expressed in the logarithmic form of 50% inhibition concentration of bioluminescence (log1/IBC50, in the unit of mol/L) was collected from literature (Kaiser and Palabrica 1991; Cronin and Schultz 1998; Zhao et al. 1998a; Cronin et al. 2000; Dearden et al. 2000; Terasaki et al. 2009; Qin et al. 2010; Aruoja et al. 2011; Jones et al. 2011; Shi et al. 2012; Villa et al. 2012). If acute toxicity data for both endpoints was present, the toxicity value at the 15-min endpoint was preferred. A total of 1221 organic compounds and their toxicity data were collated by (1) removing ions, salts, and mixtures and (2) using arithmetic means as the final toxicity data for chemicals with more than one experimental value. The 1221 organic chemicals were divided into five MOAs using Toxtree software (http://ecb. jrc.it/qsar/qsar-tools/index.php?c=TOXTREE). The details of classification, together with CAS numbers, are reported in Online Resource2.

Calculation of molecular parameters

Hydrophobicity quantified by the logarithm of the octanol/water partition coefficient (logKow) was obtained from the KOWWIN program in EPISuite (verion 4.0) (http://www.epa.gov/oppt/exposure/pubs/ episuitedl.htm). When possible, measured log Kow values from experiments were used before turning to calculated values. The dragon descriptors were calculated by Dragon software (Version 6.0, Talete SRL, Milano, Italy). The MM2 method (Schnur et al. 1991) involved in the ChemBio3D Ultra (Version 12.0) (http://www.cambridgesoft.com/services/) was used to optimize the molecular structures. The initial descriptors were reduced by excluding three types of descriptors, namely, highly correlated descriptors with (abs) pair correlation larger than or equal to 0.95, descriptors with constant (relative standard deviation < 0.0001), and descriptors with at least one missing value. Thus, a total of 1379 descriptors were maintained and used for further analysis.

Model development and evaluation

The toxicity data was divided into the training and verification sets that the toxicity values and structures of the studied compounds were well covered in the ratio of 4:1 based on the random data segmentation (RSS) (Lyakurwa et al. 2014b). The training set containing 80% of the chemicals was used to develop models, and the verification set was used for the external test. Multilinear regression (MLR) with a step-wise algorithm in SPSS 19.0 software (SPSS Company, Chicago, IL, USA) was used for linear QSAR model development. The support vector machine (SVM) running in MATLAB 2014 which has been extensively applied for nonlinear analysis was employed to build the nonlinear QSAR models.

The linear models obtained in the analysis abide by following principles. Firstly, the number of predictor variables should be reduced to avoid over-fitting. The ratio between the number of chemicals in the training set and the number of selected descriptors should be more than 5:1 (Tropsha et al. 2003). Secondly, the model should have a higher adjusted determination coefficient (R2adj) and a lower root mean squared error (RMSE). Thirdly, because the probability of highly correlated descriptors rises with the increasing number of available descriptors for model, the variable inflation factors (VIF) for the parameters should be less than 10 to avoid the issue of colinearity. Fourthly, the QUIK rules (Stewart 1989) should be satisfied, i.e., Kx (intercorrelation of selected descriptors) < Kyx (the correlation of the x block with y), where x is the selected molecular descriptor matrix and y is the response variable vector (Li et al. 2014; Wang et al. 2015b; Luo et al. 2017). Last but not the least, the R2YS and Q2YS of the Y-scrambling technique should be lower than the criteria of 0.3 and 0.05, respectively (Eriksson et al. 2003).

The number of the terminated generation of the nonlinear models was set to 100, and the best nonlinear model was selected with minimum variance.

The performance of linear models was evaluated by the statistical parameters. The determination coefficient (R2) and root mean square error (RMSE) mainly reflect the goodness of fit of the models. The Q2LOO (leave-one-out cross-validation) and Q2BOOT (the bootstrap method, to 1/5 cross-validation, repeated 5000 times) were used to evaluate the robustness of the developed models. In addition, the slopes of the experimental value against predicted value or predicted value against experimental value without intercept (expressed as k and k′, respectively) for the validation set were used to evaluate the predictive ability of the models. If 0.85 ≤ k ≤ 1.15 or 0.85 ≤ k′ ≤ 1.15, models were considered acceptable (Golbraikh and Tropsha 2002). Furthermore, the external determination coefficient (R2ext), the external explained variance (Q2ext), and the root mean square error of verification sets (RMSEext) were also adopted to characterize the predictability of the models.

The performance of nonlinear models was evaluated through the determination coefficient (R2), the external determination coefficient (R2ext), and cross-validated of verification set (Q2ext).

Applicability domain characterization

Application domains (ADs) were characterized using the leverage distance method and the Euclidean distance method. For leverage distance method, Williams plots of the standardized residual (s) versus the leverage (h) were used to characterize ADs and determine whether the outliers or influential chemicals exist. The influential chemicals can be identified by hi value larger than h* (3p/n, where n and p are the number of chemicals and descriptors, respectively) and the outliers were diagnosed by the criterion of the standardized residual (s) being larger than 3 units (Bakire et al. 2018; Li et al. 2014). Similar to the leverage distance method, for the Euclidean distance method, plots of the standardized residual(s) versus Euclidean distance were used to characterize ADs and determine whether outliers existed. The largest value of Euclidean distance in the training set was set as the warning value (d*) (Li et al. 2014).

Results and discussion

QSAR models for all the chemicals

MLR analysis was carried out for all the collected chemicals. The best global linear model obtained by the stepwise analysis was

$$ \mathrm{log}1/{\mathrm{IBC}}_{50}=3.096+0.766\ \mathrm{S}\mathrm{pMax}4\_\mathrm{Bh}\left(\mathrm{m}\right)+0.166\ \mathrm{logKow}-0.720\ \mathrm{GATS}1\mathrm{i}-0.147\mathrm{CATS}2\mathrm{D}\_02\_\mathrm{DA}-0.468\ \mathrm{GATS}3\mathrm{e}-0.265\ \mathrm{nROH}+1.969\ \mathrm{nRNHR}+1.946\mathrm{B}02\left[\mathrm{S}-\mathrm{Cl}\right]+0.587\ \mathrm{S}-106-0.098\ \mathrm{F}02\left[\mathrm{C}-\mathrm{F}\right] $$
(1)
$$ {\mathrm{n}}_{\mathrm{tra}}=977,{\mathrm{R}}^2=0.500,{{\mathrm{R}}^2}_{\mathrm{adj}}=0.495,{{\mathrm{Q}}^2}_{\mathrm{LOO}}=0.486,{{\mathrm{Q}}^2}_{\mathrm{BOOT}}=0.797,\mathrm{RMSE}=0.817,\mathrm{Kyx}=0.228,\mathrm{Kx}=0.189,{\mathrm{n}}_{\mathrm{ext}}=244,{{\mathrm{R}}^2}_{\mathrm{ext}}=0.467,{\mathrm{R}\mathrm{MSE}}_{\mathrm{ext}}=0.754,{{\mathrm{Q}}^2}_{\mathrm{ext}}=0.433,{{\mathrm{R}}^2}_{\mathrm{YS}}=0.010,{{\mathrm{Q}}^2}_{\mathrm{YS}}=-0.013 $$

As shown in model (1), ten descriptors were used in the linear equation (Online Resource 1 Table S1). This model could only account for 49.5% of the variance (R2adj), suggesting an unsatisfactory fitting ability of the model. Although the model and coefficient of variation (R2) could be improved with the increasing number of descriptors, the value of R2ext was no higher than 0.6 even when more descriptors were introduced into the model. At the same time, the nonlinear model was developed for all the chemicals using the SVM method, and the model was also unsatisfactory with C = 0.594, g = 2.226, R2 = 0.581, and R2ext = 0.533.

Considering the unsatisfactory results the global linear and nonlinear models had, the chemicals were classified into four classes with different MOAs (baseline, less inert, reactive chemicals, and specifically acting chemicals) according to the modified Verhaar scheme (Enoch et al. 2008b). It should be noted that the MOAs of some chemicals cannot be assigned, and they were not used in the model construction in the present study. A total of 606 chemicals, therefore, were used for QSAR model development based on classification and analysis of mode of action.

MOA-based linear QSAR models

Linear QSAR model for baseline chemicals

A total of 215 chemicals were assigned to baseline compounds. They were divided into the training and verification sets in the ratio of 4:1. The optimal linear model was constructed by the MLR method and the result is shown as follows:

$$ \mathrm{log}1/{\mathrm{IBC}}_{50}=0.55+0.468\ \mathrm{logKow}+0.293\ \mathrm{SpDiam}\_\mathrm{AEA}\left(\mathrm{ed}\right)+0.237\ \mathrm{N}\%-0.741\ \mathrm{O}-057-2.14\ \mathrm{B}09\left[\mathrm{C}-\mathrm{Cl}\right]+0.455\ \mathrm{Eig}04\_\mathrm{AEA}\left(\mathrm{ed}\right)-0.761\ \mathrm{GGI}3 $$
(2)
$$ {\mathrm{n}}_{\mathrm{tra}}=172,{\mathrm{R}}^2=0.787,{{\mathrm{R}}^2}_{\mathrm{adj}}=0.778,{{\mathrm{Q}}^2}_{\mathrm{LOO}}=0.764,{{\mathrm{Q}}^2}_{\mathrm{BOOT}}=0.783,\mathrm{RMSE}=0.579,\mathrm{Kyx}=0.379,\mathrm{Kx}=0.345,{\mathrm{n}}_{\mathrm{ext}}=43,{{\mathrm{R}}^2}_{\mathrm{ext}}=0.788,{\mathrm{R}\mathrm{MSE}}_{\mathrm{ext}}=0.578,{{\mathrm{Q}}^2}_{\mathrm{ext}}=0.786,{{\mathrm{R}}^2}_{\mathrm{YS}}=0.041,{{\mathrm{Q}}^2}_{\mathrm{YS}}=-0.072 $$

Model (2) contains a total of seven descriptors and their detailed information is listed in Online Resource 1 Table S2. It can be found that, as a dominant descriptor, logKow can account for the largest proportion (t = 13.277). This result is consistent with the observation published in the literature for nonpolar anesthetic compounds (He et al. 2014). The hydrophobicity expressed as logP or logKow is regarded as one of the most common descriptors to predict the toxicity of organic chemicals to organisms and this property decided the process of a chemical passing through the cell membrane. Studies found that hydrophobicity had a good correlation with the acute toxicity of chemicals to many aquatic species, such as fish or embryo (Poecilia reticulata (Su et al. 2014), zebrafish embryo (Zhu et al. 2018)), tadpoles (Rana chensinensis and Rana japonica) (Wang et al. 2019), green algae (Bakire et al. 2018), D. magna (Zvinavashe et al. 2009), and Tetrahymena pyriformis (Enoch et al. 2008a). Model (2) indicates that the acute toxicity to V. fischeri is correlated not only with logKow, but also with other factors. In the present study, the correlation coefficient could increase from 0.543 to 0.787 when another six descriptors were employed in the model. The value of k is 0.986 and k′ is 0.992 for the validation set (details in Online Resource 1 Fig. S4), indicating that the model predictive ability is acceptable. The Q2LOO of the equation (2) is 0.764 and the Q2BOOT is 0.783, indicating that the model yields a good statistical performance. Additionally, the R2YS and Q2YS of the Y-scrambling technique are 0.041 and − 0.072, respectively, suggesting the acceptable model. The plot of experimental and predicted values of log1/IBC50 is Fig. 1a. The Pearson correlation coefficient (expressed as Rp) of experimental and predicted values is 0.886, indicating that the model established for class 1 chemicals is suitable and robust.

Fig. 1
figure 1

Fitting plots of experimental values and predictive values for baseline, less inert, and specifically acting chemicals (a baseline chemicals by the MLR method; b baseline chemicals by the SVM method; c less inert chemicals by the MLR method; d less inert chemicals by the SVM method; e specifically acting chemicals by the MLR method; f specifically acting chemicals by the SVM method)

It should be noted that the linear model is not for highly hydrophobic chemicals with logKow value over 5 (Lee et al. 2013). Due to the 15 PAHs in the data set with logKow > 5, it will have a greater impact on the development of baseline model. Therefore, those chemicals are not considered in the establishment of baseline model.

Linear QSAR model for less inert chemicals

The model constructed by the MLR method for less inert chemicals was as follows:

$$ \mathrm{log}1/{\mathrm{IBC}}_{50}=-4.626+1.951\ \mathrm{EE}\_\mathrm{B}\left(\mathrm{p}\right)+0.804\ \mathrm{logKow}-0.612\ \mathrm{X}2\mathrm{v}-3.98\ \mathrm{Eig}03\_\mathrm{EA}\left(\mathrm{dm}\right)-6.251\ \mathrm{JGI}3+1.662\ \mathrm{CATS}2\mathrm{D}\_05\_\mathrm{DP}+0.102\ \mathrm{GATS}7\mathrm{e} $$
(3)
$$ {\mathrm{n}}_{\mathrm{tra}}=133,{\mathrm{R}}^2=0.737,{{\mathrm{R}}^2}_{\mathrm{adj}}=0.723,{{\mathrm{Q}}^2}_{\mathrm{LOO}}=0.641,{{\mathrm{Q}}^2}_{\mathrm{BOOT}}=0.790,\mathrm{RMSE}=0.454,\mathrm{Kyx}=0.388,\mathrm{Kx}=0.348,{\mathrm{n}}_{\mathrm{ext}}=33,{{\mathrm{R}}^2}_{\mathrm{ext}}=0.758,{\mathrm{R}\mathrm{MSE}}_{\mathrm{ext}}=0.431,{{\mathrm{Q}}^2}_{\mathrm{ext}}=0.729,{{\mathrm{R}}^2}_{\mathrm{YS}}=0.053,{{\mathrm{Q}}^2}_{\mathrm{YS}}=-0.060 $$

It was found that the model tested by Y-scrambling technique was acceptable with R2YS 0.053 and Q2YS = − 0.060 with a total of seven descriptors. Detailed information is listed in Online Resource 1 Table S3. This model could account for 72.3% of the variance (R2adj), indicating the preferable fitting ability of the model. The value of k is 0.978 and k′ is 1.013 for the validation set (Online Resource 1 Fig. S5), indicating that the predictive ability of the regression for less inert compounds is acceptable. The plot of experimental and predicted values of log1/IBC50 for less inert chemicals is shown in Fig. 1c and the outcome was acceptable (Rp = 0.858). The observed and predicted log1/IBC50 values, together with the descriptors introduced to model (3), are reported in Online Resource2. It was found that logKow still has a large influence on the toxicity of less inert chemicals. This situation was also observed in other aquatic organisms (Qin et al. 2010; Vighi et al. 2009; Fu et al. 2015). EE_B (p) (t = 8), the second most important descriptor in the model, is an estrada-like index (logarithmic form) from Burden matrix weighted by polarizability. Therefore, it is considered that the polarizability of less inert chemicals has a significant effect on the toxicity. A positive relationship indicates that the higher polarizability a chemical has, the more toxic the chemical is.

Linear QSAR models for reactive chemicals

For reactive compounds, poor relationship was usually found between toxicity values and descriptors (Zhu et al. 2018). The unified prediction model on reactive chemicals could not be obtained due to the fairly complex structures of the compounds in class 3. It has been identified that the toxicity would be influenced by some structural characteristics such as the number of nitrogen atoms (nN) or carbonyl groups (n(C=O)) (Lyakurwa et al. 2014a). Therefore, the chemicals in class 3 were further divided into three groups according to nN or n(C=O). The results were as follows:

Group 1:nN > 0

$$ \mathrm{log}1/{\mathrm{IBC}}_{50}=2.509+21.815\ \mathrm{X}5\mathrm{Av}-2.751\ \mathrm{MATS}3\mathrm{s}+0.252\ \mathrm{GATS}8\mathrm{m}-0.249\ \mathrm{Eig}11\_\mathrm{EA}\left(\mathrm{dm}\right)-0.227\ \mathrm{F}06\left[\mathrm{C}-\mathrm{O}\right]+0.64\ \mathrm{MLOGP} $$
(4)
$$ {\mathrm{n}}_{\mathrm{tra}}=57,{\mathrm{R}}^2=0.801,{{\mathrm{R}}^2}_{\mathrm{adj}}=0.777,{{\mathrm{Q}}^2}_{\mathrm{LOO}}=0.729,{{\mathrm{Q}}^2}_{\mathrm{BOOT}}=0.757,\mathrm{RMSE}=0.527,\mathrm{Kyx}=0.299,\mathrm{Kx}=0.189,{\mathrm{n}}_{\mathrm{ext}}=14,{{\mathrm{R}}^2}_{\mathrm{ext}}=0.738,{\mathrm{R}\mathrm{MSE}}_{\mathrm{ext}}=0.605,{{\mathrm{Q}}^2}_{\mathrm{ext}}=0.715,{{\mathrm{R}}^2}_{\mathrm{YS}}=0.106,{{\mathrm{Q}}^2}_{\mathrm{YS}}=-0.199 $$

The R2YS and Q2YS of the Y-scrambling technique are 0.106 and − 0.199 lower than the criteria of 0.3 and 0.05, respectively, which confirms that the model is acceptable. Six descriptors were introduced to the model (4) and detailed information is listed in Online Resource 1 Table S4. Although both MLOGP and logKow represent the octanol/water partition coefficient, the results from different calculation methods could be slightly different. MLOGP and X5Av had the fairly large t values, 7.692 and 4.501, respectively (Online Resource 1), demonstrating that the toxic values of chemicals in this group are mainly affected by the hydrophobicity and the average connectivity index of chemicals. The fitting ability and robustness of the model are acceptable (R2 = 0.801, RMSE = 0.527, Q2LOO = 0.729, Q2BOOT = 0.757). The predictive ability is acceptable with k = 0.969 and k′ = 1.015 (see Fig. S6 A and B in Online Resource 1). Moreover, the plot of experimental and predicted values of log1/IBC50 for chemicals in group 1 is shown in Fig. S1A in Online Resource 1 with Rp = 0.877, indicating that the model was statistically significant.

Group 2: nN = 0, n(C=O) = 0

$$ \mathrm{log}1/{\mathrm{IBC}}_{50}=2.111+2.078\ \mathrm{B}03\left[\mathrm{C}-\mathrm{C}\right]-0.759\ \mathrm{Hy}+1.759\ \mathrm{B}06\left[\mathrm{C}-\mathrm{C}\mathrm{l}\right]-0.946\ \mathrm{B}06\left[\mathrm{O}-\mathrm{F}\right] $$
(5)
$$ {\mathrm{n}}_{\mathrm{tra}}=29,{\mathrm{R}}^2=0.864,{{\mathrm{R}}^2}_{\mathrm{adj}}=0.841,{{\mathrm{Q}}^2}_{\mathrm{LOO}}=0.749,{{\mathrm{Q}}^2}_{\mathrm{BOOT}}=0.784,\mathrm{RMSE}=0.489,\mathrm{Kyx}=0.360,\mathrm{Kx}=0.191,{\mathrm{n}}_{\mathrm{ext}}=7,{{\mathrm{R}}^2}_{\mathrm{ext}}=0.833,{\mathrm{R}\mathrm{MSE}}_{\mathrm{ext}}=0.614,{{\mathrm{Q}}^2}_{\mathrm{ext}}=0.720,{{\mathrm{R}}^2}_{\mathrm{YS}}=0.143,{{\mathrm{Q}}^2}_{\mathrm{YS}}=-0.237 $$

The model is acceptable with the R2YS = 0.143 and Q2YS = − 0.237. There are four descriptors in model (5) and detailed information is listed in Online Resource 1 Table S5. B03[C-C] which is a 2D atom pair descriptor was positively correlated to log1/IBC50 with the largest value of t (t = 8.527), manifesting that the toxicity of the compounds increased with the increasing number of the C–C structure fragment. This trend is opposite to that predicted by Hy which is a descriptor of hydrophilia. Bakire et al. (2018) found that, for reactive chemicals (nN = 0, n(C=O) = 0), only logKow observably related to the toxicity to green algae. Compared with the descriptor Hy in this model, logKow stands for hydrophobicity which is the opposite descriptor of Hy. It states clearly that hydrophobicity is positively related to the toxicity of reactive chemicals without nitrogen atoms and carbonyl groups (nN = 0, n(C=O) = 0). The highly hydrophobic chemicals would be more lipophilic and easier to pass through the membrane to cause toxicity. Relatively high negative correlation was found between logKow and Hy (Rp = − 0.778). Hy instead of hydrophobicity introduced to the model (5) indicates that the more the solubility of a chemical is, the less the toxicity to V. fischeri is. A few 2D atom pairs descriptors ( B06[C-Cl], B06 [O-F]) in the present study were also introduced into the model (5) and satisfactory performance was obtained (R2 = 0.864, R2ext = 0.833, Q2ext = 0.720). The plot of experimental and predicted values of log1/IBC50 is shown in Fig. S1C in Online Resource 1 and significant correlation was found (Rp = 0.919). The values of k and k′ (Online Resource 1 Fig. S6 C and D) for the validation set of reactive compounds in group 2 are 0.916 and 1.073, respectively. It is clear that model (5) has good predictive ability.

Group 3: nN = 0, n(C=O) > 0

$$ \mathrm{log}1/{\mathrm{IBC}}_{50}=-1.714+0.758\ \mathrm{nRCO}-0.482\ \mathrm{nArOR}+11.071\ \mathrm{SpPosA}\_\mathrm{A}-1.799\ \mathrm{Eta}\_\mathrm{F}\_\mathrm{A}+0.727\ \mathrm{F}02\left[\mathrm{O}-\mathrm{Cl}\right]+0.346\ \mathrm{Eig}02\_\mathrm{EA}\left(\mathrm{dm}\right)+0.327\ \mathrm{Eig}06\_\mathrm{EA}\left(\mathrm{dm}\right) $$
(6)
$$ {\mathrm{n}}_{\mathrm{tra}}=69,{\mathrm{R}}^2=0.734,{{\mathrm{R}}^2}_{\mathrm{adj}}=0.704,{{\mathrm{Q}}^2}_{\mathrm{LOO}}=0.659,{{\mathrm{Q}}^2}_{\mathrm{BOOT}}=0.772,\mathrm{RMSE}=0.326,\mathrm{Kyx}=0.303,\mathrm{Kx}=0.246,{\mathrm{n}}_{\mathrm{ext}}=18,{{\mathrm{R}}^2}_{\mathrm{ext}}=0.447,{\mathrm{R}\mathrm{MSE}}_{\mathrm{ext}}=0.639,{{\mathrm{Q}}^2}_{\mathrm{ext}}=0.441,{{\mathrm{R}}^2}_{\mathrm{YS}}=0.096,{{\mathrm{Q}}^2}_{\mathrm{YS}}=-0.171 $$

Y-scrambling test indicates that the model is acceptable (R2YS = 0.096 and Q2YS = − 0.171). A total of seven descriptors were used in model (6) and detailed information is listed in Online Resource 1 Table S6. The number of carbonyl in the aliphatic chemicals represented by nRCO was positively correlated to log1/IBC50, suggesting that the more carbonyl a chemical has, the more toxic the chemical is. Eig06_EA(dm) stands for an edge adjacency index weighted by the molecular dipole moment reflecting the polarity of a molecule (Luo et al. 2017). It indicated that the toxicity may be caused by the chemicals undergoing polar interactions with biomacromolecules. The plot of experimental and predicted values of log1/IBC50 is shown in Online Resource 1 Fig. S1E and Rp was equal to 0.812. The prediction of the model for training set was satisfactory; however, when it was used to predict the toxicity of the chemicals in validation set, an unsatisfactory outcome was obtained with R2ext = 0.447.

Linear QSAR model for specifically acting chemicals

The model by the MLR method for the class four chemicals was as follows:

$$ \mathrm{log}1/{\mathrm{IBC}}_{50}=-10.193+5.839\ \mathrm{SpMAD}\_\mathrm{AEA}\left(\mathrm{dm}\right)-0.179\ \mathrm{F}03\left[\mathrm{C}-\mathrm{N}\right]+3.994\ \mathrm{SpMAD}\_\mathrm{B}\left(\mathrm{p}\right) $$
(7)
$$ {\mathrm{n}}_{\mathrm{tra}}=25,{\mathrm{R}}^2=0.766,{{\mathrm{R}}^2}_{\mathrm{adj}}=0.733,{{\mathrm{Q}}^2}_{\mathrm{LOO}}=0.665,{{\mathrm{Q}}^2}_{\mathrm{BOOT}}=0.729,\mathrm{RMSE}=0.327,\mathrm{Kyx}=0.351,\mathrm{Kx}=0.172,{\mathrm{n}}_{\mathrm{ext}}=6,{{\mathrm{R}}^2}_{\mathrm{ext}}=0.749,{\mathrm{R}\mathrm{MSE}}_{\mathrm{ext}}=0.649,{{\mathrm{Q}}^2}_{\mathrm{ext}}=0.703,{{\mathrm{R}}^2}_{\mathrm{YS}}=0.127,{{\mathrm{Q}}^2}_{\mathrm{YS}}=-0.269 $$

Specifically acting chemicals are made up of various chemicals and the toxicity mostly ascribes to the interaction with the target receptors. A total of three descriptors were introduced into the model (7), namely, SpMAD_AEA(dm), F03[C-N], and SpMAD_B(p). SpMAD_AEA(dm), which stands for spectral mean absolute deviation from augmented edge adjacency matrix weighted by dipole moment edge adjacency indices, has the largest influence on toxicity. F03[C-N] is the frequency at which the C-N of the topological distance is 3, and SpMAD_B(p) represents the spectral mean absolute deviation from Burden matrix weighted by polarizability. The values of R2YS and Q2YS by Y-scrambling test are 0.127 and − 0.269, respectively. The results indicate that model (7) is acceptable for the toxicity prediction of specifically acting chemicals to V. fischeri. The Q2LOO and Q2BOOT values (Q2LOO = 0.665, Q2BOOT = 0.729) indicate that the model has good robustness. The values of external prediction correlation coefficients (R2ext = 0.749 and Q2ext = 0.703) as well as k and k′ (0.996 and 0.986) reflect the good external prediction ability of the model. The plot of experimental and predicted values of log1/IBC50 is shown in Fig. 1e and Rp was 0.864.

Comparison with existing linear QSAR models

The existing linear QSAR models for V. fischeri have been collected and shown in Table 1. Inspection of the QSAR models shows that most of them were established based on chemical species (Cronin et al. 2000; Wang et al. 2016b; Wang et al. 2015a; de Melo et al. 2016). Although a few MOA-based models have been established, most of them are used to predict the toxicity baseline (nonpolar narcotics) or less inert (polar narcotics) compounds (Zhao et al. 1998b; Li et al. 2015; Wang et al. 2016a). Only one global model was established. However, the AD of the model is limited because a small data set (102 chemicals) was used for building the model (Qin et al. 2010). To our knowledge, this is the first work to carry out an investigation on linear and nonlinear QSARs for the toxicity of V. fischeri by a large number of chemicals based on their MOAs. More importantly, the models developed include not only baseline and less inert compounds, but also reactive and specifically acting compounds. If a chemical can be classified into one of four MOAs (baseline, less inert, reactive chemicals, and specifically acting chemicals) based on its structural characters, the toxicity of the chemical to V. fischeri can be predicted using the MOA-based models in this study. It should be noted that, if the MOAs are unclear and cannot be identified using the Verhaar scheme, their toxicity cannot be well predicted by the models established in this study. In addition, reactive chemicals with nN = 0 and n(C=O) > 0 cannot be predicted either, because a satisfactory model has not been established for this group of chemicals. Compared with the established models presented in the references (Table 1), the ADs of the models in this study have been broadened to the chemicals with determined MOAs.

Table 1 QSAR models on the toxicity to V. fischeri from references

Nonlinear QSAR models

The nonlinear QSAR models were also investigated for chemicals based on classification and MOAs using the SVM method. The prediction ability and robustness as well as the calculating parameters are shown in Table 2.

Table 2 Parameters of nonlinear QSAR models

The results in Table 2 show that greater performance was found from the nonlinear models developed with SVM when compared with models developed with the MLR method except for reactive chemicals without nitrogen atoms and carbonyl group.

High-influence chemicals and outliers diagnosis

Leverage distance and Euclidean distance methods were used to characterize application domains (ADs). For model established for baseline compounds, seven chemicals are defined as influential compounds with hi > h* and |s| < 3 using the leverage distance method (see Williams plots in Fig. 2a). Those influential compounds are not outliers. Three compounds (acetic acid, octane in the training set, and 1-chlorooctane in the verification set) are identified as outliers with |s| > 3. The toxicity of 1-chlorooctane and octane is overestimated (Table 3) from the model. Outliers occurred for several reasons (Zhao et al. 2009; Hamadache et al. 2016; Wang et al. 2015b; Bakire et al. 2018). First, experimental errors or experimental uncertainty may be one reason for the deviation of 1-chlorooctane and octane. Apparently, the experimental toxicity sometimes does not really reflect “true” toxicity of the compounds because of experimental errors or experimental uncertainty (Zhao et al. 2009; Hamadache et al. 2016). The toxicity of octane and 1-chlorooctane (2.11 and 2.57) seems too low compared with that of nonane and heptane (5.93 and 4.96) although they are structurally similar compounds. Second, species sensitivity may be another reason for the outliers. Most straight-chain alkanes belong to the baseline mode for toxicity to fish, whereas more straight-chain alkanes including octane and 1-chlorooctane are identified as outliers for toxicity to V. fischeri (Wang et al. 2016a). This means that V. fischeri is not sensitive to all alkanes which results in more outliers observed in V. fischeri toxicity than in fish toxicity. The predicted toxic value of acetic acid was underestimated and also identified as an outlier. The pka value of acetic acid is 4.9, whereas the pH of V. fischeri toxicity test is close to 7. The ionic form of acetic acid exists under the test condition. Higher toxicity of acetic acid was observed which may ascribe to the higher ionization. This makes it easier for the chemical to enter the cell tissue and interact with the organism, V. fischeri. This phenomenon has been observed by one other study (Zhang et al. 2010). In addition, the water phases of V. fischeri are the main storage sites, rather than the lipid tissue, for a highly hydrophilic compound. This would result in the underestimation of toxicity from log Kow (Wen et al. 2012).

Fig. 2
figure 2

Plots of ADs by the leverage distance method for baseline chemicals (a the MLR method; b the SVM method)

Table 3 Outliers of the model for baseline chemicals

The AD of nonlinear model for base line chemicals is shown in Fig. 2b; octane in the training set and 1-chlorooctane in the verification set were also considered as outliers with |s| > 3.The results were relatively matched by those of linear model.

For less inert chemicals, the ADs of linear and nonlinear models are shown in Fig. S2A and Fig. S2B (Online Resource 1), respectively. For linear model, four compounds (allylamine, 4-chloro-N-methylaniline, antioxidant 264, and p-aminodiphenylamine) in the training set and 4-n-nonylpheno in the verification set were defined as influential compounds with hi > h* and |s| < 3. Only one compound (4-bromophenol) in the training set with |s| > 3 is regarded as the outlier of the model.

For reactive chemicals, it is found that three compounds (benzyl benzoate, metolachlor, and dithiocyanomethane in the training set) predicted by linear model are identified as influential chemicals with leverages exceeding the warning value (h* = 0.368) in group 1 (Fig. S3A and Fig. S3B in Online Resource 1). Similarly, the leverage value of 1′,4′-dichloro-p-xylene in the training set exceeded the warning value in group 2 (Fig. S3C and Fig. S3D in Online Resource 1). However, the predicted result is not significantly affected. The standardized residuals obtained by the MLR method are similar to those obtained using the SVM method. There is no significant difference between ADs by the SVM method and by the MLR method. All chemicals are within the ADs by both methods. The Williams diagrams for linear model and nonlinear model of group 3 are shown in Fig. S3E and Fig. S3F in Online Resource 1. The result from the toxicity to organism of green algae is consistent with that from the present study and no acceptable models were established for group 3 of reactive chemicals (nN = 0, n(C=O) > 0) (Bakire et al. 2018).

For specifically acting chemicals, it has been verified that the 31 compounds covered by linear or nonlinear models are all within the AD of model (7) (Fig. S2C and Fig. S2D in Online Resource 1).

Based on the Euclidean distance method, plots of the standardized residual (s) versus Euclidean distance were used to characterize ADs and determine whether the outliers exist. All plots not only for linear models but also for nonlinear models are shown in Online Resourse 1 Fig. S9 and Fig. S10. The outliers decided by Euclidean distance method are the same with those by leverage distance method. As the biggest value of Euclidean distance in the training set is set as the warning value (d*), no influential chemicals exist for all the models. The reason is probably due to the different warning values adopted by different methods.

Conclusions

This study demonstrates that the global linear and nonlinear models for all collected acute toxicity data of 1221 chemicals to V. fischeri were unsatisfactory for chemicals with structural diversity and different MOAs. Identification of MOA is crucial for the establishment of mechanistically based QSAR models. MOA-based linear and nonlinear models have been developed for baseline, less inert, reactive, and specifically acting compounds based on the modified Verhaar’s classification scheme. QSAR models based on MOAs were more predictable and robust not only for baseline and less inert chemicals, but also for reactive and specifically acting compounds. Compared with linear models obtained through the MLR method, the nonlinear models obtained by the SVM method had better performance. There was no significant difference between ADs determined by the SVM method and by the MLR method. The most extensive chemicals with toxic values to V. fischeri could be predicted when the MOA of a chemical was assigned. The descriptors selected in the models reveal that the acute toxicity of baseline compounds is dominated by the hydrophobicity. Also of note, chemical polarizability has an effect on the toxicity of acute exposure when dealing with less inert and reactive chemicals. The application domains of linear and nonlinear models and outliers have been discussed and explained. The models developed in this paper can be used to predict the toxicity not only for baseline and less inert compounds, but also for reactive and specifically acting compounds. This information is very valuable for the risk assessment of organic chemicals in an aquatic environment, specifically for reactive and specifically-acting compounds.