Introduction

Eco-toxicity of nonreactive organic pollutants (personal care products, food, pesticides, and pharmaceuticals) is important data for development and improvement of chemical technology (Concu et al. 2017; Castillo-Garit et al. 2016; Kleandrova et al. 2014a, b). Exposure of chemical contaminants to the aquatic environment (Baun et al. 2000; Sánchez-Bayo 2006; Parvez et al. 2008) to air (Raevsky et al. 2011) poses serious threats to the preservation of environmental quality and to human health and is recognized as a global problem (Kleandrova et al. 2014a, b; Castillo-Garit et al. 2008; Papa et al. 2005; de Morais e Silva et al. 2018). In addition, ionic liquids are important class of the organic pollutants caused by their use of everyday life (Peric et al. 2015; Ma et al. 2015). Other source of eco-toxicologic pollutants is associated with the massive use of petroleum-derived organic solvents (Perales et al. 2017). Finally, nanomaterials become additional source of eco-toxic effects (Nowack and Mitrano 2018). Thus, the development of databases together with predictive models related to eco-toxicity data for nonreactive pollutants becomes an important task of biochemistry and medicinal chemistry.

The aim of this study is estimation of the CORAL software (Toropova and Toropov 2014) as a possible tool to build up predictive models for eco-toxicity. The index of ideality of correlation (IIC) (Toropova and Toropov 2017; Toropov and Toropova 2017; Toropov et al. 2018; Toropov and Toropova 2018) is examined as a criterion of predictive potential of the CORAL model of eco-toxicity.

Method

Data

The experimental values measured for EC50 (effective molar concentration) (mol/L) are represented by negative decimal logarithm pEC50. The data taken in the literature (de Morais e Silva et al. 2018). These numerical data (n = 111) were randomly distributed into the training (n = 28), invisible training (n = 27), calibration (n = 29), and external validation (n = 27) sets. Table 1 confirms that the percentage of the identical distribution is not large.

Table 1 Percentage of identical distribution of compounds into the training, invisible training, calibration, and validation sets

Optimal descriptor

The optimal descriptor (Toropova and Toropov 2014) used here is calculated as the following:

$$ DCW\left({T}^{\ast },{N}^{\ast}\right)=\sum \limits_{k=1}^{NA} CW\left({S}_k\right)+\sum \limits_{k=1}^{NA-1} CW\left(S{S}_k\right) $$
(1)

The Sk is the “SMILES-atom,” i.e., one symbol or two symbols (e.g.. “C,” “N,” and “O”) which cannot be examined separately (e.g., “Cl” and “Si”); the SSk is a combination of two SMILES-atoms. The CW(Sk) and CW(SSk) are so-called correlation weights of the above-mentioned attributes of SMILES. The numerical data on the CW(Sk) and CW(SSSk) are calculated with the Monte Carlo method, i.e., the optimization procedure which gives maximal value of a target function (TF).

QSAR models, calculated with the Monte Carlo optimization of target functions TF1 and TF2:

$$ {TF}_1={r}_{TRN}+{r}_{iTRN}-\left|{r}_{TRN}-{r}_{iTRN}\right|\ast 0.1 $$
(2)
$$ {TF}_2={TF}_3+{IIC}_{CLB}\ast 0.1 $$
(3)

The rTRN and riTRN are correlation coefficient between observed and predicted endpoint for the training and invisible training sets, respectively.

The IICCLB is calculated with data on the calibration (CLB) set as the following:

$$ {IIC}_{CLB}={r}_{CLB}\frac{\min \Big({}{}^{-}{MAE}_{CLB},{}{}^{+}{MAE}_{CLB}\Big)}{\mathit{\max}\Big({}{}^{-}{MAE}_{CLB},{}{}^{+}{MAE}_{CLB}\Big)} $$
(4)
$$ {}{}^{-}M{AE}_{CLB}=\frac{1}{{}{}^{-}N}\sum \limits_{k=1}^{-N}\mid {\varDelta}_k\mid, {\varDelta}_k<0;{}{}^{-}N\ \mathrm{is}\ \mathrm{the}\ \mathrm{number}\ \mathrm{of}\ {\varDelta}_k<0 $$
(5)
$$ {}{}^{+}M{AE}_{CLB}=\frac{1}{{}{}^{+}N}\sum \limits_{k=1}^{-N}\mid {\varDelta}_k\mid, \kern0.5em {\varDelta}_k\ge 0;{}{}^{+}N\ \mathrm{is}\ \mathrm{the}\ \mathrm{number}\ \mathrm{of}\ {\varDelta}_k\ge 0 $$
(6)
$$ {\varDelta}_k={\mathrm{observed}}_k-{\mathrm{calculated}}_k $$
(7)

The observed and calculated are corresponding values of pEC50.

Having the numerical data on the CW(Sk) and CW(SSk), the predictive model is calculated by the least squares method with compounds from the training set:

$$ p{EC}_{50}={C}_0+{C}_1\ast DCW\left({T}^{\ast },{N}^{\ast}\right) $$
(8)

Results and discussion

Three models for pEC50 are built up using three random splits with two versions of target function TF1 calculated with Eq. 2 and TF2 calculated with Eq. 3.

In the case of TF1 these models are the following:

$$ \mathrm{pEC}50=1.732\left(\pm 0.027\right)+0.3695\left(\pm 0.0047\right)\ast \mathrm{DCW}\left(1,2\right) $$
(9)
$$ \mathrm{pEC}50=1.842\left(\pm 0.042\right)+0.3694\left(\pm 0.0063\right)\ast \mathrm{DCW}\left(1,6\right) $$
(10)
$$ \mathrm{pEC}50=1.784\left(\pm 0.023\right)+0.4488\left(\pm 0.0046\right)\ast \mathrm{DCW}\left(1,2\right) $$
(11)

In the case of TF2, these models are the following:

$$ \mathrm{pEC}50=1.582\left(\pm 0.048\right)+0.3745\left(\pm 0.0069\right)\ast \mathrm{DCW}\left(1,15\right) $$
(12)
$$ \mathrm{pEC}50=1.366\left(\pm 0.054\right)+0.2766\left(\pm 0.0052\right)\ast \mathrm{DCW}\left(1,15\right) $$
(13)
$$ \mathrm{pEC}50=2.009\left(\pm 0.036\right)+0.4891\left(\pm 0.0091\right)\ast \mathrm{DCW}\left(1,15\right) $$
(14)

Table 2 contains the statistical characteristics of the models calculated with Eqs. 35. Comparison of these models with model from the literature (de Morais e Silva et al. 2018) shows that the CORAL-models are better for the external validation set.

Table 2 The statistical characteristics of models for eco-toxicity

Figure 1 contains comparison of co-evolutions of correlations between observed and calculated pEC50 for training, invisible training, and calibration sets. The absence of overtraining is the main difference between the optimization with TF2 and optimization with TF1. Factually, this is an advantage of the optimization with TF2.

Fig. 1
figure 1

Co-evolution of correlations between pEC50observed and pEC50calculated for training (white circle), invisible training (dark circle), and calibration (white triangle) sets with applying target function TF1 (Eq. 2) and TF2 (Eq. 3)

Concordance correlation coefficient (CCC) (I-Kuei Lin 1989) and average <Rm2> (Roy et al. 2009; Ojha et al. 2011) are widely used criteria of predictive potential of a QSAR model. In other words, if there are model-1 and model-2 and CCC-1 is larger than CCC-2, then the model-1 should has better predictive potential for external compounds. Analogically, if there are model-1 and model-2 and Rm2-1 is larger than Rm2-2, then the model-1 should has better predictive potential for external compounds. The same principle is related to IIC: larger value of IIC should be observed for model with better predictive potential. The CCC and <Rm2> give correct recommendation for pair of models built up with TF1 and TF2 for split #1 and #3, but for split #2 these criteria give wrong recommendation (Table 2). The IIC gives correct recommendations for all splits #1, #2, and #3. Thus, CCC (I-Kuei Lin 1989), <Rm2> (Roy et al. 2009; Ojha et al. 2011) and IIC (Toropova and Toropov 2017; Toropov and Toropova 2017; Toropov et al. 2018; Toropov and Toropova 2018) are different criteria of predictive potential.

Supplementary materials contain confirmation of the compliances of the CORAL approach to OECD principles: Table S1 contains definition of the domain of applicability; Table S2 contains mechanistic interpretation of the CORAL model in terms of SMILES-attributes, which are promoters of increase or decrease for pEC50. Table S3 contains observed and calculated pEC50 together with distribution into the training, invisible training, calibration, and validation sets.

Conclusions

The CORAL software factually is a tool to build up predictive models for eco-toxicity of compounds examined here. The target function TF2 gives models with better predictive potential in comparison with models based on the Monte Carlo optimization with TF1. In other words, the IIC is checked up with three random splits. Hence, the IIC can be a useful criterion of the predictive potential of QSAR models of eco-toxicity.