Keywords

1 Introduction

The QSAR world has undergone profound changes since the pioneering work of Corwin Hansch, considered the founder of modern QSAR modeling [1, 2]. The main change is reflected in the growth of a parallel and quite different conceptual approach to the modeling of the relationships among a chemical’s structure and its activity/properties.

In the Hansch approach, still applied widely and followed by many QSAR modelers (for instance, [35]), molecular structure is represented by only a few molecular descriptors (typically log Kow,Footnote 1 Hammett constants, HOMO/LUMO, some steric parameters) selected personally by the modeler and inserted in the QSAR equation to model a studied endpoint. Alternatively, in a different approach chemical structure is represented, in the first preliminary step, by a large number of theoretical molecular descriptors which are then, in a second step, selected by different chemometric methods as the best correlated with the response and, finally, included in the QSAR model (the algorithm), the fundamental aim being the optimization of model performance for prediction.

According to the Hansch approach, descriptor selection is guided by the modeler’s conviction to have a priori knowledge of the mechanism of the studied activity/property. The modeler’s presumption is to assign mechanistic meaning to any used molecular descriptor selected by the modeler from among a limited pool of potential modeling variables. These descriptors are normally well known and used repeatedly (for instance, log Kow is a universal parameter mimicking cell membrane permeation, thus it is used in models for toxicity, but it is also related to various partition coefficients such as bioconcentration/bioaccumulation, soil sorption coefficient; HOMO/LUMO energies are often selected for modeling chemical reactivity, etc.).

On the other hand, the “statistical” approach, an approach parallel to the previous so-called “mechanistic” one, is based on the fundamental conviction that the QSAR modeler should not influence, a priori and personally, the descriptor selection through mechanistic assumptions. Instead they should apply unbiased mathematical tools to select, from a wide pool of input descriptors, those descriptors most correlated to the studied response. The number and typology of the available input descriptors must be as wide and different as possible in order to guarantee the possibility of representing any aspect of the molecular structure. Different descriptors are different ways or perspectives to view a molecule. Descriptor selection should be performed by applying mathematical approaches to maximize, as an optimization parameter, the predictive power of the QSAR model, as the real utility of any model considered is its predictivity.

The first aim of any modeler should be the validation for predictive purposes of the QSAR model, for both the mechanistic and statistical approaches; in fact, a QSAR model must, first of all, be a real model, robust and predictive, to be considered a reliable model; only a stable and predictive model can be usefully interpreted for its mechanistic meaning, even so this is not always easy or feasible [6]. However, this is a second step in the statistical QSAR modeling.

QSAR model validation has been recognized by specific OECD expert groups as a crucial and urgent requirement in recent years, and this has led to the development, for regulatory purposes, of the “OECD principles for the validation of (Q)SAR models” (http://www.oecd.org/document/23/0,3343,fr_2649_34365_33957015_1_1_1_1,00.html).

The need for this important action was mainly due to the recent new chemicals policy of the European Commission (REACH: Registration, Evaluation, Authorization and restriction of Chemicals) (http://europa.eu.int/comm/environment/chemicals/reach.htm) that explicitly states the need to use (Q)SAR models to reduce experimental testing (including animal testing). Obviously, to meet the requirements of the REACH legislation (see also Chapter 13) it is essential to use (Q)SAR models that produce reliable estimates, i.e., validated (Q)SAR models. Thus, reliable QSAR model must be associated with the following information: (1) a defined endpoint; (2) an unambiguous algorithm; (3) a defined domain of applicability; (4) appropriate measures of goodness-of-fit, robustness and predictivity; (5) a mechanistic interpretation, if possible.

Some crucial points of the statistical approach of QSAR modeling, applied by the author’s group, are put into context, according to the guidelines of the OECD principles, which are the chemometric approach steps.

2 A Defined Endpoint (OECD Principle 1)

The most common regulatory endpoints, associated with OECD test guidelines, are related to (a) physico-chemical properties (such as melting and boiling points, vapor pressure, Kow, Koc, water solubility); (b) environmental fate (such as biodegradation, hydrolysis, atmospheric oxidation, bioaccumulation); (c) human health (acute oral, acute inhalation, acute dermal, skin irritation, eye irritation, skin sensitization, genotoxicity, reproductive and developmental toxicity, carcinogenicity, specific organ toxicity (e.g., hepatotoxicity, cardiotoxicity)); and (d) ecological effects (acute fish, acute daphnid, alga, long-term aquatic, and terrestrial toxicity) of chemicals.

The various experimental endpoints that have been modelled by the QSAR Research Unit of Insubria University are described in the following sections, after the discussion on the main methodological topics. A distinction will be made between single endpoints and cumulative endpoints, which take into account a contemporaneous contribution of different properties or activities.

3 An Unambiguous Algorithm (OECD Principle 2)

The algorithms used in (Q)SAR modeling should be described thoroughly so that the user will understand exactly how the estimated value was produced and can reproduce exactly the calculations also for new chemicals, if desired.

When the studied endpoint needs to be modelled using more than one descriptor (selected by different approaches) multivariate techniques are applied. As there can be multiple steps in estimating the endpoint of a chemical, it is important that the nature of the used algorithms be unambiguous, as required by OECD Principle 2.

3.1 Chemometric Methods

3.1.1 Regression Models

Regression analysis is the use of statistical methods for modeling a dependent variable Y, a quantitative measures of response (e.g., boiling point, LD50), in terms of predictors X (independent variables or molecular descriptors).

There are many different multivariate methods for regression analysis, more or less widely applied in QSAR studies: multiple linear regression (MLR), principal component regression (PCR), partial least squares (PLS), artificial neural networks (ANNs), fuzzy clustering and regression are among more commonly used approaches for regression modeling.

Although all QSAR models (linear and not linear) are based on algorithms, the most common regression method, which describes models by completely transparent and easily reproducible mathematical equations, is multiple linear regression (MLR), in particular ordinary least squares (OLS) method. This method has been applied by the author in her QSAR studies; to cite some most recent papers, see [728] and Chapter 6. Some of these models are commented on in the following paragraphs.

The correlation of the variables in the modeling must be controlled carefully (for instance, by applying the QUIK rule [29]) and the problem of possible overfitting [30], common also to other modeling methods, must also be checked by statistical validation methods to verify robustness and predictivity. The selection of descriptors in MLR can be performed either a priori by the model developer on a mechanistic basis or by evolutionary techniques such as genetic algorithms. In this second approach, the model’s developer should try to interpret mechanistically the descriptors selected, but only after model development and statistical validation for predictivity.

3.1.2 Classification Models

Another common problem in QSAR analysis is prediction of the group membership from molecular descriptors. In the simplest case, chemicals are categorized into one, two, or more groups depending on their activity, indicated by the same value of a categorical variable: active/inactive or, for instance, toxic/non-toxic.

Classification models are quantitative models based on relationships between independent variables X (in this case molecular descriptors) and a categorical response variable of integer numerical values, each representing the class of the corresponding sample.

The term “quantitative” is referred to the numerical value of the variables necessary to classify the chemicals in the qualitative classes (a categorical response) and it specifies the quantitative meaning of a QSAR-based classification process.

Such classification, also called supervised pattern recognition, is the assignment, on the basis of a classification rule, of chemicals to one of the classes defined a priori (or of groups of chemicals in the training set). Thus, the goal of a classification method is to develop a classification rule (by selecting the predictor variables) based on a training set of chemicals of known classes so that the rule can be applied to a test set of compounds of unknown classes. A wide range of classification methods exists, including discriminant analysis (DA; linear quadratic, and regularized DA), soft independent modeling of class analogy (SIMCA), k-nearest neighbors (k-NN), classification and regression tree (CART), artificial neural network, support vector machine, etc.

The QSAR Research Unit of Insubria University has developed some satisfactory, validated, and usable classification models (for instance, among the more recent [16, 3135]) by applying different classification methods, mainly classification and regression tree (CART) [36, 37], k-nearest neighbor (k-NN) [38], and artificial neural networks (in particular, Kohonen maps or self-organizing maps (SOM) [3941]).

CART is a non-parametric unbiased classification strategy to classify chemicals with automatic stepwise variable selection. As the final output, CART displays a binary, immediately applicable, classification tree; each non-terminal node corresponds to a discriminant variable (with the threshold value of that molecular descriptor) and each terminal node corresponds to a single class. To classify a chemical, at each binary node, the tree branch, matching the values of the chemical on the corresponding splitting descriptor, must be followed.

The k-NN method is a non-parametric unbiased classification method that searches for the k-nearest neighbors of each chemical in a data set. The compound under study is classified by considering the majority of classes to which the kth nearest chemicals belong. k-NN is applied to autoscaled data with a priori probability proportional to the size of the classes; the predictive power of the model is checked for k nearest neighbors between 1 and 10.

Counter-propagation artificial neural networks (CP-ANNs), particularly Kohonen maps, are supervised classification methods. Input variables (molecular descriptors) calculated for the studied chemicals provide the input for the net or the Kohonen layer. The architecture of the net is constituted by N × N × p, where p is the number of input variables and each p-dimensional vector is a neuron (N). Thus, the neurons are vectors of weights, corresponding to the input variables. During the learning, n chemicals are presented to the net – one at a time – a fixed number of times (epochs); each chemical is then assigned to the cell for which the distance between the chemical vector and the neuron is minimum. The target values (i.e., the classes to be modelled) are given to the output layer (the top-map: a two-dimensional plane of response), which has the same topological arrangement of neurons as the Kohonen layer. The position of the chemicals is projected to the output layer and the weights are corrected in such a way that they fit the output values (classes) of corresponding chemicals. The Kohonen-ANN automatically adapts itself in such a way that similar input objects are associated with topologically close neurons in the top-map. The chemical similarity decreases with increasing of the topological distance.

The trained network can be used for predictions; a new object in the Kohonen layer will lie on the neuron with the most similar weights. This position is then projected to the top-map, which provides a predicted output value. It is important to remember that the Kohonen top-map has toroid geometry; each neuron has the same number of neighbors, including the neurons on the borders of the top-map.

According to the OECD principles, for a QSAR model to be acceptable for use to make regulatory decisions it must be clearly defined, easily and continuously applicable in such a way that the calculations for the prediction of the endpoint can be reproduced by everyone, and applicable to new chemicals. The unambiguous algorithm is characterized not only by the mathematical method of calculation used, but also by the specific molecular descriptors required in the model mathematical equation. Thus, the exact procedure used to calculate the descriptors, including compound pre-treatment (e.g., energy minimization, partial charge calculation), the software employed, and the variable selection method for QSAR model development should be considered integrative parts of the overall definition of an unambiguous algorithm.

3.2 Theoretical Molecular Descriptors

It has become quite common to use a wide set of molecular descriptors of different kinds (experimental and/or theoretical) that are able to capture all the structural aspects of a chemical to translate the molecular structure into numbers. The various descriptors are different ways or perspectives to view a molecule, taking into account the various features of its chemical structure, not only one-dimensional (e.g., the simple counts of atoms and groups), but also two-dimensional from a topological graph or three-dimensional from a minimum energy conformation. Livingstone has published a survey of these approaches [42]. Much of the software calculates broad sets of different theoretical descriptors, from SMILES, 2D-graphs to 3D-x,y,z-coordinates. Some of the frequently used descriptor calculation software includes ADAPT [43], OASIS [44], CODESSA [45], DRAGON [46], and MolConnZ [47]. It has been estimated that more than 3000 molecular descriptors are now available, and most of them have been summarized and explained [4850]. The great advantage of theoretical descriptors is that they can be calculated homogeneously by a defined software for all chemicals, even those not yet synthesized, the only need being a hypothesized chemical structure. This peculiarity explains their wide and successful use in QSAR modeling. The DRAGON software has always been used in models developed by the author’s group. In the version more frequently used by the author (5.4), 1664 molecular descriptors of the following different typologies were calculated: (a) 0D-48 constitutional (atom and group counts); (b) 1D-14 charge descriptors; (c) 1D-29 molecular properties; (d) 2D-119 topological; (e) 2D-47 walk and path counts, (f) 2D-33 connectivity index; (g) 2D-47 information index; (h) 2D-96 various auto-correlations from the molecular graph; (i) 2D-107 edge adjacency indices; (j) 2D-64 descriptors of Burden (BCUTs eigenvalues); (k) 2D-21 topological charge indices; (l) 2D-44 eigenvalue-based indices; (m) 3D-41 Randic molecular profiles; (n) 3D-74 geometrical descriptors; (o) 3D-150 radial distribution function; (p) 3D-160 Morse; (q) 3D-99 weighted holistic invariant molecular descriptors (WHIMs) [5153]; (r) 3D-197 geometry, topology and atom-weights assembly (GETAWAY) descriptors [54, 55]; (s) 154 functional groups; (t) 120 atom-centered fragments. The list and meaning of the molecular descriptors are provided by the DRAGON package and the calculation procedure is explained in detail, with related literature references, in the Handbook of Molecular Descriptors from Todeschini and Consonni [50] and in Chapter 3. The DRAGON software is continuously implemented with new descriptors.

3.3 Variable Selection and Reduction. The Genetic Algorithm Strategy for Variable Selection

The existence of a huge number of different molecular descriptors, experimental or theoretical, to describe chemical structure is a great resource as it allows QSAR modelers (particularly those working with the statistical approach) to have different X-variables available that take into account each structural feature in various ways. In principle, all the different possible combinations of the X-variables should be investigated to find the most predictive QSAR model. However, this can be quite taxing, mainly for reasons of time.

Sometimes molecular descriptors, which are only different views of the same molecular aspect, are highly correlated. Thus, when dealing with a large number of highly correlated descriptors, variable selection is necessary to find a simple and predictive QSAR model, which must be based on the minimum number of descriptors, and the least correlated, as possible. First, objective selection is applied using only independent variables (X): descriptors to discard are identified by tests of identical values and pairwise correlations, looking for descriptors less correlated to one another.

Secondly, modeling variable selection methods, which additionally use dependent variable values (Y), are applied to this pre-reduced set of descriptors to further reduce it to the true modeling set, not only in fitting but, most importantly, in prediction. Such selection is performed by alternative variable selection methods.

Several strategies for variable subset selection have been applied in QSAR (stepwise regressions, forward selection, backward elimination, simulated annealing, evolutionary and genetic algorithms, among those most widely applied). A comparison of these methods [56] has demonstrated the advantages, and the success, of genetic algorithms (GAs) as a variable selection procedure for QSAR studies.

GAs are a particular kind of evolutionary algorithms (EAs), shown to be able to solve complex optimization problems in a number of fields, including chemistry [5759]. The natural principles of the evolution of species in the biological world are applied, i.e., the assumption that conditions leading to better results will prevail over poorer ones, and that improvement can be obtained by different kinds of recombination of independent variables, i.e., reproduction, mutation, and crossover. The goodness-of-fit of the selected solution is measured by a function that has to be optimized.

Genetic algorithms, first proposed as a strategy for variable subset selection in multivariate analysis by Leardi et al. [60] and applied to QSAR modeling by Rogers and Hopfinger [61], are a very effective tool with many merits compared to other methods. GAs are now widely and successfully applied in QSAR approaches, where there is quite a number of molecular descriptors, in various modified versions, depending on the way of performing reproduction, crossover, mutation, etc. [6266].

In variable selection for QSAR studies, a bit equal to 1 denotes a variable (molecular descriptor) present in the regression model or equal to 0 if excluded. A population, constituted by a number of 0/1 bit strings (each of length equal to the total number of variables in the model), is evolved following genetic algorithm rules, maximizing the predictive power of the models (verified by the explained variance in prediction, \( {\rm{Q}}_{{\rm{cv}}}^2 \) or by the root mean squared error of prediction, RMSEcv). Only models producing the highest predictive power are finally retained and further analyzed with additional validation techniques.

Whereas EAs search for the global optimum and end up with only one or very few results [64, 65, 67], GAs simultaneously create many different results of comparable quality in larger populations of models with more or less the same predictive power. Within a given population the selected models can differ in the number and kind of variables. Similar descriptors, which are able to capture some specific aspects of chemical structure, can be selected by GA in alternative combinations for modeling the response. Thus, similarly performing models can be considered as different perspectives to arrive at essentially the same conclusion. Owing to this, the GA-based approach has no single “best” set of descriptors related to the Y-dependent variable; there is a population of good models of similar performance that could be also combined in consensus modeling approaches [18, 19] to obtain averaged predictions.

Different rules can be adopted to select the final preferred “best” models. In the author’s researches the QUIK (Q under influence of K) rule [29] is always applied as the first filter to avoid multi-collinearity in model descriptors without prediction power or with “apparent” prediction power (chance correlation). According to this rule, only models with a K multivariate correlation calculated on the X+Y block, at least 5% greater than the K correlation of the X-block, are considered statistically significant and checked for predictivity (both internally by different cross-validations and externally on chemicals which do not participate in model development).

Another important parameter that must be considered is the root mean squared error (RMSE) that summarizes the overall error of the model; it is calculated as the root square of the sum of squared errors in calculation (RMSE) or prediction (RMSEcv and RMSEp) divided by their total number. The best model has the smallest RMSE and very similar RMSE values for training and external prediction chemicals, highlighting the model’s generalizability [68].

4 Applicability Domain (OECD Principle 3)

The third OECD Principle takes into consideration another crucial problem: the definition of the applicability domain (AD) of a QSAR model. Even a robust, significant, and validated QSAR model cannot be expected to reliably predict the property modelled for the entire universe of chemicals. In fact, only predictions for chemicals falling within the domain of the developed model can be considered reliable and not model extrapolations. This topic was dealt with at a recent workshop where several different approaches for linear and non-linear models were proposed [69], in relation to different model types.

The AD is a theoretical spatial region defined by the model descriptors and the response modelled, and is thus defined by the nature of the chemicals in the training set, represented in each model by specific molecular descriptors. To clarify recent doubts [70], it is important to note that each QSAR model has its own specific AD based on the training set chemicals, not just on the kind of included chemicals but also on the values of the specific descriptors used in the model itself; such descriptors are dependent on the type of the training chemicals.

As was explained above, a population of MLR models of similar good quality, developed by variable selection performed with a genetic algorithm [66] can include a 100 different models developed on the same training set but based on different descriptors: even if developed on the same chemicals, the AD for new chemicals can differ from model to model, depending on the specific descriptors. Through the leverage approach [71] (shown below) it is possible to verify whether a new chemical will lie within the model domain (in this case predicted data can be considered as interpolated and with reduced uncertainty, at least similar to that of training chemicals, thus more reliable) or outside the domain (thus, predicted data are extrapolated by the model and must be considered of increased uncertainty, thus less reliable). If it is outside the model domain a warning must be given. Leverage is used as a quantitative measure of the model applicability domain and is suitable for evaluating the degree of extrapolation, which represents a sort of compound “distance” from the model experimental space (the structural centroid of the training set). It is a measure of the influence a particular chemical’s structure has on the model: chemicals close to the centroid are less influential in model building than extreme points. A compound with high leverage in a QSAR model would reinforce the model if the compound is in the training set, but such a compound in the test set could have unreliable predicted data, the result of substantial extrapolation of the model.

The prediction should be considered unreliable for compounds in the test set with high-leverage values (h>h*, the critical value being h*=3p/n, where p is the number of model variables plus one and n is the number of the objects used to calculate the model). When the leverage value of a compound is lower than the critical value, the probability of accordance between predicted and actual values is as high as that for the training set chemicals. Conversely, a high-leverage chemical is structurally distant from the training chemicals, thus it can be considered outside the AD of the model. To visualize the AD of a QSAR model, the plot of standardized cross-validated residuals (R) vs. leverage (Hat diagonal) values (h) (the Williams plot) can be used for an immediate and simple graphical detection of both the response outliers (i.e., compounds with cross-validated standardized residuals greater than three standard deviation units, >3σ) and structurally influential chemicals in a model (h>h*).

It is important to note that the AD of a model cannot be verified by studying only a few chemicals, as in such cases [72] it is impossible to obtain conclusions that can be generalized on the applicability of the model itself.

Figure 12-1 shows the Williams plot of a model for compounds that act as polar narcotics to Pimephales promelas [26]; as an example, here the toxicity of chemical no. 347 is incorrectly predicted (>3σ) and it is also a test chemical completely outside the AD of the model, as defined by the Hat vertical line (high h leverage value), thus it is both a response outlier and a high-leverage chemical. Two other chemicals (squares at 0.35 h) slightly exceed the critical hat value (vertical line) but are close to three chemicals of the training set (rhombus), slightly influential in the model development: the predictions for these test chemicals can be considered as reliable as those of the training chemicals. The toxicity of chemical no. 283 is incorrectly predicted (>3σ), but in this case it belongs to the model AD, being within the cut-off value of Hat. This erroneous prediction could probably be attributed to error or variability in the experimental data rather than to molecular structure or model.

Figure 12-1.
figure 12_1_156654_1_En

Williams plot for an externally validated model for the toxicity to Pimephales promelas of polar narcotics. Cut-off value: 2.5 h* (with copyright permission from [26])

5 Model Validation for Predictivity (OECD Principle 4)

Model validation must always be used to avoid the possibility of “overfitted” models, i.e., models where too many variables, useful only for fitting the training data, have been selected, and to avoid the selection of variables randomly correlated (by chance) with the dependent response. Particular care must be taken against overfitting [30], thus subsets with the fewest variables are favored, as the chance of finding “apparently acceptable” models increases with increasing X-variables. The proportion of random variables selected by chance correlation could also increase [73]. The ratio of chemicals to variables should always be higher than five for a small data set, but the number of descriptors must be the lowest as possible for bigger data sets too (according to the Ockham’s Razor: “avoid complexity if not necessary”).

Therefore, a set of models of similar performance, verified by leave-one-out model validation, need to be further validated by leave-more-out cross-validation or bootstrap [74, 75]. This is done to avoid overestimation of the model’s predictive power by \( {\rm{Q}}_{{\rm{LOO}}}^2 \) [76, 77] and to verify the stability of model predictivity (robustness). Response permutation testing (Y scrambling) [6] or other resampling techniques are also applied for excluding that the developed model is based on descriptors that could be related to the response only by chance. Finally, for the most stringent evaluation of model applicability for prediction of new chemicals, external validation (verified by \( {\rm{Q}}_{{\rm{EXT}}}^2 \) or \( {\rm{R}}_{{\rm{EXT}}}^2 \)) of all models is recommended as the last step after model development, and for the assessment of true predictive ability [6, 10, 78].

The preferred model will be that with the highest prediction parameter values and the most balanced results between the cross-validation parameters on the training chemicals (\( {\rm{Q}}_{{\rm{cv}}}^2 \), \( {\rm{Q}}_{{\rm{LMO}}}^2 \), \( {\rm{Q}}_{{\rm{BOOT}}}^2 \)), verified during descriptor selection, and the predictive power (\( {\rm{Q}}_{{\rm{EXT}}}^2 \) or \( {\rm{R}}_{{\rm{EXT}}}^2 \)), verified later on the external prediction chemicals.

The limiting problem for efficient external validation of a QSAR model is, obviously, data availability. Given the availability of a sufficiently large number (never less than five or 20% of training set) of really new and reliable experimental data, the best proof of an already developed model accuracy is to test model performance on these additional data, at the same time checking the chemical AD. However, it is usually difficult to have data available for new experimentally tested compounds (in useful quantity and quality) for external validation purposes, thus, in the absence of additional data, external validation by a priori splitting the available data can be usefully applied to define the actual predictive power of the model more precisely.

5.1 Splitting of the Data Set for the Construction of an External Prediction Set

In the absence of new additional data, we assume that there is less data than is actually available; this is the reason for splitting the data in a reasonable way (commented on below) into a training set and a prediction set of “momentarily forgotten chemicals.”

Thus, before model development, the available input data set can be split adequately by different procedures into the training set (for model development) and the prediction set (never used for variable selection and model development, but used exclusively once for model predictive assessment, performed only after model development). At this point the underlying goal is to ensure that both the training and prediction sets separately span the whole descriptor space occupied by the entire data set, and that the chemical domain in the two data sets is not too dissimilar [77, 7981] as it is impossible for a model to be applied outside its chemical domain and obtain reliable predictions. The composition of the training and prediction sets is of crucial importance. The best splitting must guarantee that the training and prediction sets are scattered over the whole area occupied by representative points in the descriptor space (representativity), and that the training set is distributed over an area occupied by representative points for the whole data set (diversity). The more widely applied splitting methodologies are based on structural similarity analysis (for instance, Kennard Stone, duplex, D-optimal distance [1113, 17, 18, 20, 21, 81, 82], self-organizing map (SOM) or Kohonen-map ANN [17, 18, 20, 21, 26, 27, 35, 39, 41, 80]. Alternatively, to split the available data without any bias for structure, random selection through activity sampling can be applied. Random splitting is highly useful if applied iteratively in splitting for CV internal validation and can be considered quite similar to real-life situations, but it can give very variable results when applied in this external validation, depending greatly on set dimension and representativity [80, 83, 84]. In addition, in this last case there is a greater probability of selecting chemicals outside the model structural AD in the prediction set; thus, the predictions for these chemicals could be unreliable, simply as they are extrapolated by the model.

5.2 Internal and External Validation

External validation should be applied to any proposed QSAR model to determine both its generalizability for new chemicals that, obviously, must belong to the model AD and the “realistic” predictive power of the model [6, 8385]. The model must be tested on a sufficiently large number of chemicals not used during its development, at least 20% of the complete data set is recommended, but the most stable models (of easily modelled endpoints) can also be checked on a prediction set larger than the training set [19, 85]; this will avoid “supposed” external validation based on too few chemicals [72]. In fact, it has been demonstrated that if the test set consists only of a small number of compounds, there is increased possibility of chance correlation between the predicted and observed response of the compounds [79].

It is not unusual for models with high internal predictivity, verified by internal validation methods (LOO, LMO, Bootstrap), but externally less predictive or even absolutely unpredictive, to be present in populations of models developed using evolutionary techniques to select the descriptors. The statistical approach to QSAR modeling always carefully checks this possibility by externally validating any model, stable in cross-validation, before its proposal. In fact, cross-validation is necessary but is not a sufficient validation approach for really predictive models [6, 7779]. In relation to this crucial point of QSAR model validation, there is a wide debate and discordant opinions in the QSAR community concerning the different outcomes of internal and external validation on QSAR models. A mini-review dealing with this problem has been recently published by the author [84], where an examination is made of the OECD Principles 2, 3, and 4, and particular attention has been paid to the differences in internal and external validation. The theoretical constructs are illustrated with examples taken from both the literature and personal experience, derived also from a recent report for the European Centre for Validation of Alternative Methods (ECVAM) on “Evaluation of different statistical approaches to the validation of Quantitative Structure–Activity Relationships” [83].

Since GAs simultaneously create many different, similarly acceptable models in a population, the user can choose the “best model” according to need: the possibility of having reliable predictions for some chemicals rather than others, the interpretability of the selected molecular descriptors, the presence of different outliers, etc.

In the statistical approach the best model is selected by maximizing all the CV internal validation parameters, by applying CV in the proper way and step. Then, only the good models (\({\rm{Q}}_{{\rm{LOO}}}^2\)>0.7), stable and internal predictive (with similar values of all the different CV-Q2), are subjected to external validation on the a priori split prediction set.

In our works we always select, from among the best externally predictive models, those with the smallest number of response outliers and structurally influential chemicals, especially those in the prediction set.

5.3 Validation of Classification Models

To assess the predictive ability of classification models, the percentage of misclassified chemicals, as error rate (ER%) and error rate in prediction (ERcv%), are calculated by the leave-one-out method (where each chemical is taken out of the training set once and predicted by the model). Comparison with the no-model error rate (NoMER) is used to evaluate model performance. NoMER represents the object distribution in the defined classes before applying any classification method, and is calculated as an error rate by considering all the objects as misclassified into the greatest class. This provides a reference classification parameter to evaluate the actual efficiency of a classifier: the greater the difference between NoMER and the actual ER, the better the model performance.

The outputs of a classification model are the class assignments and the misclassification matrix, which shows how well the classes are separated. The goodness of the classification models is also assessed by the following parameters: accuracy or concordance (the proportion of correctly classified chemicals), sensitivity (the proportion of active chemicals predicted to be active), specificity (the proportion of non-active chemicals predicted to be non-active), false negatives (the proportion of active chemicals falsely predicted as non-active) and false positives (the proportion of non-active chemicals falsely predicted as active). Depending on the intended application of the predictive tool, the classification model can be optimized in either direction. In drug design the objective is to obtain a high specificity as a false positive prediction could result in the loss of a valuable candidate. In the regulatory environment, for safety assessment and consumer protection, the precautionary principle must be applied, so an optimization of sensitivity would be desirable, as every false negative compound could result in a lack of protection and consequently pose a risk for the user.

6 Molecular Descriptor Interpretation, If Possible (OECD Principle 5)

Regarding the interpretability of the descriptors it is important to take into account that the response modelled is frequently the result of a series of complex biological or physico-chemical mechanisms, thus it is very difficult and reductionist to ascribe too much importance to the mechanistic meaning of the molecular descriptors used in a QSAR model. Moreover, it must also be highlighted that in multivariate models such as MLR models, even though the interpretation of the singular molecular descriptor can certainly be useful, it is only the combination of the selected set of descriptors that is able to model the studied endpoint. If the main aim of QSAR modeling is to fill the gaps in available data, the modeler’s attention should be focused on model quality. In relation to this point, Livingstone, in an interesting perspective paper [42] states: “The need for interpretability depends on the application, since a validated mathematical model relating a target property to chemical features may, in some cases, be all that is necessary, though it is obviously desirable to attempt some explanation of the ‘mechanism’ in chemical terms, but it is often not necessary, per se.” Zefirov and Palyulin [78] took the same position, differentiating predictive QSARs, where attention essentially concerns the best prediction quality, from descriptive QSARs where the major attention is paid to descriptor interpretability.

The author’s approach to QSAR modeling will be illustrated in the following sections of this chapter through the modeling of environmental endpoints. The approach starts with a statistical validation for predictivity and continues on through further interpretation for the mechanistic meaning of the selected descriptors, but only if possible, as set down by the fifth OECD principle [6]. Therefore, the application domain of this approach (the “statistical approach”) is mainly related to the production of predicted data (predictive QSAR), strongly verified for their reliability; such data can be more usefully applied to screen and rank chemicals providing priority lists.

7 Environmental Single Endpoints

7.1 Physico-chemical Properties

Organic chemicals now need to be characterized by many parameters, either because of the registration policy required to chemical industries (see for example, the new European REACH policy) or for an understanding of the environmental behavior of chemicals present as pollutants in various compartments. Unfortunately there is an enormous lack of knowledge for many important endpoints, such as various physico-chemical properties (for instance, melting point, boiling point, aqueous solubility, volatility, hydrophobicity, various partition coefficients), environmental reactivity and derived persistence, toxicity, mutagenicity. This lack of knowledge calls for a predictive approach to the assessment of chemicals, such as by QSAR modeling.

A set of various physico-chemical properties for important classes of chemicals present in the environment, pollutant compounds such as PAHs [86] haloaromatics [87], PCBs [88], chemicals of EEC Priority List 1 [89] have been modelled using the weighted holistic invariant molecular (WHIM) descriptors [5153, 90, 91]. WHIM descriptors are theoretical three-dimensional molecular indices that contain information, in terms of size, shape, symmetry, and atom distribution, on the whole molecular structure. These indices are calculated from the (x, y, z) coordinates of a molecule within different weighting schemes by principal component analysis and represent a very general approach to describe molecules in a unitary conceptual framework, independent from the molecular alignment. Their meaning is defined by the same mathematical properties of the algorithm used for their calculation, and their application in QSAR modeling was very successful. A recent paper [92] again highlighted that, contrary to erroneous statements in the literature [93, 94], one set of WHIM descriptors, the k descriptors, are very useful in discriminating the shape of chemicals and can thus be used to study structural similarity.

Since then other physico-chemical properties have been modelled successfully by combining different kinds of theoretical molecular descriptors (mono-dimensional, bi-dimensional, and three-dimensional) calculated by the DRAGON software [46]: the basic physico-chemical properties of organic solvents [95], esters [15] and brominated flame retardants, mainly polybromodiphenyl ethers (PBDE) [24], the soil sorption coefficient (Koc) for pesticides [19, 96] (discussed below in Section 12.7.1.1).

A general classification of 152 organic solvents has been proposed [95] by applying the k-nearest neighbor method and counter propagation artificial neural networks (CP-ANN), in particular Kohonen-maps. A good separation for five classes was obtained by the net architecture (20×20×4, 200 iterations), based on simple molecular descriptors (unsaturation index – UI, hydrophilicity factor – Hy, average atomic composition – AAC, and the number of nitrogen atoms in the molecular structure – nN). The performances were very satisfactory: ER (%)=4.4 and ERcv (%)=11.4 (to be compared with the error rate without the model NoMER (%)=69.5.)

7.1.1 Soil Sorption of Pesticides

Sorption processes play a major role in determining the environmental fate, distribution, and persistence of chemicals. An important parameter when studying soil mobility and environmental distribution of chemicals is the soil sorption coefficient, expressed as the ratio between chemical concentration in soil and in water, normalized to organic carbon (Koc).

Many QSAR papers on soil sorption coefficient prediction have been published and reviewed by some authors [85, 96104].

The proposed models were mainly based on the correlation with octanol/water partition coefficients (Kow) and water solubility (Sw), others on theoretical molecular structure descriptors. A recent paper by the author dealt with log Koc of a heterogeneous set of 643 organic non-ionic compounds [19]; the response range was more than six log units, and prediction was made by a statistically validated QSAR modeling approach based on MLR and theoretical molecular descriptors, selected by GA from DRAGON (see Eq. 12-1). The high generalizability of one of the proposed models (scatter plot in Figure 12-2) was verified on external chemicals, performed by adequately splitting, by SOM and also randomly, the available set of experimental data into a very reduced representative training set (even less than 15% of the original data set) for model development and a large prediction set (more than 85% of the original data) used only for model performance inspection.

$$\begin{array}{l} {\rm{log}}\,{\rm{K}}_{{\rm{oc}}} = - 2.19(\pm 0.30) + 2.10(\pm 0.14){\rm{VED}}1 - 0.34(\pm 0.04){\rm{nHAcc}} - 0.31\\ \quad\qquad\quad\,\,(\pm 0.05){\rm{MAXDP}} - 0.33(\pm 0.12){\rm{CIC}}0 \end{array}$$
$$\begin{array}{l} {\rm{n}}({\rm{training}}) = 93\;{\rm{R}}^2 = 0.82\;{\rm{Q}}_{{\rm{cv}}}^2 = 0.80\;{\rm{Q}}_{{\rm{BOOT}}}^2 = 0.79\;{\rm{RMSE}}\\ \qquad\quad\quad\,\,\,= 0.523\;{\rm{RMSEp}}_{{\rm{LOO}}} = 0.523 \end{array}$$
$$ {\rm{n}}({\rm{prediction}}\;{\rm{set}}) = 550\;{\rm{Q}}_{{\rm{EXT}}}^2 = 0.78\;{\rm{RMSEp}}_{{\rm{EXT}}} = 0.560 $$
((12-1))
Figure 12-2.
figure 12_2_156654_1_En

Plot of experimental vs. predicted log Koc for the Eq. (12-1). The values for the training and prediction set chemicals are labeled differently, the outliers are numbered. The dotted lines indicate the 3σ interval (with copyright permission from [19])

The proposed models have good stability, robustness, and predictivity when verified by internal validation (cross-validation by LOO and Bootstrap) and also by external validation on a much greater data set. The stability of RMSE/RMSEp for both the training and prediction sets is further proof of model predictivity. The chemical applicability domain is verified by the Williams graph: nine outliers for response and three structurally influential chemicals have been highlighted (numbered in Figure 12-2).

The selected molecular descriptors have a clear mechanistic meaning; they are related to both the molecular size of the chemical and its electronic features relevant to soil partitioning, as well as to the chemical’s ability to form hydrogen bonds with water. A combination of different models from the GA-model population also allowed the proposal of predictions obtained by the better consensus model that, compared with published models and EPISuite predictions [105], are always among the best. The proposed models fulfill the fundamental points set down by OECD principles for the regulatory acceptability of a QSAR and could be reliably used as scientifically valid models in the REACH program.

The application of a single and general QSAR model, based on theoretical molecular descriptors for a large set of heterogeneous compounds, could be very useful for the screening of big data sets and for designing new chemicals, environmentally friendly as safer alternatives to dangerous chemicals.

7.2 Tropospheric Reactivity of Volatile Organic Compounds with Oxidants

The troposphere is the principal recipient of volatile organic compounds (VOCs) of both anthropogenic and biogenic origin. An indirect measure of the persistence of organic compounds in the atmosphere, and therefore a necessary parameter in environmental exposure assessment, is the rate at which these compounds react. The tropospheric lifetime of most organic chemicals, deriving from terrestrial emissions, is controlled by their degradation reaction with the OH radical and ozone during the daytime and NO3 radicals at night.

In recent years, several QSAR/QSPR models predicting oxidation rate constants with tropospheric oxidants have been published and the different approaches to molecular description and the adopted methodology have been compared [13, 14, 18, 23, 106117].

The most used method, implemented in AOPWIN of EPISUITE [118] for estimating tropospheric degradation by hydroxyl radicals is Atkinson’s fragment contribution method [107]. New general MLR models of the OH radical reaction rate for a wide and heterogeneous data set of 460 volatile organic compounds (VOCs) were developed by the author’s group [18]. The special feature of these models, in comparison to others, is the selection of theoretical molecular descriptors by a genetic algorithm as a variable subset selection procedure, their applicability to heterogeneous chemicals, and their validation for predictive purposes by both internal and external validation. External validation was performed by splitting the original data set by two different methods: the statistical experimental design procedure (D-optimal distance) and the Kohonen self-organizing map (SOM); this was performed to verify the impact that the structural heterogeneity (in chemicals’ split into training and prediction sets) has on model performance. The consequences on the model predictivity are also compared. D-optimal design, where the most dissimilar chemicals are always selected for the training set, leads to models with better predictive performance than models developed on the training set selected by SOM. The chemical applicability domain of the models and the reliability of the predictions are always verified by the leverage approach. The best proposed predictive model is based on four molecular descriptors and has the following equation (12-2):

$$\begin{array}{l}\log \;{\rm{k}}({\rm{OH}}) = 5.15(\pm 0.35) - 0.66(\pm 0.03){\rm{HOMO}} + 0.33(\pm 0.03){\rm{nX}} - 0.37\\ \qquad\qquad\quad(\pm 0.04){\rm{CIC}}0 + (\pm 0.02)0.13\;{\rm{nCaH}}\end{array}$$
$${\rm{n}}({\rm{training}}) = 234\;{\rm{R}}^2 = 0.83\;{\rm{Q}}^2 = 0.82\;{\rm{Q}}^2 {\rm{LMO}}(50\% ) = 0.81\;{\rm{RMSE}} = 0.473$$
$${\rm{n}}({\rm{test}}) = 226\;{\rm{Q}}_{{\rm{EXT}}}^2 = 0.81\;{\rm{RMSEp}} = 0.484\;{\rm{K}}_{{\rm{xx}}} = 33.8\% \;{\rm{K}}_{{\rm{xy}}} = 44.6\%$$
((12-2))

It is evident from the statistical parameters that the proposed model has good stability, robustness, and predictivity verified by internal (cross-validation by LOO and LMO) and also external validation. The influential chemicals are mainly the highly fluorinated chemicals, which have a strong structural peculiarity that the model is not able to capture. In Figure 12-3 the experimental values vs. those predicted by Eq. (12-2) are plotted.

Figure 12-3.
figure 12_3_156654_1_En

Plot of experimental and predicted log k(OH) values for the externally validated model by experimental design splitting. The training and test set chemicals are labeled differently, the outliers and influential chemicals are highlighted. The dotted lines indicate the 3σ interval (with copyright permission from [18])

The availability in the GA population of several possible models, similarly reliable for response prediction, also allowed the proposal of a consensus model which provides better predicted data than the majority of individual models, taking into account the more unique aspects of a particular structure.

While good models for OH rate constants are proposed in the literature for various chemical classes [107, 110113, 115, 117], the modeling of reactivity with NO3 radicals is more problematic. Most published QSAR models were obtained from separate training sets for aliphatic and aromatic compounds and the rate constants of aliphatic chemicals with NO3 radicals were successfully predicted [106, 108, 109]; however, the models for aromatic compounds do not appear to be so satisfactory, often being only local models built on very small training sets and, consequently, without any reasonable applicability for data prediction.

New general QSAR models for predicting oxidation rate constants (kNO3) for heterogeneous sets containing both aliphatic and aromatic compounds, based on few theoretical molecular descriptors (for instance, HOMO, number of aromatic rings, and an autocorrelation descriptor, MATS1m), were recently developed by the author’s group [13, 23]. The models have high predictivity even on external chemicals, obtained by splitting the available data using different methods. The possibility of having molecular descriptors available for all chemicals (even those not yet synthesized), the good prediction performance of models applicable to a wide variety of aromatic and aliphatic chemicals, and the possibility of verifying the chemical domain of applicability by the leverage approach makes these useful models for producing reliable estimated NO3 radical rate constants, when experimental parameters are not available.

The author has also proposed a predictive QSAR model of reaction rate with ozone for 125 heterogeneous chemicals [14]. The model, based on molecular descriptors, always selected by GA (HOMO–LUMO gap plus four molecular descriptors from DRAGON), has good predictive performance, also verified by statistical external validation on 42 chemicals not used for model development (\( {\rm{Q}}_{{\rm{EXT}}}^2 \)=0.904, average RMS=0.77 log units). This model appears more predictive than the model previously proposed by Pompe and Veber [114], a six-parameter MLR model developed on 116 heterogeneous chemicals and based on molecular descriptors, calculated by the CODESSA software, selected by a stepwise selection procedure. The predictive performance of this model was verified only internally by cross-validation with 10 groups of validation (Q2=0.83) and had an average RMS of 0.99 log units.

7.3 Biological Endpoints

7.3.1 Bioconcentration Factor

The bioconcentration factor (BCF) is an important parameter in environmental assessment as it is an estimate of the tendency of a chemical to concentrate and, consequently, to accumulate in an organism. The most common QSAR method, and the oldest, for estimating chemical bioconcentration is to establish correlations between BCF and chemical hydrophobicity using Kow, i.e., the n-octanol/water partition coefficient. A comparative study of BCF models based on log Kow was performed by Devillers et al. [119]. Different models for BCF using theoretical molecular descriptors have been developed, among others: [120124] and also by the author’s group [8, 9, 27], with particular attention, as usual, to the external predictivity and the chemical applicability domain.

An example is the model reported by the following equation (12-3):

$$\begin{array}{l}{\rm{log}}\;{\rm{BCF}} = - 0.74(\pm 0.35) + 2.55(\pm 0.13)^{\rm{V}} {\rm{I}}_{{\rm{D}},{\rm{deg}}}^{\rm{M}} - 1.09(\pm 0.11){\rm{HIC}}\\\quad\qquad\quad\ - 0.42(\pm 0.03){\rm{nHAcc}} - 1.22(\pm 0.17){\rm{GATS}}1{\rm{e}} - 1.55(\pm 0.34){\rm{MATS}}1{\rm{p}}\end{array}$$
$$\begin{array}{l}{\rm{n}}_{({\rm{training}})} = 179\;{\rm{R}}^2 = 0.81\;{\rm{Q}}_{{\rm{LOO}}}^2 = 0.79\;{\rm{Q}}_{{\rm{BOOT}}}^2 = 0.79\\{\rm{RMSE}}_{({\rm{train}}\;{\rm{set}})} = 0.56\;{\rm{RMSE}}_{({\rm{cross}} - {\rm{val}}.\;{\rm{set}})} = 0.58\end{array}$$
$${\rm{n}}_{({\rm{prediction}})} = 59\;{\rm{Q}}_{{\rm{EXT}}}^2 = 0.87\;{\rm{RMSE}}_{({\rm{prediction}}\;{\rm{set}})} = 0.57$$
((12-3))

7.3.2 Toxicity

Acute aquatic toxicity. The European Union’s so-called “List 1” of priority chemicals dangerous for the aquatic environment (more than 100 heterogeneous chemicals) was modelled for ecotoxicological endpoints (aquatic toxicity on bacteria, algae, Daphnia, fish, mammals) [89] by different theoretical descriptors, mainly WHIM. In addition, WHIM descriptors were also satisfactory in the modeling of a more reduced set of toxicity data on Daphnia (49 compounds including amines, chlorobenzenes, organotin and organophosphorous pesticides) [125].

An innovative strategy for the selection of compounds with a similar toxicological mode of action was proposed as a key problem in the study of chemical mixtures (PREDICT European Research Project) [126]. A complete representation of chemical structures for phenylureas and triazines by different molecular descriptors (1D-structural, 2D-topological, 3D-WHIM) allowed a preliminary exploration of structural similarity based on principal components analysis (PCA), multidimensional scaling (MDS), and hierarchical cluster. The use of a genetic algorithm to select the most relevant molecular descriptors in modeling toxicity data makes it possible both to develop good predictive toxicity models and select the most similar phenylureas and triazines. The way of doing this is to apply chemometric approaches based only on molecular similarity related to toxicological mode of action.

The Duluth data set of toxicity data to P. promelas was recently studied by the author group [26] and new statistically validated MLR models were developed to predict the aquatic toxicity of chemicals classified according to their mode of action (MOA). Also, a unique general model for direct toxicity prediction (DTP model) was developed to propose a predictive tool with a wide applicability domain, applicable independently of a priori knowledge of the MOA of chemicals.

The externally validated general-DTP log P-free model, reported below (Eq. 12-4) with statistical parameters, was developed on a training set of 249 compounds and applied for the prediction of the toxicity of 200 external chemicals, obtained by splitting the data by SOM (scatter plot in Figure 12-4):

$$\begin{array}{l} {\rm{log}}(1/{\rm{LC}}_{50} )_{96{\rm{h}}} = - 2.54 + 0.91{\rm{WA}} + 6.2{\rm{Mv}} + 0.21{\rm{nCb}}^ -\\\qquad\qquad\qquad\qquad + 0.08{\rm{H}} - 046 - 0.19{\rm{MAXDP}} - 0.33{\rm{nN}} \end{array}$$
$$ {\rm{n}}_{{\rm{training}}} = 249\;{\rm{R}}^2 = 0.79\;{\rm{Q}}_{{\rm{LOO}}}^2 = 0.78\;{\rm{Q}}_{{\rm{BOOT}}}^2 = 0.78\;{\rm{RMSE}} = 0.595 $$
$$ {\rm{n}}_{{\rm{test}}} = 200\;{\rm{Q}}_{{\rm{EXT}}}^2 = 0.71\;{\rm{RMSEcv}} = 0.613\;{\rm{RMSEp}} = 0.64 $$
((12-4))
Figure 12-4.
figure 12_4_156654_1_En

Plot of experimental and predicted toxicity values (Pimephales promelas) of the externally validated general-DTP log P-free model developed on a training set of 249 compounds (with copyright permission from [26])

Chronic toxicity: mutagenicity. The potential for mutagenicity of chemicals of environmental concern, such as aromatic amines and PAHs, is of high relevance; many QSAR models, based on the mechanistic approach, have been published on this topic and reviewed by Benigni [5, 127].

With regard to this important topic, our group has published useful MLR models, always verified for their external predictivity on new chemicals, for the Ames test results on amines [12] and nitro-PAHs [20]. Externally validated classification models, by k-NN and CART, were also developed for the mutagenicity of benzo-cyclopentaphenanthrenes and chrysenes, determined by the Ames test [128], and PAH mutagenicity, determined on human B-lymphoblastoid [35].

Endocrine Disruption. A large number of environmental chemicals, known as endocrine disruptor chemicals (EDCs), are suspected of disrupting endocrine functions by mimicking or antagonizing natural hormones. Such chemicals may pose a serious threat to the health of humans and wildlife; they are thought to act through a variety of mechanisms, mainly estrogen receptor-mediated mechanisms of toxicity. Under the new European legislation REACH (http://europa.eu.int/comm/environment/chemicals/reach.htm) EDCs will require an authorization to be produced and used, if safer alternative are not available. However, it is practically impossible to perform thorough toxicological tests on all potential xenoestrogens, thus QSAR modeling has been applied by many other authors in these last years [129142] providing promising methods for the estimation of a compound’s estrogenic activity.

QSAR models of the estrogen receptor binding affinity of a large data set of heterogeneous chemicals have been built also in our laboratory using theoretical molecular descriptors [21, 33] giving full consideration, during model construction and assessment, to the new OECD principles for the regulatory acceptance of QSARs. A data set of 128 NCTR compounds (EDKB, http://edkb.fda.gov/databasedoor.html) including several different chemical categories, such as steroidal estrogens, synthetic estrogens, antiestrogens, phytoestrogens, other miscellaneous steroids, alkylphenols, diphenyl derivatives, organochlorines, pesticides, alkylhydroxybenzoate preservatives (parabens), phthalates, and a number of other miscellaneous chemicals, was studied. An unambiguous multiple linear regression (MLR) algorithm was used to build the models by selecting the modeling descriptors by a genetic algorithm. (Table 12-1 presents the statistical parameters of the best-selected model.) The predictive ability of the model was validated, as usually, by both internal and external validation, and the applicability domain was checked by the leverage approach to verify prediction reliability.

Table 12-1. The MLR model between the structural descriptor and the log RBA of estrogens

Twenty-one chemicals of the Kuiper data set [143] were used for external validation, with the following highly satisfying results: \( {\rm{R}}_{{\rm{pred}}}^2 \)=0.778, \( {\rm{Q}}_{{\rm{EXT}}}^2 \)=0.754, RMSE of prediction of 0.559 (Figure 12-5).

Figure 12-5.
figure 12_5_156654_1_En

Predicted Log RBA values vs. experimental values for the original data set of estrogens (NCTR data set) and external prediction set (Kuiper’s data set) (with copyright permission from [21])

The results of several validation paths using different splitting methods performed in parallel (D-optimal design, SOM, random on activity sampling) give additional proof that the proposed QSAR model is robust and satisfactory (\( {\rm{R}}_{{\rm{pred}}}^2 \) range: 0.761–0.807), thus providing a feasible and practical tool for the rapid screening of the estrogen activity of organic compounds, supposed endocrine disruptors chemicals.

On the same topic, satisfactory predictive models for the EDC classification based on different classification methods have been developed and recently proposed [33]. In this study, QSAR models were developed to quickly and effectively identify possible estrogen-like chemicals based on 232 structurally diverse chemicals from the NCTR database (training set) by using several non-linear classification methodologies (least square support vector machine (LS-SVM), counter propagation artificial neural network (CP-ANN), and k-nearest neighbor (kNN)) based on molecular structural descriptors. The models were validated externally with 87 chemicals (prediction set) not included in the training set. All three methods gave satisfactory prediction results both for training and prediction sets; the most accurate model was obtained by the LS-SVM approach. The highly important feature of all these models is their low false negative percentage, useful in a precautionary approach. Our models were also applied to about 58,000 discrete organic chemicals from US-EPA; about 76% were predicted, by each model, not to bind to an estrogen receptor.

The obtained results indicate that the proposed QSAR models are robust, widely applicable, and could provide a feasible and practical tool for the rapid screening of potential estrogens. It is very useful information to prioritize chemicals for more expensive assays. In fact, the common 40,300 negative compounds could be excluded from the potential estrogens without experiments and a high accuracy (low false negative value).

A review on the applications of machine learning algorithms in the modeling of estrogen-like chemicals has been recently published [144].

8 Modeling More than a Single Endpoint

8.1 PC Scores as New Endpoints: Ranking Indexes

The environment is a highly complex system in which many parameters are of contemporaneous relevance: the understanding, rationalization, and interpretation of their covariance are the principal pursuit of any environmental researcher. Indeed, environmental chemistry deals with the behavior of chemicals in the environment, behavior which is regulated by many different variables such as physico-chemical properties, chemical reactivity, biological activity.

The application of explorative methods of multivariate analysis to various topics of environmental concern allows a combined view that generates ordination and grouping of the studied chemicals, in addition to the discovering of variable relationships. Any problem related to chemical behavior in the environment can be analyzed by multivariate explorative techniques, the outcome being to obtain chemical screening and ranking according to the studied properties, reactivities, or activities and, finally, the proposal of an index.

This was the starting point, and also the central core, of most of the author 15-year research of QSAR modeling at Insubria University.

The significant combination of variables from multivariate analysis can be used as a score value (a cumulative index), and modelled as a new endpoint by the QSAR approach to exploit already available information concerning chemical behavior, and to propose models able to predict such behavior for chemicals for which the same information is not yet known, or even for new chemicals before their synthesis. In fact, our QSAR approach, both for modeling quantitative response by regression methods and qualitative response by classification methods, is based on theoretical molecular descriptors that can be calculated for any drawn chemicals starting from the atomic coordinates, thus without the knowledge of any experimental parameter.

8.2 Multivariate Explorative Methods

The principal aim of any explorative technique is to capture the information available in any multivariate context and condense it into a more easily interpretable view (a score value or a graph). Thus, from these exploratory tools a more focused investigation can be made into chemicals of higher concern, directing the next investigative steps or suggesting others. Some of the more commonly used exploratory techniques are commented on here and applied in environmental chemistry and ecotoxicology.

8.2.1 Principal Component Analysis

Probably the most widely known and used explorative multivariate method is principal component analysis (PCA) [145, 146] (Chapter 6). In PCA, linear combinations of the studied variables are created, and these combinations explain, to the greatest possible degree, the variation in the original data. The first principal component (PC1) accounts for the maximum amount of possible data variance in a single variable, while subsequent PCs account for successively smaller quantities of the original variance. Principal components are derived in such a way that they are orthogonal. Indeed, it is good practice, especially when the original variables have different ranges of scales, to derive the principal components from the standardized data (mean of 0 and standard deviation of 1), i.e., via the correlation matrix. In this way all the variables are treated as if they are of equal importance, regardless of their scale of measurement. To be useful, it is desirable that the first two PCs account for a substantial proportion of the variance in the original data, thus they can be considered sufficiently representative of the main information included in the data, while the remaining PCs condense irrelevant information or even experimental noise. It is quite common for a PCA to be represented by a score plot, loading plot, or biplot, defined as the joint representation of the rows and columns of a data matrix; points (scores) represent the chemicals and vectors or lines represent the variables (loadings). The lengths of the vectors indicate the information associated with the variable, while the cosine of the angle between the vectors reflects their correlation. In our environmental chemistry studies, PCA has been widely used for screening and ranking purposes in many contexts: (a) tropospheric degradability of volatile organic compounds (VOCs) [11, 17, 106]; (b) mobility in the atmosphere or long-range transport of persistent organic pollutants (POPs) [16, 31, 147]; (c) environmental partitioning tendency of pesticides [7, 32]; (d) POP and PBT screening [10, 24, 34, 147149].

In addition, this multivariate approach was adopted to study aquatic toxicity of EU-priority listed chemicals on different endpoints [150] and esters [25], the endocrine disrupting activity based on three different endpoints [33] and the abiotic oxidation of phenols in an aqueous environment [9].

8.2.2 QSAR Modeling of Ranking Indexes

Tropospheric Persistence/Degradability of Volatile Organic Compounds (VOCs). Studies has been made of the screening/ranking of volatile organic chemicals according to their tendency to degrade in the troposphere. Indeed, as the atmospheric persistence of a chemical is mainly dependent on the degradation rates of its reaction with oxidants, the contemporaneous variation and influence of the rate constants for their degradation by OH, NO3 radicals, and ozone (kOH, kNO3, and kO3), in determining the inherent tendency to degradability, was explored by principal component analysis (PCA).

In a preliminary study, the experimental data allowed the ranking of a set of 65 heterogeneous VOCs, for which all the degradation rate constants were known; an atmospheric persistence index (ATPIN) had been defined and modelled by theoretical molecular descriptors [11]. Later, the application of our MLR models, developed for each studied degradation rate constant (kNO3, kO3, and kOH) [13, 14, 18], allowed a similar PC analysis (Figure 12-6) of a much larger set of 399 chemicals.

Figure 12-6.
figure 12_6_156654_1_En

Score plot and loading plot of the two principal component analysis of three rate constants (kOH, kNO3, kO3) for 399 chemicals (labeled according to chemical classes). ATDIN: ATmospheric Degradability INdex. Cumulative explained variance: 95.3%. Explained Variance of PC1 (ATDINdex)=80.9% (with copyright permission from [17])

This new more informative index (PC1 score of Figure 12-6, 80.9% of explained variance, newly defined ATDIN – atmospheric degradability index), based on a wider set of more structurally heterogeneous chemicals, was also satisfactorily modelled by MLR based on theoretical molecular descriptors and externally validated (Q2 0.94; \( {\rm{Q}}_{{\rm{EXT}}}^2 \) 0.92) (scatter plot in Figure 12-7) [17].

Figure 12-7.
figure 12_7_156654_1_En

Regression line for the externally validated model of ATPIN (ATmospheric Persistence Index: the opposite of ATDIN). The training and test set chemicals are differently highlighted, the outliers and influential chemicals are named (with copyright permission from [17])

Mobility in Atmosphere and Long-Range Transport of Persistent Organic Pollutants (POPs). The intrinsic tendency of compounds toward global mobility in the atmosphere has been studied, since it is a necessary property for the evaluation of the long-range transport (LRT) of POPs [16, 31]. As the mobility potential of a chemical depends on the various physico-chemical properties of a compound, principal component analysis was used to explore the contemporaneous variation and influence of all the properties selected as being the most relevant to LRT potential (such as vapor pressure, water solubility, boiling point, melting point, temperature of condensation, various partition coefficients among different compartments; for instance, Henry’s law constant, octanol/water partition coefficient, soil sorption coefficient, octanol/air partition coefficient).

A simple interpretation of the obtained PC1 is as a scoring function of intrinsic tendency toward global mobility. We have proposed this PC1 scoring as the ranking score for the 82 possible POPs in four a priori classes: high, relatively high, relatively low, and low mobility.

These classes have been successfully modelled by the CART method, based on four theoretical molecular descriptors (two Kier and Hall connectivity indexes, molecular weight, and sum of electronegativities) with only 6% of errors in cross-validation. The main aim was to develop a simple and rapid framework to screen, rank, and classify also new organic chemicals according to their intrinsic global mobility tendency, just from the knowledge of their chemical structure.

An analogous approach was previously applied to a subset of 52 POPs to define a long-range transport (LRT) index derived from the PC1 score, on the basis of physico-chemical properties and additionally taking into account atmospheric half-life data [147].

Environmental partitioning tendency of pesticides. The partitioning of pesticides into different environmental compartments depends mainly on the physico-chemical properties of the studied chemical, such as the organic carbon partition coefficient (Koc), the n-octanol/water partition coefficient (Kow), water solubility (Sw), vapor pressure (Vp), and Henry’s law constant (H). To rank and classify the 54 studied pesticides, belonging to various chemical categories, according to their distribution tendency in various media, we applied [32] a combination of two multivariate approaches: principal component analysis (Figure 12-8) for ranking and hierarchical cluster analysis for the definition of the four a priori classes, according to their environmental behavior (1. soluble, 2. volatile, 3. sorbed, and 4. non-volatile/medium class) (circles in Figure 12-8).

Figure 12-8.
figure 12_8_156654_1_En

Score plot and loading plot of the two first principal components of PCA of five physico-chemical properties (Koc, Kow, Sw, Vp, and Henry’s law constant) for 54 pesticides. Cumulative explained variance: 94.6%; explained variance of PC1: 70.1% (with copyright permission from [32])

The pesticides were finally assigned to the defined four classes by different classification methods (CART, k-NN, RDA) using theoretical molecular descriptors (for example, the CART tree is reported in Figure 12-9). Two of the selected molecular descriptors are quite easily interpretable, in particular (a) MW encodes information on molecule dimension; it is well known that big molecules have the greatest tendency to bind, by van der Waals forces, to the organic component of the soil, becoming the most sorbed in organic soils but the least soluble in water (Class 3) and (b) the possibility of a chemical to link by hydrogen bonds to water molecules (encoded in the molecular descriptor nHDon) results in the higher solubility of the Class 1 pesticides; furthermore, chemicals with fewer intramolecular hydrogen bonds are the most volatile (Class 2). The last topological descriptor J, that discriminates Class 4 of the medium-behavior pesticides, is not easily interpretable.

Figure 12-9.
figure 12_9_156654_1_En

Classification tree by classification and regression tree (CART) of mobility classes for 54 pesticides. Error rate (ER) 11.11%; ER in prediction: 18.53%; NoMER: 62.96% (with copyright permission from [32])

A wider, heterogeneous, and quite representative data set of pesticides of different chemical classes (acetanilides, carbamates, dinitroanilines, organochlorides, organophosphates, phenylureas, triazines, triazoles), already studied for their Koc modeling [96] has also undergone PC analysis of various environmental partitioning properties (solubility, volatility, partition coefficients, etc.) to study leaching tendency [7]. The resultant macrovariables, PC1 and PC2 scores, called the leaching index (LIN) and volatility index (VIN), have been proposed as cumulative environmental partitioning indexes in different media. These two indexes were modelled by theoretical molecular descriptors with satisfactory predictive power (Q2 leave-30%-out=0.85 for LIN). Such an approach allows a rapid pre-determination and the screening of the environmental distribution of pesticides, starting only from the molecular structure of the pesticide without any a priori knowledge of the physico-chemical properties.

The proposed index LIN was used in a comparative analysis with GUS and LEACH index for highlighting the pesticides most dangerous to the aquatic compartment among those widely used in Uzbekistan, in the Amu-Darya river basin [151].

POPs and PBTs. QSAR approaches, based on molecular structure for the prioritization of chemicals for persistence, particularly persistent organic pollutants (POPs) screening and ranking method for global half-life, have recently been proposed [10, 24, 148, 149].

Persistence in the environment is an important criterion in prioritizing hazardous chemicals and in identifying new persistent organic pollutants (POPs). Degradation half-life in various compartments is among the more commonly used criteria for studying environmental persistence, but the limited availability of experimental data or reliable estimates is a serious problem. Available half-life data for degradation in air, water, sediment, and soil, for a set of 250 organic POP-type chemicals, have been combined in a multivariate approach by principal component analysis. This PCA distributes the studied compounds according to their cumulative, or global, half-life and relative persistence in different media, to obtain a ranking of the studied organic pollutants according to their relative overall half-life.

The biplot relative to the first and second components is reported in Figure 12-10, where the chemicals (points or scores) are distributed according to their environmental persistence, represented by the linear combination of their half lives in the four selected media (loadings shown as lines). The cumulative explained variance of the first two PCs is 94%, and the PC1 alone provides the largest part, 78%, of the total information. The loading lines show the importance of each variable in the first two PCs.

Figure 12-10.
figure 12_10_156654_1_En

Principal component analysis on half-life data for 250 organic compounds in the various compartments (air, water, sediment, and soil) (PC1–PC2: explained variance=94%). P=persistent (with copyright permission from [10])

It is interesting to note that all the half-life values (lines) are oriented in the same direction along the first principal component, thus PC1, derived from a linear combination of half-life in different media, is a new macro-variable condensing chemical tendency to environmental persistence. PC1 ranks the compounds according to their cumulative half-life and discriminates between them with regard to persistence; chemicals with high half-life values in all the media (highlighted in the PCA graph) are located to the right of the plot, in the zone of global higher persistence (very persistent chemicals anywhere); chemicals with a lower global half-life fall to the left of the graph, not being persistent in any medium (labeled in Figure 12-10) or persistent in only one medium; chemicals persistent in 2 or 3 media are located in the intermediate zone of Figure 12-10.

PC2, although less informative (E.V. 16%), is also interesting; it separates the compounds more persistent in air (upper parts in Figure 12-10, regions 1 and 2), i.e., those with higher LRT potential from those more persistent in water, soil, and sediment (lower parts in Figure 12-10, regions 3 and 4).

A deeper analysis of the distribution of the studied chemicals gives some interesting results and confirms experimental evidence: to the right, among the very persistent chemicals in all the compartments (full triangles in Figure 12-10), we find most of the compounds recognized as POPs by the Stockholm Convention [152]. Highly chlorinated PCBs and hexachlorobenzene are among the most persistent compounds in our reference scenario. All these compounds are grouped in Region 1 owing to their global high persistence, especially in air. The less chlorinated PCBs (PCB-3 and PCB 21) fall in the zone of very persistent chemicals, but not in the upper part of Region 1, due to their lower persistence in air compared with highly chlorinated congeners. p,p -DDT, p,p -DDE, o,p -DDE, highly chlorinated dioxins and dioxin-like compounds, as well as pesticides toxaphene, lindane, chlordane, dieldrin, and aldrin fall in Region 3 (highly persistent chemicals mainly in compartments different from air).

A global half-life index (GHLI) obtained from existing knowledge of generalized chemical persistence over a wide scenario of 250 chemicals, which reliability was verified through comparison with multimedia model results and empirical evidence, was proposed from this PC analysis [10]. This global index, the PC1 score, was then modelled as a cumulative endpoint using a QSAR approach based on theoretical molecular descriptors; a simple and robust regression model externally validated for its predictive ability [6, 84] has been derived. The original set of available data was first randomly split into training and prediction sets; 50% of the compounds were put into the prediction set (125 compounds) while the other 50% was used to build the QSPR model by MLR. Given below (Eq. 12-5) is the best QSPR model, selected by statistical approaches and its statistical parameters (Figure 12-11 shows the plot of GHLI values from PCA vs. predicted GHLI values):

$$\begin{array}{ll} {\rm{GHL}}\;{\rm{Index}} = - 3.12(\pm 0.77) + 0.33(\pm 4.5{\rm{E}} - 2){\rm{X}}0{\rm{v}} + 5.1(\pm 0.99){\rm{Mv}} - 0.32\\ \qquad\qquad\qquad(\pm 6.13{\rm{E}}- 2){\rm{MAXDP}} - 0.61(\pm 0.10){\rm{nHDon}} - 0.5(\pm 1.15){\rm{CIC}}0\\ \qquad\qquad\qquad - 0.61(\pm 0.13){\rm{O}} - 060 \end{array}$$
$$\begin{array}{l} {\rm{n}}_{{\rm{training}}} = 125\;{\rm{R}}^2 = 0.85\;{\rm{Q}}_{{\rm{LOO}}}^2 = 0.83\;{\rm{Q}}_{{\rm{BOOT}}}^2 = 0.83\;\\{\rm{RMSE}} \;= 0.76\;{\rm{RMSEcv}} = 0.70; \end{array}$$
$$ {\rm{n}}_{{\rm{prediction}}} = 125\;{\rm{R}}_{{\rm{EXT}}}^2 = 0.79\;{\rm{RMSEp}} = 0.78 $$
((12-5))
Figure 12-11.
figure 12_11_156654_1_En

Scatter plot of the GHLI values calculated by PCA vs. predicted values by the model. The GHLI values for the training and prediction set chemicals are labeled differently. The diagonal dotted lines indicate the 2.5σ interval and response outliers are numbered. Vertical and horizontal dotted lines identify the cut-off value of GHLI=1 for high-persistent chemicals (with copyright permission from [10])

This model presents good internal and external predictive power, a result that must be highlighted as proof of model robustness and real external predictivity. The only really dangerous zone in the proposed model is the underestimation zone (circled in Figure 12-11).

The application of this model, using only a few structural descriptors, could allow a fast preliminary identification and prioritization of not yet known POPs, just from the knowledge of their molecular structure. The proposed multivariate approach is particularly useful not only to screen and to make an early prioritization of environmental persistence for pollutants already on the market, but also for compounds not yet synthesized, which could represent safer alternative and replacement solutions for recognized POPs. No method other than QSAR is applicable to detect the potential persistence of new compounds.

Similarly, highly predictive classification models, based on k-NN, CART, and CP-ANN, have been developed and can be usefully applied for POP pre-screening. The a priori classes have been defined by applying hierarchical cluster analysis to the half-life data [34].

An approach analogous to GHLI has been successfully applied to the PCA-combination of data obtained from the above cumulative half-lives for persistence GHLI, bioconcentration data of fish, and acute toxicity data of P. promelas in order to propose, and then model by QSPR approach, a combined index of PBT behavior [24, 148, 149]. A simple model, based on easy calculable molecular descriptors and with high external predictivity (\( {\rm{Q}}_{{\rm{EXT}}}^2 \)>0.8), has been developed and will be published. This PBT index can be applied also to chemicals without any experimental data and even to not yet synthesized compounds.

These QSAR-based tools, validated for their predictivity on new chemicals, could help in highlighting the POP and PBT behavior also of chemicals not yet synthesized, and could be usefully applied for the new European Regulation REACH, which requires most demanding authorization steps for PBTs and the design of safer alternatives. The results of our predictions were comparable with those from the US-EPA PBT profiler (http://www.epa.gov/pbt/tools/toolbox.htm).

9 Conclusions

A statistical approach to QSAR modeling, based on heterogeneous theoretical molecular descriptors and chemometric methods and developed with the fundamental aim of predictive applications, has been introduced and discussed in this review. Several applications to environmentally relevant topics related to organic pollutants, performed by the Insubria QSAR Research Unit in last 15 years, have been presented. Different endpoints related to physico-chemical properties, persistence, bioaccumulation, and toxicity have been modelled, not only singularly, but also as combined endpoints, obtained by multivariate analysis; the approach is innovative and highly useful for ranking and prioritizing purposes. All the proposed models characteristically check the predictive performance and applicability domain of the chemicals, even new chemicals that never participated in the model development. The fulfillment of the “OECD principles for QSAR validation” is a guarantee for the reliability of the predicted data obtained by our models and their possible applicability in the context of REACH.