Introduction

Toxicology deals with the quantitative assessment of toxic effects to organisms in relation to the level, duration and frequency of exposure. Various segments of the population come in contact with toxic chemicals due to misuse (e.g., accidental poisoning), but also through manufacturing, drug and food consumption. Additionally, people working in various jobs (e.g., painters and applicators of pesticides) are exposed to toxic substances. In general, exposure to toxic substances is to be avoided [1].

As the experimental determination of toxicological properties is a costly and time-consuming process, it is essential to develop mathematical predictive relationships to theoretically quantify toxicity [2, 3]. Quantitative structure-toxicity relationship (QSTR) studies can provide a useful tool for achieving this goal, given the successful applications of quantitative structure-activity relationships (QSARs) in several scientific areas, such as pharmacology, chemistry and environmental research. Based on a training database containing measured toxicity potencies of compounds and a number of molecular descriptors, QSTRs can be used to predict the toxicity of chemical compounds that are not included in the database [46].

For the formal description of relationships between activity measures and structural descriptors of compounds, various statistical techniques can be used. Among them the most frequently used are multiple linear regression (MLR) and partial least squares (PLS). Several other statistical techniques have been used in QSAR, including discriminant analysis, principal component analysis (PCA) and factor analysis, cluster analysis, multivariate analysis, and adaptive least squares [79]. Neural network (NN) techniques have also been used successfully in QSAR [1016]. The NN methodologies are generally used when the relationships cannot be interpreted accurately by linear functions [17].

The goal of the present study is to determine the efficiency of a newly introduced RBF training methodology in predicting the toxicity of compounds. The methodology uses the innovative fuzzy-means clustering technique to determine the number and the locations of the hidden node centres [18]. Compared to traditional training techniques, the method employed in this work is much faster since it does not involve any iterative procedure, utilizes only one tuning parameter and is repetitive, i.e., it does not depend on a random initial selection of centres. The RBF method is applied to a data set of 221 phenols and the results indicate that it can be used as an efficient new technique for predicting toxicity with significant accuracy, using appropriate descriptors as inputs.

Materials and methods

It is essential in order to obtain a successful QSTR that all data used as part of the training and validation procedure are of high quality. High quality data should derive from the same endpoint and protocol and ideally should be measured in the same laboratory [19]. The data set used in this study fulfills this criterion.

Toxicity data

This data set consists of 221 phenols and their corresponding toxicity data to the ciliate Tetrahymena pyriformis in terms of log(1/IGC50) (mmol/L). The toxicity values were taken from the literature [20] and are shown in Table 1. The phenols are structurally heterogeneous and represent a variety of mechanisms of toxic action. The dataset consists of polar narcotics, weak acid respiratory uncouplers, pro-electrophiles and soft electrophiles.

Table 1 Predicted values [log(1/IGC50)] for the training and the test set

Molecular descriptors

The molecular descriptors used to derive the model were taken from the literature [20] and include the logarithm of the octanol/water partition coefficient (log K ow), acidity constant (pK a), the energies of the highest occupied and lowest unoccupied molecular orbital (E HOMO and E LUMO respectively) and the hydrogen bond donor number (N hdon). All these descriptors are related to the toxicity effect of the compounds studied.

Statistical analysis (QSAR development)

In this section, we present the basic characteristics of the RBF NN architecture and the training method used to develop the QSAR NN models.

RBF network topology and node characteristics

RBF networks consist of three layers: the input layer, the hidden layer and the output layer. The input layer collects the input information and formulates the input vector x. The hidden layer consists of L hidden nodes, which apply nonlinear transformations to the input vector. The output layer delivers the NN responses to the environment. A typical hidden node l in an RBF network is described by a vector \( {\textbf{\ifmmode\expandafter\hat\else\expandafter\^\fi{x}}}_{l} , \) equal in dimension to the input vector and a scalar width \( \sigma _{l} . \) The activity ν l (x) of the node is calculated as the Euclidean norm of the difference between the input vector and the node center and is given by

$$ v_{l} ({\textbf{x}}) = {\left\| {{\textbf{x}} - {\textbf{\ifmmode\expandafter\hat\else\expandafter\^\fi{x}}}_{l} } \right\|} $$
(1)

The response of the hidden node is determined by passing the activity through the radially symmetric Gaussian function:

$$ f_{l} ({\textbf{x}}) = \exp {\left( { - \frac{{v_{l} ({\textbf{x}})^{2} }} {{\sigma _{l} ^{2} }}} \right)} $$
(2)

Finally, the output values of the network are computed as linear combinations of the hidden layer responses:

$$ \ifmmode\expandafter\hat\else\expandafter\^\fi{y} = g({\textbf{x}}) = {\sum\limits_{l = 1}^L {f_{l} ({\textbf{x}})} }w_{l} $$
(3)

where [w 1, w 2,... ,w L ] is the vector of weights, which multiply the hidden node responses in order to calculate the output of the network.

RBF network training methodology

Training methodologies for the RBF network architecture are based on a set of input–output training pairs (x(k); y(k)) (k=1, 2,...,K). The training procedure used in this work consists of three distinct phases:

(i) Selection of the network structure and calculation of the hidden-node centers using the fuzzy-means clustering algorithm [18]. The algorithm is based on a fuzzy partition of the input space, which is produced by defining a number of triangular fuzzy sets on the domain of each input variable. The centers of these fuzzy sets produce a multidimensional grid on the input space. A rigorous selection algorithm chooses the most appropriate knots of the grid, which are used as hidden node centers in the RBF network model produced. The idea behind the selection algorithm is to place the centers in the multidimensional input space so that there is a minimum distance between the center locations. At the same time, the algorithm assures that for any input example in the training set there is at least one selected hidden node that is close enough according to a distance criterion. It must be emphasized that, in contrast to both the k-means [21] and the c-means clustering [22] algorithms, the fuzzy-means technique does not need the number of clusters to be fixed before the execution of the method. Moreover, due to the fact that it is a one-pass algorithm, it is extremely fast even if a large database of input–output examples is available. Furthermore, the fuzzy-means algorithm needs only one tuning parameter, which is the number of fuzzy sets that are used to partition each input dimension.

(ii) Following the determination of the hidden-node centers, the widths of the Gaussian activation function are calculated using the P-nearest neighbor heuristic [23]:

$$ \sigma _{l} = {\left( {\frac{1} {p}{\sum\limits_{i = 1}^p {{\left\| {{\textbf{\ifmmode\expandafter\hat\else\expandafter\^\fi{x}}}_{l} - {\textbf{\ifmmode\expandafter\hat\else\expandafter\^\fi{x}}}_{i} } \right\|}^{2} } }} \right)}^{{1/2}} $$
(4)

where \( {\textbf{\ifmmode\expandafter\hat\else\expandafter\^\fi{x}}}_{1} ,\,{\textbf{\ifmmode\expandafter\hat\else\expandafter\^\fi{x}}}_{2} ,\, \ldots ,\,{\textbf{\ifmmode\expandafter\hat\else\expandafter\^\fi{x}}}_{p} \) are the p nearest-node centers to the hidden node l. The parameter p is selected so that many nodes are activated when an input vector is presented to the NN model.

(iii) The connection weights are determined using linear regression between the hidden-layer responses and the corresponding output training set.

Results

In order to evaluate and compare the performance of the RBF training methodology presented in this work, the data set was initially split into a training and a validation set in a ratio of approximately 80:20% (180 and 41 compounds, respectively). For that, the Kennard and Stones algorithm [24] was used. The Kennard–Stones algorithm has gained increasing popularity for splitting data sets into two subsets. The algorithm starts by finding two samples that are the farthest apart from each other on the basis of the input variables in terms of some metric, e.g., the Euclidean distance. These two samples are removed from the original data set and put into the calibration data set. The procedure described is repeated until the desired number of samples has been reached in the calibration set. The advantages of this algorithm are that the calibration samples map the measured region of the variable space completely with respect to the induced metric and that the test samples all fall inside the measured region. The training and validation compounds are clearly indicated in Table 1. Both RBF network and MLR models were developed based on exactly the same training set. The validation set was not involved in any way during the training phase. The results are shown in Table 1, where the predictions of the two models are shown for both the training and the external examples. The same results are shown in a graphical format in Figs. 1, 2, 3 and 4, where the experimental toxicity is plotted against the predictions of the RBF network and the MLR model. In each figure the corresponding coefficients of determination (R 2-value) are presented, which indicate a much higher correlation between experimental and predicted values using the RBF network methodology. The full linear equation for the prediction of toxicity is the following:

$$ \begin{aligned} {\text{log}}\,1/{\text{IGC}}_{{50}} & = 0.5617{\text{log}}\,{\text{K}}_{{{\text{ow}}}} + 0.0026{\text{pK}}_{{\text{a}}} - 0.8792{\text{E}}_{{{\text{LUMO}}}} \\ & \quad + 0.7995{\text{E}}_{{{\text{HUMO}}}} + 0.2734{\text{N}}_{{{\text{hdon}}}} + 6.2044, \\ {\text{n}} & = 180,\quad {\text{R}}^{2} = 0.6022,\quad {\text{RMS}} = 0.5352. \\ \end{aligned} $$
(5)
Fig. 1
figure 1

Experimental versus predicted toxicity using the RBF methodology for the training set (180 compounds)

Fig. 2
figure 2

Experimental versus predicted toxicity using the MLR methodology for the training set (180 compounds)

Fig. 3
figure 3

Experimental versus predicted toxicity using the RBF methodology for the test set (41 compounds)

Fig. 4
figure 4

Experimental versus predicted toxicity using the MLR methodology for the test set (41 compounds)

To compare the performance of the modeling schemes further, their predictive ability was also evaluated by the leave-one-out (LOO) cross-validation procedure. A number of modified data sets were created by deleting in each case one object from the data. An RBF network and an MLR model were developed in each case based on the remaining data and were validated using the object that had been deleted. Consequently, 221 RBF networks and MLR models were built, by deleting each time one compound from the training set. Figures 5 and 6 show the experimental toxicity versus the predictions produced by the RBF NN models and the multiple regression technique, using the LOO cross validation procedure. The corresponding coefficients of determination \( R^{{\text{2}}}_{{{\text{CV}}}} \) indicate again that the models derived from the RBF methodology have a higher predictive potential. The comparison between the RBF and the MLR methods is summarized in Table 2. In all cases, the RBF models proved to be remarkably more accurate than the MLR models. The predictive abilities of both modeling techniques can be improved if different models are developed for each one of the several different mechanisms of action, but in this paper we concentrated on building a single model for each methodology that can predict toxicity for the variety of mechanisms that are included in the data set.

Fig. 5
figure 5

Experimental versus predicted toxicity with cross validation (RBF methodology)

Fig. 6
figure 6

Experimental versus predicted toxicity with cross validation (MLR methodology)

Table 2 Summary of the results produced by the different methods

It should finally be noted that the MATLAB programming language was used to implement all the training and testing procedures. The computational time required to build the NN models in a Pentium IV 3 GHz processor was always less than 0.2 s. It should also be emphasized that the RBF training method has been developed in-house, so no commercial packages were used to develop the NN models. The complete QSTR models can be made available to the interested readers.

Discussion and conclusions

In this work, we presented a novel QSTR methodology based on the RBF NN architecture. The method was illustrated using a data set of 221 phenols and compared with standard MLR. Validation of the different QSTR methodologies was based on two evaluation procedures. In the first method the data were split into a training and a validation set and the model generated using the training set was used to predict toxicity in the validation set. The second method was the standard LOO cross-validation procedure. The modeling procedures used in this work illustrated the accuracy of the models produced, not only by calculating their fitness on sets of training data but also by testing the predicting abilities of the models.

The RBF NN models were produced based on the fuzzy-means training method, which is fast and repetitive, in contrast to most traditional training techniques. The model generated for the data set required five descriptors. In terms of the R 2, \( R^{{\text{2}}}_{{{\text{cv}}}} \) and RMS values, the RBF models proved to have a significant predictive potential. The results obtained illustrated that the RBF NN architecture can be used to derive QSTRs, which are more accurate and have better generalization capabilities compared to linear regression models at the expense of the increased complexity of the model compared to a simple structure of a linear model. The method proposed could be a substitute to costly and time-consuming experiments for determining toxicity.