Prediction of toxicity using a novel RBF neural network training methodology

Melagraki, Georgia; Afantitis, Antreas; Makridima, Kalliopi; Sarimveis, Haralambos; Igglessi-Markopoulou, Olga

doi:10.1007/s00894-005-0032-8

Prediction of toxicity using a novel RBF neural network training methodology

Published: 08 November 2005

Volume 12, pages 297–305, (2006)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Journal of Molecular Modeling Aims and scope Submit manuscript

Prediction of toxicity using a novel RBF neural network training methodology

Download PDF

Georgia Melagraki¹,
Antreas Afantitis¹,
Kalliopi Makridima¹,
Haralambos Sarimveis¹ &
…
Olga Igglessi-Markopoulou¹

1321 Accesses
35 Citations
Explore all metrics

Abstract

A neural network methodology based on the radial basis function (RBF) architecture is introduced in order to establish quantitative structure-toxicity relationship models for the prediction of toxicity. The dataset used consists of 221 phenols and their corresponding toxicity values to Tetrahymena pyriformis. Physicochemical parameters and molecular descriptors are used to provide input information to the models. The performance and predictive abilities of the RBF models are compared to standard multiple linear regression (MLR) models. The leave-one-out cross validation procedure and validation through an external test set produce statistically significant R ² and RMS values for the RBF models, which prove considerably more accurate than the MLR models.

Using Radial Basis Function Neural Networks to identify river water data parameters

Article 01 July 2016

Combining Radial Basis Function Neural Network Models and Inclusive Multiple Models for Predicting Suspended Sediment Loads

Article 26 July 2022

Development of Extreme Learning Machine Radial Basis Function Neural Network Models to Predict Residual Aluminum for Water Treatment Plants

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Toxicology deals with the quantitative assessment of toxic effects to organisms in relation to the level, duration and frequency of exposure. Various segments of the population come in contact with toxic chemicals due to misuse (e.g., accidental poisoning), but also through manufacturing, drug and food consumption. Additionally, people working in various jobs (e.g., painters and applicators of pesticides) are exposed to toxic substances. In general, exposure to toxic substances is to be avoided [1].

As the experimental determination of toxicological properties is a costly and time-consuming process, it is essential to develop mathematical predictive relationships to theoretically quantify toxicity [2, 3]. Quantitative structure-toxicity relationship (QSTR) studies can provide a useful tool for achieving this goal, given the successful applications of quantitative structure-activity relationships (QSARs) in several scientific areas, such as pharmacology, chemistry and environmental research. Based on a training database containing measured toxicity potencies of compounds and a number of molecular descriptors, QSTRs can be used to predict the toxicity of chemical compounds that are not included in the database [4–6].

For the formal description of relationships between activity measures and structural descriptors of compounds, various statistical techniques can be used. Among them the most frequently used are multiple linear regression (MLR) and partial least squares (PLS). Several other statistical techniques have been used in QSAR, including discriminant analysis, principal component analysis (PCA) and factor analysis, cluster analysis, multivariate analysis, and adaptive least squares [7–9]. Neural network (NN) techniques have also been used successfully in QSAR [10–16]. The NN methodologies are generally used when the relationships cannot be interpreted accurately by linear functions [17].

The goal of the present study is to determine the efficiency of a newly introduced RBF training methodology in predicting the toxicity of compounds. The methodology uses the innovative fuzzy-means clustering technique to determine the number and the locations of the hidden node centres [18]. Compared to traditional training techniques, the method employed in this work is much faster since it does not involve any iterative procedure, utilizes only one tuning parameter and is repetitive, i.e., it does not depend on a random initial selection of centres. The RBF method is applied to a data set of 221 phenols and the results indicate that it can be used as an efficient new technique for predicting toxicity with significant accuracy, using appropriate descriptors as inputs.

Materials and methods

It is essential in order to obtain a successful QSTR that all data used as part of the training and validation procedure are of high quality. High quality data should derive from the same endpoint and protocol and ideally should be measured in the same laboratory [19]. The data set used in this study fulfills this criterion.

Toxicity data

This data set consists of 221 phenols and their corresponding toxicity data to the ciliate Tetrahymena pyriformis in terms of log(1/IGC₅₀) (mmol/L). The toxicity values were taken from the literature [20] and are shown in Table 1. The phenols are structurally heterogeneous and represent a variety of mechanisms of toxic action. The dataset consists of polar narcotics, weak acid respiratory uncouplers, pro-electrophiles and soft electrophiles.

Table 1 Predicted values [log(1/IGC₅₀)] for the training and the test set

Full size table

Molecular descriptors

The molecular descriptors used to derive the model were taken from the literature [20] and include the logarithm of the octanol/water partition coefficient (log K _ow), acidity constant (pK _a), the energies of the highest occupied and lowest unoccupied molecular orbital (E _HOMO and E _LUMO respectively) and the hydrogen bond donor number (N _hdon). All these descriptors are related to the toxicity effect of the compounds studied.

Statistical analysis (QSAR development)

In this section, we present the basic characteristics of the RBF NN architecture and the training method used to develop the QSAR NN models.

RBF network topology and node characteristics

RBF networks consist of three layers: the input layer, the hidden layer and the output layer. The input layer collects the input information and formulates the input vector x. The hidden layer consists of L hidden nodes, which apply nonlinear transformations to the input vector. The output layer delivers the NN responses to the environment. A typical hidden node l in an RBF network is described by a vector $ {\textbf{\ifmmode\expandafter\hat\else\expandafter\^\fi{x}}}_{l} , $ equal in dimension to the input vector and a scalar width $ \sigma _{l} . $ The activity ν_l(x) of the node is calculated as the Euclidean norm of the difference between the input vector and the node center and is given by

$$ v_{l} ({\textbf{x}}) = {\left\| {{\textbf{x}} - {\textbf{\ifmmode\expandafter\hat\else\expandafter\^\fi{x}}}_{l} } \right\|} $$

(1)

The response of the hidden node is determined by passing the activity through the radially symmetric Gaussian function:

$$ f_{l} ({\textbf{x}}) = \exp {\left( { - \frac{{v_{l} ({\textbf{x}})^{2} }} {{\sigma _{l} ^{2} }}} \right)} $$

(2)

Finally, the output values of the network are computed as linear combinations of the hidden layer responses:

$$ \ifmmode\expandafter\hat\else\expandafter\^\fi{y} = g({\textbf{x}}) = {\sum\limits_{l = 1}^L {f_{l} ({\textbf{x}})} }w_{l} $$

(3)

where [w ₁, w ₂,... ,w _L] is the vector of weights, which multiply the hidden node responses in order to calculate the output of the network.

RBF network training methodology

Training methodologies for the RBF network architecture are based on a set of input–output training pairs (x(k); y(k)) (k=1, 2,...,K). The training procedure used in this work consists of three distinct phases:

(i) Selection of the network structure and calculation of the hidden-node centers using the fuzzy-means clustering algorithm [18]. The algorithm is based on a fuzzy partition of the input space, which is produced by defining a number of triangular fuzzy sets on the domain of each input variable. The centers of these fuzzy sets produce a multidimensional grid on the input space. A rigorous selection algorithm chooses the most appropriate knots of the grid, which are used as hidden node centers in the RBF network model produced. The idea behind the selection algorithm is to place the centers in the multidimensional input space so that there is a minimum distance between the center locations. At the same time, the algorithm assures that for any input example in the training set there is at least one selected hidden node that is close enough according to a distance criterion. It must be emphasized that, in contrast to both the k-means [21] and the c-means clustering [22] algorithms, the fuzzy-means technique does not need the number of clusters to be fixed before the execution of the method. Moreover, due to the fact that it is a one-pass algorithm, it is extremely fast even if a large database of input–output examples is available. Furthermore, the fuzzy-means algorithm needs only one tuning parameter, which is the number of fuzzy sets that are used to partition each input dimension.

(ii) Following the determination of the hidden-node centers, the widths of the Gaussian activation function are calculated using the P-nearest neighbor heuristic [23]:

$$ \sigma _{l} = {\left( {\frac{1} {p}{\sum\limits_{i = 1}^p {{\left\| {{\textbf{\ifmmode\expandafter\hat\else\expandafter\^\fi{x}}}_{l} - {\textbf{\ifmmode\expandafter\hat\else\expandafter\^\fi{x}}}_{i} } \right\|}^{2} } }} \right)}^{{1/2}} $$

(4)

where $ {\textbf{\ifmmode\expandafter\hat\else\expandafter\^\fi{x}}}_{1} ,\,{\textbf{\ifmmode\expandafter\hat\else\expandafter\^\fi{x}}}_{2} ,\, \ldots ,\,{\textbf{\ifmmode\expandafter\hat\else\expandafter\^\fi{x}}}_{p} $ are the p nearest-node centers to the hidden node l. The parameter p is selected so that many nodes are activated when an input vector is presented to the NN model.

(iii) The connection weights are determined using linear regression between the hidden-layer responses and the corresponding output training set.

Results

In order to evaluate and compare the performance of the RBF training methodology presented in this work, the data set was initially split into a training and a validation set in a ratio of approximately 80:20% (180 and 41 compounds, respectively). For that, the Kennard and Stones algorithm [24] was used. The Kennard–Stones algorithm has gained increasing popularity for splitting data sets into two subsets. The algorithm starts by finding two samples that are the farthest apart from each other on the basis of the input variables in terms of some metric, e.g., the Euclidean distance. These two samples are removed from the original data set and put into the calibration data set. The procedure described is repeated until the desired number of samples has been reached in the calibration set. The advantages of this algorithm are that the calibration samples map the measured region of the variable space completely with respect to the induced metric and that the test samples all fall inside the measured region. The training and validation compounds are clearly indicated in Table 1. Both RBF network and MLR models were developed based on exactly the same training set. The validation set was not involved in any way during the training phase. The results are shown in Table 1, where the predictions of the two models are shown for both the training and the external examples. The same results are shown in a graphical format in Figs. 1, 2, 3 and 4, where the experimental toxicity is plotted against the predictions of the RBF network and the MLR model. In each figure the corresponding coefficients of determination (R ²-value) are presented, which indicate a much higher correlation between experimental and predicted values using the RBF network methodology. The full linear equation for the prediction of toxicity is the following:

$$ \begin{aligned} {\text{log}}\,1/{\text{IGC}}_{{50}} & = 0.5617{\text{log}}\,{\text{K}}_{{{\text{ow}}}} + 0.0026{\text{pK}}_{{\text{a}}} - 0.8792{\text{E}}_{{{\text{LUMO}}}} \\ & \quad + 0.7995{\text{E}}_{{{\text{HUMO}}}} + 0.2734{\text{N}}_{{{\text{hdon}}}} + 6.2044, \\ {\text{n}} & = 180,\quad {\text{R}}^{2} = 0.6022,\quad {\text{RMS}} = 0.5352. \\ \end{aligned} $$

(5)

To compare the performance of the modeling schemes further, their predictive ability was also evaluated by the leave-one-out (LOO) cross-validation procedure. A number of modified data sets were created by deleting in each case one object from the data. An RBF network and an MLR model were developed in each case based on the remaining data and were validated using the object that had been deleted. Consequently, 221 RBF networks and MLR models were built, by deleting each time one compound from the training set. Figures 5 and 6 show the experimental toxicity versus the predictions produced by the RBF NN models and the multiple regression technique, using the LOO cross validation procedure. The corresponding coefficients of determination $ R^{{\text{2}}}_{{{\text{CV}}}} $ indicate again that the models derived from the RBF methodology have a higher predictive potential. The comparison between the RBF and the MLR methods is summarized in Table 2. In all cases, the RBF models proved to be remarkably more accurate than the MLR models. The predictive abilities of both modeling techniques can be improved if different models are developed for each one of the several different mechanisms of action, but in this paper we concentrated on building a single model for each methodology that can predict toxicity for the variety of mechanisms that are included in the data set.

Table 2 Summary of the results produced by the different methods

Full size table

It should finally be noted that the MATLAB programming language was used to implement all the training and testing procedures. The computational time required to build the NN models in a Pentium IV 3 GHz processor was always less than 0.2 s. It should also be emphasized that the RBF training method has been developed in-house, so no commercial packages were used to develop the NN models. The complete QSTR models can be made available to the interested readers.

Discussion and conclusions

In this work, we presented a novel QSTR methodology based on the RBF NN architecture. The method was illustrated using a data set of 221 phenols and compared with standard MLR. Validation of the different QSTR methodologies was based on two evaluation procedures. In the first method the data were split into a training and a validation set and the model generated using the training set was used to predict toxicity in the validation set. The second method was the standard LOO cross-validation procedure. The modeling procedures used in this work illustrated the accuracy of the models produced, not only by calculating their fitness on sets of training data but also by testing the predicting abilities of the models.

The RBF NN models were produced based on the fuzzy-means training method, which is fast and repetitive, in contrast to most traditional training techniques. The model generated for the data set required five descriptors. In terms of the R ², $ R^{{\text{2}}}_{{{\text{cv}}}} $ and RMS values, the RBF models proved to have a significant predictive potential. The results obtained illustrated that the RBF NN architecture can be used to derive QSTRs, which are more accurate and have better generalization capabilities compared to linear regression models at the expense of the increased complexity of the model compared to a simple structure of a linear model. The method proposed could be a substitute to costly and time-consuming experiments for determining toxicity.

References

Lu FC, Kacew S (2002) Lu’s basic toxicology. Taylor & Francis, London
Google Scholar
Karcher W, Devillers J (1990) SAR and QSAR in environmental chemistry and toxicology: scientific tool or wishful thinking?. In: Karcher W, Devillers J (eds) Practical applications of quantitative structure-activity relationships (QSAR) in environmental chemistry and toxicology. Kluwer, Dordrecht, pp 1–12
Google Scholar
Nendza M (1998) Structure-activity relationships in environmental sciences, ecotoxicology series 6. Chapman & Hall, London
Google Scholar
Schultz TW, Netzeva TI, Cronin MTD (2003) SAR QSAR Environ Res 14:59–81
Article PubMed CAS Google Scholar
Netzeva TI, Schultz TW, Aptula AO, Cronin MTD (2003) SAR QSAR Environ Res 14:265–283
Article PubMed CAS Google Scholar
Zahouily M, Rhihil A, Bazoui H, Sebti S, Zakarya D (2002) J Mol Model 8:168–172
Article CAS Google Scholar
Cronin MTD, Aptula AO, Duffy JC, Netzeva TI, Rowe PH, Valkova IV, Schultz TW (2002) Chemosphere 49:1201–1221
Article PubMed CAS Google Scholar
Ren S (2003) Chemosphere 53:1053–1065
Article PubMed CAS Google Scholar
Bukard U (2003) Methods for data analysis. In: Gasteiger J, Engel Th (eds) Chemoinformatics. Wiley VCH, Weinheim, pp 439–485
Chapter Google Scholar
Devillers J (1996) Neural networks in QSAR and drug design. Academic Press, London
Google Scholar
Afantitis Α, Melagraki G, Makridima K, Alexandridis A, Sarimveis H, Iglessi-Markopoulou O (2005) J Mol Struct: Theochem 716:193–198
Article CAS Google Scholar
Devillers J (2004) SAR QSAR Environ Res 15:237–249
Article PubMed CAS Google Scholar
Kaiser KLE (2003) Quant Struct-Act Relat 22:1–5
Google Scholar
KaiserKLE (2003) J Mol Struct: Theochem 622:85–95
Article Google Scholar
Gasteiger J (2003) Handbook of chemoinformatics: from data to knowledge, vol 3. Wiley VCH, Weinheim
Zupan J, Gasteiger J (1999) Neural networks in chemistry and drug design. Wiley VCH, Weinheim
Google Scholar
Debnath AK (2001) Quantitative structure-activity relationship (QSAR): a versatile tool in drug design. In: Ghose AK, Viswanadhan VN (eds) Combinatorial library design and evaluation: principles, software tools, and applications in drug discovery. Marcel Dekker, New York, pp 73–129
Google Scholar
Sarimveis H, Alexandridis A, Tsekouras G, Bafas G (2002) Ind Eng Chem Res 41:751–759
Article CAS Google Scholar
Lessigiarska I, Cronin MTD, Worth AP, Dearden JC, Netzeva TI (2004) SAR QSAR Environ Res 15:169–190
Article PubMed CAS Google Scholar
Aptula AO, Netzeva TI, Valkona IV, Cronin MTD, Schultz TW, Kuhne R, Schuurmann G (2002) Quant Struct-Act Relat 21:12–22
Article CAS Google Scholar
Darken C, Moody J (1990) Fast adaptive K-means clustering: some empirical results. IEEE INNS Int Joint Conf Neural Netw 2:233–238
Article Google Scholar
Dunn JC (1974) J Cybernet 3:32–57
Google Scholar
Leonard JA, Kramer MA (1991) Radial basis function networks for classifying process faults. IEEE Control Syst 11:31–38
Article Google Scholar
Kennard RW, Stone LA (1969) Technometrics 11:137–148
Article Google Scholar

Download references

Acknowledgements

G.M. wishes to thank the Greek State Scholarship Foundation for a doctoral assistantship.

Author information

Authors and Affiliations

School of Chemical Engineering, National Technical University of Athens, 9 Heroon Polytechniou Str., Zografou Campus, Athens, 15780, Greece
Georgia Melagraki, Antreas Afantitis, Kalliopi Makridima, Haralambos Sarimveis & Olga Igglessi-Markopoulou

Authors

Georgia Melagraki
View author publications
You can also search for this author in PubMed Google Scholar
Antreas Afantitis
View author publications
You can also search for this author in PubMed Google Scholar
Kalliopi Makridima
View author publications
You can also search for this author in PubMed Google Scholar
Haralambos Sarimveis
View author publications
You can also search for this author in PubMed Google Scholar
Olga Igglessi-Markopoulou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haralambos Sarimveis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Melagraki, G., Afantitis, A., Makridima, K. et al. Prediction of toxicity using a novel RBF neural network training methodology. J Mol Model 12, 297–305 (2006). https://doi.org/10.1007/s00894-005-0032-8

Download citation

Received: 31 March 2005
Accepted: 25 July 2005
Published: 08 November 2005
Issue Date: February 2006
DOI: https://doi.org/10.1007/s00894-005-0032-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Prediction of toxicity using a novel RBF neural network training methodology

Abstract

Similar content being viewed by others

Using Radial Basis Function Neural Networks to identify river water data parameters

Combining Radial Basis Function Neural Network Models and Inclusive Multiple Models for Predicting Suspended Sediment Loads

Development of Extreme Learning Machine Radial Basis Function Neural Network Models to Predict Residual Aluminum for Water Treatment Plants

Introduction