Abstract
The Carcinogenicity Reliability Database (CRDB) was constructed by collecting experimental carcinogenicity data on about 1,500 chemicals from six sources, including IARC, and NTP databases, and then by ranking their reliabilities into six unified categories. A wide variety of 911 organic chemicals were selected from the database for QSAR modeling, and 1,504 kinds of different molecular descriptors were calculated, based on their 3D molecular structures as modeled by the Dragon software. Positive (carcinogenic) and negative (non-carcinogenic) chemicals containing various substructures were counted using atom and functional group count descriptors, and the statistical significance of ratios of positives to negatives was tested for those substructures. Very few were judged to be strongly related to carcinogenicity, among substructures known to be responsible for carcinogens as revealed from biomedical studies. In order to develop QSAR models for the prediction of the carcinogenicities of a wide variety of chemicals with a satisfactory performance level, the relationship between the carcinogenicity data with improved reliability and a subset of significant descriptors selected from 1,504 Dragon descriptors was analyzed with a support vector machine (SVM) method: the classification function (SVC) for weighted data in LIBSVM program was used to classify chemicals into two carcinogenic categories (positive or negative), where weights were set depending on the reliabilities of the carcinogenicity data. The quality and stability of the models presented were tested by performing a dual cross–validation procedure. A single SVM model as the first step was developed for all the 911 chemicals using 250 selected descriptors, achieving an overall accuracy level, i.e., positive and negative correct estimate, of about 70%. In order to improve the accuracy of the final model, the 911 chemicals were classified into 20 mutually overlapping subgroups according to contained substructures, a specific SVM model was optimized for each subgroup, and the predicted carcinogenicities of the 911 chemicals were determined by the majorities of the outputs of the corresponding SVM models. The model developed on the basis of grouping of chemicals into 20 substructures predicts the carcinogenicities of a wide variety of chemicals with a satisfactory overall accuracy of approximately 80%.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Abbreviations
- QSAR:
-
Quantitative structure–activity relationship
- PAH:
-
Polycyclic aromatic hydrocarbons
- PTC:
-
Predictive Toxicology Challenge
- EL:
-
Ensemble learning
- ML:
-
Machine learning
- NTP:
-
US National Toxicology Program
- FDA:
-
US Food and Drug Administration
- DB:
-
Database
- ANN:
-
Artificial neural network
- SVM:
-
Support vector machine
- IARC:
-
International Agency for Research on Cancer
- EU:
-
European Union
- EPA:
-
US Environmental Protection Agency
- ACGIH:
-
American Conference of Governmental Industrial Hygienists
- JSOH:
-
Japan Society for Occupational Health
- PRTR-MSDS:
-
Pollutant Release and Transfer Register- Material Safety Data Sheet
- CRDB:
-
Carcinogenicity Reliability Database
- CC:
-
Correlation coefficient
- SVC:
-
Support vector classification function in the LIBSVM program
- RBF:
-
Radial basis kernel function
- CV:
-
Cross-validation
- LOO:
-
Leave-one-out
- AUROC:
-
Area under the receiver operating characteristic curve
- MCC:
-
Matthews correlation coefficient
- OA:
-
Overall accuracy
- TP:
-
True positive
- TN:
-
True negative
- FP:
-
False positive
- FN:
-
False negative
References
Doll R, Peto R (1981) The causes of cancer: quantitative estimates of avoidable risks of cancer in the United States today. J Natl Cancer Inst 66: 1192–1309
Harvard Center for Cancer Prevention (1996) Harvard report on cancer prevention. Volume 1: Causes of human cancer. Cancer Causes Control 7:S3–S59. doi:10.1007/BF02352719
Vracko M (2000) A study of structure–carcinogenicity relationship for 86 compounds from NTP database using topological indexes as descriptors. SAR QSAR Environ Res 11: 103–115. doi:10.1080/10629360008039117
Passerini L (2003) QSARs for individual classes of chemical mutagens and carcinogens. In: Benigni R (eds) Quantitative structure–activity relationship (QSAR) models of mutagens and carcinogens. CRC Press, Boca Raton, pp 81–123
Patlewicz G, Rodford R, Walker JD (2003) Quantitative structure–activity relationships for predicting mutagenicity and carcinogenicity. Environ Toxicol Chem 22: 1885–1893. doi:10.1897/01-461
Benigni R (2004) Prediction of human health endpoints: mutagenicity and carcinogenicity. In: Cronin MTD, Livingstone DJ (eds) Predicting chemical toxicity and fate. CRC Press, Boca Raton, pp 173–192
Sun H (2004) Prediction of chemical carcinogenicity from molecular structure. J Chem Inf Comput Sci 44: 1506–1514. doi:10.1021/ci049917y
Crettaz P, Benigni R (2005) Prediction of the rodent carcinogenicity of 60 pesticides by the DEREKfw expert system. J Chem Inf Comput Sci 45: 1864–1873. doi:10.1021/ci050150z
Helguera AM, Perez MCA, Combes RD, Gonzalez MP (2005) The prediction of carcinogenicity from molecular structure. Curr Comp Aid Drug Des 1: 237–255
Contrera JF, MacLaughlin P, Hall LH, Kier LB (2005) QSAR modeling of carcinogenic risk using discriminant analysis and topological molecular descriptors. Curr Drug Discov Tech 2: 55–67. doi:10.2174/1570163054064684
Benigni R, Bossa C (2008) Predictivity of QSAR. J Chem Inf Model 48: 971–980. doi:10.1021/ci8000088
Benigni R, Giuliani A, Franke R, Gruska A (2000) Quantitative structure–activity relationships of mutagenic and carcinogenic aromatic amines. Chem Rev 100: 3697–3714. doi:10.1021/cr9901079
Franke R, Gruska A, Giuliani A, Benigni R (2001) Prediction of rodent carcinogenicity of aromatic amines: a quantitative structure–activity relationships model. Carcinogenisis 22: 1561–1571
Benigni R, Giuliani A, Gruska A, Franke R (2003) QSARs for the mutagenicity and carcinogenicity of the aromatic amines. In: Benigni R (eds) Quantitative structure–activity relationship (QSAR) models of mutagens and carcinogens. CRC Press, Boca Raton, pp 125–144
Vendrame R, Braga RS, Takahata Y, Galvao DS (1999) Structure–activity relationships of carcinogenic activity of polycyclic aromatic hydrocarbons using calculated molecular descriptors with principal component analysis and neural network methods. J Chem Inf Comput Sci 39: 1094–1104. doi:10.1021/ci990326v
Braga RS, Barone PMVB, Galvao DS (1999) Identifying carcinogenic activity of methylated polycyclic aromatic hydrocarbons (PAHs). J Mol Struct 464: 257–266. doi:10.1016/S0166–1280(98)00557-0
Zhou Z, Dai Q, Gu TA (2003) QSAR model of PAHs carcinogenesis based on thermodynamic stabilities of bioactive sites. J Chem Inf Comput Sci 43: 615–621. doi:10.1021/ci0256135
Benigni R (2003) SARs and QSARs of mutagens and carcinogens: understanding action mechanisms and improving risk assessment. In: Benigni R (eds) Quantitative structure–activity relationship (QSAR) models of mutagens and carcinogens. CRC Press, Boca Raton, pp 259–282
Benigni R (2005) Structure–activity relationship studies of chemical mutagens and carcinogens: Mechanistic investigations and prediction approaches. Chem Rev 105: 1767–1800. doi:10.1021/cr030049y
Helma C, King RD, Kramer S, Srinivasan A (2000) The Predictive Toxicology Challenge (PTC) for 2000–2001. http://www.informatik.uni-freiburg.de/~ml/ptc/ (accessed May 1, 2009)
Helma C, Kramer S (2003) A survey of the predictive toxicology challenge 2000–2001. Bioinformatics 19: 1179–1182
Ivanciuc O (2009) Drug design with machine learning. In: Meyers RA (eds) Encyclopedia of complexity and system science. Springer-Verlag, New York
Svetnik V, Wang T, Tong C, Liaw A, Sheridan RP, Song Q (2005) Boosting: an ensemble learning tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 45: 786–799. doi:10.1021/ci0500379
Fukunishi H, Teramoto R, Shimada J (2008) Hidden active information in a random compound library: Extraction using a pseudo–structure–activity relationship model. J Chem Inf Model 48: 575–582. doi:10.1021/ci7003384
Langham JJ, Jain AN (2008) Accurate and interpretable computational modeling of chemical mutagenicity. J Chem Inf Model 48: 1833–1839. doi:10.1021/ci800094a
Liu T-Y, Li G-Z, Yang JY, Yang MQ (2008) Feature selection for the imbalanced QSAR problems by using EasyEnsemble. Int J Comput Biol Drug Design 1: 334–346. doi:10.1504/IJCBDD.2008.022206
Woo Y-T, Lai DY (2003) Mechanisms of action of chemical carcinogens and their role in structure–activity relationship (SAR) analysis and risk assessment. In: Benigni R (eds) Quantitative structure–activity relationship (QSAR) models of mutagens and carcinogens. CRC Press, Boca Raton, pp 41–80
Devillers J (1996) Neural networks in QSAR and drug design. Academic Press, San Diego
Zupan J, Gasteiger J (1999) Quantitative structure–activity relationships. In: Zupan J, Gasteiger J (eds) Neural networks in chemistry and drug design, 2nd edn. Weinheim, Wiley-VCH,, pp 219–242
Peterson KL (2000) Artificial neural networks and their use in chemistry. In: Lipkowitz KB, Boyd DB (eds) Reviews in computational chemistry . Wiley-VCH, New York, pp 53–140
Ivanciuc O (2009) Drug design with artificial neural networks. In: Meyers RA (eds) Encyclopedia of complexity and system science. Springer-Verlag, New York
Basak SC, Grunwald GD, Gute BD, Balasubramanian K, Optiz D (2000) Use of statistical and neural net approaches in predicting toxicity of chemicals. J Chem Inf Comput Sci 40: 885–890. doi:10.1021/ci9901136
Bahler D, Stone B, Wellington C, Bristol D (2000) Symbolic, neural, and Bayesian machine learning models for predicting carcinogenicity of chemical compounds. J Chem Inf Comput Sci 40: 906–914. doi:10.1021/ci990116i
Hemmateenejad B, Safarpour M, Miri R, Nesari N (2005) Toward an optimal procedure for PC–ANN model building: prediction of the carcinogenic activity of a large set of drugs. J Chem Inf Model 45: 190–199. doi:10.1021/ci049766z
Devillers J (1996) Strengths and weaknesses of the back–propagation neural network in QSAR and QSPR studies. In: Devillers J (eds) Neural networks in QSAR and drug design. Academic Press, London, pp 1–46
Tanabe K, Ohmori N, Ono S, Suzuki T, Matsumoto T, Nagashima U, Uesaka H (2005) Neural network prediction of carcinogenicity of diverse organic compounds. J Comput Chem Jpn 4: 89–100. doi:10.2477/jccj.4.89
Chen, N, Lu, W, Yang, J, Li, G (eds) (2004) Support vector machine in chemistry. World Scientific, Singapore
Ivanciuc O (2007) Applications of support vector machines in chemistry. Rev Comput Chem 23: 291–400. doi:10.1002/9780470116449.ch6
Byvatov E, Fechner U, Sadowski J, Schneider G (2003) Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. J Chem Inf Comput Sci 43: 1882–1889. doi:10.1021/ci0341161
Yao XJ, Panaye A, Doucet JP, Zhang RS, Chen HF, Liu MC, Hu ZD, Fan B T (2004) Comparative study of QSAR/QSPR correlations using support vector machines, radial basis function neural networks, and multiple linear regression. J Chem Inf Comput Sci 44: 1257–1266. doi:10.1021/ci049965i
Helma C, Cramer T, Kramer S, De Raedt L (2004) Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. J Chem Inf Comput Sci 44: 1402–1411. doi:10.1021/ci034254q
Xue Y, Li ZR, Yap CW, Sun LZ, Chen X, Chen YZ (2004) Effect of molecular descriptor feature selection in support vector machine classification of pharmacokinetic and toxicological properties of chemical agents. J Chem Inf Comput Sci 44: 1630–1638. doi:10.1021/ci049869h
Chen N, Lu W, Yang J, Li G (2004) SVM applied to structure–activity relationships. In: Chen N, Lu W, Yang J, Li G (eds) Support vector machine in chemistry. World Scientific, Singapore, pp 186–219
Jorissen RN, Gilson MK (2005) Virtual screening of molecular databases using a support vector machine. J Chem Inf Comput Sci 45: 549–561. doi:10.1021/ci049641u
Bhavani S, Ngargadde A, Thawani A, Sridhar V, Chandra N (2006) Substructure–based support vector machine classifiers for prediction of adverse effects in diverse classes of drugs. J Chem Inf Model 46: 2478–2486. doi:10.1021/ci060128l
Bruce CL, Melville JL, Pickett SD, Hirst JD (2007) Contemporary QSAR classifiers compared. J Chem Inf Model 47: 219–227. doi:10.1021/ci600332j
Tang L-J, Zhou Y-P, Jiang J-H, Zou H-Y, Wu H-L, Shen G-L, Yu R-Q (2007) Radial basis function network-based transform for a nonlinear support vector machine as optimized by a particle swarm optimization algorithm with application to QSAR studies. J Chem Inf Model 47: 1438–1445. doi:10.1021/ci700047x
Doucet J-P, Barbault F, Xia H, Panaye A, Fan B (2007) Nonlinear SVM approaches to QSPR/QSAR studies and drug design. Curr Comp Aid Drug Design 3: 263–289. doi:10.2174/157340907782799372
Tanabe K, Suzuki T, Kaihara M, Onodera N (2008) Prediction of carcinogenicity of noncongeneric chemical substances by a support vector machine. J Comput Chem Jpn 7: 93–102. doi:10.2477/jccj.H1921
Ivanciuc O (2002) Support vector machine classification of the carcinogenic activity of polycyclic aromatic hydrocarbons. Internet Electron J Mol Design 1: 203–218
Luan F, Zhang R, Zhao C, Yao X, Liu M, Hu Z, Fan B (2005) Classification of the carcinogenicity of N-nitroso compounds based on support vector machines and linear discriminant analysis. Chem Res Toxicol 18: 198–203. doi:10.1021/tx049782q
Japan Chemical Industry Ecology–Toxicology and Information Center (2007) Estimation and classification criteria of carcinogenicity of chemical substances. JETOC, Tokyo, pp 21–23
Urano K (2001) Toxicity ranks and physical property information for PRTR–MSDS chemical substances, Chap 2. In: Rank of carcinogenicity. Kagaku Kogyo Nippo, Tokyo, pp 21–23
Gasteiger J, Sadowski J, Schuur J, Selzer P, Steinhauer L, Steinhauer V (1996) Chemical information in 3D space. J Chem Inf Comput Sci 36: 1030–1037. doi:10.1021/ci960343+
Oellien F, Nicklaus MC. (2009) Online SMILES Translator and Structure File Generator: http://cactus.nci.nih.gov/services/translate/ (accessed July 17, 2009)
Todeschini R, Consonni V (2006) DRAGON Professional 5.4 program, TALETE srl, Milano, Italy, (http://www.talete.mi.it/dragon.htm)
Chang CC, Lin CJ (2009) LIBSVM–A library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (accessed May 25, 2009)
Chang CC, Lin CJ (2009) LIBSVM–A library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#14 (accessed May 25, 2009)
Toropov AA, Toropova AP, Benfenati E, Manganaro A (2009) QSAR modelling of carcinogenicity by balance of correlations. Mol Div 13: 367–373. doi:10.1007/s11030-009-9113-4
Fjodorova N, Vračko M, Tušar M, Jezierska A, Novič M, Kühne R, Schüürmann G (2009) Quantitative and qualitative models for carcinogenicity prediction for non-congeneric chemicals using CP ANN method for regulatory uses. Mol Divers. doi:10.1007/s11030-009-9190-4.
Toropov AA, Toropova AP, Benfenati E (2009) Additive SMILES-based carcinogenicity models: probabilistic principles in the search for robust predictions. Int J Mol Sci 10: 3106–3127. doi:10.3390/ijms10073106
Tan NX, Rao HB, Li ZR, Li XY (2009) Prediction of chemical carcinogenicity by machine learning approaches. SAR QSAR Environ Res 20: 27–75. doi:10.1080/10629360902724085
Venkatapathy R, Wang CY, Bruce RM, Moudgal C (2009) Development of quantitative structure–activity relationship (QSAR) models to predict the carcinogenic potency of chemicals I. Alternative toxicity measures as an estimator of carcinogenic potency. Toxicol Appl Pharmacol 234: 209–221. doi:10.1016/j.taap.2008.09.028
Guyton KZ, Kyle AD, Aubrecht J, Cogliano VJ, Eastmond DA, Jackson M, Keshava N, Sandy MS, Sonawane B, Zhang L, Waters MD, Smith MT (2009) Improving prediction of chemical carcinogenicity by considering multiple mechanisms and applying toxicogenomic approaches. 1. Mutat Res 681: 230–240
Benfenati E, Benigni R, De Marini DM, Helma C, Kirkland D, Martin TM, Mazzatorta P, Ouédraogo-Arras G, Richard AM, Schilter B, Schoonen WGEJ, Snyder RD, Yang C (2009) Predictive models for carcinogenicity and mutagenicity: Frameworks, state–of–the–art, and perspectives. J Environ Sci Health, Part C 27: 57–90. doi:10.1080/10590500902885593
Author information
Authors and Affiliations
Corresponding author
Electronic Supplementary Material
The Below are the Electronic Supplementary Materials.
Rights and permissions
About this article
Cite this article
Tanabe, K., Lučić, B., Amić, D. et al. Prediction of carcinogenicity for diverse chemicals based on substructure grouping and SVM modeling. Mol Divers 14, 789–802 (2010). https://doi.org/10.1007/s11030-010-9232-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11030-010-9232-y