Abstract
Quantitative structure–property/activity relationships (QSPRs/QSARs) have been used to predict the physicochemical property and biological activity of different substances, considering that the physicochemical property/biological activity of a new or untested substance can be inferred from the molecular structure or other properties of similar compounds whose properties/activities have already been assessed. Traditional QSPR/QSAR models based on physicochemical properties and molecular information are not so successful in predicting endpoint of substances such as nanomaterials due to scarcity of available dataset in same conditions. A new approach using eclectic information as descriptors to predict the endpoint of substance materials was developed in CORAL software (http://www.insilico.eu/coral). In this approach, physicochemical properties and the experimental conditions of substance are represented by so-called quasi-SMILES, which are character-based representations derived from traditional Simplified Molecular Input Line Entry System (SMILES). Thus, a main advantage of the quasi-SMILES is to increase the number of available datasets by using the eclectic data in developing quasi-SMILES-based QSPRs/QSARs models. This chapter provides instructions on how to use CORAL software for building QSPR/QSAR models based on quasi-SMILES.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
1 Introduction
Quantitative structure–activity/property relationship (QSAR/QSPR) approach is indubitably of considerable importance in food chemistry [1, 2], environmental chemistry [3], modern chemistry [4,5,6], biochemistry [7], nanotechnology [8, 9], and drug design [10, 11]. The QSAR/QSPR approach is the mathematical and computerized search for compounds with desired activities/properties using chemical intuition and experience. Once a structure–activity/property correlation has been established, any number of compounds, including those not yet synthesized, can be easily screened on a computer to select structures with the desired activity/properties. Then the most promising compounds can be found for synthesis and experimental testing [12]. Therefore, QSAR/QSPR study saves cost and time for the development process of new molecules as drugs, materials, additives, or any other purpose. While finding successful structure–activity models is not an easy task, the recent increase in the number of papers in QSPR/QSAR research clearly indicates the rapid evolution in this area. To obtain a significant correlation, it is very important to use appropriate descriptors, whether they are theoretical, empirical, or derived from easily empirical properties of the constructs [12]. A group of descriptors shows simple molecular properties and therefore can give insight into the physicochemical nature of the activity/property under consideration.
Considering the growth of nanotechnology, modeling the properties or toxicity of nanoparticles (NPs) on living organisms is very important [13,14,15]. Although it is difficult to conduct toxicological experiments or obtain physical properties of NPs on a case-by-case basis, QSPR/QSAR is a computationally efficient technique because it saves time, cost, and animal sacrifice. The first part of nano-QSPR/QSAR model implementation includes data collection (including descriptors and endpoints) and data processing. The dataset can be obtained from the literature, databases, experiments, or integrated multiple sources. Therefore, to construct nano-QSPR/QSAR models, it is important to identify a new set of descriptors that can accurately represent the properties of NPs as well as the experimental conditions.
During recent years, the Simplified Molecular Input Line Entry System (SMILES) and quasi-SMILES descriptors have been examined by some researchers for QSPR/QSAR modeling [16,17,18,19]. The SMILES can reveal molecular structures, and quasi-SMILES can represent molecular structure and physicochemical properties and exposure conditions [8, 20, 21]. SMILES of a molecule is based on a set of rules that allow a molecular structure to be represented as a sequence of atom and bond symbols, but quasi-SMILES imports the physicochemical properties and experimental conditions as a string of characters after SMILES symbol.
2 Principals of QSPR/QSAR Models
Although QSPR/QSAR modeling has been used for over five decades, many studies still do not follow the Organization of Economic Co-operation and Development (OECD) guidelines. Figure 8.1 summarizes the best practices for each step of QSPR/QSAR approach using models in peer reviewed literature. Dearden et al. have reported a detailed description of common errors in QSPR/QSAR research [22].
According to OECD guidelines, if a QSPR/QSAR study is to be reliable, the following five principles must be met: (i) a well-defined endpoint, (ii) an unambiguous algorithm, (iii) a defined applicability domain (AD), (iv) appropriate measures of goodness-of-fit, robustness, and predictivity, and (v) a mechanistic interpretation, if possible.
3 Monte Carlo Technique for Nano-QSPR/QSAR
3.1 SMILES and Quasi-SMILES
SMILES is a chemical notation system designed by Weininger et al. [23, 24]. According to the principles of molecular graph theory, SMILES uses a very small, natural grammar to specify precise structural features. The SMILES symbol system is also suitable for fast machine processing. Quasi-SMILES is an alternative to SMILES, which is used for substances considering physicochemical properties and experimental conditions.
3.2 The Main Step for QSPR/QSAR Modeling by SMILES or Quasi-SMILES
CORrelation And Logic (CORAL) software (http://www.insilico.eu/coral) has two possibilities for building QSPR/QSAR models based on SMILES or quasi-SMILES. In the following, the method of preparing the input data for the CORAL software is described.
3.2.1 Dataset Preparation for Models Based on SMILES
The SMILES string is a procedure for representing a two-dimensional molecular graph as a one-dimensional string that can show the connectivity and chirality of a molecule. In most cases, there are too many SMILES strings for a structure. Canonical SMILES gives a single ‘canonical’ form for any particular molecule. Molecular structures of desired compounds were transformed to canonical SMILES using different software such as Open Babel and ACD/ChemSketch program. Figure 8.2a, b indicates the sample of data based on SMILES, and quasi-SMILES as input for CORAL software, respectively. The first column indicates set, the second is compound ID, the third is SMILES/quasi-SMILES, and the last column is desired property/activity.
3.2.2 Dataset Preparation for Models Based on Quasi-SMILES
For building of QSPR/QSAR in different physicochemical properties and/or the experimental conditions of substance, one can use quasi-SMILES instead of SMILES of molecules. Dataset preparation for quasi-SMILES is same as SMILES, only SMILES is replaced by quasi-SMILES.
3.2.3 Quasi-SMILES Definition for Various Datasets/Endpoints
Quasi-SMILES is a sequence of symbols that not only represents the molecular structure but also the different conditions that can affect the endpoint under investigation. Eclectic data can include: different physical properties such as temperature, pressure, and assay of experiment to obtain an endpoint, or cell line type, time exposition, concentration, etc. to obtain an activity. The type and number of eclectic data can be different in various datasets.
Quasi-SMILES may be made by eclectic condition, only [4, 13] or combination of SMILES and eclectic conditions [5, 8]. The continuous eclectic conditions can be normalized by the following equation for assigning codes:
Ei is its value of physicochemical parameter E, min(\(E_{i}\)) is minimum value of E, and max(\(E_{i}\)) indicates maximum value of E.
According to Table 8.1, the number of unique values in each parameter was less than 10; therefore, the quasi-SMILES descriptors representations could be coded by assigning a number between zero and nine in a single character.
A further development of the CORAL software (CORAL-2020) allows the display of experimental conditions through groups of symbols enclosed in parentheses. Table 8.2 shows the comparison codes in the last version (CORAL-2020) and old version of CORAL for creating quasi-SMILES in recently proposed models for cytotoxicity of metal oxide NPs [4]. One can see codes-2020 are quite transparent and consequently are more convenient for a user. As is clearly evident, CORAL-2020 codes being quite transparent and thus more user-friendly. Table 8.2 indicates codes used for the cell line, method, time exposition, concentration, nanoparticle size, and metal oxide type. Table 8.3 indicates the examples of quasi-SMILES obtained based on these codes.
Toropov and Toropova developed a QSAR model based on the new version of CORAL for the toxicity of ZnO NPs [14]. Experimental data from the literature are toxicity assessment of ZnO NPs and ZnO NPs coated with polyethylene glycol (PEG), which are investigated by intraperitoneal injections in the rat (50, 100, 200 mg/kg) for one month. Measurement of the toxic effects of renal factors including creatinine, uric acid, and blood urea nitrogen was measured after 15 and 30 days after injection. Table 8.4 shows the quasi-SMILES attributes together with experimental conditions. Table 8.5 represents examples of available quasi-SMILES obtained based on this condition and related activity.
Toropova et al. developed new nano-QSAR model for predicting toxicity of nano-mixtures to Daphnia magna based on quasi-SMILES [25]. The binary mixtures of TiO2 NPs and with of one of the second component including AgNO3, Cd(NO3)2, Cu(NO3)2, CuSO4, Na2HAsO4, NaAsO2, benzylparaben, and benzophenone-3 have been investigated. Quasi-SMILES contain the following information: (1) Second component of mixture represented by SMILES; (2) core diameter of TiO2 NPs; (3) Zeta potential of TiO2 NPs; (4) mole fraction of TiO2 NPs; (5) mole fraction of mixed substance; and (6) exposure time. Figure 8.3 shows the transformation of the experimental condition and substance into the quasi-SMILES.
3.2.4 Model Development
Model development has several steps that can be organized in CORAL software and does not require any software for data partitioning, descriptor generation, and model validation. In the following sections, the main step for QSPR/QSAR modeling using CORAL software is described.
3.2.5 Dataset Splitting
After the preparation and curation of dataset, the next step of building a QSAR/QSPR model for an endpoint by CORAL software (http://www.insilico.eu/coral) is loading an array of lines. Each line consists of four components.
The first column is the types of set which ‘+’, ‘−’, ‘#’, and ‘*’ indicate the active training, passive training, calibration, and validation, respectively (Fig. 8.2).
-
The second column without space with type of set is number or ID of compound.
-
The third column is quasi-SMILES.
-
The last column is endpoint value.
After the preparation of input file, the dataset was splitted into training, passive training, calibration, and validation sets using CORAL software, randomly with desired present for each set.
3.2.6 Monte Carlo Optimization Process
Quasi-SMILES is a group of attributes where each attribute group is converted into a group of coefficients called correlation weights. Monte Carlo optimization refines the correlation weights that provide numerical data on them, which maximizes the predictive potential of a model as much as possible. Figure 8.4 shows the flowchart of one cycle of Monte Carlo optimization of correlation weights (n is the number of correlation weights that contribute to model construction).
There are different target functions (TFs) in CORAL software for Monte Carlo optimization [25,26,27,28,29], which are introduced below four TFs:
\(r_{{{\text{AT}}}}\) and \(r_{{{\text{PT}}}}\) represent the correlation coefficient between the experimental and predicted endpoints for active and passive training sets, respectively. Empirical constant (C), WIIC, and WCII have a defined numerical value [1, 18, 30,31,32,33].
IICC is the index of ideality correlation. IICC is obtained based on the calibration set as follows:
The \({\text{Obs}}_{i}\) and \({\text{Calc}}_{i}\) are the experimental and predicted endpoint for \(i{\text{th}}\) compound.
The correlation intensity index (CII), like IIC criteria, was developed to modify the quality of the Monte Carlo optimization used to build the QSPR/QSAR models. CII is formulated as follows:
where R2 is the coefficient of determination for all endpoints and \(R_{i}^{2}\) is the coefficient of determination for all endpoints in the absence of ith compound. Therefore, if \(\Delta R_{i}^{2}\) is greater than zero, the meaning of ith is an ‘opposite’ for the correlation between the experimental and calculated values of the set.
A small sum of \(\Delta R_{i}^{2}\) means a more ‘intensive’ correlation.
The CORAL model for an endpoint (EP) is defined by the below equation:
C0 and C1 represent regression coefficients, T is a threshold, and N is the number of optimization cycles. The DCW(T, N) is defined as the below equation:
where Sk represents the symbol of a quasi-SMILES line; the CW(Sk) shows the correlation weights of Sk.
3.2.7 Applicability Domain
The AD of QSAR/QSAR models for CORAL software is determined in two steps based on the distribution of SMILES or quasi-SMILES features in the training and calibration sets:
Step 1: the statistical defect (dk) is calculated for each involved (unblocked) SMILES or quasi-SMILES feature (Sk) to build the model with the following equation:
here, P(Sk) and P′(Sk) represent the probability of Sk in the active training set and calibration sets, respectively; N(Sk) and N′(Sk) denote the frequencies of Sk in the active training and calibration sets, respectively.
Step 2: the quasi-SMILES (Di) statistical defect of all compounds is defined according to the following equation:
here NA denotes the number of non-blocked quasi-SMILES features in the quasi-SMILES.
Quasi-SMILES falls in the AD if:
where \(\overline{D}\) represents average statistical defect of the training set.
3.2.8 Model Validation
Validation, as the fourth principle of OECD, is recognized as an intrinsic component to check the robustness, predictability, and reliability of any QSPR/QSAR models. There are three approaches to examine the robustness, reliability, and predictive potential of the QSPR/QSAR models in CORAL software, including:
-
Internal validation
-
External validation
-
Y-scrambling or data randomization.
Various statistical criteria such as determination coefficient (R2), concordance correlation coefficient (CCC), cross-validated correlation coefficient (Q2), \(Q_{F1}^{2}\), \(Q_{F2}^{2}\), \(Q_{F3}^{2}\), standard error of estimation (s), mean absolute error (MAE), Fischer ratio (F) and root-mean-square error (RMSE), \(R_{{\text{m}}}^{2}\), and average of \(R_{{\text{m}}}^{2}\) metric (\(\overline{{R_{{\text{m}}}^{2} }}\)) are calculated to authenticate the QSPR/QSAR models constructed based on the Monte Carlo optimization by the CORAL software. Table 8.6 indicates the mathematical equation of diverse statistical benchmark of the predictive potential for CORAL models.
3.2.9 Mechanistic Interpretation
The 5th OECD principle focuses on mechanistic interpretation of the QSPR/QSAR model if possible. The model interpretation is used to examine the critical and responsible attributes that influence the endpoint. Finally, the new compounds are designed based on these attributes. In the QSPR/QSAR modeling based on the CORAL software, the same structural attributes (Sk) collected from three or more different splits are used to perform the mechanistic interpretation [39,40,41,42]. These structural attributes (Sk) are divided into three categories according to previous studies:
4 Examples of Quasi-SMILES-Based QSPR/QSAR Models
Some examples of QSAR/QSPR models base on quasi-SMILES with CORAL software using different TFs are presented in Table 8.7.
5 Conclusion and Future Direction
QSPR/QSAR modeling based on SMILES and quasi-SMILES by CORAL software is useful for big dataset. In CORAL software, QSPR/QSAR generally follows the five OECD principles. In addition, additional principles may be defined practically for nano-QSPR/QSAR that reflect the nature of the nanomaterial under investigation. For example, the new principles should take into account the test conditions and the quality of the applied equipment.
The use of CORAL software in building QSPR/QSAR models for nanomaterials in different conditions is simple, and the models can be easily predicted and interpreted. There are very good TFs (TF0–TF3) to find reliable correlation weights and this is one of the important capabilities of CORAL for building excellent QSAR/QSAR models. The type and number of input features can change the performance of a QSAR/QSPR model. But there is one of a shortcoming for CORAL software, the user can use only CORAL software descriptors, and it is impossible to add the other descriptors produced by other descriptor generators.
In CORAL software, there is only Monte Carlo algorithm to find correlation weights. The use of various algorithms can increase the quasi-SMILES QSPR/QSAR performance. Data splitting in CORAL software is done randomly; the possibility of using different methods of data splitting can increase the validity of the models. Since the correlation weight of the descriptors in this software is calculated through Monte Carlo approach, the use of consensus modeling can dramatically increase the prediction results.
Abbreviations
- AD:
-
Applicability Domain
- CCC:
-
Concordance Correlation Coefficient
- CORAL:
-
CORrelation And Logic
- CII:
-
Correlation Intensity Index
- EP:
-
Endpoint
- F :
-
Fischer ratio
- IIC :
-
Index of Ideality Correlation
- MAE:
-
Mean Absolute Error
- NPs:
-
Nanoparticles
- OECD:
-
Organization of Economic Co-operation and Development
- QSAR:
-
Quantitative Structure–Activity Relationship
- QSPR:
-
Quantitative Structure–Property Relationship
- RMSE:
-
Root-Mean-Square Error
- SMILES:
-
Simplified Molecular Input Line Entry System
- TF:
-
Target Function
References
Ahmadi S, Ghanbari H, Lotfi S, Azimi N (2021) Mol Divers 25(1):87–97. https://doi.org/10.1007/s11030-019-10026-9
Achary PGR, Toropova AP, Toropov AA (2019) Food Res Int 122:40–46. https://doi.org/10.1016/j.foodres.2019.03.067
Kumar A, Kumar P (2021) J Hazard Mater 402:123777. https://doi.org/10.1016/j.jhazmat.2020.123777
Ahmadi S (2020) Chemosphere 242:125192. https://doi.org/10.1016/j.chemosphere.2019.125192
Ahmadi S, Ketabi S, Qomi M (2022) New J Chem 46:8827–8837. https://doi.org/10.1039/D2NJ00596D
Lotfi S, Ahmadi S, Kumar P (2021) RSC Adv 11:33849–33857. https://doi.org/10.1039/D1RA06861J
Ahmadi S, Khazaei MR, Abdolmaleki A (2014) Med Chem Res 23:1148–1161. https://doi.org/10.1007/s00044-013-0716-z
Ahmadi S, Aghabeygi S, Farahmandjou M, Azimi N (2021) Struct Chem 32:1893–1905. https://doi.org/10.1007/s11224-021-01748-4
Kumar P, Kumar A (2021) Nanotoxicology 15:1199–1214. https://doi.org/10.1080/17435390.2021.2008039
Ghasedi N, Ahmadi S, Ketabi S, Almasirad A (2022) J Recept Signal Transduct 42:418–428. https://doi.org/10.1080/10799893.2021.1988971
Ahmadi S, Moradi Z, Kumar A, Almasirad A (2022) J Recept Signal Transduct 42:361–372. https://doi.org/10.1080/10799893.2021.1957932
Karelson M, Lobanov VS, Katritzky AR (1996) Chem Rev 96:1027–1044. https://doi.org/10.1021/cr950202r
Ahmadi S, Toropova AP, Toropov AA (2020) Nanotoxicology 14:1118–1126. https://doi.org/10.1080/17435390.2020.1808252
Toropov AA, Toropova AP (2021) Sci Total Environ 772:145532. https://doi.org/10.1016/j.scitotenv.2021.145532
Toropov AA, Toropova AP (2020) Sci Total Environ 737:139720. https://doi.org/10.1016/j.scitotenv.2020.139720
Ahmadi S, Akbari A (2018) Environ Res 29:895–909. https://doi.org/10.1080/1062936X.2018.1526821
Lotfi S, Ahmadi S, Kumar P (2022) RSC Adv 12:24988–24997. https://doi.org/10.1039/D2RA03936B
Jafari K, Fatemi MH, Toropova AP, Toropov AA (2022) Chemom Intell Lab Syst 222:104500. https://doi.org/10.1016/j.chemolab.2022.104500
Toropov A, Toropova A, Lombardo A, Roncaglioni A, Lavado G, Benfenati E (2021) Environ Res 32:463–471. https://doi.org/10.1080/1062936X.2021.1914156
Choi J-S, Trinh TX, Yoon T-H, Kim J, Byun H-G (2019) Chemosphere 217:243–249. https://doi.org/10.1016/j.chemosphere.2018.11.014
Lotfi S, Ahmadi S, Zohrabi P (2020) Struct Chem 31:2257–2270. https://doi.org/10.1007/s11224-020-01568-y
Dearden JC, Cronin MTD, Kaiser KLE (2009) Environ Res 20:241–266. https://doi.org/10.1080/10629360902949567
Weininger D (1988) J Chem Inf Model 28:31–36. https://doi.org/10.1021/ci00057a005
Weininger D, Weininger A, Weininger JL (1989) J Chem Inf Comput Sci 29:97–101. https://doi.org/10.1021/ci00062a008
Toropova AP, Toropov AA, Fjodorova N (2022) NanoImpact 28:100427. https://doi.org/10.1016/j.impact.2022.100427
Kumar P, Kumar A, Lal S, Singh D, Lotfi S, Ahmadi S (2022) J Mol Struct 1265:133437. https://doi.org/10.1016/j.molstruc.2022.133437
Azimi A, Ahmadi S, Kumar A, Qomi M, Almasirad A (2022) Polycycl Aromat Compd 1–21. https://doi.org/10.1080/10406638.2022.2067194
Ahmadi S, Lotfi S, Afshari S, Kumar P, Ghasemi E (2021) Environ Res 32:1013–1031. https://doi.org/10.1080/1062936X.2021.2003429
Ahmadi S, Mehrabi M, Rezaei S, Mardafkan N (2019) J Mol Struct 1191:165–174. https://doi.org/10.1016/j.molstruc.2019.04.103
Nimbhal M, Bagri K, Kumar P, Kumar A (2020) Struct Chem 31:831–839. https://doi.org/10.1007/s11224-019-01468-w
Toropova AP, Duchowicz PR, Saavedra LM, Castro EA, Toropov AA (2020) Mol Inform 39:1900070. https://doi.org/10.1002/minf.201900070
Toropova AP, Toropov AA, Carnesecchi E, Benfenati E, Dorne JL (2020) Environ Sci Pollut Res 27:13339–13347. https://doi.org/10.1007/s11356-020-07820-6
Kumar P, Kumar A (2021) J Mol Struct 1246:131205. https://doi.org/10.1016/j.molstruc.2021.131205
Shayanfar A, Shayanfar S (2014) Eur J Pharm Sci 59:31–35. https://doi.org/10.1016/j.ejps.2014.03.007
Consonni V, Ballabio D, Todeschini R (2009) J Chem Inf Model 49:1669–1678. https://doi.org/10.1021/ci900115y
Roy K, Kar S (2014) Eur J Pharm Sci 62:111–114. https://doi.org/10.1016/j.ejps.2014.05.019
Lin LI-K (1992) Biometrics 48:599. https://doi.org/10.2307/2532314
Rücker C, Rücker G, Meringer M (2007) J Chem Inf Model 47:2345–2357. https://doi.org/10.1021/ci700157b
Manisha, Chauhan S, Kumar P, Kumar A (2019) Environ Res 30:145–159. https://doi.org/10.1080/1062936X.2019.1568299
Kumar P, Kumar A, Sindhu J, Lal S (2019) Drug Res (Stuttg) 69:159–167. https://doi.org/10.1055/a-0652-5290
Kumar P, Kumar A, Sindhu J (2019) Environ Res 30:63–80. https://doi.org/10.1080/1062936X.2018.1564067
Kumar P, Kumar A, Sindhu J (2019) SAR QSAR Environ Res 30:525–541. https://doi.org/10.1080/1062936X.2019.1629998
Toropov AA, Toropova AP, Benfenati E, Gini G, Leszczynska D, Leszczynski J (2012) Anticancer Agents Med Chem 12:807–817. https://doi.org/10.2174/187152012802650255
Nesmerak K, Toropov AA, Toropova AP, Kohoutova P, Waisser K (2013) Eur J Med Chem 67:111–114. https://doi.org/10.1016/j.ejmech.2013.05.031
Veselinović AM, Milosavljević JB, Toropov AA, Nikolić GM (2013) Eur J Pharm Sci 48:532–541. https://doi.org/10.1016/j.ejps.2012.12.021
Toropov AA, Kjeldsen F, Toropova AP (2022) Chemosphere 303:135086. https://doi.org/10.1016/j.chemosphere.2022.135086
Trinh TX, Choi J-S, Jeon H, Byun H-G, Yoon T-H, Kim J (2018) Chem Res Toxicol 31:183–190. https://doi.org/10.1021/acs.chemrestox.7b00303
Leone C, Bertuzzi EE, Toropova AP, Toropov AA, Benfenati E (2018) Chemosphere 210:52–56. https://doi.org/10.1016/j.chemosphere.2018.06.161
Toropova AP, Toropov AA, Benfenati E, Korenstein R, Leszczynska D, Leszczynski J (2015) Environ Sci Pollut Res 22:745–757. https://doi.org/10.1007/s11356-014-3566-4
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Ahmadi, S., Azimi, N. (2023). Quasi-SMILES-Based QSPR/QSAR Modeling. In: Toropova, A.P., Toropov, A.A. (eds) QSPR/QSAR Analysis Using SMILES and Quasi-SMILES. Challenges and Advances in Computational Chemistry and Physics, vol 33. Springer, Cham. https://doi.org/10.1007/978-3-031-28401-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-28401-4_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28400-7
Online ISBN: 978-3-031-28401-4
eBook Packages: Chemistry and Materials ScienceChemistry and Material Science (R0)