Keywords

1 Introduction

Quantitative structure–activity/property relationship (QSAR/QSPR) approach is indubitably of considerable importance in food chemistry [1, 2], environmental chemistry [3], modern chemistry [4,5,6], biochemistry [7], nanotechnology [8, 9], and drug design [10, 11]. The QSAR/QSPR approach is the mathematical and computerized search for compounds with desired activities/properties using chemical intuition and experience. Once a structure–activity/property correlation has been established, any number of compounds, including those not yet synthesized, can be easily screened on a computer to select structures with the desired activity/properties. Then the most promising compounds can be found for synthesis and experimental testing [12]. Therefore, QSAR/QSPR study saves cost and time for the development process of new molecules as drugs, materials, additives, or any other purpose. While finding successful structure–activity models is not an easy task, the recent increase in the number of papers in QSPR/QSAR research clearly indicates the rapid evolution in this area. To obtain a significant correlation, it is very important to use appropriate descriptors, whether they are theoretical, empirical, or derived from easily empirical properties of the constructs [12]. A group of descriptors shows simple molecular properties and therefore can give insight into the physicochemical nature of the activity/property under consideration.

Considering the growth of nanotechnology, modeling the properties or toxicity of nanoparticles (NPs) on living organisms is very important [13,14,15]. Although it is difficult to conduct toxicological experiments or obtain physical properties of NPs on a case-by-case basis, QSPR/QSAR is a computationally efficient technique because it saves time, cost, and animal sacrifice. The first part of nano-QSPR/QSAR model implementation includes data collection (including descriptors and endpoints) and data processing. The dataset can be obtained from the literature, databases, experiments, or integrated multiple sources. Therefore, to construct nano-QSPR/QSAR models, it is important to identify a new set of descriptors that can accurately represent the properties of NPs as well as the experimental conditions.

During recent years, the Simplified Molecular Input Line Entry System (SMILES) and quasi-SMILES descriptors have been examined by some researchers for QSPR/QSAR modeling [16,17,18,19]. The SMILES can reveal molecular structures, and quasi-SMILES can represent molecular structure and physicochemical properties and exposure conditions [8, 20, 21]. SMILES of a molecule is based on a set of rules that allow a molecular structure to be represented as a sequence of atom and bond symbols, but quasi-SMILES imports the physicochemical properties and experimental conditions as a string of characters after SMILES symbol.

2 Principals of QSPR/QSAR Models

Although QSPR/QSAR modeling has been used for over five decades, many studies still do not follow the Organization of Economic Co-operation and Development (OECD) guidelines. Figure 8.1 summarizes the best practices for each step of QSPR/QSAR approach using models in peer reviewed literature. Dearden et al. have reported a detailed description of common errors in QSPR/QSAR research [22].

Fig. 8.1
A flow chart depicts the steps of the Q S P R forward slash Q S A R approach using models in peer-reviewed literature. The flowchart is divided into two parts model training and testing.

General flowchart for QSPR/QSAR modeling

According to OECD guidelines, if a QSPR/QSAR study is to be reliable, the following five principles must be met: (i) a well-defined endpoint, (ii) an unambiguous algorithm, (iii) a defined applicability domain (AD), (iv) appropriate measures of goodness-of-fit, robustness, and predictivity, and (v) a mechanistic interpretation, if possible.

3 Monte Carlo Technique for Nano-QSPR/QSAR

3.1 SMILES and Quasi-SMILES

SMILES is a chemical notation system designed by Weininger et al. [23, 24]. According to the principles of molecular graph theory, SMILES uses a very small, natural grammar to specify precise structural features. The SMILES symbol system is also suitable for fast machine processing. Quasi-SMILES is an alternative to SMILES, which is used for substances considering physicochemical properties and experimental conditions.

3.2 The Main Step for QSPR/QSAR Modeling by SMILES or Quasi-SMILES

CORrelation And Logic (CORAL) software (http://www.insilico.eu/coral) has two possibilities for building QSPR/QSAR models based on SMILES or quasi-SMILES. In the following, the method of preparing the input data for the CORAL software is described.

3.2.1 Dataset Preparation for Models Based on SMILES

The SMILES string is a procedure for representing a two-dimensional molecular graph as a one-dimensional string that can show the connectivity and chirality of a molecule. In most cases, there are too many SMILES strings for a structure. Canonical SMILES gives a single ‘canonical’ form for any particular molecule. Molecular structures of desired compounds were transformed to canonical SMILES using different software such as Open Babel and ACD/ChemSketch program. Figure 8.2a, b indicates the sample of data based on SMILES, and quasi-SMILES as input for CORAL software, respectively. The first column indicates set, the second is compound ID, the third is SMILES/quasi-SMILES, and the last column is desired property/activity.

Fig. 8.2
Two tables depict the input for CORAL software. Part a represents the data for SMILES and part b represents Quasi-SMILES.

Sample of data based on a SMILES, and b quasi-SMILES as input for CORAL

3.2.2 Dataset Preparation for Models Based on Quasi-SMILES

For building of QSPR/QSAR in different physicochemical properties and/or the experimental conditions of substance, one can use quasi-SMILES instead of SMILES of molecules. Dataset preparation for quasi-SMILES is same as SMILES, only SMILES is replaced by quasi-SMILES.

3.2.3 Quasi-SMILES Definition for Various Datasets/Endpoints

Quasi-SMILES is a sequence of symbols that not only represents the molecular structure but also the different conditions that can affect the endpoint under investigation. Eclectic data can include: different physical properties such as temperature, pressure, and assay of experiment to obtain an endpoint, or cell line type, time exposition, concentration, etc. to obtain an activity. The type and number of eclectic data can be different in various datasets.

Quasi-SMILES may be made by eclectic condition, only [4, 13] or combination of SMILES and eclectic conditions [5, 8]. The continuous eclectic conditions can be normalized by the following equation for assigning codes:

$$ {\text{Norm}}\left( {E_{i} } \right) = \frac{{\min \left( {E_{i} } \right) + E_{i} }}{{\min \left( {E_{i} } \right) + \max \left( {E_{i} } \right)}} $$
(8.1)

Ei is its value of physicochemical parameter E, min(\(E_{i}\)) is minimum value of E, and max(\(E_{i}\)) indicates maximum value of E.

According to Table 8.1, the number of unique values in each parameter was less than 10; therefore, the quasi-SMILES descriptors representations could be coded by assigning a number between zero and nine in a single character.

Table 8.1 Distinction of standardized physiochemical features into classes 1–9 according to its value

A further development of the CORAL software (CORAL-2020) allows the display of experimental conditions through groups of symbols enclosed in parentheses. Table 8.2 shows the comparison codes in the last version (CORAL-2020) and old version of CORAL for creating quasi-SMILES in recently proposed models for cytotoxicity of metal oxide NPs [4]. One can see codes-2020 are quite transparent and consequently are more convenient for a user. As is clearly evident, CORAL-2020 codes being quite transparent and thus more user-friendly. Table 8.2 indicates codes used for the cell line, method, time exposition, concentration, nanoparticle size, and metal oxide type. Table 8.3 indicates the examples of quasi-SMILES obtained based on these codes.

Table 8.2 Codes used for the cell line, method, time exposition, concentration, nanoparticle size, and metal oxide type to convert various information of the experimental data to quasi-SMILES [4]
Table 8.3 Some examples for quasi-SMILES extracted by codes indicated in Table 8.2

Toropov and Toropova developed a QSAR model based on the new version of CORAL for the toxicity of ZnO NPs [14]. Experimental data from the literature are toxicity assessment of ZnO NPs and ZnO NPs coated with polyethylene glycol (PEG), which are investigated by intraperitoneal injections in the rat (50, 100, 200 mg/kg) for one month. Measurement of the toxic effects of renal factors including creatinine, uric acid, and blood urea nitrogen was measured after 15 and 30 days after injection. Table 8.4 shows the quasi-SMILES attributes together with experimental conditions. Table 8.5 represents examples of available quasi-SMILES obtained based on this condition and related activity.

Table 8.4 Codes used as fragments of quasi-SMILEs and their meaning
Table 8.5 Some examples for quasi-SMILES extracted by codes presented in Table 8.4

Toropova et al. developed new nano-QSAR model for predicting toxicity of nano-mixtures to Daphnia magna based on quasi-SMILES [25]. The binary mixtures of TiO2 NPs and with of one of the second component including AgNO3, Cd(NO3)2, Cu(NO3)2, CuSO4, Na2HAsO4, NaAsO2, benzylparaben, and benzophenone-3 have been investigated. Quasi-SMILES contain the following information: (1) Second component of mixture represented by SMILES; (2) core diameter of TiO2 NPs; (3) Zeta potential of TiO2 NPs; (4) mole fraction of TiO2 NPs; (5) mole fraction of mixed substance; and (6) exposure time. Figure 8.3 shows the transformation of the experimental condition and substance into the quasi-SMILES.

Fig. 8.3
A screenshot depicts the transformation of the experimental condition and substance into the quasi-SMILES. The upper half of the image represents the experimental data.

Transfer of experimental data into quasi-SMILES [25]

3.2.4 Model Development

Model development has several steps that can be organized in CORAL software and does not require any software for data partitioning, descriptor generation, and model validation. In the following sections, the main step for QSPR/QSAR modeling using CORAL software is described.

3.2.5 Dataset Splitting

After the preparation and curation of dataset, the next step of building a QSAR/QSPR model for an endpoint by CORAL software (http://www.insilico.eu/coral) is loading an array of lines. Each line consists of four components.

The first column is the types of set which ‘+’, ‘−’, ‘#’, and ‘*’ indicate the active training, passive training, calibration, and validation, respectively (Fig. 8.2).

  • The second column without space with type of set is number or ID of compound.

  • The third column is quasi-SMILES.

  • The last column is endpoint value.

After the preparation of input file, the dataset was splitted into training, passive training, calibration, and validation sets using CORAL software, randomly with desired present for each set.

3.2.6 Monte Carlo Optimization Process

Quasi-SMILES is a group of attributes where each attribute group is converted into a group of coefficients called correlation weights. Monte Carlo optimization refines the correlation weights that provide numerical data on them, which maximizes the predictive potential of a model as much as possible. Figure 8.4 shows the flowchart of one cycle of Monte Carlo optimization of correlation weights (n is the number of correlation weights that contribute to model construction).

Fig. 8.4
The flowchart depicts the one cycle of Monte Carlo optimization of correlation weights. The flowchart represents both the condition of an algorithm.

Flowchart of one cycle of the Monte Carlo optimization for finding correct correlation weights (n is the number of correlation weights that contribute to model construction)

There are different target functions (TFs) in CORAL software for Monte Carlo optimization [25,26,27,28,29], which are introduced below four TFs:

$$ {\text{TF}}_{0} = r_{{{\text{AT}}}} + r_{{{\text{PT}}}} - \left| {r_{{{\text{AT}}}} - r_{{{\text{PT}}}} } \right| \times C $$
(8.2)
$$ {\text{TF}}_{1} = {\text{TF}}_{1} + {\text{IIC}}_{{\text{C}}} \times W_{{{\text{IIC}}}} $$
(8.3)
$$ {\text{TF}}_{2} = {\text{TF}}_{1} + {\text{CII}}_{{\text{C}}} \times W_{{{\text{CII}}}} $$
(8.4)
$$ {\text{TF}}_{3} = {\text{TF}}_{1} + {\text{IIC}}_{{\text{C}}} \times W_{{{\text{IIC}}}} + {\text{CII}}_{{\text{C}}} \times W_{{{\text{CII}}}} $$
(8.5)

\(r_{{{\text{AT}}}}\) and \(r_{{{\text{PT}}}}\) represent the correlation coefficient between the experimental and predicted endpoints for active and passive training sets, respectively. Empirical constant (C), WIIC, and WCII have a defined numerical value [1, 18, 30,31,32,33].

IICC is the index of ideality correlation. IICC is obtained based on the calibration set as follows:

$$ {\text{CII}}_{{\text{C}}} = r_{{\text{C}}} \frac{{{\text{min}}\left( {{^{-}{\text{MAE}}}_{{\text{C}}} ,{^{+}{\text{MAE}}}_{{\text{C}}} } \right)}}{{{\text{max}}\left( {{^{-}{\text{MAE}}}_{{\text{C}}} ,{^{+}{\text{MAE}}}_{{\text{C}}} } \right)}} $$
(8.6)
$${^{-}{\text{MAE}}}_{{\text{C}}} = \frac{1}{{{^{-}N}}}\sum \left| {\Delta_{i} } \right|, {^{-}N}\,{\text{is}}\,{\text{the}}\,{\text{number}}\,{\text{of}}\,\Delta_{i} < 0 $$
(8.7)
$${^{+}{\text{MAE}}}_{{\text{C}}} = \frac{1}{{{^{-}N}}}\sum \left| {\Delta_{i} } \right|, {^{+}N}\,{\text{is}}\,{\text{the}}\,{\text{number}}\,{\text{of}}\,\Delta_{i} \ge 0 $$
(8.8)
$$ \Delta_{i} = {\text{Obs}}_{i} - {\text{Calc}}_{i} $$
(8.9)

The \({\text{Obs}}_{i}\) and \({\text{Calc}}_{i}\) are the experimental and predicted endpoint for \(i{\text{th}}\) compound.

The correlation intensity index (CII), like IIC criteria, was developed to modify the quality of the Monte Carlo optimization used to build the QSPR/QSAR models. CII is formulated as follows:

$$ {\text{CII}} = 1 - \sum \Delta R_{i}^{2} > 0,\,{\text{If}}\,\Delta R_{i}^{2} < 0\,\,{\text{then}}\,\Delta R_{i}^{2} = 0 $$
(8.10)
$$ \Delta R_{i}^{2} = R_{i}^{2} - R^{2} $$
(8.11)

where R2 is the coefficient of determination for all endpoints and \(R_{i}^{2}\) is the coefficient of determination for all endpoints in the absence of ith compound. Therefore, if \(\Delta R_{i}^{2}\) is greater than zero, the meaning of ith is an ‘opposite’ for the correlation between the experimental and calculated values of the set.

A small sum of \(\Delta R_{i}^{2}\) means a more ‘intensive’ correlation.

The CORAL model for an endpoint (EP) is defined by the below equation:

$$ {\text{EP}} = C_{0} + C_{1} \times {\text{DW}}\left( {T,N} \right) $$
(8.12)

C0 and C1 represent regression coefficients, T is a threshold, and N is the number of optimization cycles. The DCW(T, N) is defined as the below equation:

$$ {\text{DCW}}\left( {T,N} \right) = \sum {\text{CW}}\left( {S_{k} } \right) $$
(8.13)

where Sk represents the symbol of a quasi-SMILES line; the CW(Sk) shows the correlation weights of Sk.

3.2.7 Applicability Domain

The AD of QSAR/QSAR models for CORAL software is determined in two steps based on the distribution of SMILES or quasi-SMILES features in the training and calibration sets:

Step 1: the statistical defect (dk) is calculated for each involved (unblocked) SMILES or quasi-SMILES feature (Sk) to build the model with the following equation:

$$ d_{k} = \frac{{\left| {P\left( {S_{k} } \right) - P^{\prime}\left( {S_{k} } \right)} \right|}}{{N\left( {S_{k} } \right) + N^{\prime}\left( {S_{k} } \right)}} $$
(8.14)

here, P(Sk) and P′(Sk) represent the probability of Sk in the active training set and calibration sets, respectively; N(Sk) and N′(Sk) denote the frequencies of Sk in the active training and calibration sets, respectively.

Step 2: the quasi-SMILES (Di) statistical defect of all compounds is defined according to the following equation:

$$ D_{i} = \mathop \sum \limits_{k = 1}^{{N_{{\text{A}}} }} d_{k} $$
(8.15)

here NA denotes the number of non-blocked quasi-SMILES features in the quasi-SMILES.

Quasi-SMILES falls in the AD if:

$$ D_{i} < 2 \times \overline{D} $$
(8.16)

where \(\overline{D}\) represents average statistical defect of the training set.

3.2.8 Model Validation

Validation, as the fourth principle of OECD, is recognized as an intrinsic component to check the robustness, predictability, and reliability of any QSPR/QSAR models. There are three approaches to examine the robustness, reliability, and predictive potential of the QSPR/QSAR models in CORAL software, including:

  • Internal validation

  • External validation

  • Y-scrambling or data randomization.

Various statistical criteria such as determination coefficient (R2), concordance correlation coefficient (CCC), cross-validated correlation coefficient (Q2), \(Q_{F1}^{2}\), \(Q_{F2}^{2}\), \(Q_{F3}^{2}\), standard error of estimation (s), mean absolute error (MAE), Fischer ratio (F) and root-mean-square error (RMSE), \(R_{{\text{m}}}^{2}\), and average of \(R_{{\text{m}}}^{2}\) metric (\(\overline{{R_{{\text{m}}}^{2} }}\)) are calculated to authenticate the QSPR/QSAR models constructed based on the Monte Carlo optimization by the CORAL software. Table 8.6 indicates the mathematical equation of diverse statistical benchmark of the predictive potential for CORAL models.

Table 8.6 Mathematical formulation of different statistical benchmark of the predictive potential for CORAL models

3.2.9 Mechanistic Interpretation

The 5th OECD principle focuses on mechanistic interpretation of the QSPR/QSAR model if possible. The model interpretation is used to examine the critical and responsible attributes that influence the endpoint. Finally, the new compounds are designed based on these attributes. In the QSPR/QSAR modeling based on the CORAL software, the same structural attributes (Sk) collected from three or more different splits are used to perform the mechanistic interpretation [39,40,41,42]. These structural attributes (Sk) are divided into three categories according to previous studies:

  • Increasing factor if the CW(Sk) is positive in all splits and in three attempts,

  • Decreasing factor if the CW(Sk) is negative in all splits and in three attempts,

  • Undefined attributes if the CW(Sk) is both positive and negative [43,44,45].

4 Examples of Quasi-SMILES-Based QSPR/QSAR Models

Some examples of QSAR/QSPR models base on quasi-SMILES with CORAL software using different TFs are presented in Table 8.7.

Table 8.7 Some examples of QSAR/QSPR models base on quasi-SMILES with CORAL software using different TFs

5 Conclusion and Future Direction

QSPR/QSAR modeling based on SMILES and quasi-SMILES by CORAL software is useful for big dataset. In CORAL software, QSPR/QSAR generally follows the five OECD principles. In addition, additional principles may be defined practically for nano-QSPR/QSAR that reflect the nature of the nanomaterial under investigation. For example, the new principles should take into account the test conditions and the quality of the applied equipment.

The use of CORAL software in building QSPR/QSAR models for nanomaterials in different conditions is simple, and the models can be easily predicted and interpreted. There are very good TFs (TF0–TF3) to find reliable correlation weights and this is one of the important capabilities of CORAL for building excellent QSAR/QSAR models. The type and number of input features can change the performance of a QSAR/QSPR model. But there is one of a shortcoming for CORAL software, the user can use only CORAL software descriptors, and it is impossible to add the other descriptors produced by other descriptor generators.

In CORAL software, there is only Monte Carlo algorithm to find correlation weights. The use of various algorithms can increase the quasi-SMILES QSPR/QSAR performance. Data splitting in CORAL software is done randomly; the possibility of using different methods of data splitting can increase the validity of the models. Since the correlation weight of the descriptors in this software is calculated through Monte Carlo approach, the use of consensus modeling can dramatically increase the prediction results.