Keywords

1 Introduction

1.1 History

The history of molecular descriptors as a feature vector for each compound is closely related to the concept of molecular structure [1]. The years between 1860 and 1880 were marked by a strong disagreement about the theory of molecular structure, which arose from studies on substances showing optical isomerism and Kekulé's (1867–1861) studies on the structure of benzene [2].

Today, many chemical, physical, and biological characteristics of compounds rely on the principle that these parameters are effects of its structural descriptors.

In 1868, Crum-Brown and Fraser [3] introduced first formulation about relationship between the bioactivity/property of a chemical (Φ) and its chemical constitution (C), as the following equation:

$$\Phi = f\left( C \right)$$
(2.1)

Based on this concept, many studies were conducted on the relationship of molecular descriptors to observed properties, including the relationship between the anesthetic power of various aliphatic alcohols with chain length of carbon and molecular weight [4], between the color of disubstituted benzenes with various ortho-, meta-, and para-orienting [5], and between the narcotic toxicity and solubility in water [6].

One of the most attractive quantitative structure–activity relationship (QSAR) approach is the Hammett equation [7]. In 1973, he showed a linear relationship between the rate constants of a series of methyl ester reactions with N(CH3)3 and the ionization equilibrium constants of the related carboxylic acids in aqueous solution at ambient temperature. The linear relationship between the ionization constant of the ester containing a substituent X in the meta (m) or para (p) orientation (KX) and the ionization constant of the unsubstituted ester (KH) is defined by the following formula:

$$\log \left( {\frac{{K_{X} }}{{K_{H} }}} \right) = \rho \cdot \sigma_{X},$$
(2.2)

where \(\sigma_{X}\) is the constant of the substituent in m or p position is indicated by \(\sigma_{m}\) or \(\sigma_{p}\), respectively. The absolute value of σ, which varies for each substituent, refers to the measure of the global electronic effect exerted on the reaction center by the presence of substituent X. The sign of σ is positive for electron-withdrawer and negative for electron-donor substituent. The electronic induction effect and the electronic resonance effect denote by \(\sigma_{I}\) and \(\sigma_{R}\), respectively; the constant for the unsubstituted aromatic ring as a reference represented by \(\sigma_{R}^{0}\). Hammett’s equation in this case defined by the following equation.

$$\log \left( {\frac{{K_{X} }}{{K_{H} }}} \right) = \rho_{I} \cdot \sigma_{I} + \rho_{R} \cdot \sigma^{0}_{R}$$
(2.3)

1.2 QSPR/QSAR Modeling

In cheminformatics, a QSPR/QSAR model, either qualitative or quantitative, is a mathematical function that can be used to describe the connection between the molecular structures of a series of chemical compounds and their physicochemical properties/biological activities [8,9,10,11,12,13,14].

This field of knowledge assumes that the activity or property of a compound depends on its structural features, which affect its overall activities and properties [15,16,17,18,19].

Despite the formal differences between different methodologies, each QSPR/QSAR method is based on a QSPR/QSAR table that can be generalized as presented in Fig. 2.1 [20].

Fig. 2.1
A flowchart depicts the combinatorial Q S A R methodology. It includes datasets, molecular descriptors, multiple training and test sets, methods, models, prediction, and validation.

Flowchart of the combinatorial QSAR methodology

The differences in various QSPR/QSAR studies can be explained in the following terms:

  • Endpoint value

  • Molecular descriptors

  • Optimization algorithms.

Endpoint value as dependent variables can generally be of three types:

  • Continuous

This endpoint is real values covering certain range, e.g., physicochemical properties of compounds such as boiling point and melting point. or IC50 values and binding constant.

  • Categorical-related

This is classes of activities covering certain range of values, e.g., active and inactive compounds.

  • Adjacent classes of metabolic stability

Adjacent classes of metabolic stability such as unstable, moderately stable, stable; and categorical-unrelated (i.e., classes of endpoints that do not relate to each other in any continuum, e.g., compounds that belong to different pharmacological categories, or compounds that are categorized as drugs vs. non-drugs).

Understanding this classification is indeed very important because the choice of descriptor types as well as modeling methods is often determined by the type of endpoints. Thus, in general the latter two types require classification modeling methods, whereas the former type of the target properties allows using linear regression modeling. Therefore, the latter two types require categorical modeling methods, generally while the former type of endpoint characteristics allows the use of linear regression modeling. Methods related to data analysis are called classification or continuous QSPR/QSAR.

1.3 Molecular Descriptors

Chemical descriptors as independent features in QSPR/QSAR modeling are usually classified into the following two types:

  • Continuous

There are so many continuous descriptors such as molecular weight or many molecular connectivity indices.

  • Categorical-related

The categorized descriptors such as counts of functional groups, binary descriptors indicating the presence or absence of a chemical functional group or an atom in a molecule.

1.3.1 Types of Molecular Descriptors

Molecular descriptors can be obtained from different representations of molecules. Knowing various types of descriptors is also critical for a fundamental understanding of QSPR/QSAR modeling because, as mentioned above, any modeling requires establishing a relationship between the chemical similarity of compounds and their target properties [21,22,23,24]. Chemical similarity is calculated in descriptor space using various similarity metrics [25]. For example, in the case of continuous molecular descriptors, the Euclidean distance in the descriptor space is an advisable choice of similarity metric, while in the case of binary descriptors metrics such as the Tanimoto coefficient or the Manhattan distance seem more appropriate.

The grade of the sufficiency of molecular structure samples differs from 0 to 4D demonstrations.

0D Descriptors

The 0D models contain the simplest molecule interpretation that does not hold any information about atom connections. Chemical formula, which organizes the atom types and their occurrences within a molecule, is independent of any information about the molecular structure. Therefore, molecular descriptors gained from the chemical formula stated as 0D descriptors. The most usual examples are atom type, number of atoms, molecular weight (MW), and any function of atomic properties.

1D Descriptors

Substructure list representation can be classified as a 1D description and contain of structural fragments of a molecule such as functional groups, bonds, rings, and substituents. Therefore, 1D descriptors do not involve a full information of molecular structure. These descriptors are inanimate to any conformation variation and, hence, do not recognize between isomers.

2D Descriptors

The 2D models include knowledge about the structure of the compound on the basis of its structural formula [26]. These patterns solely mirror the topology of the molecule. Such templates are highly common. The ability of such methods is that the topology model of the molecular structure includes information about the possible combinations of the molecule in virtual form.

Evaluation of the internal atomic arrangement of compounds is done by topological parameters [27]. They originated from the topological exhibition of molecules and can be measured as structure-manifest descriptors. These factors numerically code data related to molecular shape, size, branching, attendance of heteroatoms, and multifold bonds in numeric form. These topological parameters show the correlation of atoms by the characteristic of chemical bonds.

In modeling distinct biological, physicochemical, and pharmacokinetic properties, they have considerable performance. A topological display of the molecule is accessible as a molecular diagram. This diagram is defined in mathematical phrases as \(G = \left( {V, E} \right)\), where V is a series of vertices corresponding to the atoms of the molecule and E is a series of elements that initiate a double connection between pairs of vertices.

These chemical diagrams illustrate a non-numerical figure of the molecular compound although a numeric interpretation of the diagram is crucial for computing topological parameters [28].

Some common 2D descriptors together with their description have been listed in the following.

Wiener ( W ) Index

The structure descriptor based on the classical molecular diagram is the Wiener index (W) which has become one of the most heavily applied descriptors in QSAR/QSPR approaches [29]. The descriptor is defined as the sum of edges on the shortest path in a chemical diagram.

Actually, the following equation denotes Wiener index W(G) of the graph G (the graph G is a tree, T):

$$W\left( G \right) = \mathop \sum \limits_{e \in E\left( G \right)} n_{1} \left( {e|G} \right)n_{2} \left( {e|G} \right)$$
(2.4)

\(n_{1} \left( {e|G} \right)\) and \(n_{2} \left( {e|G} \right)\) counts the vertices of G lying closer to the endpoints of the edge e than to its other endpoint

Hyper-Wiener Index (WW)

This index of a chemical tree T is defined as the sum of n1n2 products over all pairs of u vertices of T [30]. In fact, WW is the path number, and it is defined as the sum of the distances between any two atoms in the molecule, in terms of atom-atom bonds. Actually, WW can be calculated by multiplying the number of atoms on one side of any path by those on the other side, and the sum of these values for all paths. Wiener index is restricted to bonds and in Hyper-Wiener index bond is replaced with path.

Modified Wiener Index ( W *)

Bond contribution is determined by using the reciprocal of the number of atoms on each side of the bond [31].

Novel Wiener Index

It is obtained as an additive bond quantity, where the bond contribution is given as the product of the number of atoms close to each of the two points of each bond [32].

Connectivity Indices

It is structural invariant. Such indices are widely used in structure–property and structure–activity studies. These descriptors are on the basis of graph-theoretical constants that are presented to calculate the branching index of alkenes [33].

Kier and Hall extended these indices and intrinsic valence coupling indices to differentiate heteroatoms. Today, these phenomena have been optimized for a wide range of biological and physicochemical properties [34]. Randic [35] proposed some descriptors for topological indices: (i) they should be well-correlated with at least one feature; (ii) have structure commentary; (iii) be normal and self-determining; (iv) easily applied in a situational structure; (v) be free of empirical features; and (vi) be independent of other parameters.

Higher Order Connectivity

These indices are weight paths, where higher weight is given to terminal bonds and a lower weight to less exposed internal bonds [36].

Kier Shape

The descriptor defines shape indexes from molecular graphs. The shape of molecules is defined by the number of atoms and their bonding pattern which present in various orders [37].

Balaban Index

It is also one of the most distinctive molecular descriptors. Its value is independent of the molecular size or the number of rings [38].

Zagreb Indices

This descriptor is the first topological indices used for the total π-energy of conjugated molecules. The significant use of these indices is the distinction between the size of the molecules, flexibility, degree of branching, and entire shape [39].

Augmented Zagreb Index (AZI)

This index is based on the atom-bond connectivity (ABC index) used to obtain extreme values of AZI in chemical trees, and it can be used for upper and lower bonds’ power of chemical trees [40].

Hosoya ( Z )

It constructs QSAR/QSPAR models that describe the physical properties [41].

Modified Hosoya Index ( Z *)

The frequency of occurrence of single CC bond in disjoint bond patterns is considered [42].

Autocorrelation Indices

This is a function of spatial separation and has particular advantageous for any QSAR/QSPAR study [43]

Szeged (SZ)

It is obtained as an additive bond quantity, where the bond contributions are given as the product of the number of atoms close to each of the two points of each bond [44].

Luckily, most of these parameters are identified in the topological descriptors. Therefore, they have been widely utilized in QSAR/QSPR simulation to determine the structural resemblance or disparity of chemical compounds.

Topological Maximum Cross Correlation (TMACC)

These descriptors generated from atom properties determined by molecular topology based on concepts derived from autocorrelation descriptors. In 2007, Topological Maximum Cross Correlation (TMACC) was developed through atomic features characterized by molecular topology [45]. These parameters are based on meanings derived from coefficient descriptors. The ability to decode TMACC descriptors using QSAR simulation of angiotensin-converting enzymes (ACE) and dihydrofolate reductase (DHFR) inhibitors was demonstrated by Spowage et al. [46]. Altogether, TMACC revealed specific properties for C domain-selective ACE inhibition, which was an improvement on prior QSAR studies [46].

The physical and chemical features of a molecule that are evaluated by examining its 2D structure are physicochemical descriptors. These features play a main role in characterizing the drug condensation in the body. The convenient characteristics of a drug can enhance its effect and thus its market value.

Therefore, investigating these features of a drug not only contributes to the general plan of drug safety but also plays a significant role in drug detection collaboration by optimizing the selected compounds. Thus, it is necessary to pay attention to properties like solubility, permeability, and lipophilicity that can warrant optimal power, as well as to select the volunteer compounds with proper physicochemical properties.

The lipophilicity of a drug is related to its dependence on a lipophilic surrounding. It is an essential feature in the movement of drugs in the body, which includes intestinal absorption, membrane penetrance, protein linkage, and dispensation among multiple tissues [47].

Generally, a drug exhibits negligible chemical absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties in the presence of low lipophilicity [48]. Many pieces of research have been conducted on in vitro cellular permeance, which have demonstrated its connection to lipophilicity with other parameters, like molecular size, hydrophilicity, hydrogen bonds, and degree of ionization. These factors are recognized to have a considerable role in the intestinal absorption of a molecule. Molecular size is the main operative influencing biological activity like intestinal absorption.

Hydrogen bond donors and lipophilicity play considerable roles in predicting human intestinal permeability [49]. MW is associated with reduced permeability. Solubility in water plays a significant role in the distribution of drugs and their permeance through biological membranes, and their redeploy and sorption.

3D Descriptors

The 3D QSAR models [50,51,52,53] provide complete structural data including composition, topology, and steric form of the molecule for only one conformer. These patterns are the most common. Geometrical descriptors are computed from the 3D correlations of atoms in a given molecule. These parameters are in contrast to topological descriptors in terms of data and distinction power for similar chemical structures and molecular compounds [54].

In addition, they also contain data procured from atomic van der Waals regions and their participation on the molecular surface. In spite of their high data quantity, these parameters normally have drawbacks.

Geometrical descriptors need geometry optimization and, thus, the overhead to compute them. Thus, new data are available and can be extracted for flexible molecules that can have different molecular compositions. However, this propels the complexity that can enhance considerably. In addition, most of these parameters (grid-based descriptors) require arrangement rules to accomplish molecule abduction. Different groups of descriptors can be recognized using the set of geometric descriptors [54].

A diversity of 3D descriptors is accessible, some of them are:

3D-Molecular Representation of Structures Based on Electron Diffraction (MoRSE)

MoRSE descriptors have been shown to have good modeling power for various biological and physicochemical properties and can also be used to simulate infrared spectra [55].

Weighted Holistic Invariant Molecular (WHIM)

WHIM descriptors are applied to obtain related 3D data about molecular shape, size, symmetry and atom dispensation and have been utilized to model several physicochemical and toxicology properties. At the minimum, ten distinct sorts of WHIM parameters with distinct molecular characteristics have been expanded [54].

3D Autocohesion

Using the autocohesion function, these parameters are computed at individual spots on molecular surface. For a specific geometry and sensitive conformational change, they are unique and are constant to rototranslation [56].

GEometry, Topology, Atom-Weights AssemblY (GETAWAY)

These parameters are on the basis of spatial coherence formula, which weights the atom to calculate van der Waals volume, atomic mass, and electronegativity alongside 3D data. According to data factors and the matrix operator, seven GETAWAY descriptors have been declared until now [54].

4D Descriptors

In 3D descriptors, the choice of the analyzed conformer is often random. The most adequate explanation of the molecular structure will be provided by 4D-QSAR patterns [57]. These models are similar to 3D models, but unlike them, structural data are discussed for a set of conformers (in essence, the fourth dimension), for a firm conformation.

Representation of molecular descriptors used in QSPR/QSAR modeling indicated in Fig. 2.2.

Fig. 2.2
An infographic represents the geometrical, topological, thermodynamic, constitutional, and electronic molecular descriptors used in Q S P R or Q S A R modeling.

Representation of molecular descriptors used in QSPR/QSAR modeling

1.3.2 Molecular Descriptors’ Resources

To get a considerable connection in QSAR studies, suitable descriptors must be used, whether they are empirical, theoretical, or derived from easily accessible experimental features of the molecules. Multiple descriptors mirror simple molecular features and thus can equip vision into the physicochemical characteristics of the property/activity under observation.

Quantum Chemical Descriptors. Quantum chemical computations are an important source of new molecular descriptors that can actually represent all electronic and geometrical properties of molecules and their interactions.

Quantum chemical and molecular modeling techniques provide the description of a large number of molecular and local values that determine the shape, reactivity, and binding characteristics of an entire molecule in addition to its molecular pieces and substituents.

In the last years, quantum chemical parameters have been significant in QSAR models helping researchers illustrate the biological activities and toxicity mechanisms of various chemicals. In the past decades, semiempirical calculations were the prior ways to generate descriptors owing to the restrictions of the software and applied systems. Recent advances in computational hardware and the expansion of effective algorithms have helped to expand molecular quantum mechanical computations. In particular, the parameters derived from density functional theory (DFT) and hybrid density functional calculations (mPW1PW91) have excellent potential through their better accuracy in contrast to the semiempirical procedure and have good efficiency to fit into the geometrical, electrostatic, and orbital energy calculations [58,59,60,61].

Since the context of large discrete physical data is encoded in a large number of theoretical descriptors, their usage in the scheme of instruction sets in QSAR studies offers two significant priorities: (a) molecules, their diverse parts, and their substitutions; can be instantly identified based on their molecular structure, and (b) the presented mechanism of action can be straight considered for the chemical reaction of the studied compounds [62]. As a result, the derived QSAR models contain data on the essence of the intermolecular interactions imported in specifying the biological or other properties of the investigated compounds. The most commonly used quantum chemical descriptors can be classified as follows:

Geometry Descriptors. The bond lengths, angles, and molecular dihedrals of the root segment should be the same for all molecules in the series.

Atomic Charges. In accordance with the classical theory of chemistry, all chemical interactions are either orbital (covalent) or electrostatic (polar) in nature. The electric charges in the molecule are clearly the order of the electrostatic interactions. Indeed, local electron density or charges have been shown to be momentous in a large number of physicochemical properties and chemical reactions of structures. Therefore, charge-based descriptors have been broadly utilized as indicators of chemical reactivity or as a measure of fragile intermolecular interactions. Numerous quantum chemical descriptors are derived from partial charge. Partial atomic charges are known as indicators of static chemical reactivity [63]. The computed σ- and π-electron densities on a specific atom determine the feasible direction of the chemical interactions and, hence, are often discussed as indices of directional reactivity. Unlike the total electron density, specific charges on atoms are observed as indicators of non-directional reactivity. Several sums of absolute or squared values of partial charges have also been used to characterize intermolecular interactions, e.g., solute–solvent interactions [64,65,66].

Molecular Orbital Energies. Highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) energies are very universal quantum chemical descriptors. It has been displayed [67] that these orbitals play an important role in controlling various chemical reactions and specifying electronic band gaps in solids. They are also in charge of the formation of several charge transfer complexes [63, 68]. Based on the frontier molecular orbital theory (FMO) of chemical reactivity, the organization of a transition state is owing to the interaction between the frontier orbitals (HOMO and LUMO) of the reacting fragments [69]. Therefore, the behavior of frontier molecular orbitals is distinct from others based on the general origins controlling the character of chemical reactions [69]. The HOMO energy is straightly connected to the ionization potential and characterizes the ability of the molecule to attack by electrophiles. The LUMO energy is straightly connected to the electron affinity and determines the readiness of the molecule against nucleophile attack. Both the HOMO and the LUMO energies are essential in radical reactions [70, 71]. The meaning of soft and hard nucleophiles and electrophiles is also connected to the relative energy of the HOMO/LUMO orbitals.

Soft nucleophiles have high-energy HOMOs. Hard nucleophiles have low-energy HOMOs. Soft electrophiles have low-energy LUMO, and hard electrophiles have high-energy LUMOs[72]. The HOMO–LUMO gap, i.e., the energy difference between HOMO and LUMO, is a major stability indicator [73].

$$E_{{{\text{gap}}}} = E_{{{\text{LUMO}}}} - E_{{{\text{HOMO}}}}$$
(2.5)

A large HOMO–LUMO gap indicates high resistance for the molecule by definition its less reactivity in chemical reactions [67]. The HOMO–LUMO gap has also been utilized as an estimate of the lowest stimulation energy of the molecule. However, this definition ignores electronic restructuring in the excited state and hence may mostly make incorrect theoretical results. The meaning of activation hardness (η) and softness (S) is also determined based on the HOMO–LUMO energy gap.

$$\eta = \frac{{\left( {E_{{{\text{LUMO}}}} - E_{{{\text{HOMO}}}} } \right)}}{2}$$
(2.6)
$$S = \frac{1}{2\eta }$$
(2.7)

Activation hardness determines the rate of reaction at various sites of the molecule and is therefore related to anticipating direction effects [67]. The qualitative description of hardness is intimately connected to polarizability, as a reduction in the energy gap normally results in an easier polarization of the molecule [74].

Frontier Orbital Densities. Frontier orbital electron densities on atoms provide an effective alternative or accurate description of donor–acceptor interactions [71, 75]. Due to the theory of frontier electron reactivity, most chemical reactions happen in the location and direction where the overlap of the HOMO and LUMO of the respective reactants can be maximized [69].

In the matter of a donor molecule, both ionization potential (IE) and HOMO density (electrophilic electron density, \(f_{r}^{E}\)) are necessary to charge transfer:

$$f_{r}^{E} = \sum \left( {C_{{{\text{HOMO}},n}} } \right)^{2} ;\;\;\;C_{{{\text{HOMO}},n}} \;{\text{are atomic orbital factors in HOMO}}$$
(2.8)
$${\text{IE}} = - {\text{EHOMO}}$$
(2.9)

and in the terms of an acceptor molecule, LUMO density (nucleophilic electron density, \(f_{r}^{N}\)) and electron affinity (EA) are critical [63].

$$f_{r}^{N} = \sum \left( {C_{{{\text{LUMO}},n}} } \right)^{2} ;\;\;C_{{{\text{LUMO}},n}} \;{\text{are atomic orbital factors in LUMO}}$$
(2.10)
$${\text{EA}} = - E_{{{\text{LUMO}}}}$$
(2.11)

These descriptors have been applied in QSAR studies to characterize drug–receptor interaction sites. By comparing the relativities of different molecules, the frontier electron density should be normalized by the energy of the frontier molecular orbitals, and hence molecules with lower ionization potentials are predicted to be more reactive as nucleophiles. Absolute electronegativity index (χ), electron affinity (ω), and electron charge transfer (∆N) are also determined based on ionization potential and electron affinity:

$$\chi = \frac{{\left( {I + A} \right)}}{2}\quad {\text{absolute electronegativity}}$$
(2.12)
$$\omega = \frac{{\mu^{2} }}{2\eta }\quad {\text{electrophilicity index}}$$
(2.13)
$$\Delta N = \frac{{(\mu_{B} - \mu_{A} )}}{{2\left( {\eta_{A} + \eta_{B} } \right)}}\quad {\text{electron charge transfer}}$$
(2.14)

Molecular Polarizability. The polarization of a molecule by an external electric [76] area is given by the potential tensors of order n of the molecular mass. The first-order term is used as polarizability (α):

$$\alpha = \frac{1}{3}\left( {\alpha_{xx} + \alpha_{yy} + \alpha_{zz} } \right)$$
(2.15)

The second-order term is mentioned in the first hyperpolarizability, etc. Therefore, the most considerable characteristic of molecular polarizability is binding to the molecular bulk or molar volume [73]. Polarizability values have been demonstrated to depend on hydrophobicity and other biological activities [77,78,79]. In addition, the electronic polarizability of the molecules contributes to the typical parameters of electrophilic super-delocalizability [80]. The first-order polarizability tensor includes data about feasible inductive interactions in the molecule [70, 73, 81, 82]. The total anisotropy of the polarizability (second-order term) determines the properties of a molecule as an electron acceptor:

$$\beta^{2} = \frac{1}{2}[\left( {\alpha_{xx} - \alpha_{yy} } \right)^{2} + \left( {\alpha_{yy} - \alpha_{ZZ} } \right)^{2} + \left( {\alpha_{ZZ} - \alpha_{xx} } \right)^{2} ]$$
(2.16)

Dipole Moment and Polarity Indices. The polarity of a molecule is essential for several physicochemical properties. A large number of descriptors have been suggested to estimate the polarity effects. For instance, molecular polarity counts for chromatographic retention in a polar static phase [65, 83]. The dipole moment (μ) is the most obvious and is often used to explain the polarity of the molecule [64, 65, 70, 81, 84]. Difference between net charges on atoms (∆) [68, 84], and topological electronic index (TE) [68].

$$T_{E} = \mathop \sum \limits_{ij,i \ne j} \frac{{\left| {q_{i} - q_{j} } \right|}}{{r_{ij}^{2} }}$$
(2.17)

The quadrupole moment tensor can also be applied as an index to characterize probable electrostatic interactions. However, such tensors belong to the selection of the coordinate system and thus the direction of the molecular root section must be the same for all molecules in the series [70].

Energy. The total energy computed by quantum mechanical methods has been presented as a good descriptor in several cases [64, 68, 85, 86].

In addition, thermodynamic parameters contain entropy (S°), internal energy (Eth), constant-enthalpy (H°), free energy (G°), zero-point vibrational energy (ZPE), and volume heat capacity (CV°) can be computed from frequency quantum mechanical calculations. Reaction enthalpy (∆H), entropy (∆S), and free energy (∆G) can be calculated by the difference in heats of formation, entropy, and free energies of formation between reactants and products or between conjugate forms [87, 88]. The protonation energy, described as the difference between the total energy of the protonated and neutral forms of the molecule, can be discussed as a good scale of the power of hydrogen bonds (the higher the energy, the stronger the bond) and can be used to specify the correct position of the most desirable hydrogen bond acceptor [89].

The others. The descriptors considered above form the bulk of quantum chemical descriptors effectively used in QSAR/QSPR studies. Other descriptors have also been designed but do not fall into the categories mentioned above, such as frequency and NMR chemical shifts.

1.3.3 Empirical and Experimental Descriptors

Quantum chemical and molecular modeling techniques allow the description of many molecular and local values that determine the reactivity, binding features, and shape of a molecule in addition to molecular moieties and substituents. A principled combination of theoretical molecular descriptors with both empirical Hammett’s substituent constants (σm and σp) [90, 91], Swain–Lupton’s field and resonance constants (F and R) [92], hydrophobic constant (П) [92], Taft’s steric parameter (Es) [92], Verloop’s steric parameters [90, 91], etc., and experimental descriptors (substituent-induced chemical shifts, molecular weight and molecular refractivity (MR) [92]) are available. Table 2.1 shows the list of empirical and experimental descriptors.

Table 2.1 List of empirical and experimental descriptors

The mentioned substituent descriptors can be categorized pursuant to three main cluster groups: (a) descriptors that capture the effects of the substituent on the aromatic ring (electronic charges on the ring carbon atoms, resonance and field substituent constants, and substituent-induced chemical shifts); (b) descriptors characterizing the properties of the majority of substituents (Verloop’s steric parameters and the molecular refractivity) are clustered with theoretical descriptors describing the polarizability properties of the substituents, molecular polarizability anisotropy, dispersion interaction terms (IP*ANIS, IP*ΣПmol) and electrophilic super-delocalizability of the substituent.

IP = ionization potential derived from the AM1 wave function.

ANIS = anisotropy of the molecular polarizability.

IP*ANIS = product of the molecular ionization potential and the anisotropy of the molecular polarizability.

IP*ΣПmol = product of the molecular ionization potential and the sum of the self-atom polarizability over all the atoms of the molecule.

ΣПXX = sum of the self-atom polarizability values of the substituent atoms.

ΣПmol = sum of the self-atom polarizability over all the atoms of the molecule.

Σ\(S_{X}^{H}\) = sum of the electrophilic super-delocalizability on the substituent atoms.

Σ\(S_{E,X}\) = sum of the electrophilic super-delocalizability (computed over all the occupied molecular orbitals) on the substituent atoms.

ΣSN,X = sum of the nucleophilic super-delocalizability (computed over all the unoccupied molecular orbitals) on the substituent atoms.

The hydrophobic parameter П is near to this cluster and to the solvent hydrophobic available surface of the substituent and the electrophilic super-delocalizability with the polarizability of the benzene ring; (c) molecular dipole moments and their experimental and theoretical substituents and their square.

(a) Hammett substituent constants, substituent-induced chemical shifts, and Taft and Lupton’s resonance constants are mapped by the first component, the major contribution of which is the electronic charges of the carbon atoms of the benzene ring, the super-electrophilic mobility of the benzene ring and the energy of frontier molecular orbitals; (b) Verloop steric descriptors and the molecular refraction along with substituent van der Waals volumes and molecular weight are mapped by the second principal component, which includes theoretical parameters described as polarizability (ΣПXX, ANIS, ΣПmol), dispersion forces (IP*ΣПmol, IP*ANIS), and substituent reactivity indices (Σ\(S_{X}^{H}\), Σ\(S_{E,X}\), and Σ\(S_{N,X}\)). These recent cases perhaps indicate the portion of the molecular orbital development to molecular shape; (c) the third component models the lipophobic descriptor λar and the lipophilic descriptor П. The parameters that collaborate to this part are the dipole moments (consisting of the group dipole moment, μar) and their square terms, the solvent available surfaces of the substituent, the energy difference between the HOMO and the LUMO (GAP), the П-symmetry component of the electronic charges and the polarizability of the ring.

However, λar and П are not solely modeled by this section, as they also contribute significantly to the first and the third components, respectively. This suggests that more than one type of substituent effect specifies the values of these parameters. The same result is for the steric descriptors Es modeled both by the first and the second components. These findings are similar to other research aimed at modeling П [96] and Es [97] and support the intricate character of these empirical parameters.

Empirical scales called principal properties (PPs) which define the physicochemical features of twenty naturally encoded amino acids were recently developed by Sjostrom and Wold [98].

Sjostrom et al. applied the PPs in the same way to categorize several types of signal peptides of different lengths [99]. Carlson and co-workers have reported principal component analyses (PCA) of multivariate characterization (MVC) characterize PPs, the physicochemical properties of organic solvents [100], Lewis acids in organic synthesis [101], amines in the Willgerodt Kindler reaction [102], and aldehyde/ketones [103].

These PPs are now heavily used in their laboratory to explore the realm and limits of new organic reactions. PPs of amino acids may be suitable for instance for screening of peptides [104]. The expansion of PPs for many aromatic substituents for subsequent uses has been the aim of researchers, and unfortunately, it is very difficult to find experimental information evaluated in a coordinated manner on a large number of substituents. Therefore, they should use the next best kind of data, famous and broadly used physicochemical parameters that are accessible for a large number of substituents.

The empirical parameter used to characterize a class of monosubstituted benzenes were П, MR, σm, σp [92, 105], and the Verloop descriptors L and B1B4 [106]. The Verloop parameters B1B4, derived from STERIMOL calculations, are normally listed in order of magnitude improvement. Researchers attempt to choose the variables to define steric bulk (MR), hydrophobicity (П), the shape of each substituent (Verloop parameters), and electronic properties (sigmas).

In this case, they knew that there are three groups of variables: hydrophobicity/bulk, electronic, and size.

From the numeric amounts of the loadings, it is shown that the first component is significantly connected to the steric bulk and hydrophobicity because the length, molecular refractivity, and П have the largest contributions. The second component is dominated by the two electronic descriptors, σm and σp, while the third component is again mainly hydrophobicity (П) but also shape since L and B1B4 (Verloop parameters) [106] have relatively large contributions.

Since biological sieving of chemical substances is both expensive and time-consuming, it is essential to expand an instrument for the statistical design of the compounds in a filtering experiment. The main features are heavily appropriate for this purpose because they are few and orthogonal.

2 Descriptors for Nano-QSPR/QSAR

Over the past few decades, nano-based technology has become one of the top research areas in all fields of science and technology. A wide variety of consumer products are at the nanoscale, typically defined by all species having at least one diameter of 100 nm or less. Currently, nanotechnology has integrated various fields including biomedicine, pharmaceutical industry, food industry, environmental protection, solar batteries, energy, information and communication, heavy industry, consumer goods, and so on. However, it seems that we are only at the beginning of the “nano-industrial revolution.” Because of the unique electrical as well as optical, magnetic, thermal, and chemical properties of nanomaterials, the range of their possible applications is likely to expand rapidly.

Some recent papers report obvious evident toxicity of selected nanoparticles and highlight potential risk associated with the development of nano-engineering. Currently, there are many gaps in nanomaterial data. Predictive nano-QSAR/QSPR is one of the most promising methods used by chem informaticians to extrapolate the activity/property of nanomaterials. We believe that some of the missing data that are crucial for environmental risk assessment can be obtained using computational chemistry, saving the time and cost of conducting experiments. It is worth noting that the nano-QSPR/QSAR approach should be employed to predict not only activity responses (e.g., toxicity) but also many important physicochemical properties (e.g., water solubility, n-octanol/water partition coefficient, vapor pressure). These physicochemical properties affect the absorption, distribution, and metabolism of the compound in the organism, as well as environmental transport and the fate.

In nano-QSPR/QSAR modeling, one of the important parameters for building a validate model is suitable descriptors. In general, there are more than 5000 different descriptors for the characterization of molecular structure from zero to four dimensional (0D–4D). Only a few of traditional descriptors can characterize nanostructures. There are some reports that [107, 108] the existing descriptors are not enough to express the specific physical and chemical properties of nanoparticles. Therefore, new and more suitable types of descriptors for characterizing of nanoparticles should be developed.

Even though the computational features used for QSPR/QSAR modeling, experimentally derived features may also be employed as descriptors for nano-QSARs development (Fig. 2.3). The experimental descriptors seem to be especially useful for expressing size distribution, aggregation mode, shape, porosity, and surface disorder. Moreover, the combination of experimental results with a numerical approach can be used to define a new descriptor. For instance, images obtained by scanning electron microscopy (SEM), transmission electron microscopy (TEM), or atomic force microscopy (AFM) might be processed with new chemometric methods of image analysis. This means that first a series of pictures of different particles of a nanostructure should be taken. Then, the images must be numerically averaged and converted into a matrix containing numerical values that correspond to each pixel's grayscale intensity or red, green, and blue (RGB) color value. The other descriptors can be produced based on the matrix (i.e., the shape descriptor can be obtained as the sum of the nonzero elements in the matrix; the porosity as the sum of the relative differences between each pixel and its “neighbors,” etc.) [109].

Fig. 2.3
An infographic depicts the experimental properties like area, diameter, volume, crystal structure, aggregation state, hydrophobicity, surface charge, and element composition.

Experimental characteristics as descriptors in nano-QSAR research [110]

Undoubtedly, proper characterization of nanoparticle structure is currently one of the most challenging tasks in nano-QSAR. Although more than five thousand QSAR descriptors have been defined until now, they may be insufficient to express the supramolecular phenomena governing the unusual activity/property of nanomaterials. Consequently, much more effort is needed in this area.

3 SMILES and Quasi-SMILES Descriptors

The CORrelation And Logic (CORAL) software (http://www.insilico.eu/coral/) was developed by Alla Toropova and Andrey Toropov used to build up QSPR/QSAR models using Simplified Molecular Input Line Entry System (SMILES) [61, 111,112,113,114,115,116] and quasi-SMILES descriptors. SMILES is a chemical notation system designed by Weininger et al. [117, 118]. According to the principles of molecular graph theory, SMILES uses a very small, natural grammar to specify precise structural features. The SMILES symbol system is also suitable for high-speed machine processing [119, 120].

Over the last two decades, there have been numerous reports on the QSAR/QSPR modeling of nanomaterials and other compounds using CORAL software. This approach provides simple representation of molecular structures. There are defined equivalences between the representation of molecular structure using diagrams and the SMILES symbol. However, one should also be aware of their significant differences [121]. The SMILES can be produced by popular software such as ChemSketch, Biovia, and Chem Draw [122].

The prediction of activity/property of nanomaterials can be predicted by SMILES [123,124,125]. Quasi-SMILES is an alternative of SMILES-based optimal descriptors to build up predictive models for nanomaterials and other materials by consideration of the experimental conditions. Quasi-SMILES may be eclectic condition [126, 127] or combination of SMILES and eclectic conditions [128, 129]. The continuous eclectic conditions can be normalized by the following equation for assigning codes:

$${\text{Norm}}\left( {P_{i} } \right) = \frac{{\min \left( {P_{i} } \right) + P_{i} }}{{\min \left( {P_{i} } \right) + \max \left( {P_{i} } \right)}}$$
(2.18)

Pi is its value of physicochemical parameter P, min(\(P_{i}\)) is minimum value of P and max(\(P_{i}\)) indicates maximum value of P.

According to Table 2.2, the number of unique values in each parameter was less than 10; therefore, the quasi-SMILES descriptors representations could be coded by assigning a number between zero and nine in a single character.

Table 2.2 Distinction of standardized physiochemical features into classes 1–9 according to its value

3.1 Quasi-SMILES Examples in Peer-Reviewed Papers

Table 2.3 shows an example of the construction codes for the quasi-SMILES. Based on the data shown in Table 2.3, the quasi-SMILES can be generated, which can be used to build a model according to the optimal descriptors. Table 2.4 indicates some examples for quasi-SMILES generated by codes shown in Table 2.3.

Table 2.3 Codes used for the cell line, method, time exposition, concentration, size of nanoparticles, and type of metal oxide to convert various information of experimental data into quasi-SMILES [126]
Table 2.4 Some examples for quasi-SMILES produced by codes indicated in Table 2.3

The new reported QSPR analysis of MOFs by Ahmadi et al. is application of quasi-SMILES parameters including Brunauer, Emmett, and Teller (BET) specific surface area and pore volume, pressure, and temperature for prediction of CO2 adsorption of MOFs [128]. Tables 2.5 and 2.6 show the eclectic data range and quasi-SMILES codes for them, respectively.

Table 2.5 Lower and high levels of CO2 capture capacity, BET, pore volume, pressure (bar), and temperature (K) [128]
Table 2.6 Defined quasi-SMILES codes for eclectic conditions (BET-normalized, normalized pore volume normalized, pressure-normalized, and temperature-normalized) of CO2 capture capacity of MOFs [128]

In the code-2019 of CORAL software for quasi-SMILES groups of symbols %10–%99 (reserved for representation of complex systems of rings for usual SMILES) were applied as codes for the quasi-SMILES (Table 2.6). The disadvantage of this version of quasi-SMILES is the difficulty of interpretation of results by a user.

Further development of the CORAL software (CORAL-2020) allows the display of experimental conditions through groups of symbols enclosed in parentheses. Table 2.7 shows the comparison codes in the last version (CORAL-2020) and old version of CORAL for creating quasi-SMILES in recently proposed models for the mutagenic potential. One can see codes-2020 are quite transparent and consequently are more convenient for a user. As is clearly evident, CORAL-2020 codes are quite transparent and thus more user-friendly.

Table 2.7 Definition of eclectic condition for the definition of quasi-SMILES [130]

Toropov et al. reported the model of toxicity examined based on four eclectic data including three possible forms of silver nanoparticles (bare, coat, cons), organisms (Daphnia magna or Zebrafish), size (nm), and zeta-potential (mV) [131], where “bare” characterizes nanoparticles without any coating, coat (coating) demonstrates nanoparticles with a shell, and “cons” defines nanoparticles including coating material descriptors (Table 2.8).

Table 2.8 Indicates some quasi-SMILES used to generate nano-QSAR model for pLC50 [131]

4 Software for Generation of Molecular Descriptors

Over the last two decades, the growing interest in property/activity prediction has led to the release of many software products to the market and open-source domains for scientists working in the field of QSPR/QSAR modeling. Table 2.9 shows some popular software for calculating molecular descriptors. In addition, some of them are complex packages that also include modules for QSPR/QSAR modeling, statistical analysis, and data visualization.

Table 2.9 List of software packages for the calculation of molecular descriptors

5 Conclusion and Future Direction

Molecular descriptors are a critical component of the methodological toolbox used to study quantitative structure–property/activity relationship (QSPR/QSAR) modeling and are widely used to describe the structures of chemical compounds for design of new compounds. The predictive and reliable QSPR/QSAR models depend on accurate descriptors, as accurate predictions can save the time and cost needed to design new compounds with the desired property/activity.

In this chapter, the main classes of theoretical molecular descriptors including 0D, 1D, 2D, 3D, and 4D descriptors are described. The most significant progress over the last few years in chemometrics, cheminformatics, and bioinformatics has led to new strategies for finding new molecular descriptors. Here, some of the most common molecular descriptors and some new molecular descriptors especially for design and QSPR/QSAR modeling of nanocomposites have been highlighted.

In nano-QSPR/QSAR modeling, the data in many different publications are small and not ready enough for model building. In addition, nanomaterials exhibit high complexity and heterogeneity in their structures, which makes data collection and processing more challenging compared to traditional QSPR/QSAR. Quasi-SMILES descriptors are one of the solutions to this challenge and have been introduced as new descriptors combining SMILES and eclectic conditions. These novel descriptors provide transparent interpretation equation models with correlation weights calculated by Monte Carlo optimization using CORAL software.

Finally, a list of the most commonly used software packages for calculating molecular descriptors is reviewed here.