Introduction

A growing number of research have been conducted in recent years, wherein computational methods have been used to predict the physicochemical properties and biological activities of chemical compounds. Quantitative structure−activity relationship (QSAR) (Dearden 2016) modeling is a popular in silico technique performed to find out a quantitative correlation between the structural features (known as descriptors) and a known response (activity/property/toxicity) for a set of molecules using various chemometric methodologies. QSAR evolves at the crossroads of chemistry, statistics, biology, and toxicological studies. The main aim is to identify and optimize new leads to shorten the time and reduce expenditure for drug discovery (Hsu et al. 2017). The fundamental assumption regarding QSAR modeling is that a chemical structure possesses unique features (geometric, steric, and electronic properties) responsible for its physical, chemical, and biological properties.

The European Union (EU) envisaged that QSAR models would increasingly be used for hazard and risk assessments of chemicals (Commission of the European Communities 2001). It is also necessary to create and apply QSARs to address animal welfare concerns by replacing, reducing, and refining animal testing in toxicological assessments. In November 2004, the European Commission and the OECD (Organisation for Economic Co-operation and Development) member countries adopted principles for validation of QSAR models for use in regulatory assessment of chemical safety (Organisation for Economic Co-operation and Development (OECD 2004). According to the agreed guidelines of OECD, a QSAR model should be developed with

  • (a) A defined endpoint,

  • (b) An unambiguous algorithm to guarantee model transparency,

  • (c) A defined domain of applicability,

  • (d) Proper measures of validation including internal performance (as determined by goodness-of-fit and robustness) and predictivity (as represented by external validation), and

  • (e) Possible mechanistic interpretation.

Validation is crucial for the development and application of any QSAR model. It confirms the reliability of the developed model and the acceptability of each step through model development. The debate between internal versus external validation prevails predominantly among QSAR practitioners (Roy 2007). Some QSAR studies reported an inconsistency between internal and external predictivity (Novellino et al. 1995; Norinder 1996). According to researchers, there might be an inconsistency between internal and external predictability, i.e., high internal predictivity may result in low external predictivity and vice versa (Kubinyi 1998). However, external validation is considered the ‘gold standard’ of checking predictive potential of QSAR models. Some researchers consider cross-validation to be more appropriate for checking the predictive ability of QSAR models to circumvent the loss of information from splitting the dataset into training and test sets (Héberger 2017). Several validation metrics (as discussed later) are used to check the quality of predictions generated by regression-based and classification-based QSAR models (Gramatica and Sangion 2016; Todeschini et al. 2016).

The present review has discussed several prediction reliability tools exploring various strategies to determine model reliability and predictivity. We have discussed the tools that engage in the model-building through a double cross-validation approach on large and small datasets. Furthermore, we have explained the utility of intelligent selection of multiple models and various forms of consensus prediction. We have also mentioned a tool that explains a similarity-based reliability scoring approach to understand the quality of predictions for a new query compound and ensure the developed models’ reliability. We have further reported a similarity-based quantitative read-across tool addressing the quality of predictions both quantitatively and qualitatively.

Predictive QSAR model development approaches

Modern QSAR methods use multiple descriptors combined with the application of both linear and non-linear modeling approaches with a strong emphasis on rigorous model validation to afford robust and predictive QSAR models. Several types of research along with our understanding of QSAR model development and validation led us to establish a general outline of QSAR model workflow as described in Fig. 1. This figure illustrates the classical QSAR model development algorithm which includes: (a) collection of pertinent data with a defined endpoint, (b) descriptor calculation and data pre-treatment, (c) model development through analysis of the correlation between input data and descriptors calculated, (d) validation of the model, and (e) design and prediction of the activity of new query molecules. The QSAR modeling scheme is further described briefly in the following section.

  1. (i)

    Dataset preparation and data curation: One of the most challenging parts of QSAR is dataset collection with a “defined endpoint” as explained in OECD principle 1. The intent is to confirm the transparency of the endpoint aimed for prediction models, considering that a given endpoint could be dependent on the experimental protocol and the experimental conditions. Data curation is an essential and time-consuming step in the QSAR model development process. Erroneous data (both in chemical structures and biological data) retrieved from online sources require strict curation to avoid false or non-predictive models (Ambure and Cordeiro 2020).

  2. (ii)

    Calculation of molecular descriptors: The molecular structures applied for QSAR modeling need to be translated into numbers, i.e., molecular descriptors. The molecular descriptor is an encoded representation of the information about a chemical compound in the form of numerical values based on its chemical constitution, allowing the correlation of chemical structure with physical properties, chemical reactions, or biological activity (Consonni and Todeschini 2010). In a QSAR model, descriptors of a molecule, which describe specific aspects of a molecule, are predictors (X) of the dependent variable (Y). A QSAR study uses a variety of descriptors that can be classified into different dimensions or categories, as shown in Table 1.

Fig. 1
figure 1

Schematic representation of QSAR methodology according to OECD guidelines

Table 1 Types of 0D-3D descriptors used in the QSAR study
  1. (iii)

    Dataset division: A predictive model's performance must be determined by dividing the dataset into a training set and a test set. Among all chemicals, only the training set molecules are used for developing QSAR models, and the external predictivity of the models is examined through the use of test set compounds. In developing the QSAR model, it is necessary to select a training set in a way, such that it encompasses a wide chemical domain. The test set compounds must lie within the chemical space of the training set. Dataset division involves different methods including (a) Euclidean distance (diversity-based) (Golmohammadi et al. 2012), (b) Kennard-Stone (Kennard and Stone 1969), (c) k-means clustering (Likas et al. 2003), (d) sorted response (Roy 2018), etc.

  2. (iv)

    Feature selection: A feature selection process is a vital step that involves identifying important predictor variables to develop correlations with the response variable. Feature selection helps decrease the model complexity, decreases the risk of overfitting or overtraining, and helps select the most critical descriptors among a pool of hundreds or thousands. In this way, the dimensionality of input descriptors is minimized without the loss of essential information (Goodarzi et al. 2012). Finally, these selected descriptors are used to build a mathematical model linking to the biological activity of the corresponding compounds. According to the OECD guidelines, several feature selection techniques have been applied using a mechanistic basis including, genetic algorithms, genetic function approximation (GFA), forward selection, backward elimination, stepwise regression, simulated annealing, etc.

  3. (v)

    Model development algorithms: The OECD guideline 2 explains that a QSAR model should be developed using an “unambiguous algorithm” (Directorate 2007). The rule focuses on bringing transparency in model-building, rendering it reproducible to others and making it possible to achieve the endpoint estimates. This embraces the methods implemented during data pre-treatment, division of data, feature selection, and model development. Linear modeling techniques involve multiple linear regression (MLR) (Pope and Webster 1972; De and Roy 2018), ordinary least squares (OLS), partial least squares (PLS) (Wold et al. 2001), principal component analysis (PCA) (Abdi and Williams 2010), principal component regression (PCR), etc.

In QSAR, model-building tools can be grouped into two major categories: regression-based approach and classification-based approach. Regression-based approaches are effective when both dependent (response variable) and independent (molecular descriptors) variables are quantitative (Roy et al. 2015a; b). In the case of classification-based modeling, a relationship between the descriptors and the graded values of the response variable(s) is established. Here, the response is offered in a Boolean form like active/inactive and positive/negative or categorical (as observed in linear discriminant analysis, logistic regression, and cluster analysis).

  1. (vi)

    Determination of domain of applicability: One of the most essential checkpoints in QSAR modeling is determining the applicability domain (AD) of a model as explained in OECD principle 3. The applicability domain denotes a physicochemical space (both the response and chemical structure space) within which a QSAR model can predict with a certain degree of reliability (Roy et al. 2015a, b). This space is defined by the features explained by the compounds in the training set and is mandatory to examine whether the prediction of test set molecules is reliable or not. The concept of AD was used to avoid an unjustified extrapolation of property predictions.

  2. (vii)

    QSAR model validation: Before interpreting and predicting biological responses of untested compounds, any QSAR model needs to be validated. Here, the model's predictive power is established, and the ability to reproduce the biological activities of the untested compounds is measured. In consonance with the fourth principle of OECD guidelines, statistical validation of models in terms of goodness-of-fit, robustness, and predictivity is an extremely important step during QSAR model development. The validation of QSAR models is crucial if these models are used for virtual screening. Each validation parameter aims to judge the accuracy of prediction, i.e., determining whether the experimental value is close to the model-derived value. The model fitness determined using the coefficient of determination or correlation coefficient from the training set measures the degree of achieved correlation between the experimental (Yexp) and calculated (Ycalc) response values. Data fitting does not confirm the predictability of a model but instead demonstrates the model’s statistical quality. Different internal and external validation metrics for both regression and classification modeling are utilized to check model prediction quality which is discussed later in the following section.

  3. (viii)

    Mechanistic interpretation: The fifth OECD principle focuses on identifying the features of the variables that may contribute to a more thorough understanding of the response being modeled. Chemicals that act specifically using a specific mechanism can only be designed and developed with absolute certainty using the structural analogues. However, it is evident that furnishing mechanistic information may not always be feasible. The rule suggests that the modeler should report if any such information is available, facilitating future research on that endpoint. A mechanistic interpretation from the literature can be added, and therefore, the fifth OECD principle encourages the reporting of such information to enrich the physicochemical understanding of response being modeled.

Regression and classification validation metrics

The reliability of a developed QSAR model is confirmed through the validation process. The quality of input data, dataset diversity, predictability on an external set, applicability domain determination, and mechanistic interpretability are also confirmed through various validation metrics. QSAR model validation can be classified into two major types: (a) internal validation and (b) external validation. Internal validation in QSAR modeling involves activity prediction of the molecules/compounds used for generating the model. This is followed by estimating metrics for detecting the precision of predictions. Internal validation is useful in the case of cross-validation approaches (Konovalov et al. 2008) where the internal data are partitioned into calibration (training) and validation (test) subsets. The calibration set is used for model-building purposes, and the validation set is utilized for model predictivity assessment. Assessment of prediction capability and applicability of a QSAR model to predict newly designed or untested molecules is done using external validation metrics. In most cases, some compounds from the original datasets are used for validation purpose when true external data points are limited or not available.

Regression-based validation metrics

One of the main quality metrics to check the goodness-of-fit of a regression model is the determination coefficient \(\left({R}^{2}\right)\) which measures the variation of observed data with the fitted ones. The maximum possible value for \({R}^{2}\) is 1, which defines a perfect correlation.

Adjusted \({R}^{2}\) (\({R}_{adj}^{2}\)) is a modified version of the determination coefficient and is also known as the explained variance. The \({R}_{adj}^{2}\) parameter incorporates the information of the number of samples and the independent variables used in the model.

Considering the internal validation for a regression-based QSAR model, the leave-one-out cross-validation (\({Q}_{LOO}^{2}\)) metric is calculated. Here, a model is developed by modifying the original training set of n compounds by removing one compound. The activity of the omitted compound is then predicted using the model developed with n-1 compounds. This cycle is repeated until all the training set compounds have been eliminated once and the predicted activity data are obtained for all the training set compounds. The model predictivity is thus measured using the predicted residual sum of squares (\(\mathrm{PRESS}\)) and cross-validated \({R}^{2}\) (\({Q}^{2}\)) (Table 2). The PRESS value is defined as the sum of squared differences between the experimental and leave-one-out predicted data. The standard deviation of error of predictions (\(\mathrm{SDEP}\)) is calculated from the \(\mathrm{PRESS}\) value (Table 2). A model is considered satisfactory if the value of \({Q}^{2}\) is higher than the predetermined value of 0.6. However, numerous evidences suggested that leave-one-out prediction should neither be considered as the ultimate standard for judging the predictive power of models nor for model selection (Konovalov et al. 2007; Veerasamy et al. 2011). There is a chance of overfitting and overestimation in LOO due to structural redundancy (Höltje and Sippl 2001). Leave-many-out (LMO) or leave-some-out (LSO) might be a better alternative where a part of the training data is held out ((1 ≤ m < n, where n is a sample size) and the remaining data are modeled. The model is developed using the remaining compounds in each cycle, and the hold-out compounds are predicted. This cycle continues till all the compounds are predicted, and the predicted values are used for the calculation of \({Q}_{\mathrm{LMO}}^{2}\). Therefore, the LMO technique partly reflects external validation in the context of internal validation.

Table 2 Validation metrics for regression modeling

Although, \({Q}_{\mathrm{LOO}}^{2}\) provides a measure of model robustness, it may not be sufficient to characterize the performance of the model during prediction of new query/test compounds. Furthermore, \({Q}_{\mathrm{LOO}}^{2}\) can provide an overestimation of model quality as a result of structural redundancy in the training set data. Thus, the performance of a model on an external dataset is considered mandatory for the judgment of predictivity. The metric employed for judging external predictivity is termed as predictive \({R}^{2}\) or \({R}_{\mathrm{pred}}^{2}\) or \({Q}_{\mathrm{ext} \left(F1\right)}^{2}\). The \({Q}_{\mathrm{ext} \left(F1\right)}^{2}\) metric is characterized by a minimum threshold value of 0.6, i.e., models showing a value more than 0.6 are considered to be externally predictive with the ideal value being 1.0. Schüürmann and co-workers (Schüürmann et al. 2008) defined another external validation metric \({Q}_{\mathrm{ext} \left(F2\right)}^{2}\) for the judgment of the predictivity of a model using the test set. Consonni et al. (2009) introduced another external validation metric \({Q}_{\mathrm{ext} \left(F3\right)}^{2}\). This metric measures the model predictability and is sensitive to the selection of training dataset and tends to penalize models fitted to a very homogeneous data set even if predictions are close to the truth, with a threshold value being 0.6.

Another metric that checks the model reliability is the concordance correlation coefficient (CCC) metric (Chirico and Gramatica 2011). It measures both precision and accuracy, detecting the distance of the observations from the fitting line and the degree of deviation of the regression line from that passing through the origin, respectively. Any deviation of the regression line from the concordance line (line passing through the origin) gives a value of CCC smaller than 1. The desirable threshold value for CCC is 0.85.

The root-mean-square error in predictions \(\left({\mathrm{RMSE}}_{p}\right)\) gives a measure of model external validation. This metric is comparatively simpler and directly depicts the prediction errors for the test set observations concerning the total number of test set samples. A lower value of this metric is desirable for good external predictivity.

The \({r}_{\mathrm{m }}^{2}\) metrics: the training set mean value and the distance of the mean from the response values of each compound play a decisive role in computing the \({Q}^{2}\) values. The \({Q}^{2}\) value increases with the rise in the value of the denominator in the expression of Q2. Thus, even for a considerable deviation between the predicted and observed response values, satisfactory \({Q}^{2}\) values may be obtained, if the molecules exhibit a considerably broad range of response data. Using the concept of regression through origin approach, Roy et al. (2012) introduced a new metric \({r}_{\mathrm{m }}^{2}\) or modified \({r}^{2}\) that penalizes the r2 value of a model when there is large deviation between r2 (squared correlation coefficient values between the observed (Y axis) and predicted (X axis) values of the compounds with intercept) and r02 (squared correlation coefficient values between the observed (Y axis) and predicted (X axis) values of the compounds without intercept) values (Table 1).

MAE-based criteria: in a study, Roy et al. (2016) have shown that the conventional correlation-based external validation metrics (\({Q}_{\mathrm{ext} \left(F1\right)}^{2}\),\({Q}_{\mathrm{ext} \left(F2\right)}^{2}\)) often provide biased judgment of model predictivity, since such metrics are influenced by factors such as response range and distribution of data. Here, the authors have defined a set of criteria using simple ‘mean absolute error’ (MAE) and the corresponding standard deviation (σ) measure of the predicted residuals to judge the external predictivity of the models. Note that \(\mathrm{MAE}= \frac{1}{n}\times \sum \left|{Y}_{\mathrm{obs}}-{Y}_{\mathrm{pred}}\right|,\) where \({Y}_{\mathrm{obs}}\) and \({Y}_{\mathrm{pred}}\) are the respective observed and predicted response values of the test set comprising n number of compounds. The response range of training set compounds has been employed here to define the threshold values. Furthermore, the authors have proposed the application of the ‘MAE based criteria’ on 95% of the test set data by removing 5% data with high predicted residual values precluding the possibility of biased prediction quality due to any outlier prediction. The following criteria for MAE prediction are followed:

  1. i.

    Good predictions: in easier terms, an error of 10% of the training set range should be acceptable, while an error more than 20% of the training set range should be a very high error. Thus, the criterion for good predictions is as follows:

    $$\begin{aligned}& {\text{MAE}}\, \le \,0.{1}\, \times \,{\text{training set range and }}\left( {{\text{MAE}}\, + \,{3}\sigma } \right)\, \hfill \\ &\quad\le \,0.{2} \, \times \,{\text{training set range}}. \hfill \\ \end{aligned}$$

Here, σ value indicates the standard deviation of absolute errors for the test data. For a normal distribution pattern, mean ± 3σ covers 99.7% of the data points.

  1. ii.

    Bad predictions: a value of MAE more than 15% of the training set range is considered high, while an error higher than 25% of the training set range is judged as very high. Thus, prediction is considered bad when

    $${\text{MAE}} > 0.{15} \times {\text{training set range or }}\left( {{\text{MAE}} + {3}\sigma } \right) > 0.{25} \times {\text{training set range}}.$$

Predictions which do not fall under either of the above two conditions may be considered as of moderate quality. This criterion is applied for judging the quality of test set prediction when there are at least 10 data points signifying statistical reliability and there is no systemic error in model predictions.

Randomisation of response (Y-scrambling)–Randomisation is an assessment to ensure the developed QSAR model is not due to chance, thereby giving an idea of model robustness (Rücker et al. 2007). In this technique, validation metrics are checked by repetitive permutation of the response data (Y) of n compounds in the training set with respect to the X (descriptor) matrix which is kept unchanged. The calculations are repeated with randomized activities, followed by a probabilistic examination of the results. Every run will yield approximations of \({R}^{2}\) and \({Q}^{2}\), which are recorded. For an acceptable QSAR model, the average correlation coefficient (\({R}_{r}\)) of randomized models should be less than the correlation coefficient (\(R\)) of a non-random model. The difference between mean-squared correlation coefficients of the randomized (\({R}_{r}^{2}\)) and that of the non-random (\({R}^{2}\)) models can be obtained through \({R}_{p}^{2}\) calculation (\({R}_{p}^{2}={R}^{2}\times \sqrt{{R}^{2}-{R}_{r}^{2}}\)). A robust QSAR model should have \({R}_{p}^{2}\) value less than 0.5. At the ideal condition, the average value of \({R}^{2}\) for the randomized models should be zero, i.e., \({R}_{r}^{2}\) should be zero. Consequently, in such a case, the value of \({R}_{p}^{2}\) should be equal to the value of \({R}^{2}\) for the developed QSAR model. Thus, as proposed by Todeschini, the corrected formula of \({R}_{p}^{2} ({c_R}_{p}^{2})\) is \({c_R}_{p}^{2}=R\times \sqrt{{R}^{2}-{R}_{r}^{2}}\) (Todeschini 2010).

Classification-based QSAR validation metrics

In a binary classification model, several validation metrics are utilized to evaluate the model's performance in terms of accurate qualitative prediction of the dependent variable. Classification models are generally assessed using a statistical method that is based on the Bayesian approach (Ghosh et al. 2020). A binary classification model is typically a two-class model, i.e., positive and negative, or active and inactive. The results obtained can be arranged in a contingency table (also known as confusion matrix) (Table 3). The statistical metrics explaining the quality of a classification model are given below and in Table 4.

Table 3 Contingency table or confusion matrix for classification modeling
Table 4 Validation metrics for classification modeling

In classification QSAR modeling, the compounds are classified into four main categories: a) true positives (TP), b) true negative (TN), c) false positive (FP), and d) false negative (FN) (Table 3). Researchers have used a variety of statistical tests to assess the classifier model performance and classification capability. Sensitivity (Sn) is the percentage of active compounds correctly predicted and is expressed as the ratio of true-positive results to the total number of positive data. Specificity (Sp) is the ratio of true-negative results to the total number of negative data. Accuracy (Acc) implies the fraction of correctly predicted compounds. The precision indicates the accuracy of a predicted class (ratio between the true positives and total positives) and F-measure refers to the harmonic mean of Recall (or Sensitivity) and Precision. Higher values for recall and precision give higher values for F-measure, thereby implying better classification.

G-means is a combination term that includes Sn and Sp into a single parameter merged via the geometric mean. This allows an easy assessment of the model’s ability to distinguish between active or inactive samples.

Cohen’s kappa (κ) can be utilized to determine the concordance between classification (predicted) models and known classifications (Cohen 1960). It is a measure of the degree of agreement. It returns value from − 1 (total disagreement) to 0 (random classification) to 1 (total agreement).

Mathews correlation coefficient (MCC) measures the quality of binary classifications and compares different classifiers. In any case, where the number of positive and negative compounds is not equal, the terms sensitivity, specificity, and accuracy are not reliable. MCC uses all four values (TP, TN, FP, and FN) and is directly calculated from the confusion matrix to provide a more-balanced prediction evaluation. Like Cohen’s kappa, the value for MCC also ranges from − 1 to 1.

Prediction reliability detection tools

As discussed earlier, the process of QSAR modeling consists of three important steps: model development, model selection, and model interpretation. The model development process involves various feature selection practices including stepwise-multiple linear regression (S-MLR), genetic algorithm, genetic function approximation, etc. Model selection is based on the identification of the finest model (based on validation metric values) from a set of alternative models. When it comes to the reliability of QSAR/QSPR models, validation is essential. After a model has been selected, several internal and external validation metrics are assessed which help in demonstrating the actual predictive performance of the chosen model. Several groups of researchers in QSAR suggested external validation to be the gold standard in demonstrating the predictive ability of a model (Golbraikh and Tropsha 2002; Gramatica and Sangion 2016; Gramatica 2020). Multiple modeling in consensus form has been introduced to achieve a lower degree of predicted residuals for query compounds (Roy et al. 2015b; Khan et al. 2019a; Roy et al. 2019). In the following sections, we will discuss various tools from the DTC Laboratory (https://sites.google.com/site/kunalroyindia/home/qsar-model-development-tools) that help understand the prediction ability of one or more QSAR models.

  1. (i)

    Double cross-validation (version 2.0) tool

The most common scheme of external validation is by introducing the hold-out method. Here, the original dataset is divided into training and test sets, where the training set is used for model-building purposes followed by model selection based on internal validation metrics, and the test set is used for model validation through external validation metrics. This approach ensures that the test set is never applied during the model-building procedure and it remains unseen by the developed model. However, a single training set does not confirm feature optimization, since a fixed training set composition leads to a bias in feature selection. This issue is more apparent in the case of MLR models than partial least-squares (PLS) or principal component regression (PCR) models which are more robust and generalized methods. Baumann and Baumann (Baumann and Baumann 2014) discussed the concept of double cross-validation (DCV) which Roy and Ambure implemented in a tool (Roy and Ambure 2016) where the training set is further divided into ‘n’ number of calibration and validation sets. The tool is freely available from http://dtclab.webs.com/software-tools and http://teqip.jdvu.ac.in/QSAR_Tools/DTCLab/. The algorithm comprises two nested cross-validation loops (Bates et al. 2021), namely, the outer loop and the inner loop (Fig. 2). The outer loop consists of data points that are split arbitrarily into disjoint subsets known as training set compounds and test set compounds. The training set is utilized in the inner loop for model development and model selection, and the test set is used exclusively for checking model predictivity. The training set in the inner loop is further split into k number of calibration and validation sets in the inner loop by applying the k-fold cross-validation technique (Wainer and Cawley 2021). In the k-fold cross-validation method, the training data are initially segregated into k subsets followed by preparing k-iterations of calibration and validation sets. At each reiteration, different subset of data is excluded and used as validation set, while the remaining k-1 subsets are used as calibration sets. The data are passed through a stratification process, i.e., data rearrangement which helps maintain data uniformity (each fold is representative of the whole dataset). Each k-fold calibration set is then used to develop multiple linear regression (MLR) models, while the respective validation sets are applied to find the prediction errors. The tool provides two methods of feature selection: stepwise-multiple linear regression (S-MLR) (Maleki et al. 2014; Ojha and Roy 2018) and genetic algorithm-MLR (GA-MLR) (Leardi 2001). The prediction error is checked using mean absolute error (MAE95%) (Roy et al. 2016). There is also a provision for the generation of PLS models in the tool. Furthermore, the models in the inner loop are selected based on three major criteria as follows:

  1. i)

    The models with the lowest MAE value (on the validation set) are selected.

  2. ii)

    Consensus predictions of the top model are selected based on the MAE value of the validation set.

  3. iii)

    Searching out the best descriptor combination from the top models.

Fig. 2
figure 2

Schematic diagram of double cross-validation algorithm (colour figure online)

Researchers found the DCV approach to be reliant and useful and thus successfully employed in various applications, for example, quantitative structure–property relationship (QSPR) modeling for sweetness potency of organic chemicals (Ojha and Roy 2018), developing nano-QSAR models for TiO2-based photocatalysts (Mikolajczyk et al. 2018), inhalational toxicity modeling (Nath et al. 2022), modeling of diagnostic agents (De et al. 2019; De et al. 2020, 2022; De and Roy 2020, 2021), etc.

  1. (ii)

    Intelligent consensus predictor tool

A well-validated QSAR model engages different classes of descriptors, which accentuate many features of molecular structures. Individual QSAR models may exaggerate a few important features, undervalue other features, and overlook some significant characteristics features. Roy et al. (2018b) proposed an “intelligent” selection of multiple models that would enhance the quality of predictions of query compounds (Roy et al. 2018b). This software helps judge the performance of consensus predictions compared to their quality obtained from the individual MLR models based on the MAE-based criteria (95%). The tool “Intelligent Consensus Prediction” is available from http://dtclab.webs.com/software-tools and http://teqip.jdvu.ac.in/QSAR_Tools/DTCLab/. The tool takes multiple individual models (M1, M2, M3, etc.) as input derived using a different combination of descriptors from the training set. There are four ways of consensus prediction explained in the work:

  • (i) Consensus model 0 (CM0): it provides a simple average of predictions from all input individual models.

  • (ii) Consensus model 1 (CM1): it is the average of predictions from all individual qualified models. It is calculated from the arithmetic average of predicted response values attained from the ‘n’ qualified models for test compounds rather than from all existing individual models.

  • (iii) Consensus model 2 (CM2): it is the weighted average prediction (WAP) from all qualified individual models. In CM2, the average is evaluated by giving a proper weightage to the qualified models for a particular test set compound.

  • (iv) Consensus model 3 (CM3): compound-wise best selection of predictions from qualified individual models. The best model for a particular test compound is selected based on its cross-validated mean absolute error (\({\mathrm{MAE}}_{\mathrm{CV}})\). A model with the lowest value \({\mathrm{MAE}}_{\mathrm{CV}}\) is the best for a particular test set compound.

The tool further provides additional selection criteria which include:

  1. (a)

    Euclidean distance cut-off: this is used to find a fitting model to predict the test set compound, where 10 most similar compounds are selected based on Euclidean Distance score. The user can set a Euclidean cut-off ranging from 0 to 1 to restrict the selection of only those training set compounds with a Euclidean distance score less than or equal to the set cut‐off value.

  2. (b)

    Applicability domain: AD helps to check whether the test/query compound is in the chemical space of the model or not. A simple standardization approach is used for AD determination.

  3. (c)

    Dixon Q test: this test can be employed to spot and remove a response outlier out of selected similar training set compound.

The complete calculation method is demonstrated in the published article by Roy et al. and the methodology is given in Fig. 3. The ICP method has found good application in the prediction of pharmaceuticals (Khan et al. 2019a), organic chemicals and dyes (Roy et al. 2019; Khan and Roy 2019; Ghosh and Roy 2019; Ojha et al. 2020), determining aquatic toxicity (Hossain and Roy 2018), inhalational toxicity (Nath et al. 2022), polymer properties (Khan et al. 2018), etc.

  1. (iii)

    Prediction Reliability Indicator tool

Fig. 3
figure 3

“Intelligent Consensus Prediction” algorithm

A QSAR model is developed based on the physicochemical features of an appropriately designed training set having experimentally derived response data. In contrast, the model is validated using one or more test set(s) for which the experimental response data are available. The suitability of this model for a completely new data set (true external set) for providing a reliable prediction is quite an interesting study. Roy et al. (2018a, b) have developed a new scheme (Fig. 4) to define the reliability of predictions from QSAR models for new query compounds and implemented the method in a new tool called “Prediction Reliability Indicator” freely available from http://dtclab.webs.com/software-tools and http://teqip.jdvu.ac.in/QSAR_Tools/DTCLab/. This tool is applicable for predictions from MLR and PLS models. The work aimed at formulating a set of rules/criteria that will ultimately empower the user to estimate the quality of predictions for individual test (external) compounds. Prediction of test/external sets can have varying quality. It might not be good predictions in all cases, while the model can show moderate to bad/unreliable predictions for some of the external set compounds. By keeping the variation of prediction quality, the authors have hypothesized three rules/criteria which might assist in classifying the quality of predictions for individual test/external set compounds into good, moderate, and poor/unreliable ones. We have now discussed the three rules briefly in the following segment:

  1. (a)

    Rule/criterion 1: the scoring is based on the quality of leave-one-out predictions of the closest 10 training compounds to a test/external compound. Here, 10 most similar compounds are identified for each test/query compound (based on Euclidean distance similarity), followed by which mean of absolute LOO prediction error (\({\mathrm{MAE}}_{\mathrm{LOO}}\)) is calculated for the selected closest 10 compounds. For a test/query compound whose \({\mathrm{MAE}}_{\mathrm{LOO}}\) is lowest corresponding to its closest training compounds is predicted well and gets the highest prediction score (Prediction Score = 3). Test/query compounds that have medium \({\mathrm{MAE}}_{\mathrm{LOO}}\) values with corresponding close training compounds should get a moderate score (Prediction Score = 2), and those test compounds with corresponding close training compounds having high \({\mathrm{MAE}}_{\mathrm{LOO}}\) values should get the least score (Prediction Score = 1). The MAE-based criteria (Roy et al. 2016) are applied here for scoring the compounds which involve \({\mathrm{MAE}}_{\mathrm{LOO}}\) and standard deviation (\({\sigma }_{\mathrm{LOO}}\)) of the absolute prediction error values.

  2. (b)

    Rule/criterion 2: scoring based on the similarity-based AD using standardization method. The applicability domain (AD) of a model plays an important role in identifying uncertainty in the prediction of a specific chemical (test/query) by that model. This is based on how similar is the test/query compound with those in the training set. When a test/query compound is similar to a small fraction or none of the training compounds, the prediction is considered unreliable or fails to fall under the training set AD. Here, a simple AD based on the standardization approach (Roy et al. 2015a, b) is employed.

  3. (c)

    Rule/Criterion 3: scoring based on the proximity of predictions to the training set observed/experimental response mean. Earlier, the quality of fit or prediction of compounds is better when compounds possess experimental response values (training and test compounds) close to the training set observed response mean. Thus, in rule/criterion 3, the authors have proposed to assess the prediction quality of a test compound based on the closeness of predicted response value to the training set observed/experimental response mean. The predicted response value (\({Y}_{\mathrm{pred}}^{\mathrm{test}}\)) of each test compound is first measured using the training set model, and then, this \({Y}_{\mathrm{pred}}^{\mathrm{test}}\) value is compared with the training set experimental response mean (\({Y}_{\mathrm{mean}}^{\mathrm{train}}\)) and the corresponding standard deviation (\({\sigma }^{\mathrm{train}}\)). The scoring is based on the following manner:

Fig. 4
figure 4

Methodology applied for scoring test/query compounds in “Prediction Reliability Indicator” tool

(i) A test compound with \({Y}_{\mathrm{pred}}^{\mathrm{test}}\) value falling within the range inside \({Y}_{\mathrm{mean}}^{\mathrm{train}}\pm 2{\sigma }^{\mathrm{train}}\), that is, \({(Y}_{\mathrm{mean}}^{\mathrm{train}}+2{\sigma }^{\mathrm{train}})\)\({Y}_{\mathrm{pred}}^{\mathrm{test}}\)  ≥ (\({Y}_{\mathrm{mean}}^{\mathrm{train}}-2{\sigma }^{\mathrm{train}}\)), can be assumed to be well (good) predicted by the model and thus have a score 3.

(ii) A test compound with \({Y}_{\mathrm{pred}}^{\mathrm{test}}\) value falling within the range \({(Y}_{\mathrm{mean}}^{\mathrm{train}}+3{\sigma }^{\mathrm{train}} )\)\({Y}_{\mathrm{pred}}^{\mathrm{test}}\)  ≥ (\({Y}_{\mathrm{mean}}^{\mathrm{train}}-3{\sigma }^{\mathrm{train}}\)) and \({(Y}_{\mathrm{mean}}^{\mathrm{train}}+2{\sigma }^{\mathrm{train}})\) < \({Y}_{\mathrm{pred}}^{\mathrm{test}}\)  < (\({Y}_{\mathrm{mean}}^{\mathrm{train}}-2{\sigma }^{\mathrm{train}}\)) can be presumed to be predicted moderately by the model and thus gets a score 2.

(iii) A test compound with \({Y}_{\mathrm{pred}}^{\mathrm{test}}\) value falling within the range \({(Y}_{\mathrm{mean}}^{\mathrm{train}}+3{\sigma }^{\mathrm{train}} )\) < \({Y}_{\mathrm{pred}}^{\mathrm{test}}\)  < (\({Y}_{\mathrm{mean}}^{\mathrm{train}}-3{\sigma }^{\mathrm{train}}\)) can be assumed to be predicted poorly by the model and thus gets a score 1.

Furthermore, after these three criteria are checked, a weighting scheme is employed to compute a composite score for judging the prediction quality of each test compound using all three individual scores. The composite score is defined as follows:

$${\text{Composite score}} = W_{1} \times {\text{score}}_{{{\text{rule1}}}} + W_{2} \times {\text{score}}_{{{\text{rule2}}}} + W_{3} \times {\text{score}}_{{{\text{rule3}}}} .$$

Here, \({\mathrm{score}}_{\mathrm{rule}1}, {\mathrm{score}}_{\mathrm{rule}2}\), and \({\mathrm{score}}_{\mathrm{rule}3}\) represent the scores obtained after applying respective rules, whereas \(W1\),\(W2\), \(W3\) indicate the weightage (automatic or user-provided) given to each of the three individual scores. The PRI tool offers a unique composite score which can act as a marker of prediction quality of true external test compound. This tool has found application for the prediction of external set/query compounds in many areas, viz., endocrine disruptor chemicals (Khan et al. 2019b), metal oxide nanoparticles (De et al. 2018), organic chemicals (Khan and Roy 2019; Khan et al. 2019c; De et al. 2020; 2022; Nath et al. 2022), etc.

  1. (iv)

    Small dataset modeler (version 1.0.0) tool

Various specialized datasets involving nanomaterials, properties of catalysts, radiosensitizer molecules, etc. have smaller number of data points where the division of data into training and test sets may not produce robust and predictive models. A small dataset with 25–50 compounds cannot be used for conventional double cross-validation as dividing the data set into training and test sets and further into calibration and validation sets is not possible. Ambure et al. have developed a new tool called the Small Dataset Modeler, version 1.0.0 (http://dtclab.webs.com/software-tools and http://teqip.jdvu.ac.in/QSAR_Tools/DTCLab/) solely for small datasets which includes a double cross-validation approach to develop a model for a small number of data points without training and test sets division of the dataset (Ambure et al. 2019) (Fig. 5). Here, the whole input set (containing n number of compounds) goes into a loop where it is repeatedly split up into calibration and validation sets (same as in the inner loop of DCV). The best possible combinations (k) are tried to obtain using validation sets of r compounds and calibration sets of n–r compounds. The tool asks for the number of compounds (i.e., r) in the validation set from the user based on which all probable combinations of calibration and validation sets are produced. The Multiple Linear Regression (MLR) models are generated using the calibration set compounds employing the Genetic Algorithm-Multiple Linear Regression (GA-MLR) method (Devillers 1996; Venkatasubramanian and Sundaram 2002) of variable selection, while the validation sets are employed to judge the predictive ability of the models. Numerous important internal (\({R}^{2}, {R}_{\mathrm{adj}}^{2}, {Q}_{\mathrm{LMO}}^{2},{\mathrm{MAE}}_{\mathrm{LOO}},{r}_{m}^{2}\left(\mathrm{LOO}\right)\) metrics) and external (\({Q}_{F1}^{2}, {Q}_{F2}^{2}, {r}_{m}^{2}\left(\mathrm{test}\right),\mathrm{ CCC}, {\mathrm{MAE}}_{\mathrm{test}}\)) validation metrics are measured in the exhaustive DCV method for all the chosen models. The tool is designed in such a way that it also develops Partial Least Squares Regression (PLS-R) models based on the descriptors selected in MLR models. The final top model selection can be done in any five of the following recommended ways:

  • (i) Any model (MLR/PLS) with the smallest MAE (95%) in the validation set is chosen.

  • (ii) Any model (MLR/PLS) with the smallest MAE (95%) in the modeling set is chosen.

  • (iii) Any model (MLR/PLS) with the lowest \({Q}_{\mathrm{Leave}-\mathrm{Many}-\mathrm{Out}}^{2}\) (modeling set) is chosen.

  • (iv) Implementing consensus predictions using the best models that are chosen depending on the MAE (95%) in the validation sets. Consensus predictions can be of two types: (a) Using a simple arithmetic average of predictions of the best models. (b) Using a weighted average of predictions (WAP) by assigning proper weights to the top chosen models depending on the mean absolute error obtained from leave-one-out cross-validation,\({\mathrm{MAE}}_{\mathrm{cv}}\left(95\mathrm{\%}\right)\).

  • (v) A pool of exclusive descriptors from the best models with the smallest \(\mathrm{MAE }\left(95\mathrm{\%}\right)\) obtained from the validation set is again employed to build models. In the case of MLR, the best descriptor combinations are put through the Best Subset Selection method. However, in the case of a PLS model, descriptors nominated in the top models are pooled together for a PLS run.

Fig. 5
figure 5

Methodology behind the “Small Dataset Modeler” (version 1.0.0) tool to perform QSAR modeling for a small set of data points

The method proposed in the “Small Dataset Modeler” tool confirms internal divisions of small datasets within the DCV technique without taking any test set into account. The approach of “Small Dataset Modeler” tool integrates data curation, exhaustive DCV technique, and ideal modeling techniques entailing consensus predictions to develop models, principally for a small set of data points. The methodology behind the “Small Dataset Modeler” tool is schematically presented in Fig. 5. Small dataset modeling has found use in environmental toxicity modeling including acute toxicity of antifungal agents toward fish species (Nath et al. 2021) and soil ecotoxicity (Lavado et al. 2022), radiosensitization modeling (De and Roy 2020), modeling of Hepatitis C virus inhibitor protein (Ejeh et al. 2021), and modeling anesthetics causing GABA inhibition (Stošić et al. 2020).

  1. (v)

    Read-Across-v3.1 tool

The read-across methodology has gained immense attention in recent years, because it is a non-testing approach that can be utilized for data-gap filling. The basic aim of the read-across technique is to predict endpoint information for one or more chemicals (i.e., the target chemicals) using data from the same endpoint from another substance (the source chemicals) using the similarity principle. The method is widely used as an alternative tool for hazard assessment to fill data gaps (ECHA 2011). Read-across based predictions seem to be more fitting for small data sets (limited source compounds). Hence, it has provided promising results in nanosafety assessment possessing limited data. Chatterjee and co-workers (2022) developed a new prediction-oriented quantitative read-across approach based on certain similarity principles. The reported work verifies the efficiency of the newly developed read-across algorithm in filling nanosafety data gaps. A tool has been developed to facilitate the implementation of the approach (Fig. 6) for quantitative read-across which is available from https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home. The tool allows the users to optimize different hyperparameters including similarity kernel functions and distance and similarity thresholds to get the best quality of quantitative predictions. Mainly, three types of similarity estimation techniques were introduced involving Euclidean distance, Gaussian kernel function, and Laplacian kernel function. The algorithm developed in this study was optimized using three small nanotoxicity datasets (n ≤ 20). The algorithm is based on two basic steps: (a) finding the 10 most similar training compounds for each query or test compound; (b) calculating the weighted average prediction of test set compounds from the most similar training set compounds. Different hyperparameters like sigma and gamma values in Gaussian and Laplacian kernel functions have been optimized. The effect of the number of close training compounds on the prediction quality has been evaluated; 2–5 close training compounds can efficiently predict the toxicity of query compounds. Another feature incorporation in the tool involves a distance threshold for the Euclidean distance similarity estimation and a similarity threshold for the Gaussian and Laplacian kernel function similarity estimations. This generated better prediction at the distance threshold of 0.4–0.5 and a similarity threshold of 0.00–0.05. This algorithm is easy to use, proficient, and an expert independent alternative method for the nanoparticle toxicity prediction which can further assist in data-gap filling and prioritization. Version 3.1 of this tool also computes classification-based validation metrics and generates receiver operating curve (ROC) for predictions which can be used to estimate the uncertainty of predictions. The tool is also applicable for several endpoints other than nanotoxicity, for example activity/toxicity/property of organic compounds in general.

Fig. 6
figure 6

Quantitative read-across algorithm

Future perspectives

Over the past few decades, the QSAR methodology has received both praise and criticism in connection to its reliability, limitations, successes, and failures. The above discussion of the aforementioned tools from the DTC Laboratory provides methods and information relating to QSAR model development and validation, pointing out current trends, unresolved problems, and persistent challenges associated with evolution of QSAR. Furthermore, there are few scopes of further refining the present tools like inclusion of computation of Golbraikh and Tropsha’s (Golbraikh and Tropsha 2002) criteria in the Double Cross Validation tool and computation of leave-many-out cross-validation (\({Q}_{\mathrm{LMO}}^{2}\)) criteria for both the Double Cross Validation tool and Small Dataset Modeler tool (PLS version), etc. Additionally, there is an opportunity to incorporate an uncertainty measure of predictions in the read-across tool which will improve the reliability for quantitative predictions of untested molecules.

Conclusion

The QSAR domain has been expanded substantially in the past few years as databases and their applications have grown. As the field of QSAR evolves through decades, it is necessary to evaluate the effectiveness of the QSAR models in predicting the behavior of new molecules. A QSAR model stands on the pillars of various validation metrics used to assess the quality of a predictive model that portrays the true picture of the prediction errors. The present review explains various internal and external validation metrics necessary for model predictivity assessment. Furthermore, a brief explanation of various innovative QSAR modeling tools developed by Drug Theoretics and Cheminformatics (DTC) laboratory (https://sites.google.com/site/kunalroyindia/home/qsar-model-development-tools) is given for better selection and development of models. These tools are aimed at addressing various features like selection of training set, model development methodology, model selection techniques, the use of multiple models, scoring of query compounds, etc. These improvisations helped in enhancing the quality of predictions of QSAR models. The tools highly assist in the reliability estimation of untested chemicals when experimental data are unavailable. However, most of these tools cannot be used for classification-based/graded data, but are well suited for quantitative models like MLR and PLS regression. Furthermore, the tools have a major role in different fields for predicting chemicals associated with the pharmaceutical industry, cosmeceuticals, polymer chemistry, diagnostic agents, dyes, nano-chemistry, food chemistry, etc.