Keywords

1 Introduction

The limitations of in vivo and in vitro approaches for determination of the biological activity of chemicals have fostered the development of in silico approaches [1]. In silico predictive toxicology is designed to complement experimental efforts with a view toward improving the quality of toxicity predictions for safety assessment while decreasing the associated time, cost, and ethical conflicts (animal testing) [2,3,4]. Methodology for in silico predictive toxicology has been dominated by (quantitative) structure–activity or toxicity relationship [(Q)SAR or (Q)STR] (hereafter called SAR). Traditional SAR models describe a relationship between the chemical structure of molecules (numerically encoded as molecular descriptors) and their activity against a specific biological target [1]. This is achieved by establishing a trend in the molecular descriptor space that links to a biological activity. Thus, all SAR models are developed on the assumption of a similarity principle. That is, molecules with similar structures (and descriptors, consequently) will have similar biological activity [4, 5]. A SAR model to predict toxicity (T) is given in Eq. (1)

$$T = g\left( {D_{f} } \right)$$
(1)

where \(\left( {D_{f} } \right)\) represents the feature space of molecular descriptors as chemical properties and \(g\) is a function that relates T to \(\left( {D_{f} } \right)\) [2]. The accuracy of the model or function \(g\) has been shown to depend on the most representative set of molecular descriptors that will encode the useful properties of the molecules for prediction.

Molecular descriptors, being numerical features extracted from molecular structures, are the most common variables used for SAR-based toxicity prediction modeling [6]. The information encoded by descriptors depends on the molecular representation or “dimensionality” of the compound as well as the algorithm used to calculate the descriptors [7]. One-dimensional (1D) descriptors are scalars encoding physiochemical properties (molecular weight, logP) and constitutional parameters, such as number of atoms, bond count, atom type, ring count, and fragment counts. 1D descriptors are insensitive to the topology of the molecule and tend to be similar for distinct compounds. As a result, they are often used in combination with other descriptors. Two-dimensional (2D) descriptors are more frequently used for chemical space description. 2D descriptors, including topological indices and structural fragments, are calculated from the connection table (chemical graph) representation of a molecule. They are not only independent of the conformation of the molecule but also graph invariant (not sensitive to altering the number of graph nodes). Three-dimensional (3D) descriptors provide a more complete characterization of molecular structures. 3D descriptors require conformational searching and can discriminate between isomers; this comes at the price of being computationally expensive. The ability to discriminate between isomers can translate to less redundant features. Examples of 3D descriptors include geometric, electrostatic, quantum chemical, and WHIM & GETAWAY. Four-dimensional (4D) descriptors are much like 3D descriptors that evaluate multiple structural conformations simultaneously. Fingerprints are another form of molecular descriptors [7,8,9]. Commonly used fingerprints include the Molecular ACCess System (MACCS) [10] substructure fingerprints, PubChem [11], and extended-connectivity fingerprints (ECFP) [12]. These fingerprints and 2D descriptors were widely used in the Tox21 data challenge [13] where the winning submissions used over 2500 predefined features covering a wide range of data from topological and physical properties to fingerprints [14].

As shown above, the chemical structures used in SAR modeling are characterized by many molecular descriptors. It is common to generate thousands of descriptors for a single molecule [14]. It is well known that the accuracy of predictive models is not positively correlated to the dimensionality of the data, as overfitting tends to become an issue [15,16,17]. High-dimensional spaces are prone to include irrelevant and noisy features [18]. SARs developed using such features tend to focus on the peculiarities of molecules and fail to be generalizable [19]. In the chemical space for a given library, each descriptor adds a dimension to the n-dimensional chemical space. Every molecule in the library is assigned a coordinate depending on its values for all the descriptors. A reduction in the dimensionality of the chemical space correlates with an increasing similarity between molecules. This is important because the underlying assumption in SAR modeling posits that molecules with similar structures should have similar activity [20, 21]. Thus, one of the most important tasks prior to modeling is dimension reduction focused on keeping the most important and relevant descriptors with the maximum amount of biologically meaningful information required for predicting the desired toxicity end point. Shen et al. [13] demonstrated the usefulness of feature selection for toxicity prediction, particularly for interpreting the role of the features. By reducing the feature space, they were able to pinpoint MolRef and AlogP as the most important descriptors for predicting the toxicity of aromatic compounds.

In simple terms, dimensionality reduction is considered desirable for activity prediction modeling for the following reasons [22]:

  1. (i)

    Employing fewer descriptors means that the model can focus on important information for establishing a relationship, thus improving prediction accuracy and reducing overfitting (Models with many features enjoy more discriminating power during training but are often not generalizable).

  2. (ii)

    As the number of features decreases, interpretability of certain models increases.

  3. (iii)

    Computational costs reduce significantly as the complexity of many learning algorithms is greater than linear [19, 23].

  4. (iv)

    Elimination of irrelevant descriptors can help remove activity cliffs [7].

  5. (v)

    Machine learning algorithms are statistical in nature; hence, they suffer from the “curse of dimensionality”, which is common with optimization problems as described by Bellman [24].

As the dimensionality increases, the amount of data needed to develop generalizable models increases exponentially [25, 26]. SAR data rarely have an abundance of labeled molecules and, as such, the final model and resulting toxicity prediction will benefit from a reduction in dimension as a smaller dimension means fewer samples will be required during training. The optimal subset of a feature space is one which has the least number of dimensions yet offers the best learning accuracy [26]. Two techniques used to alleviate the challenges of high dimension in SAR datasets include feature selection and feature extraction.

In this review, we discuss different methods for both feature selection and feature extraction techniques, as well as their applications in SAR modeling. In the next two sections, we discuss feature selection and feature extraction methods consecutively. In the last section, we highlight important aspects that must be considered while attempting feature space reduction, such as the stability and validation of the methods.

2 Feature Selection

Feature selection works by selecting a subset of features from the original feature set and removing irrelevant features without altering the original representation of the data, on the basis of certain relevance criteria [18, 26,27,28]. The physical meanings of the features are retained.

Mathematically, considering a descriptor space \(X = \left\{ {x_{i} , i = 1 \ldots \left. n \right\}} \right.\), find a subset \(Y_{k}\) (with k < n) that maximizes an objective function \(J\left( X \right)\) for the probability \(P\) that a compound is correctly predicted as active or inactive using Eq. (2).

$$Y_{k} = \left\{ {x_{\left( 1 \right),} } \right. x_{\left( 2 \right),} \ldots , \left. {x_{\left( k \right)} } \right\} = {\text{argmax}}_{{Y_{k} \subseteq X_{{}} }} J\left( {Y_{k} } \right)$$
(2)

Thus, the ultimate goal of feature selection is to define a subset of \(Y_{k}\) relevant descriptors (obtained from an initial set of X descriptors) which holds the most useful molecular structure information for learning the underlying pattern present in the data.

One pronounced benefit of feature selection is that it can be used to avoid overfitting. Models with high dimension offer many degrees of freedom and tend to learn random patterns and noise instead of important underlying patterns between descriptors and the target end point [29, 30]. Many feature selection algorithms have been documented. Broadly, these algorithms can be grouped into the following three categories depending on the availability of class labels for the training set: supervised [22, 25, 28, 31], semi-supervised [18, 32], and unsupervised [18, 33]. The choice of an appropriate method is dependent on the learning algorithm to be employed and the data to be used [34]. The focus of this review is on supervised feature selection methods. Supervised feature selection requires that the entire training dataset be labeled. Feature selection is achieved by eliminating descriptors that have a low correlation with the toxicity end point to be predicted [28]. Feature selection methods applied to supervised tasks can be classified into filter, wrapper, and embedded methods [28]. We discuss each of these methods and further describe Hybrid [35, 36] and Ensemble [37,38,39] methods, which are a blend of the earlier listed methods. These methods are illustrated in Fig. 7.1.

Fig. 7.1
figure 1

An illustration of different feature selection methods: a Filter b Wrapper c Embedded d Hybrid e Ensemble

2.1 Filter

Filter methods evaluate the relevance of a feature based on its intrinsic properties and are completely independent of the learning algorithm [18, 27, 28, 40]. The majority of filter methods are univariate, where each feature is considered independently of the feature space. Multivariate methods, such as correlation-based scores and paired -scores, have also been used to assess the relevance of feature pairs and how well they synergize to enhance prediction of the desired end point [41]. Filter methods are computationally efficient and fast in comparison with wrapper methods. Their lack of dependence on any learning algorithm means that the features they select can be used with almost any learning algorithm. However, this independence often results in varied performance from these different learning algorithms [28]. Statistical methods make the assumption that the data they are applied on are normally distributed [40]. By not taking the learning algorithm into consideration, filter methods also turn a blind eye to the heuristics and biases of these algorithms, which may impair their predictive abilities [25].

Filter methods use feature ranking and filtering techniques as the basis for selection. Features are first evaluated and ranked based on a criterion. Then, a threshold is used to select all features above the mark that are considered to be relevant for predicting the end point [18, 28, 41], as shown in Fig. 7.1a. The elimination of low-variance and highly correlated descriptors is a common filtering technique applied to SAR datasets [14, 23, 42]. Several criteria have been employed for filtering descriptors, including variance score [32], correlation coefficient [25, 34], fisher [28, 43], and information gain [44].

2.2 Wrapper

Wrapper methods use learning algorithms to evaluate the relevance of a feature, where the learning algorithm’s error rate or accuracy is treated as the objective function/criterion for evaluating a feature. A wrapper method begins by selecting a subset of the features heuristically or sequentially, and then a learning algorithm of choice is used to evaluate this subset. This process of subset generation and testing is repeated until the desired objective function is achieved [27, 28] (Fig. 7.1b). Wrappers tend to perform better than filters in selecting features since they consider feature dependencies and directly incorporate the specific biases and heuristics of the learning algorithm into the selection process. However, this implies that the selected features are unlikely to be optimal for any other classifiers [18].

The size of search space for m features is O(2m) [28]. Since evaluating the subsets of such a search space is considered an NP-hard problem, the computational inefficiency of wrappers becomes evident when using larger datasets. However, search algorithms have been proposed for selecting optimal subsets of the feature space. Broadly, we consider two groups of search strategies for wrappers: sequential and heuristic selection algorithms [25].

2.2.1 Sequential Selection Algorithms

Sequential selection can be achieved in two ways: forward selection and backward elimination. Sequential forward selection (SFS) begins with an empty set of features, and features are progressively incorporated into larger and larger subsets (one at a time) until no further improvement is recorded in the evaluation criterion. A backward elimination algorithm begins with the full set of features and iteratively eliminates the least relevant features [28].

The sequential floating forward selection (SFFS) [45, 46] algorithm has been suggested as an improvement over SFS because it includes flexible backtracking capabilities. Similar to SFS, SFFS adds one feature at a time as determined by the objective function. Meanwhile, it backtracks by eliminating one feature at a time from the initial subset, followed by an evaluation. If an improvement is noticed in the objective function, it leaves that feature out and moves on to add a new feature. This process goes on iteratively until the desired goal is met with the fewest number of features.

2.2.2 Heuristic Selection Algorithms

Heuristic search algorithms evaluate different subsets to optimize the objective function. Subsets can be generated by evaluating a search space or by generating solutions to the optimization problem, with the learning algorithm’s performance being the objective function [25]. Simulated annealing (SA) [47] and genetic algorithms (GA) [48], two widely used heuristic algorithms, find a subset of features for wrappers. A hybrid of these methods has also been suggested [49]. In GA, the chromosome bits indicate if a feature should be included or not. SA, a stochastic algorithm, solves for the global minimum of a function by improving the initial solution repeatedly using small local perturbations until no such perturbations yield an improvement in the objective function. This process is randomized such that there are occasional and intentional deviations from the solution to lessen the probability of becoming stuck in local optima. The use of GA to preselect descriptor subsets for SAR modeling of artificial and real data was shown to be successful in [13] where 2D descriptors were employed to discriminate between active and inactive compounds. Particle swarm optimization (PSO) [47] and ant colony optimization (ACO) [50] algorithms may also be employed for heuristic subset search. For instance, it has been shown that the ACO algorithm is a useful method for selecting descriptors for predicting cyclooxygenase inhibitors [50].

2.3 Embedded

Embedded feature selection methods incorporate feature selection into the model training process. Embedded feature learning, much like wrapper methods, takes the potential dependencies among features into consideration while being more computationally efficient and less prone to overfitting as compared to wrappers [18, 27, 28, 41]. A common embedded feature selection algorithm is random forest. A random forest is an ensemble of learners with a built-in mechanism for feature selection, such as ID3 and C4.5 [28, 51]. Base learners, i.e., decision trees, look at each feature in the feature space individually and assign importance to them based on how well they contribute to the model attaining an optimal fit. Features with the lowest importance are discarded, and the forest with the least number of features and highest predictive performance is selected [28] (Fig. 7.1c). Using the top 20 molecular descriptors from the random forest predictor importance method, Newby et al. [44] obtained more accurate decision tree classification models in most cases, compared to the use of filter methods such as information gain, chi-square, and greedy search.

Pruning is another embedded feature selection approach that has been applied to neural networks as well as classical learning algorithms, specifically support vector machines (SVMs) [25]. For instance, SVM-recursive feature elimination (SVM-RFE) begins with all the features and recursively removes features that do not contribute positively to the model’s predictive accuracy. To determine the optimal number of features for an RFE-based model, cross-validation is used to evaluate and select the subset with the best performance. Hence, RFE can select the best features for a specific learning algorithm. RFE is considered to be computationally expensive as it traverses through all the features one after the other [41]. Weighted Kernels [49] and regularization methods [52], like Lasso, Ridge and Elastic net, have also gained prominence.

2.4 Hybrid and Ensemble Feature Selection

Hybrid methods for feature selection involve combining at least two different methods and applying them, usually in succession. Hybrid methods attempt to take advantage of the benefits of the constituent methods while leveraging their strengths. In the literature, the most reported is the combination of filter and wrapper methods. Their use has been widely reported for biomedical data [35]. Hsu et al. [49] separately filtered two sets of features using F-score or information gain as the filtering criterion. The resulting features were combined and further treated with wrappers (Fig. 7.1d). They reported improved predictions in comparison with using filters alone and a decreased computational time compared to using wrappers only. Reddy et al. [53] applied a hybrid GA-based descriptor optimization technique for consistently selecting descriptor subsets that represented the whole initial descriptor space. The weights of the selected subsets were analyzed to understand the contribution of each feature to the prediction of HIV protease inhibitors, revealing the role of hydrophobic interactions. This implies the interpretability of the method.

Ensemble methods represent the application of a feature selection method on different subsets of features obtained by using subsampling strategies like bootstrapping. The resulting features from each of the subsets are aggregated using mean, weights, or simple linear aggregation [38, 39] (Fig. 7.1e). This method is often used to deal with the challenges of perturbation and instability experienced by most feature selection methods. Seijo-Pardo et al. [39] provided an in-depth discussion of ensemble methods of feature selection. Dutta et al. [54] proposed an ensemble descriptor selection that searches for descriptor subsets using a genetic algorithm whose objective function is a linear combination of the root-mean-square deviation (RMSE) of all the models in the ensemble. They reported an improvement and found that the resulting model had good performance on the PDGFR and COX-2 datasets. A 96% reduction in noise and an improvement in performance was reported by Zhu et al. [55], using a recursive random forest to rule out a quarter of the least important descriptors at each iteration. This performed better than the least absolute shrinkage and selection operator (LASSO). The authors highlighted that the difference between the prediction performance of random forest and LASSO mainly resulted from the use of variables selected by different strategies, rather than from differences between the learning algorithms.

We have summarized the characteristics, strengths, and weaknesses of the five classes of feature selection methods described above in Table 7.1 in order to assist a user in choosing the appropriate tool based on user-specific requirements and/or goals.

Table 7.1 A summary of feature selection techniques

3 Feature Extraction

The algorithms employed for mathematical representation of molecular descriptors and fingerprints are independent of the size of molecules, allowing the generation of a fixed length set of descriptors for every molecule regardless of size [7]. The generation of fixed length vectors can introduce redundant descriptors for certain molecules within a library. An optimized feature set achieved by feature extraction can minimize redundancy, noise, correlation between descriptors, and consequently generate classifiers with improved prediction accuracy [20].

A mathematical description of feature extraction is as follows: Considering a descriptor space, \(x \in R^{n}\), find a mapping \(y = f\left( x \right)\) to obtain transformed feature vector \(y , {\text{where }} y \in R^{k}\) and k < n. The vector \(y\) should preserve the majority of molecular information in \(R^{n}\). The goal is to achieve a reduction in dimension without negatively impacting the prediction performance. An optimal mapping, \(y = f\left( x \right)\), is one that minimizes the prediction error.

Feature extraction transforms the initial feature space to a new, lower dimension feature space by combining the features in the original space. As a result, it is difficult to associate the new features with the old. Further analysis, such as feature importance explanation, becomes very difficult as there is no physical meaning for the newly mapped features that are obtained from feature extraction. Here we discuss some commonly used feature extraction techniques.

3.1 Principal Component Analysis

Principal component analysis (PCA) is a multivariate, nonparametric method employed for dimensionality reduction [56, 57]. It works by performing a linear combination of the features, also referred to as the principal components, to achieve the maximum variance. At its core, PCA is centered on determining the eigenvectors of the input data’s covariance matrix. This linear transformation can minimize redundancy and reduce the number of features, which increases the information in the resulting features. Each of the resulting features, called principal components, is a combination of several original features. These principal components are also highly uncorrelated because the first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible [26]. A detailed discussion on the different applications of PCA in SAR modeling was provided in [57]. Klepsch et al. [58] applied PCA to a curated P-glycoprotein inhibitors data set of 1608 compounds, where the first two principal components were reported to explain 71.7% of the variance in the dataset. This approach was applied to classification and an analysis into the effect of the initial descriptors on these two components showed that hydrophobic information, such as the number of aromatic bonds and the partition coefficient, was the major contributor to the principal components. According to [59], 2‐aryl‐1,3,4‐Thiadiazole derivatives were classified into distinct clusters of active or inactive molecules when PCA was performed instead of using all of the descriptors calculated.

Considering that principal components are combinations of the original features, all the original features are still available within the components. This is useful for interpretation of models because knowing the original features that contribute to a component can reveal the types of features that are closely related. A key challenge with PCA is that it is unable to handle data with complicated structures that may not be represented in a linear subspace [60]. Kernel PCA (KPCA) [61, 62] was designed to serve as the nonlinear form of PCA. KPCA is based on kernel functions that intrinsically perform a nonlinear mapping of the input space to a feature space followed by performing linear PCA in this feature space. KPCA generated vectors have been used to train SVM models [59], and it was shown that KPCA is efficient over a wide range of virtual screening dataset inputs using MACCS and ECFP fingerprints. It was also observed that the KPCA embedding largely depended on the properties of the underlying representation as its performance on the ECFP fingerprint varied with the hashing employed.

3.2 Autoencoder

Autoencoders [63, 64] are unsupervised neural networks with an odd number of hidden layers that can be applied for nonlinear feature extraction. They employ the backpropagation algorithm to try to create a set of output values which are equal to the input by minimizing the error between the output and the input layer. The network architecture can be designed such that the middle layer is smaller, i.e., has fewer nodes than the input and output layers (Fig. 7.2). In that case, the network is forced to learn a compact representation (embedding) of the input data [65]. In an early work, Hinton et al. [17] demonstrated that autoencoders generated embeddings of images that were used to reconstruct images. A major drawback of autoencoders is that physical meaning for theoretical insight will be lost. They are also complex to train because they typically require a large amount of training data and a search through many possible hyperparameter values. Blaschke et al. [66] employed generative autoencoders to design new molecules in silico based on the recreated output layer. Burgoon [67] used autoencoders to screen chemicals for potential estrogenic activity by projecting the two neurons in the middle layer into a Cartesian plane. The application of autoencoders for toxicity prediction has not been widely reported, especially for feature extraction. This provides an opportunity for a future area of research.

Fig. 7.2
figure 2

An autoencoder indicating the reduced dimension in the middle layer

3.3 Linear Discriminant Analysis

Like PCA, linear discriminant analysis (LDA) [65, 68] is a linear transformation technique commonly used for dimensionality reduction. However, LDA is supervised since the discrimination power of the features is taken into consideration. LDA computes an optimal transformation (projection) of the input data on to a line such that classes are separated as clusters. The goal of the projection is to ensure maximum class discrimination by minimizing the within-class distance while maximizing the between-class distance [26]. A weakness of LDA is that if the distribution of a dataset is significantly non-Gaussian, the LDA projections will not be able to preserve any complex structure of the data [69]. Thus, the resulting features may not have good discriminative power. Features extracted with LDA were used by Ren et al. [70] in a stepwise forward manner from a combined pool of experimental data, and chemical structure-based descriptors were employed for predicting aquatic toxicity mode of action. In this work, logistic regression was shown to have a better predictive performance than LDA using the extracted features, with a 7.3% improvement over previously reported classification rates.

In addition to the above-mentioned nonlinear dimensionality reduction techniques, there are also spectral and manifold learning methods, such as t-distributed Stochastic Neighbor Embedding (t-SNE) [71], multi-dimensional scaling (MDS) [72], spectral embedding [73], and isomap [74]. Manifold learning, a class of unsupervised nonlinear algorithms, assumes that the dimensionality of a datasets is only artificially high and thus attempts to uncover the intrinsic low dimensionality. Typically, these algorithms work by computing the similarities between points to find a nearest‐neighbor, and then an eigen problem for embedding high‐dimensional points into a lower dimensional space [75].

4 Miscellaneous

4.1 Feature Stability

It is common to use the performance of a model as the metric to evaluate the suitability of a feature reduction algorithm. Therefore, it is an obvious choice to optimize the selection process to obtain the best prediction power possible. However, the stability or degree of variance of feature selection methods becomes a crucial challenge when the task at hand goes beyond optimizing prediction accuracy to include improving interpretability. A simple scenario may be the case for using substructure-based descriptors for SAR modeling. It is common to consider a substructure that is very relevant for prediction as a major contributor to the activity of that molecule, implying a potential research target. However, many feature selection algorithms tend to be unstable and would yield a different subset if a little perturbation is applied (i.e., when new training samples are added or when some training samples are removed). If every perturbation results in wide variation in the selected subset, then it is difficult to conclude that a feature may be important to the molecule’s activity.

Kalousis et al. [76] defined the stability of a feature selection algorithm as “the robustness of the feature subset the algorithm produces in the presence of perturbations in training sets drawn from the same generating distribution.” Essentially, stability quantifies how different training sets affect the variation in the selected feature subset. Hence, a similarity measure is often employed to measure the stability of feature selection algorithms. A reliable algorithm should produce the same or similar subset for any perturbations in the training data. Alelyani et al. [77] performed experiments to investigate the causes of instability and reported that dimension, sample size, and the distribution of the training data influenced stability. Larger sample size translated to improved stability, while larger dimensions caused negative effects. Thus, researchers should pay attention to the characteristics of a training dataset. Certain algorithms are also more prone to instability than others. ReliefF-based feature selection is affected by the order of samples in a training set, while stochastic search algorithms like GA that use random initialization parameters tend to yield subsets that are unstable [78, 79]. Various metrics for measuring stability have been proposed [78]. To overcome the stability challenge, it has been suggested to employ ensemble selection algorithms based on the technicalities of the selection algorithm in use [78, 80, 81]. Some of these algorithms include Bootstrap sampling, random data partitioning, parameter randomization, or the combination of several of these. Developing algorithms for feature selection that are stable and possess high predictive power is still an open and challenging area. SAR-based toxicity prediction stands to gain a lot from such techniques that can improve speed and accuracy of predictions for regulatory as well as lead optimization purposes.

4.2 Validation of Feature Selection

In selecting the optimal feature subset, it is common to evaluate the performance of a learner based on its prediction error. A very common and overlooked mistake is to select features using the entire dataset as a preprocessing step. While this appears to be obviously wrong, it has been reported that many researchers, especially in the biomedical fields, continue to make this mistake and successfully publish in top-ranking journals [82, 83]. If a test set is to be used to evaluate the performance of a feature set, it must not be involved in the feature selection step as that will result in a selection bias that will yield overly optimistic performance estimates. This is because the features used will have an unfair advantage since they were chosen based on all of the samples. As a result, the model would have gained insight into the features which are more important in the test set. This challenge is more common with wrapper methods [83].

In many practical cases of SAR-based toxicity modeling, there are rarely a large number of compounds across the different end points to be predicted. This makes it difficult to set aside a reasonable batch of data for evaluation purposes. Methods such as cross-validation and bootstrap sampling can be used to avoid sampling bias [34, 82, 83]. Cross-validation techniques like leave-one-out cross-validation (LOOCV) and the k-fold method were suggested. Feature selection is to be done in the inner loop of the cross-validation procedure; hence, the algorithm takes the following form for a k-fold technique [82]:

  1. (i)

    Randomly shuffle the data set.

  2. (ii)

    Randomly split the dataset into K folds.

  3. (iii)

    For each fold k = 1, 2,…, K.

    1. a.

      Perform feature selection to obtain an optimal subset with good univariate correlation with the desired end point using all the data except the kth fold.

    2. b.

      Use the selected features and build a multivariate model with all data except the kth fold.

    3. c.

      Perform an evaluation using the kth fold.

  4. (iv)

    Aggregate the performance across all K folds to get an unbiased evaluation.

5 Summary

QSAR-based predictive toxicity modeling methods are faced with input spaces of thousands of features. To improve the ability of a learner to find a generalizable relationship between molecular descriptors and the toxicity end point of interest, it is expedient to provide the learning algorithm with the minimum number of descriptors while ensuring that the resulting model is interpretable and computationally inexpensive to build. The relevance of a descriptor is assessed by its ability to discriminate between classes in qualitative classification or its correlation to a scalar in quantitative prediction.

In this review, we have discussed different feature selection and extraction methods applicable to SAR-based toxicity modeling. The strengths and weaknesses of each method are highlighted. The choice of which to use should largely depend on the available dataset, and we suggest beginning a new task with a few baseline performance values from a number of methods since no single approach is universally superior. Where the importance of descriptors is sought, feature selection methods such as filter , wrapper , embedded or their combinations (hybrid and ensemble) may apply. Feature extraction methods transform the features into a lower dimension while altering the physical meaning of the features. More analysis may be required to interpret the selected features. The stability of selected features and proper feature subset validation methods are often overlooked. Feature selection bias can be avoided by embedding the feature selection process within the inner loop of a cross-validation process to avoid an overly optimistic performance value. Although dimensionality reduction has been shown to improve model performance, there is still room for improvement when it comes to evaluating and validating feature selection and extraction methods and their stability. For the sake of reproducibility, researchers are encouraged to publish important parameters for feature selection or extraction methods they employed, such as the threshold for a variance score. Regardless of the choice of features (molecular descriptors, fingerprints or a combination) used for modeling, SAR models can benefit from dimensionality reduction techniques.