A Review of Feature Reduction Methods for QSAR-Based Toxicity Prediction

Idakwo, Gabriel; Luttrell IV, Joseph; Chen, Minjun; Hong, Huixiao; Gong, Ping; Zhang, Chaoyang

doi:10.1007/978-3-030-16443-0_7

Gabriel Idakwo³,
Joseph Luttrell IV³,
Minjun Chen⁴,
Huixiao Hong⁴,
Ping Gong⁵ &
…
Chaoyang Zhang³

Part of the book series: Challenges and Advances in Computational Chemistry and Physics ((COCH,volume 30))

1290 Accesses
16 Citations

Abstract

Thousands of molecular descriptors (1D to 4D) can be generated and used as features to model quantitative structure–activity or toxicity relationship (QSAR or QSTR) for chemical toxicity prediction. This often results in models that suffer from the “curse of dimensionality”, a problem that can occur in machine learning practice when too many features are employed to train a model. Here we discuss different methods of eliminating redundant and irrelevant features to enhance prediction performance, increase interpretability, and reduce computational complexity. Several feature selection and extraction methods are summarized along with their strengths and shortcomings. We also highlight some commonly overlooked challenges such as algorithm instability and selection bias while offering possible solutions.

Access provided by Autonomous University of Puebla. Download chapter PDF

On the Relevance of Feature Selection Algorithms While Developing Non-linear QSARs

QSAR—An Important In-Silico Tool in Drug Design and Discovery

alvaDesc: A Tool to Calculate and Analyze Molecular Descriptors and Fingerprints

Keywords

1 Introduction

The limitations of in vivo and in vitro approaches for determination of the biological activity of chemicals have fostered the development of in silico approaches [1]. In silico predictive toxicology is designed to complement experimental efforts with a view toward improving the quality of toxicity predictions for safety assessment while decreasing the associated time, cost, and ethical conflicts (animal testing) [2,3,4]. Methodology for in silico predictive toxicology has been dominated by (quantitative) structure–activity or toxicity relationship [(Q)SAR or (Q)STR] (hereafter called SAR). Traditional SAR models describe a relationship between the chemical structure of molecules (numerically encoded as molecular descriptors) and their activity against a specific biological target [1]. This is achieved by establishing a trend in the molecular descriptor space that links to a biological activity. Thus, all SAR models are developed on the assumption of a similarity principle. That is, molecules with similar structures (and descriptors, consequently) will have similar biological activity [4, 5]. A SAR model to predict toxicity (T) is given in Eq. (1)

$$T = g\left( {D_{f} } \right)$$

(1)

where $\left( {D_{f} } \right)$ represents the feature space of molecular descriptors as chemical properties and $g$ is a function that relates T to $\left( {D_{f} } \right)$ [2]. The accuracy of the model or function $g$ has been shown to depend on the most representative set of molecular descriptors that will encode the useful properties of the molecules for prediction.

Molecular descriptors, being numerical features extracted from molecular structures, are the most common variables used for SAR-based toxicity prediction modeling [6]. The information encoded by descriptors depends on the molecular representation or “dimensionality” of the compound as well as the algorithm used to calculate the descriptors [7]. One-dimensional (1D) descriptors are scalars encoding physiochemical properties (molecular weight, logP) and constitutional parameters, such as number of atoms, bond count, atom type, ring count, and fragment counts. 1D descriptors are insensitive to the topology of the molecule and tend to be similar for distinct compounds. As a result, they are often used in combination with other descriptors. Two-dimensional (2D) descriptors are more frequently used for chemical space description. 2D descriptors, including topological indices and structural fragments, are calculated from the connection table (chemical graph) representation of a molecule. They are not only independent of the conformation of the molecule but also graph invariant (not sensitive to altering the number of graph nodes). Three-dimensional (3D) descriptors provide a more complete characterization of molecular structures. 3D descriptors require conformational searching and can discriminate between isomers; this comes at the price of being computationally expensive. The ability to discriminate between isomers can translate to less redundant features. Examples of 3D descriptors include geometric, electrostatic, quantum chemical, and WHIM & GETAWAY. Four-dimensional (4D) descriptors are much like 3D descriptors that evaluate multiple structural conformations simultaneously. Fingerprints are another form of molecular descriptors [7,8,9]. Commonly used fingerprints include the Molecular ACCess System (MACCS) [10] substructure fingerprints, PubChem [11], and extended-connectivity fingerprints (ECFP) [12]. These fingerprints and 2D descriptors were widely used in the Tox21 data challenge [13] where the winning submissions used over 2500 predefined features covering a wide range of data from topological and physical properties to fingerprints [14].

As shown above, the chemical structures used in SAR modeling are characterized by many molecular descriptors. It is common to generate thousands of descriptors for a single molecule [14]. It is well known that the accuracy of predictive models is not positively correlated to the dimensionality of the data, as overfitting tends to become an issue [15,16,17]. High-dimensional spaces are prone to include irrelevant and noisy features [18]. SARs developed using such features tend to focus on the peculiarities of molecules and fail to be generalizable [19]. In the chemical space for a given library, each descriptor adds a dimension to the n-dimensional chemical space. Every molecule in the library is assigned a coordinate depending on its values for all the descriptors. A reduction in the dimensionality of the chemical space correlates with an increasing similarity between molecules. This is important because the underlying assumption in SAR modeling posits that molecules with similar structures should have similar activity [20, 21]. Thus, one of the most important tasks prior to modeling is dimension reduction focused on keeping the most important and relevant descriptors with the maximum amount of biologically meaningful information required for predicting the desired toxicity end point. Shen et al. [13] demonstrated the usefulness of feature selection for toxicity prediction, particularly for interpreting the role of the features. By reducing the feature space, they were able to pinpoint MolRef and AlogP as the most important descriptors for predicting the toxicity of aromatic compounds.

In simple terms, dimensionality reduction is considered desirable for activity prediction modeling for the following reasons [22]:

(i)
Employing fewer descriptors means that the model can focus on important information for establishing a relationship, thus improving prediction accuracy and reducing overfitting (Models with many features enjoy more discriminating power during training but are often not generalizable).
(ii)
As the number of features decreases, interpretability of certain models increases.
(iii)
Computational costs reduce significantly as the complexity of many learning algorithms is greater than linear [19, 23].
(iv)
Elimination of irrelevant descriptors can help remove activity cliffs [7].
(v)
Machine learning algorithms are statistical in nature; hence, they suffer from the “curse of dimensionality”, which is common with optimization problems as described by Bellman [24].

As the dimensionality increases, the amount of data needed to develop generalizable models increases exponentially [25, 26]. SAR data rarely have an abundance of labeled molecules and, as such, the final model and resulting toxicity prediction will benefit from a reduction in dimension as a smaller dimension means fewer samples will be required during training. The optimal subset of a feature space is one which has the least number of dimensions yet offers the best learning accuracy [26]. Two techniques used to alleviate the challenges of high dimension in SAR datasets include feature selection and feature extraction.

In this review, we discuss different methods for both feature selection and feature extraction techniques, as well as their applications in SAR modeling. In the next two sections, we discuss feature selection and feature extraction methods consecutively. In the last section, we highlight important aspects that must be considered while attempting feature space reduction, such as the stability and validation of the methods.

2 Feature Selection

Feature selection works by selecting a subset of features from the original feature set and removing irrelevant features without altering the original representation of the data, on the basis of certain relevance criteria [18, 26,27,28]. The physical meanings of the features are retained.

Mathematically, considering a descriptor space $X = \left\{ {x_{i} , i = 1 \ldots \left. n \right\}} \right.$, find a subset $Y_{k}$ (with k < n) that maximizes an objective function $J\left( X \right)$ for the probability $P$ that a compound is correctly predicted as active or inactive using Eq. (2).

$$Y_{k} = \left\{ {x_{\left( 1 \right),} } \right. x_{\left( 2 \right),} \ldots , \left. {x_{\left( k \right)} } \right\} = {\text{argmax}}_{{Y_{k} \subseteq X_{{}} }} J\left( {Y_{k} } \right)$$

(2)

Thus, the ultimate goal of feature selection is to define a subset of $Y_{k}$ relevant descriptors (obtained from an initial set of X descriptors) which holds the most useful molecular structure information for learning the underlying pattern present in the data.

One pronounced benefit of feature selection is that it can be used to avoid overfitting. Models with high dimension offer many degrees of freedom and tend to learn random patterns and noise instead of important underlying patterns between descriptors and the target end point [29, 30]. Many feature selection algorithms have been documented. Broadly, these algorithms can be grouped into the following three categories depending on the availability of class labels for the training set: supervised [22, 25, 28, 31], semi-supervised [18, 32], and unsupervised [18, 33]. The choice of an appropriate method is dependent on the learning algorithm to be employed and the data to be used [34]. The focus of this review is on supervised feature selection methods. Supervised feature selection requires that the entire training dataset be labeled. Feature selection is achieved by eliminating descriptors that have a low correlation with the toxicity end point to be predicted [28]. Feature selection methods applied to supervised tasks can be classified into filter, wrapper, and embedded methods [28]. We discuss each of these methods and further describe Hybrid [35, 36] and Ensemble [37,38,39] methods, which are a blend of the earlier listed methods. These methods are illustrated in Fig. 7.1.

2.1 Filter

Filter methods evaluate the relevance of a feature based on its intrinsic properties and are completely independent of the learning algorithm [18, 27, 28, 40]. The majority of filter methods are univariate, where each feature is considered independently of the feature space. Multivariate methods, such as correlation-based scores and paired -scores, have also been used to assess the relevance of feature pairs and how well they synergize to enhance prediction of the desired end point [41]. Filter methods are computationally efficient and fast in comparison with wrapper methods. Their lack of dependence on any learning algorithm means that the features they select can be used with almost any learning algorithm. However, this independence often results in varied performance from these different learning algorithms [28]. Statistical methods make the assumption that the data they are applied on are normally distributed [40]. By not taking the learning algorithm into consideration, filter methods also turn a blind eye to the heuristics and biases of these algorithms, which may impair their predictive abilities [25].

Filter methods use feature ranking and filtering techniques as the basis for selection. Features are first evaluated and ranked based on a criterion. Then, a threshold is used to select all features above the mark that are considered to be relevant for predicting the end point [18, 28, 41], as shown in Fig. 7.1a. The elimination of low-variance and highly correlated descriptors is a common filtering technique applied to SAR datasets [14, 23, 42]. Several criteria have been employed for filtering descriptors, including variance score [32], correlation coefficient [25, 34], fisher [28, 43], and information gain [44].

2.2 Wrapper

Wrapper methods use learning algorithms to evaluate the relevance of a feature, where the learning algorithm’s error rate or accuracy is treated as the objective function/criterion for evaluating a feature. A wrapper method begins by selecting a subset of the features heuristically or sequentially, and then a learning algorithm of choice is used to evaluate this subset. This process of subset generation and testing is repeated until the desired objective function is achieved [27, 28] (Fig. 7.1b). Wrappers tend to perform better than filters in selecting features since they consider feature dependencies and directly incorporate the specific biases and heuristics of the learning algorithm into the selection process. However, this implies that the selected features are unlikely to be optimal for any other classifiers [18].

The size of search space for m features is O(2^m) [28]. Since evaluating the subsets of such a search space is considered an NP-hard problem, the computational inefficiency of wrappers becomes evident when using larger datasets. However, search algorithms have been proposed for selecting optimal subsets of the feature space. Broadly, we consider two groups of search strategies for wrappers: sequential and heuristic selection algorithms [25].

2.2.1 Sequential Selection Algorithms

Sequential selection can be achieved in two ways: forward selection and backward elimination. Sequential forward selection (SFS) begins with an empty set of features, and features are progressively incorporated into larger and larger subsets (one at a time) until no further improvement is recorded in the evaluation criterion. A backward elimination algorithm begins with the full set of features and iteratively eliminates the least relevant features [28].

The sequential floating forward selection (SFFS) [45, 46] algorithm has been suggested as an improvement over SFS because it includes flexible backtracking capabilities. Similar to SFS, SFFS adds one feature at a time as determined by the objective function. Meanwhile, it backtracks by eliminating one feature at a time from the initial subset, followed by an evaluation. If an improvement is noticed in the objective function, it leaves that feature out and moves on to add a new feature. This process goes on iteratively until the desired goal is met with the fewest number of features.

2.2.2 Heuristic Selection Algorithms

Heuristic search algorithms evaluate different subsets to optimize the objective function. Subsets can be generated by evaluating a search space or by generating solutions to the optimization problem, with the learning algorithm’s performance being the objective function [25]. Simulated annealing (SA) [47] and genetic algorithms (GA) [48], two widely used heuristic algorithms, find a subset of features for wrappers. A hybrid of these methods has also been suggested [49]. In GA, the chromosome bits indicate if a feature should be included or not. SA, a stochastic algorithm, solves for the global minimum of a function by improving the initial solution repeatedly using small local perturbations until no such perturbations yield an improvement in the objective function. This process is randomized such that there are occasional and intentional deviations from the solution to lessen the probability of becoming stuck in local optima. The use of GA to preselect descriptor subsets for SAR modeling of artificial and real data was shown to be successful in [13] where 2D descriptors were employed to discriminate between active and inactive compounds. Particle swarm optimization (PSO) [47] and ant colony optimization (ACO) [50] algorithms may also be employed for heuristic subset search. For instance, it has been shown that the ACO algorithm is a useful method for selecting descriptors for predicting cyclooxygenase inhibitors [50].

2.3 Embedded

Embedded feature selection methods incorporate feature selection into the model training process. Embedded feature learning, much like wrapper methods, takes the potential dependencies among features into consideration while being more computationally efficient and less prone to overfitting as compared to wrappers [18, 27, 28, 41]. A common embedded feature selection algorithm is random forest. A random forest is an ensemble of learners with a built-in mechanism for feature selection, such as ID3 and C4.5 [28, 51]. Base learners, i.e., decision trees, look at each feature in the feature space individually and assign importance to them based on how well they contribute to the model attaining an optimal fit. Features with the lowest importance are discarded, and the forest with the least number of features and highest predictive performance is selected [28] (Fig. 7.1c). Using the top 20 molecular descriptors from the random forest predictor importance method, Newby et al. [44] obtained more accurate decision tree classification models in most cases, compared to the use of filter methods such as information gain, chi-square, and greedy search.

Pruning is another embedded feature selection approach that has been applied to neural networks as well as classical learning algorithms, specifically support vector machines (SVMs) [25]. For instance, SVM-recursive feature elimination (SVM-RFE) begins with all the features and recursively removes features that do not contribute positively to the model’s predictive accuracy. To determine the optimal number of features for an RFE-based model, cross-validation is used to evaluate and select the subset with the best performance. Hence, RFE can select the best features for a specific learning algorithm. RFE is considered to be computationally expensive as it traverses through all the features one after the other [41]. Weighted Kernels [49] and regularization methods [52], like Lasso, Ridge and Elastic net, have also gained prominence.

2.4 Hybrid and Ensemble Feature Selection

Hybrid methods for feature selection involve combining at least two different methods and applying them, usually in succession. Hybrid methods attempt to take advantage of the benefits of the constituent methods while leveraging their strengths. In the literature, the most reported is the combination of filter and wrapper methods. Their use has been widely reported for biomedical data [35]. Hsu et al. [49] separately filtered two sets of features using F-score or information gain as the filtering criterion. The resulting features were combined and further treated with wrappers (Fig. 7.1d). They reported improved predictions in comparison with using filters alone and a decreased computational time compared to using wrappers only. Reddy et al. [53] applied a hybrid GA-based descriptor optimization technique for consistently selecting descriptor subsets that represented the whole initial descriptor space. The weights of the selected subsets were analyzed to understand the contribution of each feature to the prediction of HIV protease inhibitors, revealing the role of hydrophobic interactions. This implies the interpretability of the method.

Ensemble methods represent the application of a feature selection method on different subsets of features obtained by using subsampling strategies like bootstrapping. The resulting features from each of the subsets are aggregated using mean, weights, or simple linear aggregation [38, 39] (Fig. 7.1e). This method is often used to deal with the challenges of perturbation and instability experienced by most feature selection methods. Seijo-Pardo et al. [39] provided an in-depth discussion of ensemble methods of feature selection. Dutta et al. [54] proposed an ensemble descriptor selection that searches for descriptor subsets using a genetic algorithm whose objective function is a linear combination of the root-mean-square deviation (RMSE) of all the models in the ensemble. They reported an improvement and found that the resulting model had good performance on the PDGFR and COX-2 datasets. A 96% reduction in noise and an improvement in performance was reported by Zhu et al. [55], using a recursive random forest to rule out a quarter of the least important descriptors at each iteration. This performed better than the least absolute shrinkage and selection operator (LASSO). The authors highlighted that the difference between the prediction performance of random forest and LASSO mainly resulted from the use of variables selected by different strategies, rather than from differences between the learning algorithms.

We have summarized the characteristics, strengths, and weaknesses of the five classes of feature selection methods described above in Table 7.1 in order to assist a user in choosing the appropriate tool based on user-specific requirements and/or goals.

Table 7.1 A summary of feature selection techniques

Full size table

3 Feature Extraction

The algorithms employed for mathematical representation of molecular descriptors and fingerprints are independent of the size of molecules, allowing the generation of a fixed length set of descriptors for every molecule regardless of size [7]. The generation of fixed length vectors can introduce redundant descriptors for certain molecules within a library. An optimized feature set achieved by feature extraction can minimize redundancy, noise, correlation between descriptors, and consequently generate classifiers with improved prediction accuracy [20].

A mathematical description of feature extraction is as follows: Considering a descriptor space, $x \in R^{n}$, find a mapping $y = f\left( x \right)$ to obtain transformed feature vector $y , {\text{where }} y \in R^{k}$ and k < n. The vector $y$ should preserve the majority of molecular information in $R^{n}$. The goal is to achieve a reduction in dimension without negatively impacting the prediction performance. An optimal mapping, $y = f\left( x \right)$, is one that minimizes the prediction error.

Feature extraction transforms the initial feature space to a new, lower dimension feature space by combining the features in the original space. As a result, it is difficult to associate the new features with the old. Further analysis, such as feature importance explanation, becomes very difficult as there is no physical meaning for the newly mapped features that are obtained from feature extraction. Here we discuss some commonly used feature extraction techniques.

3.1 Principal Component Analysis

Principal component analysis (PCA) is a multivariate, nonparametric method employed for dimensionality reduction [56, 57]. It works by performing a linear combination of the features, also referred to as the principal components, to achieve the maximum variance. At its core, PCA is centered on determining the eigenvectors of the input data’s covariance matrix. This linear transformation can minimize redundancy and reduce the number of features, which increases the information in the resulting features. Each of the resulting features, called principal components, is a combination of several original features. These principal components are also highly uncorrelated because the first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible [26]. A detailed discussion on the different applications of PCA in SAR modeling was provided in [57]. Klepsch et al. [58] applied PCA to a curated P-glycoprotein inhibitors data set of 1608 compounds, where the first two principal components were reported to explain 71.7% of the variance in the dataset. This approach was applied to classification and an analysis into the effect of the initial descriptors on these two components showed that hydrophobic information, such as the number of aromatic bonds and the partition coefficient, was the major contributor to the principal components. According to [59], 2‐aryl‐1,3,4‐Thiadiazole derivatives were classified into distinct clusters of active or inactive molecules when PCA was performed instead of using all of the descriptors calculated.

Considering that principal components are combinations of the original features, all the original features are still available within the components. This is useful for interpretation of models because knowing the original features that contribute to a component can reveal the types of features that are closely related. A key challenge with PCA is that it is unable to handle data with complicated structures that may not be represented in a linear subspace [60]. Kernel PCA (KPCA) [61, 62] was designed to serve as the nonlinear form of PCA. KPCA is based on kernel functions that intrinsically perform a nonlinear mapping of the input space to a feature space followed by performing linear PCA in this feature space. KPCA generated vectors have been used to train SVM models [59], and it was shown that KPCA is efficient over a wide range of virtual screening dataset inputs using MACCS and ECFP fingerprints. It was also observed that the KPCA embedding largely depended on the properties of the underlying representation as its performance on the ECFP fingerprint varied with the hashing employed.

3.2 Autoencoder

Autoencoders [63, 64] are unsupervised neural networks with an odd number of hidden layers that can be applied for nonlinear feature extraction. They employ the backpropagation algorithm to try to create a set of output values which are equal to the input by minimizing the error between the output and the input layer. The network architecture can be designed such that the middle layer is smaller, i.e., has fewer nodes than the input and output layers (Fig. 7.2). In that case, the network is forced to learn a compact representation (embedding) of the input data [65]. In an early work, Hinton et al. [17] demonstrated that autoencoders generated embeddings of images that were used to reconstruct images. A major drawback of autoencoders is that physical meaning for theoretical insight will be lost. They are also complex to train because they typically require a large amount of training data and a search through many possible hyperparameter values. Blaschke et al. [66] employed generative autoencoders to design new molecules in silico based on the recreated output layer. Burgoon [67] used autoencoders to screen chemicals for potential estrogenic activity by projecting the two neurons in the middle layer into a Cartesian plane. The application of autoencoders for toxicity prediction has not been widely reported, especially for feature extraction. This provides an opportunity for a future area of research.

3.3 Linear Discriminant Analysis

Like PCA, linear discriminant analysis (LDA) [65, 68] is a linear transformation technique commonly used for dimensionality reduction. However, LDA is supervised since the discrimination power of the features is taken into consideration. LDA computes an optimal transformation (projection) of the input data on to a line such that classes are separated as clusters. The goal of the projection is to ensure maximum class discrimination by minimizing the within-class distance while maximizing the between-class distance [26]. A weakness of LDA is that if the distribution of a dataset is significantly non-Gaussian, the LDA projections will not be able to preserve any complex structure of the data [69]. Thus, the resulting features may not have good discriminative power. Features extracted with LDA were used by Ren et al. [70] in a stepwise forward manner from a combined pool of experimental data, and chemical structure-based descriptors were employed for predicting aquatic toxicity mode of action. In this work, logistic regression was shown to have a better predictive performance than LDA using the extracted features, with a 7.3% improvement over previously reported classification rates.

In addition to the above-mentioned nonlinear dimensionality reduction techniques, there are also spectral and manifold learning methods, such as t-distributed Stochastic Neighbor Embedding (t-SNE) [71], multi-dimensional scaling (MDS) [72], spectral embedding [73], and isomap [74]. Manifold learning, a class of unsupervised nonlinear algorithms, assumes that the dimensionality of a datasets is only artificially high and thus attempts to uncover the intrinsic low dimensionality. Typically, these algorithms work by computing the similarities between points to find a nearest‐neighbor, and then an eigen problem for embedding high‐dimensional points into a lower dimensional space [75].

4 Miscellaneous

4.1 Feature Stability

It is common to use the performance of a model as the metric to evaluate the suitability of a feature reduction algorithm. Therefore, it is an obvious choice to optimize the selection process to obtain the best prediction power possible. However, the stability or degree of variance of feature selection methods becomes a crucial challenge when the task at hand goes beyond optimizing prediction accuracy to include improving interpretability. A simple scenario may be the case for using substructure-based descriptors for SAR modeling. It is common to consider a substructure that is very relevant for prediction as a major contributor to the activity of that molecule, implying a potential research target. However, many feature selection algorithms tend to be unstable and would yield a different subset if a little perturbation is applied (i.e., when new training samples are added or when some training samples are removed). If every perturbation results in wide variation in the selected subset, then it is difficult to conclude that a feature may be important to the molecule’s activity.

Kalousis et al. [76] defined the stability of a feature selection algorithm as “the robustness of the feature subset the algorithm produces in the presence of perturbations in training sets drawn from the same generating distribution.” Essentially, stability quantifies how different training sets affect the variation in the selected feature subset. Hence, a similarity measure is often employed to measure the stability of feature selection algorithms. A reliable algorithm should produce the same or similar subset for any perturbations in the training data. Alelyani et al. [77] performed experiments to investigate the causes of instability and reported that dimension, sample size, and the distribution of the training data influenced stability. Larger sample size translated to improved stability, while larger dimensions caused negative effects. Thus, researchers should pay attention to the characteristics of a training dataset. Certain algorithms are also more prone to instability than others. ReliefF-based feature selection is affected by the order of samples in a training set, while stochastic search algorithms like GA that use random initialization parameters tend to yield subsets that are unstable [78, 79]. Various metrics for measuring stability have been proposed [78]. To overcome the stability challenge, it has been suggested to employ ensemble selection algorithms based on the technicalities of the selection algorithm in use [78, 80, 81]. Some of these algorithms include Bootstrap sampling, random data partitioning, parameter randomization, or the combination of several of these. Developing algorithms for feature selection that are stable and possess high predictive power is still an open and challenging area. SAR-based toxicity prediction stands to gain a lot from such techniques that can improve speed and accuracy of predictions for regulatory as well as lead optimization purposes.

4.2 Validation of Feature Selection

In selecting the optimal feature subset, it is common to evaluate the performance of a learner based on its prediction error. A very common and overlooked mistake is to select features using the entire dataset as a preprocessing step. While this appears to be obviously wrong, it has been reported that many researchers, especially in the biomedical fields, continue to make this mistake and successfully publish in top-ranking journals [82, 83]. If a test set is to be used to evaluate the performance of a feature set, it must not be involved in the feature selection step as that will result in a selection bias that will yield overly optimistic performance estimates. This is because the features used will have an unfair advantage since they were chosen based on all of the samples. As a result, the model would have gained insight into the features which are more important in the test set. This challenge is more common with wrapper methods [83].

In many practical cases of SAR-based toxicity modeling, there are rarely a large number of compounds across the different end points to be predicted. This makes it difficult to set aside a reasonable batch of data for evaluation purposes. Methods such as cross-validation and bootstrap sampling can be used to avoid sampling bias [34, 82, 83]. Cross-validation techniques like leave-one-out cross-validation (LOOCV) and the k-fold method were suggested. Feature selection is to be done in the inner loop of the cross-validation procedure; hence, the algorithm takes the following form for a k-fold technique [82]:

(i)
Randomly shuffle the data set.
(ii)
Randomly split the dataset into K folds.
(iii)
For each fold k = 1, 2,…, K.
1. a.
  Perform feature selection to obtain an optimal subset with good univariate correlation with the desired end point using all the data except the kth fold.
2. b.
  Use the selected features and build a multivariate model with all data except the kth fold.
3. c.
  Perform an evaluation using the kth fold.
(iv)
Aggregate the performance across all K folds to get an unbiased evaluation.

5 Summary

QSAR-based predictive toxicity modeling methods are faced with input spaces of thousands of features. To improve the ability of a learner to find a generalizable relationship between molecular descriptors and the toxicity end point of interest, it is expedient to provide the learning algorithm with the minimum number of descriptors while ensuring that the resulting model is interpretable and computationally inexpensive to build. The relevance of a descriptor is assessed by its ability to discriminate between classes in qualitative classification or its correlation to a scalar in quantitative prediction.

In this review, we have discussed different feature selection and extraction methods applicable to SAR-based toxicity modeling. The strengths and weaknesses of each method are highlighted. The choice of which to use should largely depend on the available dataset, and we suggest beginning a new task with a few baseline performance values from a number of methods since no single approach is universally superior. Where the importance of descriptors is sought, feature selection methods such as filter , wrapper , embedded or their combinations (hybrid and ensemble) may apply. Feature extraction methods transform the features into a lower dimension while altering the physical meaning of the features. More analysis may be required to interpret the selected features. The stability of selected features and proper feature subset validation methods are often overlooked. Feature selection bias can be avoided by embedding the feature selection process within the inner loop of a cross-validation process to avoid an overly optimistic performance value. Although dimensionality reduction has been shown to improve model performance, there is still room for improvement when it comes to evaluating and validating feature selection and extraction methods and their stability. For the sake of reproducibility, researchers are encouraged to publish important parameters for feature selection or extraction methods they employed, such as the threshold for a variance score. Regardless of the choice of features (molecular descriptors, fingerprints or a combination) used for modeling, SAR models can benefit from dimensionality reduction techniques.

Abbreviations

1D:: One-dimensional
2D:: Two-dimensional
3D:: Three-dimensional
4D:: Four-dimensional
ACO:: Ant colony optimization
ECFP:: Extended connectivity fingerprints
GA:: Genetic algorithm
KPCA:: Kernel principal component analysis
LASSO:: Least absolute shrinkage and selection operator
LDA:: Linear discriminant analysis
LOOCV:: Leave-one-out cross-validation
MACCS:: Molecular access system
MDS:: Multi-dimensional scaling
PCA:: Principal component analysis
PSO:: Particle swarm optimization
QSAR:: Quantitative structure–activity relationship
QSTR:: Quantitative structure–toxicity relationship
RFE:: Recursive feature elimination
SA:: Simulated annealing
SAR:: Structure–activity relationship
SFFS:: Sequential floating forward selection
SFS:: Sequential forward selection
STR:: Structure–toxicity relationship
SVM:: Support vector machine
Tox21:: Toxicology in the twenty-first century
t-SNE:: t-Distributed stochastic neighbor embedding

References

Lavecchia A (2015) Machine-learning approaches in drug discovery: methods and applications. Drug Discov Today 20(3):318–331
Article Google Scholar
Raies AB, Bajic VB (2016) In silico toxicology: computational methods for the prediction of chemical toxicity. Wiley Interdiscip Rev Comput Mol Sci 6(2):147–172
Article CAS Google Scholar
Greene N, Pennie W (2015) Computational toxicology, friend or foe? Toxicol Res 4(5):1159–1172
Article CAS Google Scholar
Kruhlak NL, Benz RD, Zhou H, Colatsky TJ (2012) (Q)SAR modeling and safety assessment in regulatory review. Clin Pharmacol Ther 91(3):529–534
Article CAS Google Scholar
Tropsha A (2010) Best practices for QSAR model development, validation, and exploitation. Mol Inform 29(6–7):476–488
Article CAS Google Scholar
Yang H, Sun L, Li W, Liu G, Tang Y (2018) In silico prediction of chemical toxicity for drug design using machine learning methods and structural alerts. Front Chem 6:30. https://doi.org/10.3389/fchem.2018.00030
Article CAS PubMed PubMed Central Google Scholar
Danishuddin Khan AU (2016) Descriptors and their selection methods in QSAR analysis: paradigm for drug design. Drug Discov Today 21(8):1291–1302
Article CAS Google Scholar
Leach AR, Gillet VJ (2007) Molecular descriptors. An introduction to chemoinformatics. Springer, Dordrecht, pp 53–74
Chapter Google Scholar
Todeschini R, Consonni V (2000) Handbook of molecular descriptors. Wiley-VCH, Weinheim
Book Google Scholar
Duan J, Dixon SL, Lowrie JF, Sherman W (2010) Analysis and comparison of 2D fingerprints: insights into database screening performance using eight fingerprint methods. J Mol Graph Model 29(2):157–170
Article CAS Google Scholar
National Institutes of Health (2009) PubChem substructure fingerprint. ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt. Accessed 10 Oct 2018
Google Scholar
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742–754
Article CAS Google Scholar
Huang R, Xia M, Nguyen D-T et al (2016) Tox21Challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Front Environ Sci 3:85. https://doi.org/10.3389/fenvs.2015.00085
Article Google Scholar
Mayr A, Klambauer G, Unterthiner T, Hochreiter S (2016) DeepTox: toxicity prediction using deep learning. Front Environ Sci 3:80. https://doi.org/10.3389/fenvs.2015.00080
Article Google Scholar
Subramanian J, Simon R (2013) Overfitting in prediction models—Is it a problem only in high dimensions? Contemp Clin Trials 36(2):636–641
Article Google Scholar
Clarke R, Ressom HW, Wang A et al (2008) The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer 8(1):37–49
Article CAS Google Scholar
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article CAS Google Scholar
Ang JC, Mirzal A, Haron H, Hamed HNA (2016) Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Trans Comput Biol Bioinform 13(5):971–989
Article Google Scholar
Merkwirth C, Mauser H, Schulz-Gasch T, Roche O, Martin Stahl A, Lengauer T (2004) Ensemble methods for classification in cheminformatics. J Chem Inf Comput Sci 44(6):1971–1978
Article CAS Google Scholar
Venkatraman V, Dalby AR, Yang ZR (2004) Evaluation of mutual information and genetic programming for feature selection in QSAR. J Chem Inf Comput Sci 44(5):1686–1692
Article CAS Google Scholar
Bajorath J (2001) Selected concepts and investigations in compound classification, molecular descriptor analysis, and virtual screening. J Chem Inf Comput Sci 41(2):233–245
Article CAS Google Scholar
Goodarzi M, Dejaegher B, Heyden YV (2012) Feature selection methods in QSAR studies. J AOAC Int 95(3):636–651
Article CAS Google Scholar
Shahlaei M (2013) Descriptor selection methods in quantitative structure—activity relationship studies: a review study. Chem Rev 113(10):8093–8103
Article CAS Google Scholar
Bellman R (2016) Adaptive control processes: a guided tour. Princeton University Press, New Jersey
Google Scholar
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Article Google Scholar
Van Der Maaten L, Postma E, Van Den Herik J (2009) Dimensionality reduction: a comparative review. J Mach Learn Res 10:66–71
Google Scholar
Cai J, Luo J, Wang S, Yang S (2018) Feature selection in machine learning: a new perspective. Neurocomputing 300:70–79
Article Google Scholar
Tang J, Alelyani S, Liu H (2014) Feature selection for classification: a review. In: Aggarwal CC (ed) Data classification: algorithms and applications, 1st edn. CRC Press, Boca Raton, pp 37–64
Google Scholar
Johnstone IM, Titterington DM (2009) Statistical challenges of high-dimensional data. Philos Trans A Math Phys Eng Sci 367(1906):4237–4253
Article Google Scholar
Zhu X, Wu X (2004) Class noise versus attribute noise: a quantitative study. Artif Intell Rev 22(3): 177 –210
Google Scholar
Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43(1):59–69
Article Google Scholar
Sheikhpour R, Sarram MA, Gharaghani S, Chahooki MAZ (2017) A survey on semi-supervised feature selection methods. Pattern Recognit 64:141–158
Article Google Scholar
Dy JG, Brodley CE (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889
Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Google Scholar
Solorio-Fernandez S, Martinez-Trinidad JF, Carrasco-Ochoa JA, and Zhang Y-Q (2012) Hybrid feature selection method for biomedical datasets. In: 2012 IEEE symposium on computational intelligence in bioinformatics and computational biology (CIBCB), San Diego, 9–12 May 2012
Google Scholar
Hsu H-H, Hsieh C-W, Lu M-D (2011) Hybrid feature selection by combining filters and wrappers. Expert Syst Appl 38(7):8144–8150
Article Google Scholar
Guan D, Yuan W, Lee YK, Najeebullah K, Rasel MK (2014) A review of ensemble learning based feature selection. IETE Tech Rev 31(3):190–198
Article Google Scholar
Brahim AB, Limam M (2017) Ensemble feature selection for high dimensional data: a new method and a comparative study. Adv Data Anal Classif 12(4):937–952
Article Google Scholar
Seijo-Pardo B, Porto-Díaz I, Bolón-Canedo V, Alonso-Betanzos A (2017) Ensemble feature selection: homogeneous and heterogeneous approaches. Knowl Based Syst 118:124–139
Article Google Scholar
Janecek A, Gansterer W, Demel M, Ecker G (2008) On the relationship between feature selection and classification accuracy. Proc Mach Learn Res 4:90–105
Google Scholar
Hira ZM, Gillies DF (2015) A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinformatics. https://doi.org/10.1155/2015/198363
Article Google Scholar
Rajarshi G, Jurs PC (2004) Development of linear, ensemble, and nonlinear models for the prediction and interpretation of the biological activity of a set of PDGFR inhibitors. J Chem Inf Comput Sci 44(6):2179–2189
Article Google Scholar
Guo G, Neagu D, Cronin MTD (2005) A study on feature selection for toxicity prediction. In: Wang L, Jin Y (eds) Fuzzy systems and knowledge discovery. Springer, Heidelberg, pp 31–34
Chapter Google Scholar
Newby D, Freitas AA, Ghafourian T (2012) Pre-processing feature selection for improved C&RT models for oral absorption. J Chem Inf Model 53(10):2730–2742
Article Google Scholar
Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recognit Lett 15(11):1119–1125
Article Google Scholar
Brendel M, Zaccarelli R, Devillers L (2010) A quick sequential forward floating feature selection algorithm for emotion detection from speech. In: INTERSPEECH-2010, Chiba, 26–30 September 2010
Google Scholar
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of ICNN’95—international conference on neural networks, Perth, 27 November–1 December 1995
Google Scholar
Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison-Wesley Longman Publishing Co., Inc, Boston
Google Scholar
Revathy N, Balasubramanian R (2012) GA-SVM Wrapper approach for gene banking and classificaiton using expressions of very few genes. J Theor Appl Inf Technol 40(2):113–119
Google Scholar
Shen Q, Jiang J-H, Tao J et al (2005) Modified ant colony optimization algorithm for variable selection in QSAR modeling: QSAR studies of cyclooxygenase inhibitors. J Chem Inf Model 45(4):1024–1029
Article CAS Google Scholar
Jain D, Singh V (2018) Feature selection and classification systems for chronic disease prediction: a review. Egypt Informatics J 19(3):179–189
Article Google Scholar
Osman H, Ghafari M, Nierstrasz O (2017) Automatic feature selection by regularization to improve bug prediction accuracy. In: 2017 IEEE workshop on machine learning techniques for software quality evaluation (MaLTeSQuE), Klagenfurt, 21 February 2017
Google Scholar
Reddy AS, Kumar S, Garg R (2010) Hybrid-genetic algorithm based descriptor optimization and QSAR models for predicting the biological activity of tipranavir analogs for HIV protease inhibition. J Mol Graph Model 28(8):852–862
Article CAS Google Scholar
Dutta D, Guha R, Wild D, Chen T (2007) Ensemble feature selection: consistent descriptor subsets for multiple QSAR models. J Chem Inf Model 47(3):989–997
Article CAS Google Scholar
Zhu X-W, Xin Y-J, Ge H-L (2015) Recursive random forests enable better predictive performance and model interpretation than variable selection by LASSO. J Chem Inf Model 55(4):736–746
Article CAS Google Scholar
Lauria A, Ippolito M, Almerico AM. (2009) Combined use of PCA and QSAR/QSPR to predict the drugs mechanism of action. An application to the NCI ACAM database. QSAR Comb Sci 28(4):387–395
Article CAS Google Scholar
Yoo C, Shahlaei M (2018) The applications of PCA in QSAR studies: a case study on CCR5 antagonists. Chem Biol Drug Des 91(1):137–152
Article CAS Google Scholar
Klepsch F, Vasanthanathan P, Ecker GF (2014) Ligand and structure-based classification models for prediction of P-glycoprotein inhibitors. J Chem Inf Model 54(1):218–229
Article CAS Google Scholar
Hemmateenejad B, Miri R, Jafarpour M, Tabarzad M, Foroumadi A (2006) Multiple linear regression and principal component analysis-based prediction of the anti-tuberculosis activity of some 2-aryl-1,3,4-Thiadiazole derivatives. QSAR Comb Sci 25(1):56–66
Article CAS Google Scholar
Manikandan G, Abirami S (2018) A survey on feature selection and extraction techniques for high-dimensional microarray datasets. In: Anouncia SM, Wiil UK (eds) Knowledge computing and its applications. Springer, Singapore, pp 311–333
Chapter Google Scholar
Reverter F, Vegas E, Oller JM (2014) Kernel-PCA data integration with enhanced interpretability. BMC Syst Biol 8(2):S6
Article Google Scholar
Wang Q (2012) Kernel principal component analysis and its applications in face recognition and active shape models. https://arxiv.org/abs/1207.3538. Accessed 10 October 2018
Baldi P (2012) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, Bellevue, 2 July 2012
Google Scholar
Goh GB, Hodas NO, Vishnu A (2017) Deep learning for computational chemistry. J Comput Chem 38(16):1291–1307
Article CAS Google Scholar
Chandra B, Sharma RK (2015) Exploring autoencoders for unsupervised feature selection. In: 2015 international joint conference on neural networks (IJCNN), Killarney, 12–17 July 2015
Google Scholar
Blaschke T, Olivecrona M, Engkvist O, Bajorath J, Chen H (2018) Application of generative autoencoder in de novo molecular design. Mol Inform 37(1–2):1700123
Article Google Scholar
Burgoon LD (2017) Autoencoder predicting estrogenic chemical substances (APECS): an improved approach for screening potentially estrogenic chemicals using in vitro assays and deep learning. Comput Toxicol 2:45–49
Article Google Scholar
Ye J, Ji S (2009) Discriminant analysis for dimensionality reduction: an overview of recent developments. In: Boulgouris NV, Plataniotis KN, Micheli-Tzanakou E (eds) Biometrics: theory, methods, and applications. IEEE Press, Piscataway, pp 1–20
Google Scholar
Yan H, Dai Y (2011) The comparison of five discriminant methods. In: 2011 International conference on management and service science, Wuhan, 12–14 August
Google Scholar
Ren YY, Zhou LC, Yang L, Liu PY, Zhao BW, Liu HX (2016) Predicting the aquatic toxicity mode of action using logistic regression and linear discriminant analysis. SAR QSAR Environ Res 27(9):721–746
Article CAS Google Scholar
van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Google Scholar
Borg I, Groenen PJF (2005) Modern Multidimensional Scaling, 2nd edn. Springer Science + Business Media Inc, New York
Google Scholar
Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15(6):1373–1396
Article Google Scholar
Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323
Article CAS Google Scholar
Izenman AJ (2012) Introduction to manifold learning. Wiley Interdiscip Rev Comput Stat 4(5):439–446
Article Google Scholar
Kalousis A, Prados J, Hilario M (2007) Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inf Syst 12(1):95–116
Article Google Scholar
Alelyani S, Liu H, Wang L (2011) The effect of the characteristics of the dataset on the selection stability. In: 2011 IEEE 23rd international conference on tools with artificial intelligence, Boca Raton, 7–9 November 2011
Google Scholar
Yang P, Zhou BB, Yang JY-H, Zomaya AY (2013) Stability of feature selection algorithms and ensemble feature selection methods in bioinformatics. In: Elloumi M, Zomaya AY (eds) Biological knowledge discovery handbook: preprocessing, mining, and postprocessing of biological data. John Wiley & Sons Inc, Hoboken, pp 333–352
Chapter Google Scholar
Yang P, Ho JW, Yang Y, Zhou BB (2011) Gene-gene interaction filtering with ensemble of filters. BMC Bioinformatics 12:S10. https://doi.org/10.1186/1471-2105-12-S1-S10
Article PubMed PubMed Central Google Scholar
Yang F, Mao KZ (2011) Robust feature selection for microarray data based on multicriterion fusion. IEEE/ACM Trans Comput Biol Bioinforma 8(4):1080–1092
Article Google Scholar
Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y (2010) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26(3):392–398
Article CAS Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer-Verlag, New York
Book Google Scholar
Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci U S A 99(10):6562–6566
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, USA
Gabriel Idakwo, Joseph Luttrell IV & Chaoyang Zhang
Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
Minjun Chen & Huixiao Hong
Environmental Laboratory, US Army Engineer Research and Development Center, Vicksburg, MS, USA
Ping Gong

Authors

Gabriel Idakwo
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Luttrell IV
View author publications
You can also search for this author in PubMed Google Scholar
Minjun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Huixiao Hong
View author publications
You can also search for this author in PubMed Google Scholar
Ping Gong
View author publications
You can also search for this author in PubMed Google Scholar
Chaoyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chaoyang Zhang .

Editor information

Editors and Affiliations

National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
Huixiao Hong

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Idakwo, G., Luttrell IV, J., Chen, M., Hong, H., Gong, P., Zhang, C. (2019). A Review of Feature Reduction Methods for QSAR-Based Toxicity Prediction. In: Hong, H. (eds) Advances in Computational Toxicology. Challenges and Advances in Computational Chemistry and Physics, vol 30. Springer, Cham. https://doi.org/10.1007/978-3-030-16443-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-16443-0_7
Published: 21 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16442-3
Online ISBN: 978-3-030-16443-0
eBook Packages: Chemistry and Materials ScienceChemistry and Material Science (R0)

Publish with us

Policies and ethics

A Review of Feature Reduction Methods for QSAR-Based Toxicity Prediction

Abstract

Similar content being viewed by others

On the Relevance of Feature Selection Algorithms While Developing Non-linear QSARs

QSAR—An Important In-Silico Tool in Drug Design and Discovery

alvaDesc: A Tool to Calculate and Analyze Molecular Descriptors and Fingerprints

Keywords

1 Introduction

2 Feature Selection

2.1 Filter

2.2 Wrapper

2.2.1 Sequential Selection Algorithms

2.2.2 Heuristic Selection Algorithms

2.3 Embedded

2.4 Hybrid and Ensemble Feature Selection

3 Feature Extraction

3.1 Principal Component Analysis

3.2 Autoencoder

3.3 Linear Discriminant Analysis

4 Miscellaneous

4.1 Feature Stability

4.2 Validation of Feature Selection

5 Summary

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

A Review of Feature Reduction Methods for QSAR-Based Toxicity Prediction

Abstract

Similar content being viewed by others

On the Relevance of Feature Selection Algorithms While Developing Non-linear QSARs

QSAR—An Important In-Silico Tool in Drug Design and Discovery

alvaDesc: A Tool to Calculate and Analyze Molecular Descriptors and Fingerprints

Keywords

1 Introduction

2 Feature Selection

2.1 Filter

2.2 Wrapper

2.2.1 Sequential Selection Algorithms

2.2.2 Heuristic Selection Algorithms

2.3 Embedded

2.4 Hybrid and Ensemble Feature Selection

3 Feature Extraction

3.1 Principal Component Analysis

3.2 Autoencoder

3.3 Linear Discriminant Analysis

4 Miscellaneous

4.1 Feature Stability

4.2 Validation of Feature Selection

5 Summary

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation