Introduction

In recent years, there has been a significant upsurge in the number and diversity of molecular structure-characterizing features, also known as molecular descriptors (MDs), implemented in various educational and commercial computational programs [19]. This increase is, however, not necessarily all advantageous as it engenders high-dimensional space, which usually has a detrimental influence on the performance of regression and classification algorithms. Moreover, an exhaustive search of the entire MD space in search of subsets of features that best describe a specified molecular property comes along with high computational complexity, in addition to the fact that such exploration may lead to the selection of features that aggravate data overfitting [10]. The challenge of dealing with high-dimensional data is not limited to chemoinformatics. High-throughput data matrices obtained in genomics with microarray technology, metabolomics, proteomics, and texts analysis are typical examples of datasets characterized by the “small sample-many features” problem [10]. It is thus important to develop procedures that filter out noisy, redundant, or highly correlated variables without affecting the learning performance. It is known that usually, dimensionality reduction improves the quality of models (especially, their predictive power) and information extracted from models, in addition to permitting greater computational efficiency. Optimum classification models should ideally discriminate the molecules belonging to different classes, created on the basis of specified molecular properties or activities, the most common example being binary classifiers. Unfortunately, there exists no universally superior feature selection algorithm, since a method unsuitable for particular application may perform ideally in another. This illation is also known as the “no free lunch theorem” [11]. As a result, many computational methods for feature selection have been proposed in the literature and these are basically divided into filters, wrappers, and embedded methods. While wrapper and embedded methods incorporate learning algorithms in their settings, filter methods rely on the intrinsic tendencies of data as criteria for feature selection. Although it is known that filter methods may induce the selection of sets with redundant features, their key advantage is that the selected features are not adapted to a specific predictor algorithm and are thus suitable for dimensionality reduction tasks as well. Feature selection methods follow two primary objectives: (1) Obtain the finest low-dimensionality representation of data matrices, when no dependant (response) variables are available (or considered). The algorithms designed for this purpose are known as unsupervised methods. Examples of such methods include cluster analysis, principal component analysis, Shannon’s entropy ranking, among others [1, 12, 13]. (2) Screen for features that best correlate with response variables for classification and regression. The procedures employed for this objective are collectively denominated as supervised methods. Typical examples include information gain, relief [14], Pearson correlation coefficients, and Fisher ratio, among others.

On the whole, feature selection methods are applications of diverse theoretical concepts and methods aimed at evaluating data patterns (or tendencies) as well as relations among features and/or instances. This article focuses exclusively on information-theoretic methods for feature selection from both a supervised and unsupervised perspective. Information-theoretic functions have increasingly deserved more attention as solid tools for various chemometric tasks. Consequently, a comprehensive review of the literature for all information theory-based unsupervised and supervised feature selection measures has been performed and all these algorithms condensed in a free computational program denominated IMMAN (acronym for Information theory-based CheMoMetrics ANalysis). In addition, information-theoretic parameters previously used to define the information content of a molecular graph are adapted for use as alternative criteria for rank-based feature selection approaches. Moreover, the concept of Symmetrical Kullback–Leibler Entropy (SKL), also known as Jeffreys Information [15, 16], is introduced as an objective measure of the divergence between two probability distributions.

Program design

IMMAN offers a user-friendly graphical interface stratified in sections according to the different information-theoretic concepts and is developed using the JAVA programming language. The programs developed using this language may be executed in different architectures or operating systems, such as Windows, Linux, or OS X. The quantity of RAM necessary for the utilization of the IMMAN software depends on the size of the data to be analyzed. Batch files (.bat) and shell scripts (.sh) are provided to augment the maximum heap size for the java engine. Depending on the RAM available on the system, a program could be executed via these files (or scripts) to award greater heap size to the java engine and avoid Out of Memory errors. The default setting for java lies in the range 16–64 MB, which is normally too small.

The IMMAN system comprises two major classes that manage the datasets over which computations are to be performed. The DataSet class is a representation of the files that are introduced to the system for processing. This class contains attributes, such as file name, an array for identification codes for the instances, the number of instances, and the number of variables contained, among others. Additionally, within this class there is a list of objects for the Variable class. These objects are attributes for the variables of the dataset to be analyzed and include variable name, position occupied by variable in the dataset, an array of instance values for the variable, and statistical parameters, such as the maximum, minimum, median, and the standard deviation which are calculated on creating the object. The Variable class is the central axis of the system; it is the one that contains all the operations related with the information-theoretic parameters. Overall, the system comprises a total of 64 classes contained in 17 packages.

The accepted input file formats for IMMAN are Tab and Comma Separated Value files (.txt, .csv). In the conception of this application, we considered it useful to provide for real-time computations of multiple dataset files, and thus comparisons of features or datasets from different sources are possible. Several data pre-processing functions, such as missing values replacement (i.e., with the minimum, maximum, mean, geometric mean, median, or with a user-defined numeric value) or feature deletion, data selection (in the sense that not all loaded datasets may be used for computations), data browsing or inspection, and feature-based dataset partitioning (in cases where the user desires to operate with a reduced number of features) are provided. In addition, for supervised feature selection tasks an option is provided for converting continuous response variables to categorical ones to serve as class variables. Figure 1 provides an image of the IMMAN’s graphic user interface.

Fig. 1
figure 1

IMMAN’s graphical user interface (GUI)

It is important to remark that IMMAN provides the possibility of performing single parameter ranking or ensemble (multi-criteria) ranking using the former as base ranking methods. Note that the ensemble ranking procedure could be performed using the scores (values for the analyzed features) or the positions they occupy on ranking, following specified amalgamation criteria (i.e., product, mean, geometric mean, or sum). In addition, other than the rank-based unsupervised and supervised computations, graphic visualizations for these outcomes could be employed. The unsupervised graphical analysis options include Shannon’s distribution graph, histogram graph, importance and correlation graphs, respectively, while supervised graphic analysis allows for the importance and correlation graphs, exclusively.

Calculations performed may be exported to the report and, if desired, saved as Tab Separated Value files (.txt). Figure 2 is an illustrative work flow of operations performed with the IMMAN program. For more information see http://mobiosd-hub.com/imman-soft/.

Fig. 2
figure 2

Workflow for the operations carried out with the IMMAN program

Theory

Unsupervised variable ranking-based feature selection approaches

In unsupervised paradigms, the feature selection algorithms are unguided by objective functions but rather rely on mathematical quantification of intrinsic properties of datasets. Although in recent years, due to the enormous explosion of amounts of unlabeled datasets, there has been growing interest in unsupervised feature selection methods; these are simply a handful compared to the bulk of the supervised algorithms. Unsupervised feature selection algorithms ameliorate the performance of clustering algorithms, genomic microarray data analysis and similarity/dissimilarity studies of molecular compounds [1719]. The unsupervised feature selection methods implemented in IMMAN are now briefly discussed.

Shannon’s entropy and related entropic measures

Shannon’s Entropy (SE) and Scaled Shannon’s Entropy (sSE) are proposed by Godden and Bajorath [20, 21] as parameters for the quantification of the information content, and thus the variability of MDs. These entropic measures may therefore be used as criteria to rank features in a dataset and if a threshold value is defined, a subset of features is retained for use in building correlation and/or classification models. A brief recapitulation of the theoretical aspects of this methodology is available as supporting information (SI1).

On the other hand, while SE has been previously used as a measure for structural and/composition diversity of molecules through the so-called information theory-based MDs (or information indices), there exists a series of other information-theoretic parameters, mathematically related to Shannon’s entropy, that have been traditionally used to serve the same purpose, but not for feature selection tasks. These parameters include negentropy (nSE), Brillouin redundancy index (rSE), Gini index (gSE), and informational energy content (iSE) [1, 22]. In this sense these parameters are likewise adapted for use in the evaluation of the information content of features. Table 1 shows the mathematical definitions of these parameters, as implemented in the IMMAN software.

Table 1 Information-theoretic parameters implemented in the IMMAN software

These parameters follow a fixed, data independent discretization scheme (unsupervised equal-interval binning), where a predetermined number of bins is defined. Although the number of discrete intervals is user-defined, SE maximization is advised in that the number of bins should allow an equal distribution of instances (or an approximation) in the discrete intervals. It should be noted that although this discretization approach has been often criticized for being liable to bad cuts leading to uneven distribution of instances, studies have shown that equal-interval binning can yield excellent results, for example with the Naïve Bayes classifier [23].

Singular value decomposition entropy (SVDEi)

This entropy measure evaluates the contribution of the \(i^{th}\) feature to the dataset entropy following a leave-one-out (LOO) setting [24, 25]. Let \(\hbox {S}_{\mathrm{j}}\) represent singular values of the matrix \(\mathbf{A}_{[\mathrm{nxm}]}\) of n instances and m features. Then \(S_{\mathrm{j}}^{2}\) denotes the eigenvalues of the \(n \times n\) matrix \(\mathbf{A}_{*} \mathbf{A}^{\mathrm{t}}\). The dataset entropy is defined by

$$\begin{aligned} E(\mathbf{A})=-\frac{1}{\log N}*\sum _{j=1}^N {\frac{{\mathbf {S}}_\mathrm{j} ^{2}}{{\mathbf {S}}_T }} \log \frac{{\mathbf {S}}_\mathrm{j} ^{2}}{{\mathbf {S}}_T }, \end{aligned}$$
(1)

where \({{\mathbf {S}}}_{T}\) denotes the total sum of the \(\hbox {S}_{\mathrm{j}}^{2}\) values. Therefore, the contribution of the \(i\)th feature to the dataset entropy is defined as follows:

$$\begin{aligned} \hbox {DSE}_{\mathrm{i}} =E(\mathbf{A})-E(\mathbf{A}'), \end{aligned}$$
(2)

where \(\mathbf{A}'\) denotes the matrix A without the analyzed feature. In this sense, features may be ranked according to their relative contribution to the dataset entropy.

Degenerative entropy raid (DGSE) and degenerated value (DV)

The DV is a diversity measure based on the number of instances characterized differently by features in a data matrix and is defined as

$$\begin{aligned} \hbox {DV}({X})=\frac{\hbox {Instances}_{\mathrm{total}} -\hbox {Instances}_{\mathrm{different}} }{\hbox {Instances}_{\mathrm{total}}}. \end{aligned}$$
(3)

The DV varies between 0 and 1, with the lower bound (DV = 0) corresponding to the ideal case where X assigns different values for all instances while the upper bound (DV = 1) stands for maximum degeneracy. Note that the DV is not strictly an entropic measure. Its inclusion is justified by its relationship with the DGSE as shown below. The DGSE evaluates the feature variability according to the degenerated value and is expressed as follows:

$$\begin{aligned} \hbox {DGSE}({X})=\hbox {DV}({X})*\frac{\hbox {SE} ({X})_{\mathrm{DV}({X})} }{\hbox {SE}({X})_{\mathrm{inst}} }, \end{aligned}$$
(4)

where DV (X) is the degenerated value for variable X, \(\hbox {SE}({X})_{\hbox {DV}({X})}\), and \(\hbox {SE}({X})_{\mathrm{inst}}\) are Shannon’s entropy for X using as the number of discrete intervals the DV and number of instances, respectively.

Euclidean distance-based entropy (EDSE)

The EDSE is computed on the distance among instances, as a measure of the clustering tendency of instances according to variable X, and is expressed as follows [26]:

$$\begin{aligned} \hbox {EDSE}(\hbox {X}) \!=\!\sum \limits _{\mathrm{i}} {\sum \limits _{\mathrm{j}} {\left[ {{D}_{\mathrm{ij}} \log _2 {D}_{\mathrm{ij}} \!+\!(1+{D}_{\mathrm{ij}} )\log _2 (1\!-\!{D}_{\mathrm{ij}} )} \right] } },\nonumber \\ \end{aligned}$$
(5)

where \({D}_{\mathrm{ij}}\) is the normalized distance in the range [0.0–1.0] between the instances Xi and Xj. The EDSE is low for data with clustering tendency and high otherwise, a characteristic that makes it suitable for unsupervised feature selection procedure. Optimality is related with minimum EDSE for subsets of features.

Altogether, ten unsupervised rank-based feature selection methods are discussed. These are essentially divided into the discretization scheme-based algorithms (e.g., SE, sSE, nSE, rSE, gSE, and iSE), and those that do not follow any discretization procedure (e.g., SVDE, DGSE, DV, and EDSE). These algorithms provide an important arsenal of tools for unsupervised feature selection and dimensionality reduction.

Supervised feature selection algorithms

Supervised feature selection algorithms estimate the functional dependency between features and class labels. Based on the feature evaluation method, these algorithms are divided into two main groups: feature subset selection and feature ranking. While the former assesses the discrimination power of subsets of features, the latter evaluates individual features weighted by their degree of relevance. The IMMAN software supervised algorithms belong to the latter. Note that with this approach, a simple filtering criterion is followed (using a defined threshold value or the first \(k\) features) and no heuristic search strategy or learning scheme is employed.

Differential Shannon’s entropy

To analyze the variability of compound populations, Godden and Bajorath [21] propose a parameter denominated “Differential Shannon’s entropy.” The initial definition was conceived by comparing two compound populations. In this report, however, this definition is generalized for \(n\) datasets, defined as

$$\begin{aligned} \mathrm{DSE}=\mathrm{SE}_{1,2,3...n} -(\mathrm{SE}_1 +\mathrm{SE}_2 +\mathrm{SE}_3 +...\mathrm{SE}_n )/n, \end{aligned}$$
(6)

where “\(\mathrm{SE}_{1, 2, 3\ldots n}\)” is the SE calculated for the combination of \(n\) compound datasets under consideration. DSE is a measure of the complementarity of \(n\) compound collections with regard to the descriptor under analysis. The application of this measure in feature selection tasks is straight forward, in place of compound datasets, class-based partitions are considered. The default configuration is for binary class labels, adjustable to \(n\) classes. Note that IMMAN provides an option for transforming a continuous Y response into a categorical one, following a percentile-based rule. The usability of DSE in the identification of features useful in compound classification tasks has been demonstrated, see ref [27].

However, it should also be clarified that in strictly information-theoretic term, the terms DSE is used for entropy computations of continuous sources (or variables), rather than discrete variables as used in the initially proposed definition.

Mutual information differential Shannon’s entropy (MI-DSE)

The MI-DSE is introduced as a modification of DSE to select class-specific features when there exists notable differences in compound class sizes [28]. The MI-DSE is mathematically expressed as follows:

$$\begin{aligned} {\text {MI-DSE}}(X)=\mathrm{SE}_\mathrm{norm} (X,Y)-\frac{\mathrm{SE}(X)-SE(Y)}{2}, \end{aligned}$$
(7)

where \(\hbox {SE}_{\hbox {norm}}(\hbox {X}, \hbox {Y})\) is the normalized entropy calculated on two compound classes X and Y combined, SE(X) and SE(Y) Shannon’s entropy for X and Y, respectively.

Symmetric Kullback–Leibler entropy: Jeffreys information

Kullback–Leibler entropy or divergence is a heuristic measure of the “distance” between two probability distributions \(f (x_{i})\) and \(g (x_{i})\), and it is defined as [15, 29, 30]

$$\begin{aligned} D_{KL} (F\big \Vert G)= \sum _{i=1}^m f(x_i )log\left( \frac{f(x_i )}{g(x_i)}\right) , \end{aligned}$$
(8)

where \(f (x_{i})\) is the experimental (real) distribution and \(g (x_{i})\) is the theoretical (approximative) distribution. Viewed from a source coding perspective, Kullback–Leibler entropy defines the additional number of bits needed to codify independent draws of a discrete (o continuous) variable \(I\) with probability distribution \(f (x_{i})\), when a different distribution \(g (x_{i})\) is used [15, 29, 30]. Kullback–Leibler divergence, however, presents one shortcoming: it is a subjective parameter, i.e., its value depends on the choice of which variable is considered as experimental and the other theoretical. In other words, Kullback-Leibler divergence is asymmetrical and thus is not an ideal parameter for probability distributions comparisons in which a distinction between experimental and theoretical variables is inapplicable. To eliminate this limitation, a symmetric function of Kullback–Leibler divergence, denominated Jeffreys information (JI), is used and defined as [16]

$$\begin{aligned} \mathrm{JI}(F\big \Vert G)=D_{KL} (F\big \Vert G) +D_{KL} (G\big \Vert F). \end{aligned}$$
(9)

Contrary to Kullback–Leibler divergence, JI is an “unbiased” measure of the distance/dissimilarity of variable distributions between compound classes and thus applicable in rank-based supervised feature selection tasks. High JI values are related with maximum functional dependence between the features and class labels, and low JI values otherwise.

Information gain (IG), gain ratio (GR), and symmetrical uncertainty (SU)

The IG of attribute X refers to the reduction in uncertainty about class attribute Y given that X is known and mathematically defined as follows [23]:

$$\begin{aligned} \mathrm{IG}(X| Y)=H(X)-H(X| Y ), \end{aligned}$$
(10)

where H(X) and H \((\hbox {X}{\vert }\hbox {Y})\) are entropy of attribute X and entropy of X given Y, respectively.

The GR is a normalization of the IG to compensate for the preference for the attribute with large number of values and is defined as [31]

$$\begin{aligned} \mathrm{GR}(X,Y)=\frac{\mathrm{IG}(X| Y)}{H(X,Y)}, \end{aligned}$$
(11)

where IG \((\hbox {X}{\vert }\hbox {Y})\) and H(X, Y) are the information gain of X given Y and the intrinsic information (entropy of distribution of instances into branches), respectively.

The SU compensates for IG’s bias toward attributes with more values and is mathematically expressed as [32]

$$\begin{aligned} \mathrm{SU}(X,Y)=\frac{\mathrm{IG}(X\big |Y)}{\mathrm{SE}(X)+\mathrm{SE}(Y)}. \end{aligned}$$
(12)

Symmetrical Uncertainty normalizes its value to the range [0, 1], where 0 indicates that the attributes are completely independent while 1 indicates that each attribute predicts the values of the other.

It should be highlighted that while IG, GR, and SU have previously been reported in the literature and implemented in several feature selection software, the difference with the IMMAN’s approach is that an unsupervised equal-interval discretization procedure is employed. Note that while the number of discrete intervals is pre-defined by the user, SE maximization is advised in that the number of bins should allow an equal distribution of instances (or an approximation) in the discrete intervals.

Sample case studies

The primary objective of these studies is to exemplify the practical utility of IMMAN in chemometric tasks. These studies are divided in three parts: Case studies I and II demonstrate the application of SE in comparative studies of families of molecular characterizing parameters and software for these, respectively, from an unsupervised feature selection perspective. On the other hand, case study III deals with the application of the unsupervised and supervised feature selection tools in classification tasks. Comparisons with WEKA software are made.

Case study I

In this section, we compare the performance of families of DRAGON’s MDs [2] using SE measure, following the synthesis that high SE features are sensitive to progressive changes in chemical structures and thus generally suitable for correlation studies, while low SE features the contrary. For this study, the PrimScreen1 diversity dataset (available at http://www.otavachemicals.com/component/docman/doc_download/19-primscreen-1-db) was employed. Some MD families were grouped together into bigger families forming a total of 13 super-families. Using a discretization scheme of 1,000 bins, SE values were computed and the best 111 MDs in each family graphically represented for analysis. This cut-off value was not arbitrary chosen, but rather the family that presented the least number of variables determined this value. The probability-based normalization procedure is not advised when working with dataset files with differences in the number of variables as it gives a biased graphic impression of a dataset file with much more variables in respect to other datasets. Figure 3 shows a graphic representation of a family-wise SE distribution for DRAGON’s MDs.

Fig. 3
figure 3

Graphic representation for Shannon’s entropy distribution of DRAGON’s MDs

It is interesting to note that, generally 3D MD families show superior entropy distribution with 3D-molecule representations of structures based on electron diffraction (3D-MoRSE) MDs presenting the best performance while the worst entropy distribution is observed with fingerprint-based MDs (i.e., 2D Binary and Frequency fingerprints, respectively) and the superfamily 0D-1D & Others (molecular properties and charge descriptors), respectively. This is a logical result since 3D MDs take into account the nature and connectivity of the atoms, as well as the overall spatial configuration of the molecule, contrary to fingerprint-based MDs and 0D-1D MDs which are generally concerned with the identification of atom types, functional groups, or substituents of interest in a molecule and are thus insensitive to structural or conformational changes in molecular structures and therefore present high degeneracy. On the other hand, it would also be informative to assess the representativity of the overall best twenty MDs in terms of their SE values, of the commercial software DRAGON [2]. Table 2 shows the SE, sSE, nSE, rSE, gSE, iSE, DGSE, DV, and EDSE values, for the best twenty MDs ranked according to SE, for a discretization scheme of 1000 bins.

Table 2 The twenty best MDs of commercial software DRAGON, ranked according to Shannon’s entropy

As it can be observed, the highest representativity is achieved with Randic Molecular Profiles (derived from the distance distribution moments of the geometry matrix) with five MDs (\(i.e.,\) DP12, DP14, DP15, SP14, and SP16), followed by four MDs for the Connectivity and Topological Index families, respectively, i.e., (X2, X3sol, XMOD, X0v) and (Xu, S1K, S2K, RHyDp). GETAWAY and Geometric indices are represented by two MDs each [(HTv, H3p) and (G1, G2), respectively] and finally one Molecular Property (i.e., AMR), Eigenvalue-based index (i.e., SEig) and Constitutional MD (i.e., Sp). These indices (or families) could be recommended as “ideal” features to be taken into account in molecular modeling and in the drug discovery process. It is not surprising that these MDs families have been successively applied in the modeling and prediction of a wide range of physicochemical, pharmacological, and toxicological properties of several molecular datasets [3344]. The AMR, as an index in particular, is well known and frequently used in QSPR/QSAR models with comprehensible physicochemical interpretation.

Additionally, it is of interest to evaluate the degree of correlation of the unsupervised feature selection methods. To this end, for each feature selection method, the best 100 MDs were selected and these used to create an incidence matrix (n \(\times \) m), where n are the variables in the original dataset and m the feature selection parameters. Later, Pearson correlation analysis was performed. Table 3 shows the pair-wise correlation coefficients for the unsupervised feature selection methods.

Table 3 Correlation coefficients between unsupervised feature selection parameters

As it can be observed, the parameters that follow an equal-interval discretization scheme (with the exception of gSE) are highly correlated, which a logical result is given that their mathematical definitions are generally related to SE. This result suggests that these parameters may not be used concurrently in simple feature ranking. Nonetheless, their utility is appreciated in an ensemble-based feature ranking scheme where the magnitudes (scores) or positions yielded by these parameters for a set of features influence the final ranking from a multi-criteria perspective. For example, if we consider the values of the highly correlated parameters SE, sSE, nSE, and igE of the 20 best MDs (ranked according to SE) in a score-based ensemble scheme, using the product as the amalgamation rule (rSE is left out because it has a negative relation to the relevance of the features), a rather different ranking pattern is achieved (see “Ensemble” column in Table 2). Similarly, as expected the DV and DGSE are highly correlated as well given that the latter is derived from the former. On the other hand, the parameters that do not follow the equal-interval discretization procedure are weakly correlated among themselves as well as to the bin-based unsupervised feature selection parameters. The orthogonality of the feature selection parameters is a desirable attribute as this enables the assessment of distinct tendencies (or patterns) of dataset matrices and thus retrieve dissimilar information.

Case study II

Secondly, we performed a study to compare the performance, in entropy terms, of the most prominent MD calculating software programs. This study is key as it takes into account that there exist several MD calculating programs, and such abundance presents a dilemma when it comes to choosing the “ideal” MD computing software for a particular study. Ideal in this sense refers to software with the most variable MDs. Like in the former case, using a discretization scheme of 1000 bins, SE values were calculated for each software (set of MDs previously computed on PrimScreen1 dataset). Figure 4 illustrates the graphic representation of Shannon’s distribution of the best 170 variables (cut-off determined by BlueDesc) for each of the MD calculating computer programs.

Fig. 4
figure 4

Graphic representation for Shannon’s entropy distribution of MD calculating software

As it can be observed, the most favorable distribution is provided by DRAGON software, with 100 % of the compared MDs presenting SE values greater than 8.7 bits, followed by MOLD2 and PADEL, respectively. This result suggests that DRAGON comprises a bigger pool of highly variable MDs, sensitive to molecular structural differences, than the rest of the MD calculating software. On the other hand, the worst Shannon’s entropy distribution is demonstrated by POWER MV software. The apparently poor performance of POWER MV is attributed to the fact that the majority of the MDs contained in this software are mainly atom-type counts, and these possess weak discriminating power among similar molecular structures. It should be highlighted that the goal of these sample studies is give simple illustrations of possible case studies with IMMAN other than to establish strict ranking authority for the MD computing software. Stronger and more compelling inferences in this direction require studies with other sets of diverse databases.

Case study III: IMMAN versus WEKA

In this sub-section, we wish to compare the performance of the feature selection methods implemented in IMMAN and WEKA, on the basis of the significance of the subsets of features obtained with the different approaches. Two important differences exist between IMMAN and WEKA algorithms: (1) while IMMAN possesses both unsupervised and supervised methods, WEKA offers supervised algorithms, exclusively. (2) Another key difference lies with the discretization methods followed in each case. The IMMAN algorithms employ an equal-interval binning procedure [23], in which the instances are distributed in discrete intervals (bins) of equal size according to their numeric values. On the other hand, WEKA algorithms employ a supervised discretization scheme based on an entropy minimization heuristic following a recursive binary splitting procedure to obtain multiple intervals for continuous-valued attributes (multi-interval discretization). For details, see ref [45]. For this study the Arcene dataset is used. This dataset is freely available at the UCI Machine Learning Repository [46] and is one of the four datasets proposed in the NIPS 2003 Feature Selection Challenge [47]. The Arcene dataset exemplifies cases where the number of instances is small with respect to the features (high-dimensionality data); a common problem in cheminformatic and/or bioinformatics applications. Particularly, this dataset typifies a two-class classification problem aimed at distinguishing cancer patterns from normal ones in mass spectrometric data.

In the first study, IMMAN’s unsupervised and supervised feature selection approaches as well as WEKA’s algorithms are employed to obtain subsets of 15 variables, separately, and these are used to build classification models using Kth Nearest Neighbors (KNN1) and Support Vector Machine (SVM) classifiers, respectively, and the percentages of correct classification compared [23]. Tables 4 and 5 show the percentages of correct classification for IMMAN and WEKA, using KNN1 classifier.

Table 4 Comparison of the percentages of correct classification of KNN1-based classification models using IMMAN’s and WEKA’s feature selection approaches
Table 5 Comparison of the percentages of correct classification of SVM-based classification models using IMMAN’s and WEKA’s feature selection approaches

As expected, generally supervised methods perform better than unsupervised methods using both KNN1 and SVM classifiers, as the former favor features that are linked to the class labels. Using KNN1, it is observed that generally the WEKA’s supervised feature selection algorithm depicts a slight edge over IMMAN methods (see Table 4), although there exist parameters that exhibit comparable performance with WEKA’s algorithms, such as EDSE (77 %), SU (81 %), and DSE (77 %), with one of them being an unsupervised method (i.e., EDSE). As for SVM, IMMAN’s IG offers the highest percentage of correct classification (82 %), followed by WEKA’s GR (81 %) and SU (80 %), see Table 5. This result suggests that the IMMAN’s algorithms are effective in the selection of subsets of features with good classification accuracy.

One of the key applications of unsupervised methods is in data dimensionality reduction. In the second experiment, we use the unsupervised rank-based methods as pre-filters. For each entropic measure, the mean is determined and used as threshold value (cut-off). The retained sets of features are then filtered using the IMMAN and WEKA supervised feature selection tools, separately, to obtain 15 variable subsets for each algorithm. The final subsets of variables are then validated using KNN1 and SVM classifiers. Tables 6 and 7 show the percentages of correct classification using combinations of supervised and unsupervised feature selection approaches, using KNN1 and SVM classifiers, respectively (note that only the pairings that yield the best percentages of correct classification are shown; for results of all combinations see supporting information SI2).

Table 6 Comparison of the percentages of correct classification for combinations of unsupervised and supervised methods using KNN1 Classifier
Table 7 Comparison of the percentages of correct classification for combinations of unsupervised and supervised methods using SVM Classifier

Generally, the use of unsupervised methods as pre-filters improves the performance of IMMAN and WEKA supervised feature selection tools, for both KNN1 and SVM classifiers. Nonetheless, it is worth noting that while the use of unsupervised parameters yields minimal improvements in the performance of IMMAN supervised algorithms with the KNN1 classifier, significant improvements are achieved with WEKA (compare Tables 4, 6). On the other hand, with SVM classification method, impressive improvements are observed with IMMAN supervised feature selection tools, for example the percentage of correct classification for JI rises from 56 % (see Table 5) to 85 % (see Table 7), yielding superior performance to WEKA algorithms. This trend (improvement) is evocative of the inexistence of a universally superior feature selection tool and advocates for the use of combinations unsupervised and supervised methods to obtain subsets with a more solid and easily interpretable knowledge structure amenable to greater classification accuracy.

In the third experiment, we evaluate the influence of combining variables filtered independently with IMMAN and WEKA feature selection tools (or pairings of unsupervised and supervised methods). To this end, an \(n \times m\) incidence matrix is constructed, with \(n\) being the 1000 features in arcene dataset and \(m\) the feature selection tools, including combinations of unsupervised and supervised parameters. A \(k\)-means cluster analysis (using a \(k\)-value of 15) is performed for this data matrix and for each cluster the algorithm (variable) closest to centroid picked, obtaining a total of 15 dissimilar feature selection methods. Later, the sets of features originally filtered by the 15 algorithms from arcene data matrix are mixed in two- to seven-fold combinations and the resulting sets of variables used to build classification models using KNN1 and SVM methods. Table 8 shows the parameters (or combinations of these) selected for each cluster as well as comparisons of their percentages of correct classifications prior to and after performing two-fold variable mixing procedure. For a matrix showing the percentages of correct classification for all pair-wise combinations for the 15 feature selection tools, see Supporting Information (SI3). As it can be observed in Table 8, two-fold mixing of features obtained using dissimilar algorithms (from a clustering tendency perspective) yields significant improvements in classification accuracy of the models. This is a logical result since mixing features obtained with dissimilar algorithms provides an information structure with a wider span of the data structure patterns, which directly influences the performance of the classifier algorithms.

Table 8 Percentages of correct classification obtained using the two-fold feature mixing scheme according to KNN1 and SVM classifiers

Figure 5 shows the percentages of correct classification obtained using up to seven-fold feature mixing schemes, using the KNN1 classifier. As it can be observed, in all the cases the percentages of correct classification improve achieving up to 92 % of correct classification for six- and seven-fold mixing schemes.

Fig. 5
figure 5

Percentages of correct classification obtained using feature mixing schemes and KNN1 classifier

Conclusions

A free computational program for chemometric analysis designated IMMAN is developed to provide valuable information theory-based tools for unsupervised and supervised feature selection tasks. This software is developed in the Java programming language and can be executed in different operating systems. The usability of IMMAN has been demonstrated with sample case studies, demonstrating satisfactory behavior. In forthcoming releases, we intend to implement other information-theoretic formulations as measures of the correlation and dependence among variables to deal with the possible redundancy among features, a key handicap of rank-based methods. However, it is known that combinations of m top-ranked features considered individually do not necessarily yield the best subset of features, enunciated in the literature as “the m best features are not the best m features” [48, 49]. Therefore, other goals include the incorporating feature subset selection algorithms to the IMMAN approach as well as learning algorithms in the feature selection procedure.

Supporting information available

A brief recapitulation of the theoretical aspects of SE method, the percentages of correct classification for all pair-wise combinations of supervised and unsupervised feature selection methods are freely available to all interested users as supplementary material via the Internet at http://www.springerlink.bibliotecabuap.elogim.com/journal/11030. The IMMAN computational program, user manual, and codes may be freely downloaded via Internet at http://mobiosd-hub.com/imman-soft/.