1 Introduction

The classification task is one of the most relevant types of supervised learning in the knowledge discovery scenario [8]. A previously trained classification model automatically assigns a class label to an instance, based on the values of its features. In many important real-world problems, each instance in the dataset can be described as a binary feature vector, such that each feature takes either a “positive” or a “negative” value, indicating the presence or the absence of a property, respectively, in the object being classified. It should be noted that in this scenario, intuitively positive values are more informative than negative values in general. After all, a positive feature value has a clear and well-defined meaning, whilst the negative value of a feature represents very vague information, in the sense that it just tell us that the object being classified does not have a certain property, without providing any clue about the object’s properties. Therefore, in this work we prioritize the selection of positive feature values over negative feature values, when learning classification models.

More specifically, this work addresses hierarchical feature spaces, where binary features are related via generalization-specialization relationships. In addition, the addressed feature spaces are sparse, i.e., in general the instances contain much fewer positive than negative feature values. In a generalization-specialization hierarchy, also known as “IS-A” hierarchy, for any given instance t, if a feature x has positive value in t, denoted (x = 1), then all ancestors of x in the feature hierarchy also have positive value in t. In contrast, if a feature x has negative value in t, denoted (x = 0), then all descendants of x in the feature hierarchy also have negative value in t.

Some examples of data commonly characterized by hierarchical and sparse feature spaces, where positive feature values are in general more informative than negative values, are text [11] and biological data [29, 31], which are two of the most investigated types of machine learning applications.

For example, in the text classification problem, an article may be characterized by a set of tags describing its content. In this case, one general feature (e.g., News) may be associated with one or more specialized features (e.g., Economy, Politics and Sports). In addition, knowing that a document contains a certain word like Economics (positive feature value) provides us with clear information about the document’s contents, whilst knowing that the document does not contain a certain word like Economics (negative feature value) provides us with much less information about the document’s contents.

Similarly, in bioinformatics problems where each instance represents a gene, each gene may be associated with terms derived from an ontology of biological processes or functions. Hence, a general feature (e.g., biological process) would be the ancestor of more specific features (e.g., reproduction or biological regulation). In addition, a gene annotation indicating that the gene is involved in say DNA repair (positive feature value) provides us with much more information than a lack of DNA repair annotation (negative feature value) for that gene.

Many important real-world datasets have a large number of features, many of which are not crucial for predicting the correct class. Some features can be redundant (highly correlated with each other) or irrelevant for predicting the class variable, decreasing the classifier’s predictive accuracy, making the learning process slower, and reducing the comprehensibility of the results.

Feature selection methods have been successfully employed to cope with these problems. They aim at selecting a reduced subset of features to predict the target class, yet increasing the predictive accuracy of the classifier [17]. Although many methods address this problem [6, 12, 15, 17, 18, 22, 32], only few of them explore the hierarchical information in order to improve their effectiveness [11, 20, 24, 29,30,31]. Existing hierarchical feature selection methods usually find a suitable subset of features by keeping those features with higher values of relevance and removing redundancy among hierarchically related features.

In this work, we focus on hierarchical and sparse feature spaces from different domains that share a singular characteristic; a positive feature value is always much more informative than a negative feature value, as briefly discussed earlier and discussed in more detail later. Despite this interesting characteristic of positive feature values, none of the previously proposed feature selection methods for hierarchical and sparse feature spaces has prioritized the selection of positive feature values. Hence in this work, we hypothesize that the selection of positive feature values tends to increase the predictive accuracy of the classifier, and we propose feature selection methods prioritizing positive feature values.

The main contribution of this work is the proposal of a novel lazy feature selection method for hierarchical and sparse feature spaces which relies on the higher relevance of features with positive values for the classification task. The basic idea of this method is to select, for each test instance, a subset with the most specific positive feature values in the hierarchy as well as its relevant ancestors. To assess the quality of the subset of positive feature values, we introduce a new lazy version of a relevance measure that evaluates the predictive relevance of a feature value for the current test instance.

The main related work involves two other lazy feature selection methods proposed in the literature, the HIP and MR methods [31]. In essence, the proposed feature selection method— called Select Relevant Positive Feature Values (RPV)—differs from HIP and MR in three major ways: First, RPV uses a new relevance measure, proposed in the current work. Second, RPV selects only positive feature values for each instance, whilst the HIP and MR methods select both positive and negative feature values for each instance. Third, comparing RPV vs. HIP, HIP selects only the most specific positive features, whilst RPV selects not only the most specific features but also some of their relevant ancestors; and comparing RPV vs. MR, RPV removes only features which are hierarchically redundant, whilst MR in general removes features that are not hierarchically redundant. These differences are explained in more detail in Section 3 (Related Work), after a description of HIP, MR and other hierarchical feature selection methods.

The proposed feature selection method prioritizing positive feature values is evaluated in 17 datasets. These datasets are mainly from the area of bioinformatics, but they also include other types of application domains, in particular two datasets involving the classification of sports tweets, one dataset involving the classification of news headings, one dataset involving the classification of URLs, and finally one dataset classifying cities into categories of “liveability”. The results of experiments with these diverse application domains show that the proposed hierarchical feature selection method outperforms both traditional feature selection methods and recent state-of-the-art hierarchical feature selection methods regarding predictive accuracy, whilst also selecting smaller feature subsets.

The remainder of this paper is organized as follows. Section 2 defines the hierarchical feature selection problem and briefly discusses feature selection methods. Section 3 reviews related work. Section 4 introduces the new relevance measure and the novel lazy restrictive hierarchical feature selection method. Section 5 presents experimental results. Section 6 presents conclusions and research directions.

2 Hierarchical feature selection for classification

The classification problem can be defined as follows. Let \(X = \{ X_{1}, \dots , X_{d}\}\) be a set of d predictive features and \(L = \{l_{1}, \dots , l_{q}\}\) be a set of q class labels, where q ≥ 2. Let \(D = \{(x_{1},y_{1}), (x_{2},y_{2}), \dots , (x_{N},y_{N})\}\) be a dataset with N instances, where xi corresponds to a vector \((x_{i1}, x_{i2}, \dots , x_{id})\), for the i-th instance, which stores values for the d features in X and each yiL corresponds to a single target class. The goal of the classification task is to learn a classifier from D that, given an unlabelled instance t = (x,?), predicts its class label y.

The quality of the feature set has a huge impact on the predictive performance of classification algorithms [17]. Feature selection methods aim at improving the predictive performance of the classifier by selecting a subset containing relevant and non-redundant features. Relevant features are those that are useful for predicting the target class variable, and non-redundant features are those that are not highly correlated with other features.

Feature selection methods can be categorized into embedded, wrapper and filter methods [17]. Embedded methods are incorporated into the classification algorithm, selecting features during the learning of a classification model. Wrapper and filter methods are instead used in a data pre-processing step. Wrapper methods measure the relevance of a feature subset by evaluating the predictive accuracy of a classifier built using that subset. Hence, they select features tailored for the target classification algorithm, but they tend to be very time-consuming. By contrast, filter methods evaluate the predictive power of features in a generic way, by using a relevance measure that is independent of the target algorithm. Filter methods tend to be much faster and more scalable than wrapper methods. We focus on filter methods in this work. For an example of a wrapper approach see [3].

In some scenarios, the i-th instance is defined as a d-dimensional binary feature vector \((x_{i1}, x_{i2}, \dots , x_{id})\) with xij ∈{0,1} for all 1 ≤ jd. When the feature set X is hierarchically structured, we call it a hierarchical feature space, which can be represented as a Direct Acyclic Graph (DAG). In this DAG a vertex (node) represents a feature and an edge represents a generalization-specialization relationship between features. In this sense, an edge \((X_{a} \rightarrow X_{b})\) indicates that Xa is a parent of Xb and Xb is a child of Xa. More generally, a feature Xa is an ancestor (descendant) of a different feature Xb if and only if there is a sequence of edges leading from Xa to Xb (from Xb to Xa) in the feature DAG—or feature tree, in some scenarios. The root node is the most general feature, while the leaf nodes are the most specific ones. Note that this structure produces a hierarchical redundancy among features, since a specific feature value logically implies the values of all its ancestors or descendants: all ancestors of positive-valued features have positive values and all descendants of negative-valued features have negative values.

For example, to classify an instance where the feature set is formed of Gene Ontology (GO) terms, if the instance is annotated with the GO term “multicellular organism reproduction”, then that instance is considered annotated with the more general GO terms “reproduction” and “multicellular organism process”. Conversely, if an instance is not annotated with the GO term “reproduction” (i.e., the feature “reproduction” has a negative value), then the instance is considered not to be annotated with the GO term “multicellular organism reproduction” (i.e., the child feature is guaranteed to have a negative value too, in that instance).

Hierarchical feature selection methods exploit characteristics of the feature DAG to improve the predictive accuracy. This is typically done by removing hierarchically redundant features [24, 31]. Note that, even though we are dealing with hierarchical features, the information about the hierarchical structure (represented by a feature DAG) is only used by the hierarchical feature selection methods. I.e., the hierarchical structure is used to enhance the feature selection process, helping to identify a better set of features to be selected. After the feature selection step, the data is treated as a “flat” dataset (the hierarchical structure is not considered anymore), then we can use traditional classification methods to make predictions.

Feature selection methods (as well as classification methods) are categorized as eager or lazy. Eager methods select a subset of features based on the training instances. Then, a model trained with the selected features is used to predict the class of any test instance. By contrast, lazy methods select a feature subset tailored for each test instance [1, 22], by observing the feature values (but not the class, of course) in that test instance. In this work, the main motivation to adopt the lazy learning approach is the ability to select a set of relevant positive feature values specifically tailored for each testing instance.

Based on these definitions, our proposed hierarchical method can be categorized as a filter feature selection method which follows the lazy learning paradigm.

3 Related work

Traditional (non-hierarchical) feature selection methods, like the well-known eager Correlation-based Feature Selection (CFS) [6] and ReliefF [12] methods, can be used in hierarchical feature spaces by ignoring the hierarchical relationships among features. However, this is intuitively a sub-optimal approach. Hence, a few methods that directly exploit such hierarchical relationships to improve performance have been recently proposed, as follows.

SHSEL [24] is a hierarchical feature selection method that performs eager learning. SHSEL assumes that, if two features are directly hierarchically related (one is a parent of the other), they are usually highly correlated and tend to be similarly relevant for classification. Hence, for each pair of directly hierarchically related features, SHSEL removes the most specific feature if the correlation between them is higher than a user-defined threshold. Then, using only the remaining features, it keeps for each path in the hierarchy the features whose relevance is higher than the average relevance of features in that path. Moreover, Lu et al. proposed the Greedy Top-Down (GTD) search strategy [20], which selects the most relevant features in each path from each leaf to the root node in the hierarchy. Likewise, an eager learning hierarchical method called Tree-Based Feature Selection (TSEL) [11] has been used in the special case of tree-structured features. Previous work showed that SHSEL achieves better performance than TSEL and GTD [24].

Some hierarchical methods proposed in the literature are based on the lazy learning paradigm, such as the Select Hierarchical Information-Preserving Features (HIP) method [31], the Select Most Relevant Features (MR) method [31], and the hybrid Select Hierarchical Information-Preserving and Most Relevant Features (HIP-MR) method [31]. Since the hybrid HIP-MR obtained worse results than its base methods HIP and MR in [31], it is no longer considered. Next, we briefly describe HIP and MR.

The HIP method eliminates hierarchical redundancy by selecting only the “core” features in the current test instance—i.e., features whose values are non-redundant since they cannot be inferred from the values of other features. In other words, HIP selects the subset of the most specific positive-valued features (which imply their ancestors) and the most general negative-valued features (which imply their descendants). The values of the features selected by HIP for an instance imply the values of all other features for that instance, so it ensures that hierarchical redundancy is completely eliminated. However, HIP does not take into account the relevance of the selected features.

In a similar vein, the MR method not only eliminates hierarchical redundancy but also selects features with higher relevance. For each feature in the DAG, MR considers all paths between the feature and the root (for positive feature values) or between the feature and the leaves (for negative values). Then, the most relevant feature in each path is kept. However, unlike HIP, in general MR does not select all “core” features, i.e., it removes some hierarchically non-redundant features.

The proposed RPV method (described in Section 4) shares with HIP a certain focus on more specific positive feature values, but there are three important differences between these methods. First, HIP selects both positive and negative feature values, whereas RPV only selects positive feature values. Second, among positive feature values, HIP selects only the most specific ones; whilst RPV selects not only the most specific feature values, but also some of their relevant ancestors in the feature hierarchy. Third, RPV uses a new measure of feature value relevance (introduced in this paper), whilst HIP does not use any such relevance measure.

In this work, we compare our proposed method against the state-of-the-art hierarchical feature selection methods HIP, MR and SHSEL, as well as against the traditional (non-hierarchical) feature selection methods CFS and ReliefF.

Note there are also other types of hierarchical feature selection methods, often discussed in the literature under the name of structured feature selection, as reviewed in [5, 19]. However, in general those methods have been proposed for the regression task (using a variation of the Lasso method that produces a sparse linear model), rather than for the classification task addressed in this paper.

It is important to highlight that the hierarchical feature selection task addressed in this paper should not be confused with the kind of hierarchical feature learning performed in deep learning processes. Deep neural networks involve hierarchical feature construction, where, during the training of the neural net, features are hierarchically learnt across the layers of the network [23]. On the other hand, in the problem discussed in this work, the hierarchy of features is predefined, and it is provided as an input to the feature selection algorithm. The point is not to learn or construct new features; the point is to select the best possible subset of features, among the original feature set, exploiting generalization-specialization information associated with the predefined feature hierarchy.

4 The proposed hierarchical feature selection method

This section presents our new relevance measure and the new feature selection method for hierarchical and sparse feature spaces.

4.1 Lazy feature relevance measure

In general, how to assess the relevance (or predictive power) of a feature plays an important role in the design of a good feature selection method. Many different functions have been proposed to cope with this issue, such as the Information Gain [2], the Mutual Information [28], the R measure [26], etc.

The R measure, first proposed by [26], was adjusted by [31] to assess the predictive power of features in hierarchical feature selection. As shown in (1), where k is the number of classes, the R measure calculates the relevance of a binary feature X based on the differences between the conditional probabilities of each class ci given feature values x1 and x2.

$$ R(X) = \sum\limits_{i=1}^{k}[P(c_{i}|x_{1}) - P(c_{i}|x_{2})]^{2} $$
(1)

Note that (1) is an eager relevance measure, but features may be useful or not depending on the feature values of the test instance being currently classified [22]. Our proposed feature selection method considers that taking into account the feature values (specifically positive values) of the current test instance may contribute to identifying a subset of high-quality features for that particular instance, in the spirit of lazy learning. For this reason, we propose a new feature relevance measure, named Lazy Relevance Measure (LazyR), which assesses the predictive power of a given feature X taking a specific value x in the current test instance. Defined in (2), LazyR calculates the relevance of X with value x as a function of the sum of differences in the conditional probabilities of each class (ci) given the specific feature value x and the class probability \(\frac {1}{k}\) associated with a uniform distribution—ignoring the other values of X, since they do not occur in the current test instance. This measure has the highest value when the feature value x is perfectly correlated with one of the k classes, and presents the lowest value when the conditional probability of each class ci is exactly \(\frac {1}{k}\).

$$ LazyR(X = x) = \sum\limits_{i=1}^{k}\left[ P(c_{i}|x) - \frac{1}{k} \right]^{2} $$
(2)

The LazyR measure has some benefits over eager measures. Eager relevance measures (e.g., R and Information Gain) assess the relevance of all values of a feature to discriminate among class labels. In contrast, LazyR assesses the relevance of a specific feature value. Consider, e.g., a feature A with positive and negative values, where the positive value discriminates well among class labels and the negative value does not. An eager relevance measure could assign a low score to feature A, leading to its removal. In contrast, our lazy relevance measure would keep A in the model if the instance being classified has a positive value for A, and remove it if the instance has a negative value, a principled data-driven decision.

4.2 The proposed lazy and restrictive hierarchical feature selection method

We designed a new feature selection method for hierarchical and sparse feature spaces called Select Relevant Positive Feature Values (RPV). The intuition for this method is twofold. First, in sparse feature spaces, positive feature values are more informative and easier to interpret than negative values. That is, since positive feature values are quite rare, they provide more relevant and more meaningful information than negative values. For instance, in text mining, typically a document is described by features representing the presence (positive value) or absence (negative value) of words in that document, and the class indicates a document’s subject. The presence of the word “teacher” is relevant for predicting that the document’s class is “Education”, but the absence of the word “teacher” is not relevant for classification nor meaningful, it is too broad information. Second, the generalization-specialization structure of hierarchical feature spaces creates hierarchical redundancy among features, which intuitively reduces predictive accuracy. RPV exploits generalization-specialization relationships in order to eliminate hierarchical redundancy, which should improve predictive accuracy.

More specifically, our method adopts the following ideas: (i) it relies on a restrictive selection approach, where selecting only positive feature values might increase the accuracy of the classifier; (ii) it tries to identify a specific subset of relevant positive features for each instance t in the test set—using the lazy paradigm; (iii) taking into account the hierarchy, it selects the most specialized positive feature values as well as those positive feature values whose relevance value is higher than (or equal to) the relevance of all its positive descendants.

We now show, theoretically, that Naïve Bayes—a classifier used in related work [24, 29, 31] and also employed in our experiments—tends to give larger influence to positive feature values than to negative feature values in sparse datasets, which is in agreement with the ideas behind the proposed feature selection method.

Consider the log-odds ratio form of Naïve Bayes (for binary classes c1 and c2):

$$ ln \frac{P(c_{1}|X)}{P(c_{2}|X)} = ln\frac{P(c_{1})}{P(c_{2})} + \sum\limits_{i=1}^{d} ln \frac{P(x_{i}|c_{1})}{P(x_{i}|c_{2})}, $$
(3)

which predicts class c1 for the current instance if \(ln \frac {P(c_{1}|X)}{P(c_{2}|X)} > 0\), and predicts class c2 otherwise. The summation term of this formula can be divided into two parts: \({\sum }_{i+ =1}^{d+} ln \frac {P(x_{i+}|c_{1})}{P(x_{i+}|c_{2})}\) and \({\sum }_{i- =1}^{d-} ln \frac {P(x_{i-}|c_{1})}{P(x_{i-}|c_{2})}\), where i + and i − index the set of positive and negative feature values in the current instance, respectively; d + and d − are the number of positive and negative feature values in the current instance, respectively; and d− + d+ = d. In the case of very sparse features, each term in the second summation (over the d − negative feature values) will tend to zero. This is because, since the vast majority of instances take the negative value for a highly sparse feature, both the terms P(xi|c1) and P(xi|c2) will tend to have similar values (both will tend to be close to 1), and therefore each term \(ln \frac {P(x_{i-}|c_{1})}{P(x_{i-}|c_{2})}\) will tend to be close to zero. I.e., negative feature values will have little influence in the Naïve Bayes formula. On the other hand, for positive feature values, the terms P(xi+|c1) and P(xi+|c2) will have quite different values in general, and so the summation of the terms \(ln \frac {P(x_{i+}|c_{1})}{P(x_{i+}| c_{2})}\) over the d + positive feature values will tend to be a large number, rather than close to zero. I.e., positive feature values tend to have a larger influence than negative feature values in the Naïve Bayes formula.

The RPV method works as follows. Given a test instance t, first, it evaluates the relevance of each feature in t. Then, it identifies the list of ancestors for each positive feature value, using the feature DAG. After that, RPV marks every negative feature value in t for removal. For each positive feature Xi in t, RPV evaluates each of its ancestors and marks for removal those whose relevance is lower than the relevance of Xi. At the end of the process, RPV removes every feature marked for removal, and the remaining features are used in the lazy classification of the current test instance.

Algorithm 1 describes how RPV works in detail. This algorithm produces as output a subset of features named SelectedFeatSubSet. In the initialization phase (lines 1 to 5), the ancestors and the relevance value (measured by LazyR) for each feature in the DAG are computed and stored into the respective Ancestors and Relevance arrays (indexed by the features’ ids). Also, the Status array is initialized with the “Selected” value for all features.

figure d

The main phase of RPV works as follows. In line 7, for each feature Xi in DAG, the function V alue(Xi,t) returns the value of Xi in the test instance t. If the returned value is positive, RPV looks at each ancestor Aj of Xi in the DAG and marks for removal (setting the Status flag) those with relevance value lower than the relevance of Xi (lines 8 to 12). In line 14, every feature with negative value in t is marked for removal, since negative values are much less informative than positive values, as discussed earlier. In lines 17 and 18, the feature subset SelectedFeatSubSet receives all features whose Status is still “Selected” and this subset is returned by the algorithm. Then, a lazy classifier is executed for test instance t using only the selected features. Note that, after initializing each feature’s Status with “Selected”, the Status of a feature can only be changed to “Removed” in lines 10 and 14, and once this change is made, that feature’s Status is never set back to “Selected” by the algorithm. Hence, the result of the algorithm does not depend on the order in which the features are processed.

The RPV algorithm is executed for each test instance in a lazy learning fashion, but note that, in order to save time, the values of the Ancestors and Relevance arrays can be pre-computed in an eager fashion and stored to be accessed whenever a new instance needs to be classified.

Figure 1 illustrates how RPV works. In this figure, each vertex represents a feature, and the numbers on the right and left side of each node represent, respectively, the feature’s value (1 for positive, 0 for negative) and the relevance of that feature value. After RPV’s initialization phase, each feature in the DAG (denoted by letters A to N) is processed in turn. When A, B, D, E, F and I are processed, their Status will be set to “Removed”, since their values are “0”. When C (with value “1”) is processed, RPV sets to “Removed” the Status of C’s ancestors in the DAG whose relevance (LazyR) value is lower than C’s relevance—i.e., G and N are marked for removal. When H is processed, L and N (ancestors with lower relevance than H) are marked for removal, and M will also be marked for removal when J is processed. After processing all features, the only ones selected (never marked for removal) are features C, K, H and J.

Fig. 1
figure 1

Example of a feature DAG showing the subset of features selected by RPV

The reader might be questioning whether the rule “if a given feature A is an ancestor of a feature B then LazyR(A) > LazyR(B)” is always true. We demonstrate that it is not always true by using the counter-example presented in Fig. 2. Considering that feature A is an ancestor of feature B (the value B = 1 in an instance implies the value A = 1 in that instance), then we will show that in this case LazyR(B) > LazyR(A). Note that we evaluate the relevance of the value of the feature and how well it is correlated with the class variable. So, LazyR is higher when the feature values that appear in the instance to be classified can describe the class well. In this specific case, the features are binary and then the possible feature values are 0 or 1. We evaluate the relevance of each one of these two feature values for each feature in the datasets. According to (2), we have that k = 2, so 1/k = 0.5, then LazyR(B = 1) = (1 − 0.5)2 + (0 − 0.5)2 = 0.5 and LazyR(A = 1) = (1 − 0.5)2 + (0.5 − 0.5)2 = 0.25, i.e., LazyR(B = 1) > LazyR(A = 1). In a second example, in Fig. 2, consider that C is an ancestor of D (the value D = 1 in an instance implies the value C = 1 in that instance). In this case, LazyR(C = 1) = (0 − 0.5)2 + (1 − 0.5)2 = 0.5 and LazyR(D = 1) = (0 − 0.5)2 + (0.5 − 0.5)2 = 0.25, so LazyR(C = 1) > LazyR(D = 1), since C is much more discriminative to the class variable than D. These values demonstrate that the value of LazyR is not linked with the position of the feature in the hierarchy, it is related with the discriminative power of the feature value. The position of the feature in the hierarchy is much more related to the redundancy among feature values than to their relevance.

Fig. 2
figure 2

Example of a dataset with two features and three instances where A is an ancestor of B and LazyR(B = 1) > LazyR(A = 1)

Figure 3 depicts the end-to-end process of employing the RPV feature selection method in the pre-processing stage of the classification procedure. First of all, since the RPV method is a lazy filter feature selection method, it uses the feature hierarchy and the training dataset to automatically and intelligently select a subset of features for posterior use in the classification of a given test instance t. Note that this feature selection process follows the lazy paradigm, i.e., the filter procedure is tailored to each instance that passes through the RPV method. After the feature selection procedure is executed, a lazy classifier (such as the Naïve Bayes or the Nearest Neighbor classifier) is used to predict the class of the instance t using only the subset of the original features selected by the RPV method.

Fig. 3
figure 3

Diagram illustrating the inputs and output of the RPV method for a given test instance, and also showing that the selected feature subset (RPV’s output) is used by a lazy classifier to classify that instance

The RPV method presents some appealing characteristics: (i) it selects only positive feature values, which are more informative than negative values; (ii) it uses a lazy relevance measure specifically adapted to assess the relevance of a feature value in the current test instance; (iii) since it selects only positive values, it tends to select fewer features than the other methods used in our experiments (as shown later); (iv) it selects the most specific positive feature values and some of their most relevant ancestors.

4.3 An analysis of the worst-case time complexity of RPV

The worst-case time complexity of RPV can be calculated as follows. First of all, note that, in the worst case scenario, the feature DAG has N nodes (features) and \(\frac {N \cdot (N-1)}{2}\) edges. I.e., each feature is linked to all features but its own descendants. Hence, line 2 of Algorithm 1 has worst-case time complexity O(N), but this line is executed N times since it is within a for loop, which takes O(N2). Line 3 of Algorithm 1 requires the value of the LazyR (Lazy Relevance) measure for each feature, which is pre-computed before Algorithm 1 is run, with a time complexity of O(NM), where N is the number of features and M is the number of training instances. Lines 3 and 4 take constant time (simple assignment), and so they can be ignored in the analysis, since their time complexity is dominated by the one of line 2. During RPV’s feature elimination procedure, performed by the nested loop starting in line 6, in the worst case, in line 9 the relevance value of the most specific feature is compared to the relevance of N − 1 ancestors, the second most specific feature is compared to N − 2 ancestors and so on, until all features have been evaluated. The other lines in the nested loop, lines 10 and 14, do not change the time complexity associated with this loop. In total, in the worst case, the nested loop starting at line 6 of Algorithm 1 performs \(\frac {N \cdot (N- 1)}{2}\) comparisons, i.e., O(N2).

Hence, in addition to the time O(NM) to pre-compute all LazyR values, Algorithm 1 takes O(N2) + O(N2), which in total is O(NM + N2). Note that Algorithm 1 is executed once for each test instance to be classified. Hence, the total worst-case time complexity of the RPV feature selection method is O(NM + tN2), where N is the number of features, M is the number of training instances, and t is the number of test instances to be classified.

Note however, that in practice the time taken by RPV tends to be much smaller than suggested by this worst-case analysis, because in practice the number of ancestors of each feature is usually much smaller than the theoretical maximum of N − 1 (a key assumption in the above analysis).

5 Computational experiments

5.1 Datasets

In this work, the proposed method was evaluated on 17 distinct datasets, 12 from the bioinformatics domain and 5 from other classification domains.

Following the same methodology described in [29, 31], we created 12 datasets of ageing-related genes, involving the effect of genes on an organism’s longevity. These datasets were created by integrating data from the Human Ageing Genomic Resources (HAGR) GenAge database (version: Build 17) [21] and the Gene Ontology (GO) database (version: 2015-10-10) [27]. HAGR is a database of ageing- and longevity-associated genes in four model organisms: C. elegans (worm), D. melanogaster (fly), M. musculus (mouse) and S. cerevisiae (yeast). The GO database provides information about three ontology types: biological process (BP), molecular function (MF) and cellular component (CC). Each ontology contains a separate set of GO terms (features), i.e., a distinct feature hierarchy (a DAG). For each of the 4 model organisms, we created 3 datasets, one for each feature type (feature hierarchy), denoted by BP, CC and MF. Hence, each dataset contains instances (genes) from a single model organism. Each instance is formed of a set of binary features indicating whether or not the gene is annotated with each GO term in the GO hierarchy and a binary class variable indicating if the instance is either positive (“pro-longevity” gene) or negative (“anti-longevity” gene) according to the HAGR database. That is, the class variable indicates whether a gene has the effect of extending or reducing the lifespan of an organism, which is some important information for biologists trying to understand the process of ageing. In order to avoid overfitting, GO terms which occurred in less than three genes were discarded.

Note that GO terms are a particularly suitable type of predictive feature in the context of our experiments, because they have a clear hierarchical structure, they are available in different types of biological hierarchies (involving biological processes, molecular functions and cellular localization information), and they have also been used in benchmark datasets used in previous research in this area, as mentioned earlier. In addition, GO terms are popular in bioinformatics, because they allow biologists to describe biological properties of genes in a standardized way, as well as allowing the description of gene properties at different levels of abstraction. Hence, genes whose properties are known in detail can be annotated with very specific GO terms (around the bottom of the GO hierarchy), whilst genes whose properties are mainly unknown can be annotated with very high level GO terms (around the top of the GO hierarchy). This makes GO terms a flexible approach for representing biological knowledge.

The remaining 5 datasets, previously used in a related work [24], represent different classification tasks with features and hierarchies extracted from either the Open Directory ProjectFootnote 1 or DBpedia [14]. These datasets are described below.

  • Tweets T and Tweets C: in these datasets, the task is to identify sports-related tweets, where each tweet can be either related to sports (positive class) or not related to sports (negative class). The hierarchy and features were generated by extracting types (in Tweets T) and categories (in Tweets C) from DBpedia.

  • NY Daily: this dataset is a set of news headings augmented with DBpedia’s types, where the classification task is to identify a sentiment variable (positive/negative).

  • Stumbleupon (Stb.upon): it is a user-curated web content discovery engine that recommends relevant, high quality pages and media to its users, based on their interests. The aim is to classify webpages into the “ephemeral” class if they are visited on specific periods of time or into the “evergreen” class if they are visited for a long period of time.

  • Cities: this dataset was generated from a list of the most and the least liveable cities according to the Mercer survey augmented with DBpedia types. The task is to classify each city into low, medium and high liveability, where liveability is a variable that combines many factors (such as political stability, medical supplies and services, censorship, among others) to measure to what extent it is desirable to live in a city.

Information about the datasets is shown in Table 1. For each of the four model organisms, each of the three rows shows information about a specific dataset. The first column identifies the group of datasets for each of the four organisms or the general-domain datasets. The second column shows the feature hierarchies used to build the bioinformatics datasets or, in the last five rows, the names of the datasets from general (non-bioinformatics) domains. The other columns show, respectively, the number of features (#features), the number of edges in the feature DAGs (#edges), the number of instances (#instances), the percentage of positive-class instances (% Pos), the percentage of negative-class instances (% Neg), and the percentage of positive feature values (% Pos feat values). In the last row, columns 6 and 7 show the class distribution (low, medium and high) for the dataset Cities.

Table 1 Detailed information about the datasets used in the experiments

5.2 Experimental methodology

We implemented our RPV method and other methods used in this work within the open-source WEKA data mining tool [7]. The datasets used in the experiments and the program code of the RPV method are available at http://github.com/pablonsilva/RPV. The methods were evaluated on the 17 datasets described earlier. The lazy k-NN with Euclidean distance (with k = 1) and a lazy version of Naïve Bayes (NB) (both from WEKA) were used as classification algorithms for all evaluated feature selection methods, and the predictive accuracy was measured by 10-fold cross-validation.

It is worth mentioning that Naïve Bayes has also been used in previous work on hierarchical feature selection [20, 24, 30, 31], as well as k-NN [30, 31]. In addition, in some preliminary experiments with the datasets used in this work and without employing any feature selection method, Naïve Bayes achieved the best predictive accuracy, followed by 1-NN, when compared with other traditional classification algorithms, namely SVM (with various types of kernel), Random Forest and Decision Trees (C4.5). The results of these preliminary experiments are reported in Appendix A.

As shown in Table 1, the majority of the datasets have imbalanced class distributions, so we evaluated the methods’ predictive accuracy by using the Geometric Mean (GM) of sensitivity and specificity, the Area Under the Precision-Recall Curve (AUCPR), and the Area Under the Receiver Operating Characteristics curve (AUROC) measures [8, 10]. The GM is defined in (4), which was also used in [29,30,31].

$$ GM = \sqrt{Sensitivity * Specificity} $$
(4)

GM takes into account the balance between the sensitivity and specificity of the classifier. Sensitivity (or true positive rate) is the proportion of positive class instances correctly predicted as positive, whereas specificity (or true negative rate) is the proportion of negative class instances correctly predicted as negative. The AUCPR plots the precision of the classifier as a function of its recall, then the area under this curve is used to evaluate the classifier (the higher the better). The ROC curve is created by plotting the true positive rate (TPR)—also know as Sensitivity—against the false positive rate (FPR), and again, the higher the area under this curve, the better the performance of the classifier.

To determine whether the differences in performance are statistically significant, we ran the Friedman test and the Holm post-hoc test [9], as recommended by Demsar [4]. First, the Friedman test was executed with the null hypothesis that the performances of all methods are equivalent. The alternative hypothesis is that there is a difference between the results of all methods as a whole, without identifying specific pairs of methods with significantly different results. If the null hypothesis is rejected, we run the Holm post-hoc test (which corrects for multiple hypothesis testing) to compare the results of the proposed RPV method (with the LazyR measure) against each of the other methods. Both the Friedman and Holm test were used at the 0.05 significance level in all our experiments.

5.3 Results

Tables 2345678 and 9 report the predictive accuracy results for three experiments: the first one (Tables 23, and 4) compares our proposed RPV method to baseline approaches; the second one (Tables 5 and 6) compares RPV against the well-known traditional (non-hierarchical) feature selection methods CFS and ReliefF; and the third one (Tables 78 and 9) compares RPV against state-of-the-art hierarchical feature selection methods. These results are discussed in the following three subsections.

Table 2 Comparing RPV with different feature relevance measures against baseline methods in terms of AUCPR—in %
Table 3 Comparing RPV with different feature relevance measures against baseline methods in terms of AUROC—in %
Table 4 Comparing RPV with different feature relevance measures against baseline methods in terms of GM—in %
Table 5 Comparing RPV with traditional feature methods CFS and ReliefF in terms of AUCPR, AUROC and GM using NB as base classifier—in %
Table 6 Comparing RPV with traditional feature methods CFS and ReliefF in terms of AUCPR, AUROC and GM using 1-NN as base classifier—in %
Table 7 Comparing the proposed RPV against state-of-the-art feature selection methods in terms of AUCPR—in % values
Table 8 Comparing the proposed RPV against state-of-the-art feature selection methods in terms of AUROC—in % values
Table 9 Comparing the proposed RPV against state-of-the-art feature selection methods in terms of GM—in % values

5.3.1 Comparison against baseline feature selection approaches

This first experiment aims to evaluate the effectiveness of the two main characteristics of the proposed RPV method, i.e., its focus on selecting only a subset of positive feature values and its new lazy relevance measure.

Tables 23 and 4 report the AUCPR, AUROC and GM results, respectively. In these Tables, columns 3 to 8 show the results of six methods using Naïve Bayes and the last six columns show the results of the same six methods using 1-NN. The 6 methods are our RPV method using the LazyR relevance measure, two RPV versions with different relevance measures and three baseline methods. The first baseline is the base classifier using no feature selection method (No FS). I.e., it uses the full set of predictive features. We also implemented two baseline lazy non-hierarchical feature selection methods: one that selects all features with positive value in the current test instance (All-Pos); and another one that selects all features with negative values in the current test instance (All-Neg). Moreover, in order to evaluate the benefit of our proposed feature relevance measure (LazyR), we compare our RPV (with the LazyR measure) against two other RPV versions. The first version uses the original eager relevance measure R (RPV-R), defined in (1), and the second version uses the traditional Information Gain measure (RPV-IG).

The last two rows of Tables 23 and 4 show, for each method, its average rank (Avg. Rank) and its number of wins (#Win). The lower the Avg. Rank, the better (higher) the AUCPR, AUROC or GM value. Note that the Avg. Rank and #Win values for the 6 methods are computed separately for each of the two classifiers (NB and 1-NN). For each classifier, the highest AUCPR, AUROC or GM value for each dataset is highlighted in bold type. This bold type highlighting is also used to indicate the best methods in other result tables. In the row right below Tables 23 and 4, the symbol ≻ represents a statistically significant difference between one or more methods, such that {a}≻{b,c} means that a is significantly better than b and c.

Considering the results for the AUCPR measure in Table 2, RPV-LazyR obtained the best #Win and the best Avg. Rank values for both Naïve Bayes and 1-NN. For these two algorithms, the Holm post-hoc test indicated that RPV-LazyR is significantly better than No FS and All-Neg. Additionally, for NB, RPV-LazyR is significantly better than RPV-R.

The AUROC results in Table 3 show that All-Pos obtained the best #Win for both NB and 1-NN. For NB, the best Avg. Rank was obtained by All-Pos, while RPV-IG obtained the best result for 1-NN. For both classifiers, RPV-LazyR obtained the second best Avg. Rank among all other baselines. The Holm post-hoc test indicated that RPV-Lazy is significantly better than All-Neg for both classifiers and significantly better than No FS for 1-NN. Note that, for both classifiers, there is no significant difference between the AUROC results of All-Pos and RPV-LazyR.

The GM results in Table 4 show that, for Naïve Bayes, RPV-LazyR obtained the smallest (best) Avg. Rank among all six methods. It also obtained the highest GM value in 7 out of the 17 datasets. The Holm post-hoc test indicated that RPV-LazyR is significantly better than No FS and All-Neg. For 1-NN, RPV-LazyR achieved the best Avg. Rank, but the No FS baseline achieved the highest #Win. Moreover, the Holm post-hoc test indicated that RPV-LazyR is significantly better than All-Neg.

In summary, the results reported in Tables 23 and 4 involve six comparison settings, i.e., three predictive performance measures times two classifiers. Regarding the Avg. Ranks, RPV-LazyR was the best method in four settings (for both the NB and 1-NN classifiers with both the AUCPR and GM measures), All-Pos was the best method in one setting (for NB with AUROC) and RPV-IG was the best in one setting (for 1-NN with AUROC). Regarding the #Wins, RPV-LazyR was the best method in three settings (for both classifiers with AUCPR and for NB with GM), All-Pos was the best method in two settings (for both classifiers with AUROC), and No FS was the best approach in one setting (with 1-NN and GM).

Note that the full set of positive feature values has much higher predictive power than the full set of negative values and the full set of features, since All-Pos obtained much better AUCPR, AUROC and GM Avg. Rank values than All-Neg and No FS. This suggests that positive feature values are more informative than negative feature values. Also, RPV-LazyR has both the best average rank and the highest number of wins for Naïve Bayes using AUCPR and GM, and for 1-NN using AUCPR. RPV-LazyR also obtained the second highest number of wins for both classifiers using AUROC. So, selecting the most relevant features using the LazyR relevance measure increases the predictive power when compared with the baselines methods. Also, this result indicates the benefit of using our proposed LazyR measure, rather than the R or IG measures. Hence, the RPV method using LazyR was chosen to be used in the next experiments due to its highest predictive accuracy overall.

5.3.2 Comparison against traditional feature selection methods

In the second experiment, the RPV method (RPV with LazyR) is evaluated against two non-hierarchical feature selection methods: CFS using WEKA’s default parameters and best-first search (with lookup set to 1 and search termination set to 5) and ReliefF (with a threshold set to 0.01, sigma set to 2 and k set to 10). Tables 5 and 6 show the results for the Naïve Bayes and 1-NN classifiers, respectively. These tables are divided into three partitions, showing results for the AUCPR, AUROC and GM measures.

As shown in the last two rows of the tables, RPV (with LazyR) obtained the best Avg. Rank and by far the highest #Win for all six combinations of the two classifiers and the three performance measures. The Holm post-hoc test indicated that RPV is significantly better than both CFS and ReliefF for three of those six combinations, whilst in the remaining three combinations RPV is significantly better than ReliefF only.

5.3.3 Comparison against state-of-the-art hierarchical feature selection methods

In this experiment, the RPV method is evaluated against five recent hierarchical feature selection methods: GTD, TSEL, SHSEL (using Information Gain, with a threshold set to 0.99 [24]), HIP and MR—all reviewed in Section 3. GTD, TSEL, HIP and MR have no user-defined parameters. MR could also use the LazyR instead of the original R measure. However, previous experiments (not reported here) have shown that the R measure is the best relevance measure to MR. So, we use the original MR with R in the following experiments.

Table 7 shows that RPV achieves both the best Avg. Rank and the highest #Win for both NB and 1-NN, in terms of AUCPR. For NB, the Holm post-hoc test indicates that RPV is statistically better than TSEL, SHSEL and MR. For 1-NN, the Holm test indicates that RPV is significantly better than all the other five hierarchical feature selection methods.

Considering the results for the AUROC measure in Table 8, RPV obtained the best Avg. Rank and #Win for both NB and 1-NN classifiers. According to the Holm post-hoc test, for NB, RPV is statistically superior to four out of five state-of-the-art feature selection methods (GTD, TSEL, SHSEL and MR), while, for 1-NN, RPV obtained statistically superior results to three methods (GTD, TSEL and HIP).

Table 9 shows that RPV achieves both the best Avg. Rank and by far the highest #Win for both Naïve Bayes and 1-NN, in terms of GM. For NB, the Holm post-hoc test indicates that RPV is statistically superior to all five hierarchical methods. For 1-NN, the post-hoc test indicates that RPV had a statistically better GM than GTD, TSEL, SHSEL and HIP.

Note that there is a large difference of performance between RPV and HIP, the two best methods overall. RPV achieves both a higher number of wins and a better average rank than HIP in all experiments reported in Tables 78 and 9. Also, RPV achieved statistically better results than HIP in four out of the six comparison scenarios (three measures times two classifiers).

Summarizing, using AUCPR, AUROC and GM measures, and both Naïve Bayes and 1-NN classifiers, RPV achieved better average rank and higher number of wins than the hierarchical methods GTD, TSEL, SHSEL, HIP and MR. In all comparisons RPV was statistically superior to TSEL. In five of six comparisons RPV was statistically superior to SHSEL. When compared to GTD, HIP and MR, in some cases, the differences were not statistically significant, however, in all these cases, RPV clearly outperformed these hierarchical methods in terms of both average rank and number of wins.

In addition, by analyzing the results of RPV in Tables 2 through Table 9, we can observe that overall RPV performed particularly well in the five ’General’ datasets, which broadly speaking are the largest datasets used in our experiments, in terms of number of features, number of edges in the feature hierarchy and number of instances—as can be observed in Table 1.

5.3.4 Running time performance

Figure 2 shows the results of the experiments that measure the methods’ runtimes on a computer with 4 GB of RAM and an Intel Core i5 1.6GHz CPU. This figure reports the mean, over the 17 datasets, of the ratio of the runtime of each method over the runtime of the RPV method as a baseline. This mean ratio is used because the runtimes vary greatly in magnitude across the 17 datasets, so a direct mean over the raw runtime values would be misleading. Figure 4 shows the mean ratio of the training time of each method over RPV’s training time. For eager methods, training time includes the time spent selecting features and building the Naïve Bayes model. For lazy methods, it includes the time for pre-computing the probabilities used by Naïve Bayes and computing for each feature in the dataset: its ancestors and descendants (for HIP and RPV), the paths from a feature to root and leaves (for MR) and the relevance value (for MR and RPV).

Fig. 4
figure 4

Mean ratios of a method’s training and test times over RPV’s corresponding times, evaluated for CFS, ReliefF, GTD, TSEL, SHSEL, HIP and MR methods

Figure 4 shows the mean ratio of the testing time of each method over RPV’s testing time. For eager methods, testing time includes the time required for classifying one instance with Naïve Bayes or 1-NN. For lazy methods, it includes the time required for feature selection and classifying one instance.

CFS, ReliefF, GTD, TSEL and SHSEL are eager methods and find a single subset of features during the training step which will be used to classify all test instances, while the lazy methods HIP, MR and RPV find a subset of features for each test instance in the classification step. Hence, as can be observed in Fig. 4, RPV has the best training time when compared with all other methods but HIP, which is 20% faster to train than RPV. Regarding the testing times, RPV is the fastest among the three lazy methods, but it is slower than most eager methods.

5.3.5 Evaluating the feature space compression

The selection of a small subset of relevant features for each instance may improve the interpretability of the model’s predictions, since only relevant features are used to justify each prediction. So, we report the percentage of features selected by each method in Table 10. Again, the table’s first two columns show the feature hierarchies (GO term types) used to build the datasets and the name of the general domain datasets. The following columns show the percentage of features selected by each method. The last two rows show the number of wins (#Win) and the average percentage of features selected (Avg. %) across all 17 datasets.

Table 10 Average percentage of features selected by CFS, ReliefF, GTD, TSEL, SHSEL, HIP, MR, All-Neg, All-Pos and RPV

The results show that, in general, RPV selects a feature subset smaller than the one selected by all other nine methods. RPV (with the LazyR measure) selects, on average, only 3.9% of all features per instance, whilst achieving the highest predictive accuracy in general. Note that the percentage of features selected by RPV is about a half of those selected by All-Pos. It means that not all positive feature values are relevant and that the correct identification of relevant positive feature values increases the predictive accuracy of Naïve Bayes and 1-NN. In this work, we introduced the LazyR measure to identify such features. Although the HIP method (which selects the set of the most specific positive-valued features and the most general negative-valued features) achieved a good predictive performance, the RPV method obtained both a higher predictive performance and a much smaller selected feature subset.

Note that all lazy feature selection methods (RPV, HIP and MR) select a separate set of features for each test instance. This has the advantage of improving the user’s interpretability of the classification of individual instances, when a user wants to identify the most relevant features for the current instance; but it has the disadvantage of not providing a unique set of relevant features for the dataset as a whole.

5.3.6 Brief remarks on the most frequently selected features in the bioinformatics datasets

We have ranked all features (GO terms) in each bioinformatics dataset in decreasing order of selection frequency by the RPV method. The full rankings are available in http://github.com/pablonsilva/RPV. In this subsection we briefly focus only on the top 15 features in the four datasets (one per organism) containing only BP GO terms, since this type of GO terms is broadly easier to interpret than MF and CC GO terms. The top 15 BP GO terms for each of these datasets are available in the Supplementary Tables ??–?? in the above GitHub website.

Two examples of the relevance of some of these GO terms to the biology of ageing are as follows. First, the term GO:0006412 (“Translation”) was ranked seventh in the “yeast” dataset. As evidence supporting the relevance of this term, a reduction in the levels of 60S ribosomal subunits led to a significant increase in yeast’s replicative lifespan as shown in [25]. As a second example, in the “fly” dataset, several stress-response related GO terms—e.g., GO:0033554 (“cellular response to stress”) and GO:0006950 (“response to stress”)—were among the top-15 terms. Stress response has been shown to be enhanced in a mutant line with extended longevity [16], and stress-response functions are enriched in genes with enhanced rhythmicity of expression in late life (“late-life cyclers”) [13].

6 Conclusion and future work

This paper presented a novel lazy method for hierarchical and sparse feature selection based on the hypothesis that positive feature values provide more meaningful and accurate information, even though they are present to a small extent in each instance. Our method, named Select Relevant Positive Feature Values (RPV), has some interesting properties: (i) it selects rare but informative and relevant positive features; (ii) it selects smaller feature subsets; and (iii) it is based on a new lazy feature relevance measure (LazyR) which assesses the predictive power of a feature value specifically in the current test instance being classified.

The computational experiments involved 17 real-world datasets and six different classification scenarios, namely six combinations of two different classifiers (Naïve Bayes and 1-NN) times three different predictive accuracy measures (AUCPR, AUROC and the Geometric Mean of Sensitivity and Specificity). The results have shown that the proposed RPV method obtained in general the best predictive accuracy across those six classification scenarios. Overall, the proposed RPV method (with the LazyR measure) was compared, across the above six scenarios, against 12 other feature selection approaches: five hierarchical feature selection methods, two traditional (non-hierarchical) feature selection methods, three baseline approaches, and two other variants of RPV (not using the LazyR measure).

The results of statistical significance tests have shown that RPV obtained predictive accuracies significantly better than another approach in 44 out of the 72 cases (61.1% of all the cases).

In addition, in none of those cases RPV’s predictive accuracy was significantly worse than the accuracy of any other feature selection approach. Furthermore, RPV selected in general the smallest subset of features, among all evaluated feature selection methods. This is also desirable, since each instance is classified using its own small set of relevant features; hence each instance’s classification is justified by a more specific feature subset, improving prediction interpretability.

In addition, the hypothesis that selecting positive feature values might increase the predictive accuracy of the classifier (mentioned earlier) is supported by two types of results. First, the fact that RPV, which selects only a subset of positive feature values (i.e., it never selects negative feature values) obtained by far the best predictive accuracy results. Second, the fact that the results of the All-Pos baseline method, which selects all positive feature values (and no negative values) were clearly better than the results of the All-Neg method, which selects all negative feature values (and no positive values), as discussed in Section 5.3.1.

Table 11 Results from the statistical analysis of the experiments presented in Sections 5.3.15.3.2 and 5.3.3

In future work, we plan to evaluate the behaviour of the proposed RPV method in each application domain analysing the usefulness of the selected features with the assistance of specialists from those domains.