Keywords

1 Introduction

Nowadays, the number of new applications of multi-label learning is steadily increasing, hence, the researchers are very interested to develop novel methods and new ideas related to multi-label classification and structuring of the label/output space. Multi-label classification (MLC) is the task of learning from data examples where each example can be associated with multiple labels. MLC deals with a label dependencies and relations which is orthogonal with existing traditional methods which take into account label independence and learn independent functions from mapping from input space to the output (label) space. The different application problems include video and image annotations (new movie clips, genres), predicting genes and proteins (functional genomics), classification of a tweets and music into emotions, text classification (web-pages, bookmarks, e-mails,...) and others.

The MLC task is typically approached either by decomposing the MLC problem into multiple single class classification problems (i.e., problem transformation methods) or by modifying the algorithms to consider the multiple classes jointly (i.e., algorithm adaptation methods) [12]. In an extensive experimental study Madjarov et al. [7] show that the landscape of MLC methods is not simple: on some datasets problem transformation methods achieve top performance while on other datasets the algorithm adaptation methods are top performing. Furthermore, the study recommends the use of two algorithms for benchmarking: RF-PCT (Random forests of predictive clustering trees, an algorithm adaptation method) [5] and HOMER (Hierarchy Of Multi-label learnERs, a problem transformation method) [13]. The latter divides the label space into subspaces and then constructs classifiers for each of the subspace (e.g., label power set classifiers). This hints that the best performance might be obtained in between the spectrum of the algorithm adaptation and problem transformation methods. In other words, state-of-the-art MLC performance might be obtained by transforming the original MLC problem into several MLC problems and then learn predictive models (preferably using algorithm adaptation methods).

A crucial step in developing methods for output decomposition for MLC is the creation of the subspaces. More specifically, the goal is to find a dependency structure and consider jointly the labels that are inter-dependent. The construction of the output structure of the labels can be very tedious and expensive process, especially if domain experts are needed to complete the task. Moreover, selection of the representation language of the dependencies can be complicated task on its own. Typically, these dependencies are represented as hierarchies of labels [6]. The hierarchies can then be constructed in a data-driven manner using the descriptive space and/or the label space. This presents automatic and relatively efficient process to obtain the representation of the potential dependencies in the label space.

Madjarov et al. [6] present an extensive study of different data-driven methods for constructing label hierarchies for multi-label classification by using the label co-occurence matrix. More precisely, the hierarchies are constructed using four clustering algorithms, agglomerative clustering with single and complete linkage, balanced k-means and predictive clustering trees applied on the label co-occurrences (see Fig. 1, left table).

Next, Szymansky et al. [11] address the question whether data-driven approach using label co-occurrence graph is significantly better than a random choice in label space division for multi-label classification as performed by RAkELd. Their results show that in almost all cases data-driven partitioning outperforms the baseline RAkELd in all evaluation measures, but Hamming loss.

In this study, we build upon the idea of decomposition of the output space and we present a different approach for data-driven structuring of label space in multi-label classification. Our approach constructs the label hierarchy by clustering the per label feature rankings. Namely, instead of using the original label space consisting of label co-occurrences (see Fig. 1, left table), we calculate a feature importance/ranking scores of the features for each label by using the GENIE3 method for feature importance calculation coupled with the random forest ensemble learning method [1, 3] (see Fig. 1, right table).

The obtained structure is then used as the label hierarchy and the MLC task is addressed as hierarchical multi-label classification (HMLC) [5, 15]. We thus evaluate whether considering the dependency in the label space can provide better predictive performance than addressing MLC as a flat problem. In other words, we investigate whether considering the MLC task as a hierarchical MLC task can yield better predictive performance. Our approach is illustrated through the example in Fig. 1. The table on the left hand-side shows the construction of the label hierarchy using the label co-occurrence (as performed in [6, 11]), while the table on the right hand-side shows our proposed method for constructing the label hierarchy.

Fig. 1.
figure 1

Excerpt from the original emotions dataset showing the output space consists of label co-occurrences (left table) and the space consists of ranks of the features for each of the labels, separately (right table). The former is obtained with structuring the original label set using feature ranking.

We perform an experimental evaluation using 8 benchmark datasets from different domains: text, image, music and video classification, and gene function prediction. The predictive performance of the methods is assessed using 13 different evaluation measures used in the context of MLC (6 threshold dependent and 7 threshold independent).

The obtained results indicate that using the methods for creating the hierarchies using feature ranking can yield a better predictive performance as compared to the original flat MLC methods without the hierarchy. Moreover, using the hierarchy constructed by structuring of the output space using the feature rankings of the labels gives better predictive performance compared to using the hierarchy obtained using the label co-occurrences.

The reminder of this paper is organized as follows. Section 2 presents the background work, i.e., discussion on the tasks of multi-label classification and hierarchical multi-label classification methods. Then, in Sect. 3, we present the structuring of the output space using feature ranking. In Sect. 4, we show the experimental design. The results obtained from the experiments are presented and discussed in Sect. 5. Finally, Sect. 6 concludes this paper.

2 Background

In this section, we first define the task of multi-label classification and then the task of hierarchical multi-label classification. Multi-label learning considers learning from examples which are associated to more than one label coming from a predefined set of labels containing all possible labels. There are two types of multi-label learning tasks: multi-label classification and multi-label ranking. The main goal of multi-label classification is to create a predictive model that will output a set of relevant labels for a given, previously unseen example. Multi-label ranking, on the other hand, can be understood as learning a model that, for each unseen examples, associates a list of rankings (preferences) on the labels from a given set of possible labels and a bipartite partition of this set into relevant and irrelevant labels. An extensive bibliography of methods for multi-label learning can be found in [7, 14] and the references therein.

The task of multi-label learning can be defined as follows [5]. The input space \(\mathcal {X}\) consists of vectors of values of nominal or numeric data types i.e., \(\forall x_{i} \in \mathcal {X}, x_{i} = (x_{i1}, x_{i2}, \dots x_{iD})\), where D is a number of descriptive attributes. The output space \(\mathcal {Y}\) consists of a subset of a finite set of disjoint labels \(\mathcal {L} = \{\lambda _{1}, \lambda _{2}, \dots , \lambda _{Q}\}\) (Q > 1 and \(\mathcal {Y} \subseteq \mathcal {L}\)). Given this, each example is a pair of a vector and a set from the input and output space, respectively. All of the examples then form the set of examples (i.e., the dataset) E. The goal is then to find a function \(h: \mathcal {X} \rightarrow 2^{\mathcal {L}}\) such that from the input space assigns a set of labels to each example.

The main difference between multi-label classification and hierarchical multi-label classification (HMLC) is that in the latter the labels from the label space are organized into a hierarchy. A given example labeled with a given label it is also labeled with all its parent labels (known as the hierarchy constraint). Furthermore, an example can be labeled with multiple labels, simultaneously. That means a several paths can be followed from the root node in order to arrive at a given label.

Here, the output space \(\mathcal {Y}\) is defined with a label hierarchy \((\mathcal {L}, \le _{h})\), where \(\mathcal {L}\) is a set of labels and \(\le _{h}\) is a partial order parent-child relationship structured as a tree (\(\forall \lambda _{1}, \lambda _{2} \in \mathcal {L}: \lambda _{1} \le _{h} \lambda _{2}\) if and only if \(\lambda _{1}\) is a parent of \(\lambda _{2}\)) [5]. Each example from the set of examples E is a pair of a vector and a set from the input and output space respectively, where the set satisfies the hierarchy constraint, i.e., \(E = \{(x_{i},\mathcal {Y}_{i}) | x_{i} \in \mathcal {X}, \mathcal {Y} \subseteq \mathcal {L}, \lambda \in \mathcal {Y}_{i} \Rightarrow \forall \lambda ' \le _{h} \lambda : \lambda ' \in \mathcal {Y}_{i}, 1 \le i \le N \}\), where N is a number of examples in E. Same conditions as in multi-label classification should be satisfied for the quality criterion q (high predictive performance and low computational cost). In [9], an extensive bibliography is given, where the HMLC task is presented across different application domains.

3 Structuring of Label Spaces Using Feature Ranking

In this section, we explain our method for structuring the label space using feature ranking and we describe the different clustering algorithms used in this work. Our proposed method for label space structuring is outlined in procedure StructuringLabelSpaceFR in Table 1. First, we take the original training dataset \(D^{train}\) and using random forest method with GENIE3 feature importance, we create feature rankings for each label separately. We then construct a dataset \(D^{ranks}\) consisting of the feature rankings. Next, we obtain a hierarchy using one of the clustering algorithms described bellow. The hierarchy is then used to pre-process the datasets and obtain their hierarchical variants \(D^{train}_{H}\) and \(D^{test}_{H}\). At the end, we learn the HMLC predictive models.

Table 1. The algorithm for structuring the label space using feature rankings per label.

In our approach, described in a procedure StructuringLabelSpaceFR (Table 1), we can see that additional step, compare to the algorithm given by Madjarov et al. [6], is the function CreateFimp at line 4, which increases the theoretical complexity of the procedure. According to the dimensionality of the space which is going to be clustered using the function Clustering at line 5, one dimension in the space consists of label co-occurrences is the number of examples (instances) which means that in case of more complex datasets with large number of examples, the clustering procedure will take more of the time in order to create a hierarchy. From the other side, the procedure of creating the hierarchy using feature rankings has a dimension which depends of the feature space cardinality. Typically, the feature space cardinality is much smaller than the number of examples. It means that clustering of the rankings will finish faster than clustering of the label co-occurrences for datasets with large number of examples but small number of features, which is a case in most of the benchmarks datasets available. Consequently, although we have additional function in our procedure of structuring of the output space, for more complex datasets with high number of examples and smaller number of features, the clustering procedure, i.e., the hierarchy creation will be completed in a reasonable time, thus compensating for obtaining the feature rankings.

We next describe the procedures for obtaining the feature rankings. Random forests as ensemble method for predictive modeling are originally proposed by Breiman [1]. The empirical analysis of their use as feature ranking methods has been studied by Verikas et al. [16]. The random forests are constructed by first performing bootstrap sampling on the data and then building a decision tree for each bootstrap sample. The decision trees are constructed by taking the best split at each level, from a randomly selected feature subset.

Huynh-Thu et al. [3] propose to use the reduction of the variance in the output space at each test node in the tree (the resulting algorithm is named GENIE3). Namely, the variables that reduce the variance of the output more are, consequently, more important than the ones that reduce the variance less. Hence, for each descriptive variable we measure the reduction of variance it produces when selected as splitting variable. If a variable is never selected as splitting variable then its importance will be 0.

The GENIE3 algorithm has been heavily evaluated for single-target regression tasks (e.g., for gene regulatory network reconstruction). The basic idea adopted for future ranking is the same of that proposed in GENIE3, but we use random forest of predictive clustering trees (PCTs) for building the ensemble. The result is a feature ranking algorithm that works for different types of structure output prediction tasks (including MLC and HMLC).

Furthermore, we discuss the different clustering methods used to obtain the hierarchies of the labels. For achieving a good performance of the HMLC methods, it is critical to generate label hierarchies that more closely capture the relations among the labels. The only constraint when building the hierarchy is that we should take care about the leaves of the label hierarchies. They need to define the original MLC task. In particular, the labels from the original MLC problem represent the leaves of the label hierarchy, while the labels in internal nodes of the tree are so-called meta-labels. Meta-labels model the potential relations among the original labels.

For obtaining the hierarchies, we use four different clustering methods (two agglomerative and two divisive):

  • agglomerative clustering with single linkage;

  • agglomerative clustering with complete linkage;

  • balanced k-means clustering (divisive) and

  • predictive clustering trees (divisive).

Agglomerative clustering algorithms consider each example as separate cluster at the beginning and then iteratively merge pairs of clusters based on their distance metric (linkage). If we use the maximal distance of two examples from the clusters \(C_{1}\) and \(C_{2}\), then this type of agglomerative clustering is using complete linkage, i.e., max\(\{dist(c_{1},c_{2}) : c_{1} \in C_{1}, c_{2} \in C_{2}\}\). If we use the minimal distance between two clusters, then the agglomerative clustering approach is with single linkage i.e., min\(\{dist(c_{1},c_{2}) : c_{1} \in C_{1}, c_{2} \in C_{2}\}\).

Balanced k-means is top-down approach for clustering. First, all labels from the label space \(\mathcal {L}\) are in one common cluster at the top node of the hierarchy. Then, the procedure consecutively divides (splits) this cluster into k disjoint sub-clusters (\(k < |\mathcal {L}_{n} |\)) using k-means clustering. The division also is concerned with the number of examples in each cluster: the algorithm outputs clusters with approximately equal size [13]. The procedure recursively is repeated on each sub-cluster (meta-label) until we have n different clusters consisting of one label from the label space \(\mathcal {L}\). In other words, our label space \(\mathcal {L}\) is covered by leaves of the hierarchy obtained by the balanced k-means clustering approach.

We also use predictive clustering trees to construct the label hierarchies. More specifically, the setting from the predictive clustering framework used in this work is based on treating the target space as descriptive space, i.e., the target space is also a descriptive space. Descriptive/target variables are used to provide descriptions for the obtained clusters. Here, the focus is using predictive clustering framework on the task of clustering instead of predictive modelling [2, 4]. The obtained hierarchies using agglomerative clustering (single and complete linkage) and using predictive clustering trees for emotions dataset are shown in Fig. 2.

Fig. 2.
figure 2

Hierarchies obtained using agglomerative single (top-left), agglomerative complete (top-right), balanced K-means clustering (bottom - left) and PCTs (bottom - right) clustering methods for emotions dataset.

We next present the predictive clustering trees (PCTs) - the modelling framework we used throughout this work. PCTs are a generalization of decision trees towards the tasks of predicting structured outputs, including both MLC and HMLC. In order to apply PCTs to the task of HMLC, Vens et al. [15] define the variance and the prototype as follows. First, the set of labels for each example is represented as a vector of binary components. If the example belongs to the class \(c_{i}\) then the i’th component of the vector is 1 and 0, otherwise. The variance of a set of examples E is thus defined as follows:

$$\begin{aligned} Var(E) = \frac{1}{|E|} \cdot \sum _{i=1}^{|E|} dist(\varGamma _{i},\overline{\varGamma })^{2} \end{aligned}$$
(1)

where \(\overline{\varGamma } = \frac{1}{|E|} \cdot \sum _{i=1}^{|E|} \varGamma _{i}\).

In other words, the variance Var(E) in (1) represents the average squared distance between each example’s class vector (\(\varGamma _{i}\)) and the mean class vector of the set (\(\overline{\varGamma }\)). When we talk about HMC, then the similarity at higher levels of the hierarchy are more important than the similarity at lower levels. This is reflected with the distance term used in (1), which is weighted Euclidean distance:

$$\begin{aligned} dist(\varGamma _{1},\varGamma _{2}) = \sqrt{\sum _{s=1}^{|\varGamma |} \theta (c_{s}) \cdot (\varGamma _{1,s}-\varGamma _{2,s})^{2}} \end{aligned}$$

where \(\varGamma _{i,s}\) is the s’th component of the class vector \(\varGamma _{i}\) of the instance \(E_{i}\), \(|\varGamma |\) is the size of the class vector, and the class weights \(\theta (c) = \theta _{0}\cdot \) avg\(_{j} \{\theta (p_{j}(c))\}\), where \(p_{j}(c)\) is j’th parent of the class c and \(0<\theta _{0}<1\). The class weights \(\theta (c)\) decrease with the depth of the class in the hierarchy thus making the differences in the lower parts of the hierarchy less influential to the overall score.

Random forests of PCTs for HMLC are considered in the same way as the random forest of PCTs for MLC. In the case of HMLC, the ensemble is a set of PCTs for HMLC. A new example is classified by taking a majority vote from the combined predictions of the member classifiers. The prediction of the random forest ensemble of PCTs for HMLC follows the hierarchy constraint (if the example is labeled with a given label then is automatically labeled with all its ancestor-labels).

4 Experimental Design

The aim of our study is to address the following questions:

  1. (i)

    Whether feature ranking on the label (output) space in the MLC task can be used to construct good label hierarchies?

  2. (ii)

    Which clustering method yields better hierarchy?

  3. (iii)

    How this scales from single model to ensemble of models?

  4. (iv)

    Can we achieve better predictive models with using a hierarchies obtained by structuring the feature ranking or co-occurrences space?

In order to answer the above questions, we use eight multi-label classification benchmark problems from different domains. We have 3 datasets from text classification, 4 datasets from multimedia, includes movie clips and genres classification and 1 dataset from biology. All datasets are predefined by other researchers (typically the data owners) and divided into train and test subsets. The basic information and statistics about these datasets are given in Table 2.

Table 2. Statistics of used benchmark tasks in terms of application domain (domain), number of training examples (#tr.e), testing examples (#t.e), number of descriptors (D), total number of labels (L) and number of labels per example.

In our experiments, we use 13 different evaluation measures, as presented in [7, 14]. These are divided into two groups: 6 threshold dependent/example based measures (hamming loss, accuracy, precision, recall, \(F_{1}\) score) and 7 threshold independent measures out of which three ranking-based (one-error, coverage and ranking-loss) and four areas under ROC and PRC curves (AUROC, AUPRC, wAUPRC and pooledAUPRC). The threshold independent measures are typically used in HMLC and they do not require a (pre)selection of thresholds and calculating a prediction [15]. All of the above measures offer different viewpoints on the results from the experimental evaluation.

Hamming loss is an example-based evaluation measure that evaluate how many times a pair of example and its label are misclassified. One-error is a ranking-based evaluation measure that evaluates how many times the top-ranked label does not exist in a set of relevant labels of the example. Coverage evaluates how far, on average, we need to go down the list of label ranks in order to cover all relevant labels of given example. Ranking loss evaluates the average fraction of the label pairs that are reversely ordered for the given example. Precision and recall are very important measures defined for binary classification tasks with classes of positive and negative examples. Precision is a proportion of positive prediction that are correct, and recall is the proportion of positive examples that correctly predicted as positive. \(F_{1}\) score is the harmonic mean between precision and recall. Accuracy for each instance is defined as the proportion of correctly predicted labels over total number of labels for that instance. Overall accuracy is the average across all instances. A precision-recall curve (PR curve) is a curve that represent the precision as a function of its recall. AUPRC (area under the PR curve) is the area between the PR curve and the recall axis. wAUPRC evaluates the weighted average of the areas under the individual (per class) PR-curves. If choosing some threshold, we transform the multi-label problem into binary problems with considering binary classifier as a couple (instance, class) and predicting whether that instance belongs to that class, we can obtain PR curves that differ depend of the varying the threshold. The area under the average PR curve (from all different threshold curves) is called pooledAUPRC. From the other side, if we consider the space of true positive rates (sensitivity) versus false positive rates (fall-out) then the curve considers the sensitivity as a function of the fall-out is called ROC-curve. The are under this ROC-curve is the evaluation measure called AUROC.

The majority of our experiments are performed using the CLUS software package (https://sourceforge.net/projects/clus/), which implements the predictive clustering framework, including PCTs, random forests of PCTs and feature ranking [5, 10]. A hierarchical tree defined by the used clustering methods in HMLC setting are defined as tree shaped hierarchies. We use the same values for k in balanced k-means clustering algorithm, as suggested in [7].

For obtaining a hierarchy using the agglomerative clustering method we use the R software package (function agnes() from the cluster package. For more info, see https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/agnes.html). We use the MATLAB software package to create hierarchies with balanced k-means clustering which is based on Hungarian (Munkres’) assignment algorithm to assign the examples to the clusters [8]. We use Euclidean distance metric in all our algorithms that require distance. Moreover, for random forest for feature ranking we use GENIE3 as a feature importance method based on variable selection with ensembles of PCTs [3].

In order to make a comparative analysis with the results obtained by the study by Madjarov et al. [6], we repeated their experiments on the same experimental setting with the experiments we perform for feature ranking.

5 Results

In this section, we present the obtained results from the experiments we performed using our novel proposed method for structuring the output space. In our study, as an output space, we consider the space consisting of label co-occurrences (as presented by Madjarov et al. [6]) and the space consisting of feature ranks for each label, respectively. We compare the following methods for hierarchy construction:

  • flat MLC problem without considering a hierarchy in the label space (FlatMLC);

  • agglomerative clustering with single linkage (AggSingle);

  • agglomerative clustering with complete linkage (AggComplete)

  • balanced k-means clustering (BKmeans)

  • clustering using predictive clustering trees (ClusPCTs).

Since we have two different models (single PCTs model and random forest of PCTs) and two different structured output spaces, we show separately the results for single PCTs (Fig. 3) and random forest of PCTs (Fig. 4). In order to distinguish between using either single tree or random forest of PCTs and different methods of structuring the output space (label co-occurrences and feature rankings), we use prefixes (PCT- and RF-) and suffixes (-CO and -FR) before and after the hierarchy construction method name, respectively. For example, RF-AggComplete-CO refers to the agglomerative clustering method with complete linkage of the output space of label co-occurrences using random forest of PCTs for model creation. Then, PCT-ClusPCTs-FR refers to the clustering method with PCTs of the output space consists of feature rankings per label using single PCTs for model creation, etc.

Fig. 3.
figure 3

Results with the 13 performance measures for single PCTs from experiments performed on 8 different datasets. The best results obtained per measure per dataset are highlighted.

Fig. 4.
figure 4

Results with the 13 performance measures for Random Forest from experiments performed on 8 different datasets. The best results obtained per measure per dataset are highlighted.

Observing the results obtained using single PCTs (Fig. 3), we can note that there is no clear winner across all evaluation measures and datasets. In the case of threshold independent measures, such as AUPRC, AUROC, wAUPRC and pooledAUPRC, we can see that hierarchies created using clustering of the output space consisting of feature rankings perform the best for enron, emotions, mediamill and yeast datasets. Considering the scene and corel5k datasets, we can observe that they perform the best according to AUROC, AUPRC and pooledAUPRC, but not for wAUPRC. PCT-BKmeans-FR outperforms the other algorithms for hierarchy creation in the emotions dataset according to the most of the evaluation measures but not according to one-error. Moreover, the hierarchies created clustering the feature rankings outperform the other algorithms considering the ML performance measures (ML F1 measure, ML accuracy, ML precision and ML recall) in 5 out of the 8 datasets.

Generally, structuring the output space consisting of feature rankings for each label yields better predictive performance compared to the structuring the output space consisting of label co-occurrences considering most of the evaluation measures in almost all datasets. For the corel5k dataset only, we can see that both have similar performance. If we consider medical and tmc2007 datasets, we can see that structuring the output space does not improve the performance as compared to the flat MLC task, where there is no hierarchy considered. All in all, we can conclude that using the hierarchies, the predictive performance can be improved.

The results obtained when random forests are used as predictive models are given in Fig. 4. These results present a different situation as compared to the results obtained when single PCTs are used as predictive models. First of all, the predictive performance is improved as compared to the single PCTs for large majority of the cases. Most notably, the performance for the threshold independent measures (AUPRC, AUROC, wAUPRC and pooledAUPRC) for the mediamill and tmc2007 datsets are improved for almost twice, which is consistent to the notion from the literature that ensembles of PCTs improve the performance over single predictive models. Hierarchies created with clustering of the space consisting of feature rankings outperform both hierarchies obtained using label co-occurrences and flat MLC for the threshold independent measures on the medical, enron and emotions datasets. RF-BKmeans-FR performs the best for medical dataset in seven evaluation measures. Considering the hierarchies obtained with clustering the space of label co-occurrences, we can note that they outperform the other methods for the corel5k dataset. Using hierarchies (i.e., label dependences) rather than flat multi-label task improves the predictive performance generally for most of the evaluation measures, but not for (ML F1 measure, ML accuracy, ML precision and ML recall) in the emotions and scene datasets.

Finally, in our study we also considered training errors i.e., the errors made in the learning phase. There, in a large majority of the cases, the original FlatMLC method performed the best. This means that other methods we use for constructing the hierarchies do not overfit as the original one. This is another advantage of methods for construction the hierarchies identified from the obtained results.

6 Conclusions and Further Work

In this work, we have presented an approach for hierarchy construction and structuring the output (label) space by using feature ranking. More specifically, we cluster the feature rankings to obtain a hierarchical representation of the potential relations existing among the different labels. We then address the task of MLC as a task of HMLC. Moreover, we compare our approach with the approach of clustering the space consisting of label co-occurrences [6].

We investigated four clustering methods for hierarchy creation, agglomerative clustering with single and complete linkage, balanced k-means and clustering using predictive clustering trees (PCTs). The resulting problem was then approached as a HMLC problem using PCTs and random forests of PCTs for HMLC. We used eight benchmark datasets to evaluate the performance.

The results reveal that the best methods for hierarchy construction are agglomerative clustering methods and balanced k-means. Compared to the original MLC method where there is no hierarchy this improves the performance in most of the datasets. In four datasets, the hierarchies obtained by clustering the label space consisting of feature rankings improve the predictive performance compared to the hierarchies obtained by clustering the space consisting of label co-occurrences. Similar conclusions, but to a lesser extent, can be made for the random forests of PCTs for HMLC - in many of the cases (datasets and evaluation measures) the predictive models exploiting the hierarchy of labels yield better predictive performance. Finally, by considering the training error performance, we find that original MLC models overfit more than the HMLC models.

For further work, we plan to make more extensive evaluation on more datasets with diverse properties and to try more different feature ranking methods. Furthermore, we assume that potential improvement of the performance can be achieved with cutting the hierarchies based on some conditions such as density, distribution or distance between nodes. Moreover, we plan to include a comparison to network approaches given by Szymanski et al. [11]. Finally, we plan to extend this approach to other tasks, such as multi-target regression.