1 Introduction

In traditional classification problems, such as binary or multi-class classification, each instance in the training data is associated with only one class label. In contrast, multi-label learning (MLL) addresses the problem that each instance is associated with multiple class labels [1]. This learning paradigm is often found in real-world application scenarios, such as automatic image annotation, text classification, movie genre classification [2], etc. For example, in image annotation, a picture may be tagged with "streets" and "buildings"; in text classification, an article may include multiple topics, such as "soccer," "sports," and "games"; and in movie genre classification, a movie may simultaneously belong to the genres of "comedy," "drama," and "romance."

A crucial difference between MLL and traditional classification problems is that in MLL, the labels are not entirely independent but have correlations [3, 4]. For example, suppose an image is known to carry the label "car." In that case, it is likely also to carry the label "road," and using label correlation can significantly improve the generalization performance of the classifier. The primary methods for exploring label correlation in multi-label data include label co-occurrence, Fisher discriminant, and metric-based approaches.

Multi-label datasets (MLDs) are typically large-scale and contain many redundant or noisy data [5], which can negatively impact the classifier's performance. First, computing with large-scale data requires more time and memory. Second, many algorithms are noisy-fragile. Therefore, it is necessary to reduce the size of MLDs, especially for instance-based classifiers that use large datasets. Currently, techniques for obtaining a reduced set of instances are divided into two main strategies: Prototype Selection (PS) [6,7,8] and Prototype Generation (PG) [9]. A prototype of a dataset is usually a representative instance of each class label that can be described by a metric such as centroid, mean, etc. The purpose of PS is to select existing instances from a given dataset to form a subset of prototypes, while PG is to construct artificial sets of instances in place of the original set of instances. The primary constraint of the prototype-based approach is that the predictive performance of the reduced set of instances should be at least comparable to that of the original dataset.

PS methods have been extensively studied in traditional classification but are just beginning to emerge in MLL. Most existing methods for multi-label prototype selection are based on problem transformation methods [10]. The basic idea behind problem transformation methods is to transform MLD into single-label data. The BR-eNN algorithm is an example of such an algorithm. However, a significant drawback of this algorithm is that it ignores the correlation between labels. Although the LP-eNN [10] exploits label correlation by converting labels into label combinations, the classification accuracy of this algorithm could be better.

Through the above discussion, this paper proposes a new algorithm for prototype selection from MLDs called CO-GCNN (multi-label prototype selection with Co-Occurrence and Generalized Condensed Nearest Neighbor). CO-GCNN is a problem transformation-based algorithm with the advantage of considering the pairwise label correlation as a constraint in the data transformation process, which allows the selected prototype to characterize the original dataset better. GCNN (Generalized Condensed Nearest Neighbor) performs the prototype selection process, which adopts a strong absorption criterion compared to CNN (Condensed Nearest Neighbor) and can achieve better accuracy than other instance-based data reduction methods [11]. Moreover, to evaluate the performance of the proposed method, it is compared with four state-of-the-art algorithms on six benchmark datasets. The experimental results show that (1) better classification performance can be achieved with the reduced data than with the original data, and (2) CO-GCNN leads compared to other prototype selection algorithms with multi-label based on problem transformation methods.

This paper is organized as follows:

Section 2 briefly reviews related work on MLL and PS. Section 3 provides the details of the proposed CO-GCNN algorithm. Section 4 reports the experimental setup and results. Finally, Sect. 5 concludes and indicates several issues for future work.

2 Related work

2.1 Multi-label learning

Existing MLL methods can be categorized according to two different perspectives. The first perspective is based on the origin of the algorithm design ideology; MLL methods can be classified in this way: the problem transformation method (PT) and the algorithm adaptation method (AA). PT transforms MLD into single-label data and uses existing learning methods to solve the classification problem. Since PT transform data to fit algorithms, they are also referred to as data transformation methods (DT) in some literature. On the other hand, AA improves existing single-label classification algorithms by allowing them to handle multi-label classification problems [1,2,3].

The second perspective depends on whether the algorithm utilizes the label correlation. According to how well the algorithm uses the correlations of labels, it can be divided into first-order, second-order, and high-order methods. The first-order approach refers to decomposing a classification problem with multiple labels into a set of binary classification problems, using only the information of a single label, and each label corresponds to a subproblem. The second-order approach improves the classifier's performance by exploiting the relationship between pairs of labels. High-order methods solve the classification problem by exploiting the correlation between three or more labels.

Following these two perspectives, the existing algorithms for MLL can be summarized as follows:

  1. 1.

    PT-based first-order methods: the Binary Relevance method (BR) is typical of this algorithm [1]. This method uses a one-versus-all strategy (OVA) to transform the MLL problem into a series of independent binary classification problems where each class label corresponds to a subproblem. The label-specific features method (LIFT) is also based on the OVA strategy and generates label-specific features for classification by clustering positive and negative instances for each class label [12].

  2. 2.

    PT-based second-order methods: a typical representative of these algorithms is Ranking by Pairwise Comparison (RPC), which creates a dataset for each label pair, trains a two-class classification model, and then integrates the test results of these models to obtain the final multi-label classification results [13]. The Calibrated Label Ranking (CLR) method transforms a sequence of label rankings into a collection of relevant and irrelevant labels by introducing virtual labels as split points for each sample and using the BR method to learn a classification model for each label with its predicted value as a threshold [14].

  3. 3.

    PT-based high-order approaches: the Label Powerset (LP) algorithm transforms MLD into single-label data by treating each label combination as a new single label. After that, any single-label base classifier can perform the classification process. This approach implicitly considers the correlation of the labels. The Random k-Labelsets (Rakel) algorithm [15] improves the LP algorithm, which creates k random subsets of labels of the same size and constructs a single-label classifier for each label combination. Then, the results of these single-label classifiers are aggregated to obtain the final result.

  4. 4.

    AA-based second-order approaches: the Rank loss-SVM (Rank-SVM) [16] is an improved multi-label classification algorithm based on the classic SVM algorithm. It introduces the ranking loss of labels based on SVM to be suitable for MLDs processing.

  5. 5.

    AA-based high-order approaches: the Ensembles of Classifier Chains (ECC) [17] is an improvement of the classical Classifier Chain (CC) algorithm. ECC uses the prediction results of all previous labels as additional features when training the classifier for the subsequent label and then obtains the final result using a prediction model that ensembles multiple classifier chains.

2.2 Prototype selection

Prototype selection, referred to as instance selection in some literature [6, 10], has been widely applied in single-label learning. However, prototype selection for MLDs is still in the early research stage, and the related studies still need improvement.

The Multi-Label edited Nearest Neighbor (MLeNN) algorithm [7] is the first PS method applied to MLDs. It is an extension of the single-labeled PS method eNN [18]. The eNN algorithm traverses all examples in the training data, retaining those instances with different classes from their nearest neighbors as prototypes. The algorithm KADT [8] is also inspired by eNN, which uses the Hamming-Loss metric to determine relevant and irrelevant instances. Besides, Binary Relevance-edited Nearest Neighbor (BR-eNN) [10] is an extension algorithm of eNN, which utilizes a PT-based OVA strategy to transform MLD to single-label data and then employs the eNN algorithm to select prototypes. Label Powerset- edited Nearest Neighbor (LP-eNN) [10] treats each label combination in the MLD as a new label, then selects the prototypes using the eNN algorithm.

Several of the above methods for multi-label prototype selection either ignore label correlation, such as the MLeNN, KADT, and BR-eNN, or the performance unsatisfactory, such as the LP-eNN. In the next section, we introduce the proposed method CO-GCNN.

3 The proposed approach

Formally speaking, let \({\mathcal{X}} = {\mathbb{R}}^{d}\) denote the \(d -\) dimensional instance space, where \({\mathcal{X}} = \left\{ {{\mathbf{x}}_{1} ,...,{\mathbf{x}}_{n} } \right\}\) consists of \(n\) instances. Each instance \({\mathbf{x}}_{i}\) is represented by a set of \(d -\) dimensional feature vectors, i.e., \({\mathbf{x}}_{i} = \left[ {f_{i1} ,f_{i2} ,...,f_{id} } \right]^{{ \top }}\); and \({\mathcal{Y}} = \left\{ {l_{1} ,l_{2} ,...,l_{q} } \right\}\) denotes the label space that includes \(q\) possible class labels. Moreover, the set of labels associated with the instance \({\mathbf{x}}_{i}\) is \(Y_{i}\), and \(Y_{i} \subseteq {\mathcal{Y}}\). Given the MLDs training dataset \({\mathcal{D}} = \left\{ {({\mathbf{x}}_{i} ,Y_{i} )|1 \le i \le n} \right\}\), the task of MLL is to learn a classifier \(h:{\mathcal{X}} \to 2^{{\mathcal{Y}}}\) from \({\mathcal{D}}\) that enables prediction of the set of labels \(Y^{*}\) that are relevant to the unseen instance \({\mathbf{x}}^{*}\). MLL prototype selection aims to select an optimal representative subset \({\mathcal{U}} \subseteq {\mathcal{D}}\), such that a classifier derived from \({\mathcal{U}}\) can achieve comparable or better performance than that derived from \({\mathcal{D}}\).

Following the usual prototype selection method based on the PT techniques, the original training dataset \({\mathcal{D}}\) first divided into the set of positive and negative training instances for each class label using the OVA strategy. Specifically, let \(l_{k} \in {\mathcal{Y}}\) be any class label. The set of positive training instances \({\mathcal{P}}_{k}\) and negative training instances \({\mathcal{N}}_{k}\) for the label \(l_{k}\) can be formed as follows:

$$ \left\{ {\begin{array}{*{20}c} {{\mathcal{P}}_{k} = \left\{ {{\mathbf{x}}_{i} \left| {\left( {{\mathbf{x}}_{i} ,Y_{i} } \right)} \right. \in {\mathcal{D}},l_{k} \in Y_{i} } \right\}} \\ {{\mathcal{N}}_{k} = \left\{ {{\mathbf{x}}_{i} \left| {\left( {{\mathbf{x}}_{i} ,Y_{i} } \right)} \right. \in {\mathcal{D}},l_{k} \notin Y_{i} } \right\}} \\ \end{array} } \right., $$
(1)

namely, \({\mathcal{P}}_{k}\) is the set of positive class instances, with all examples containing the class label \(l_{k}\), and \({\mathcal{N}}_{k}\) is the opposite. However, this simple OVA-based strategy ignores the correlation between the labels since each label is processed independently. Subsequent selection of prototypes or induction of classifiers based on this type of partitioning may lead to suboptimal results. To make the best use of correlation information, we attempt to improve Eq. (1) by adding correlation constraints. Specifically, let \(l_{u}\) and \(l_{v}\) \((l_{u} ,l_{v} \in {\mathcal{Y}},u \ne v)\) denote any pair of labels, and \(M \in {\mathbb{R}}^{q \times q}\) represent the co-occurrence matrix, which is used to measure the co-occurrence times of the pair of labels \(l_{u}\) and \(l_{v}\) in the label set \(Y_{i}\) of instance \({\mathbf{x}}_{i}\). Then, each element \(m_{u,v}\) in the matrix \(M\) can be calculated as follows:

$$ m_{u,v} = \sum\limits_{i = 1}^{n} {\mathbb{I}} (l_{u} \in Y_{i} ){\mathbb{I}}(l_{v} \in Y_{i} ), $$
(2)

where "\({\mathbb{I}}(\pi )\)" denotes the indicator function, for any predicate \(\pi\), if \(\pi\) holds, then \({\mathbb{I}}(\pi ) = 1\), otherwise \({\mathbb{I}}(\pi ) = 0\).

Furthermore, let \(S \in {\mathbb{R}}^{q \times q}\) denotes the co-occurrence rate matrix of labels \(l_{u}\) and \(l_{v}\) in the whole label space, each element \(s_{u,v}\) in the matrix \(S\) is calculated as follows:

$$ s_{u,v} = \frac{{m_{u,v} }}{\sigma }, $$
(3)

where \(\sigma = q(q - 1)/2\) represents the total number of pairwise labels in the label space, and \(S\) is a symmetric matrix, which diagonal elements equal 1. Then insert \(s_{u,v}\) as a constraint in Eq. (1), and the new positive and negative class training instance set division approach that considers the label correlation is obtained as follows:

$$ \left\{ \begin{gathered} {\mathcal{P}}_{{u,v}} = \left\{ {{\mathbf{x}}_{i} \left| {\left( {{\mathbf{x}}_{i} ,Y_{i} } \right) \in {\mathcal{D}},l_{u} \in Y_{i} ,l_{v} \in Y_{i} ,s_{{u,v}} \ge \tau } \right.} \right\} \hfill \\ {\mathcal{N}}_{{u,v}} = \left\{ {{\mathbf{x}}_{i} \left| {\left( {{\mathbf{x}}_{i} ,Y_{i} } \right) \in {\mathcal{D}},l_{u} \in Y_{i} ,l_{v} \in Y_{i} ,s_{{u,v}} < \tau } \right.} \right\} \hfill \\ \end{gathered} \right., $$
(4)

here, the \(\tau\) is a pre-defined co-occurrence rate threshold. The \(\tau\) is taken as \(\tau = 0.3\), a practical value in the experiment. The way the data are divided according to Eq. (4) utilizes pairwise label correlation. Therefore, the CO-GCNN is a second-order based method. In the set of positive class instances obtained based on Eq. (4), each instance represents information on pair of labels instead of a single label.

After obtaining the set of positive and negative class training instances with label correlation, the CO-GCNN performs the PS process employing the GCNN algorithm. The GCNN improves the CNN by using a strong absorption criterion instead of the weak absorption criterion in CNN, thus achieving higher reduction accuracy.

Denote by \({\mathcal{U}}^{0} = \{ {\mathbf{x}}_{0} \}\) the initial set of prototypes, where \({\mathbf{x}}_{0}\) is a randomly chosen instance from \({\mathcal{X}}\), and the set of labels corresponding to \({\mathbf{x}}_{0}\) is \(Y_{0}\). The GCNN and CNN algorithms execute the PS process by traversing all the elements in \({\mathcal{X}}\) and selecting instances that satisfy the following condition to be added to \({\mathcal{U}}\) as prototypes. The condition is: for the nearest neighbor instance \(N({\mathbf{x}}_{0} )\) (with label \(Y_{{N({\mathbf{x}}_{0} )}}\)) of \({\mathbf{x}}_{0}\); if \(Y_{{N({\mathbf{x}}_{0} )}} \ne Y_{0}\), then \(N({\mathbf{x}}_{0} )\) is added to \({\mathcal{U}}^{0}\); otherwise, we call the instance \(N({\mathbf{x}}_{0} )\) to be absorbed. The algorithm terminates in two ways: all elements in \({\mathcal{X}}\) have been scanned or absorbed, or reached the maximum iterations \(c\). Then, the final selected prototype set is noted as \({\mathcal{U}}^{*}\). For an instance \({\mathbf{x}}_{k}\), the following equation represents the weak absorption criterion in CNN:

$$ \left\| {{\mathbf{x}}_{k} - {\mathbf{n}}} \right\| - \left\| {{\mathbf{x}}_{k} - {\mathbf{p}}} \right\| > 0 $$
(5)

here, \({\mathbf{n}}\) and \({\mathbf{p}}\) represent the different class and same class prototypes of \({\mathbf{x}}_{k}\), respectively.

The GCNN replaces the weak absorption criterion in Eq. (5) using the strong absorption criterion as follows:

$$\left\| {{\mathbf{x}}_{k} - {\mathbf{n}}} \right\| - \left\| {{\mathbf{x}}_{k} - {\mathbf{p}}} \right\| > \rho \delta _{n} ,\rho \in [0,1)$$
(6)

where \(\delta_{n} = \min \left\{ {\left\| {{\mathbf{x}}_{i} - {\mathbf{x}}_{j} } \right\|:Y_{i} \ne Y_{j} {\text{, and }}{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} \in {\mathcal{X}}} \right\}\), and \(\rho\) is the tradeoff parameter. We will discuss in detail the value of \(\rho\) in subSect. 4.1.4.

Table 1 summarizes the pseudo-code of the complete procedure of the proposed CO-GCNN algorithm. In Table 1, Steps 1 to 5 are the process of calculating the label co-occurrence rates; Step 6 is the process of generating the positive and negative training example sets combined with label correlation; Steps 8 to 12 are the process of generating the prototype set.

Table 1 The pseudo-code CO-GCNN

4 Experimental study

4.1 Configuration

This subsection introduces the experimental datasets, the comparison algorithms, evaluation metrics, and the parameter settings.

4.1.1 Datasets

A total of six open-source real-world benchmark MLDsFootnote 1 were used for the experiments. These datasets are extensively used for the performance evaluation of multi-label learning algorithms. To clearly describe the datasets, other detailed characteristics of these datasets are listed in Table 2. The characteristics include the number of instances, feature dimension, number of labels, label cardinality, and application domains, where the label cardinality indicates the average number of labels per instance.

Table 2 Characteristics of the benchmark datasets

4.1.2 Comparing algorithms

After selecting the prototypes, we used ML-kNN (k = 10) as the base classifier to perform the label prediction task [19]. In addition, ML-kNN is also intended as a comparison algorithm to compare the classifier's performance before and after instance reduction. Besides ML-kNN, several other state-of-the-art comparison algorithms were considered for the experiments, and their parameter settings are listed below.

Note that the parameters of these comparison algorithms are set according to the recommended parameters in the respective references.

  • BR-eNN: A binary relevance-based prototype selection algorithm for MLL. This method converts MLDs based on binary relevance and performs the prototype selection process using ENN. Parameter setting:\(k = 5\)( the number of the nearest neighbors) [10].

  • ML-eNN: A heuristic multi-label undersampling algorithm based on eNN. Parameter setting:\(k = 3\)( the number of the nearest neighbors,) \(HT = 0.75\)( label set difference threshold) [7].

  • KADT: A prototype selection method for MLDs based on the nearest neighbor information and local evaluation criterion. Parameter setting:\(k = 3\)( the number of the nearest neighbors) [8].

  • LP-eNN: A prototype selection method for MLDs based on the eNN algorithm and label powerset. Parameter setting: \(k = 3\) (the number of the nearest neighbors) [10].

4.1.3 Evaluation metrics

Performance evaluation of multi-label classification algorithms is more complicated than traditional single-label classification algorithms. Because each instance is associated with multiple labels simultaneously, the evaluation measures that apply to single-label classification methods, such as accuracy, precision, and recall, cannot be used directly. Several criteria can be used to evaluate the performance of MLL algorithms, and a consensus has yet to be reached on which metrics are the best. In this work, we select the following five evaluation metrics most commonly used in MLL. For a given test set \({\mathcal{S}} = \left\{ {({\mathbf{x}}_{i} ,Y_{i} )|1 \le i \le p} \right\}\) and classifier \(h( \cdot )\), each metric is defined as shown in Eq. (7)-(11), and the definition of each metric is briefly described. Details of these metrics are illustrated in the literature [2, 20].

  • Hamming loss (HL):

    $$ \downarrow HL = \frac{1}{p}\sum\limits_{i = 1}^{p} {\frac{1}{{|{\mathcal{Y}}|}}} \left| {h\left( {{\mathbf{x}}_{i} } \right)\Delta Y_{i} } \right|{,} $$
    (7)

here, the \(\Delta\) denotes the symmetrical difference between two labelsets, and \(| \cdot |\) denotes the cardinality of a set. The hamming loss evaluates the degree of inconsistency between the predicted and actual labels.

  • One-error (OE):

    $$ \downarrow OE = \frac{1}{p}\mathop {\mathop \sum \limits^{p} }\limits_{i = 1} {\mkern 1mu} \left[\kern-0.15em\left[ {[\arg \max_{{l_{k} \in {\mathcal{Y}}}} h\left( {{\mathbf{x}}_{i} ,l_{k} } \right)] \notin Y_{i} } \right]\kern-0.15em\right]{,} $$
    (8)

    here, for predicate \(\pi\),\([[\pi ]]\) equals 1 if \(\pi\) holds and 0 otherwise. The one-error counts the number of times that the top-ranked label is not the actual label of an instance when sorting the predicted labels of each instance.

  • Rank-loss (RL):

    $$ \begin{gathered} \downarrow RL = \frac{1}{p}\sum\limits_{i = 1}^{p} {\frac{1}{{\left| {Y_{i} } \right|\left| {\overline{Y}_{i} } \right|}}} \cdot \left| {{\mathcal{R}}_{i} } \right|,\quad {\text{where }} \hfill \\ \, {\mathcal{R}}_{i} = \left\{ {\left( {l_{1} ,l_{2} } \right)h\left( {{\mathbf{x}}_{i} ,l_{1} } \right) \le h\left( {{\mathbf{x}}_{i} ,l_{2} } \right),\left( {l_{1} ,l_{2} } \right) \in Y_{i} \times \overline{Y}_{i} } \right\}{.} \hfill \\ \end{gathered} $$
    (9)

The RL statistics how often the irrelevant labels are ranked prior to the relevant labels.

  • Coverage (CV):

    $$ \downarrow CV = \frac{1}{q}\left(\frac{1}{p}\sum\limits_{i = 1}^{p} {\max_{{l_{k} \in Y_{i} }} } {\text{rank}}_{h} \left( {{\mathbf{x}}_{i} ,l_{k} } \right) - 1\right). $$
    (10)

The CV evaluates how many steps are needed, on average, to move down the predicted label sequence to find all the relevant labels of the instances.

  • Accuracy (AC):

    $$ \uparrow AC = \frac{1}{p}\sum\limits_{i = 1}^{p} {\frac{{\left| {Y_{i} \cap h({\mathbf{x}}_{i} )} \right|}}{{\left| {Y_{i} \cup h({\mathbf{x}}_{i} )} \right|}}} {,} $$
    (11)

here, and \(| \cdot |\) denotes the cardinality of a set. The AC measures the average number of correct predictions for an instance.

Note that, for each evaluation metric, the notation of up arrow “\(\uparrow\)” means “the greater the better,” and down arrow “\(\downarrow\)” otherwise.

4.1.4 Parameter Settings

To investigate the effect of the parameter \(\rho\) on the performance, we tested five \(\rho\) values (\(\rho = 0,0.25,0.5,0.75\) and \(0.99\)) on the six datasets. Among them, the strong absorption criterion in Eq. (6) degenerates to the weak absorption criterion in Eq. (5) when \(\rho = 0\). Figure 1 shows the variation of AC and reduction rate (RR) for various values of \(\rho\). Note that RR indicates the proportion of the size of \({\mathcal{U}}^{*}\) to the whole set of original instances i.e.,\(RR = |{\mathcal{U}}^{*} |/|{\mathcal{X}}|\), both AC and RR are expressed as percentages in Fig. 1.

Fig. 1
figure 1

The variation of AC and RR for various values of \(\rho\)

The results presented in Fig. 1 lead to the following conclusions:

(1) Overall, as the value of \(\rho\) increases, the accuracy also increases.

(2) With the increase of \(\rho\), the selection criteria for prototypes become more lenient, leading to a more extensive set of selected prototypes retained.

(3) In the "genbase," "Medical," and "enron" datasets, when \(\rho\) is too small (i.e., \(\rho = 0\) and \(\rho = 0.25\)), the accuracy is too low. The algorithm cannot perform classification well, and this is because the criterion for selecting prototypes is too strict, resulting in too few selected prototypes and causing overfitting. Although the other three datasets also show low accuracy under these two values, the data fluctuation is less significant than in these three datasets. In the "emotions" dataset, the maximum and minimum accuracy difference is only about 1.3%, which indicates that different distributions and scales of datasets have different sensitivities to the parameter \(\rho\).

(4) The size of the prototype set differs when \(\rho\) takes the same value, indicating that the effect of prototype selection is more significant for sparser datasets.

(5) The optimal results for AC on each dataset were obtained when \(\rho = 0.99\).

In subsequent experiments, the value of \(\rho\) was set to 0.99 to achieve optimal accuracy of the CO-GCNN algorithm.

4.2 Experimental results

We repeated the experiments ten times and calculated the mean value. Tables 3, 4, 5, 6, and 7 report the detailed experimental results of the benchmarks for five metrics. Also, the best performance among the algorithms is shown in boldface, and the data in parentheses represent rankings.

Table 3 Summary of the results for HL
Table 4 Summary of the results for OE
Table 5 Summary of the results for RL
Table 6 Summary of the results for CV
Table 7 Summary of the results for AC

To show more intuitively how the CO-GCNN algorithm compares with other algorithms, Table 8 statistics the number of wins/ties/losses of CO-GCNN against the comparison algorithm on various metrics.

Table 8 Win/tie/loss counts on the classification performance of CO-GCNN against each comparing algorithm

From the experimental results from Tables 3, 4, 5, 6, 7, and 8, we can conclude the following:

  • No one algorithm can outperform the other algorithms in all the metrics;

  • The better results of CO-GCNN than ML-kNN illustrate that using the reduced dataset for classifier induction improves over using the raw data directly.

  • Comparing the results of CO-GCNN with three other PS algorithms which ignore the label correlation (BR-eNN, ML-eNN, and KADT), the CO-GCNN algorithm has achieved a significant lead. It shows that it is beneficial to consider label correlation in the prototype selection process.

  • The LP algorithm also considers label correlation, but the performance needs to be improved, which indicates that the simple label combination approach is inappropriate for multi-label prototype selection.

  • The comparison with the four algorithms BR-eNN, ML-eNN, LP-eNN, and KADT illustrates that the set of prototypes selected by CO-GCNN is optimal.

In addition, Fig. 2 shows the overall average ranking of the comparison algorithms. According to Fig. 2, the performance ranking of these algorithms is: CO-GCNN > ML-KNN > KADT > BR-eNN > ML-eNN > LP-eNN.

Fig. 2
figure 2

Overall average ranking for each comparison algorithm

5 Conclusion

This paper proposed a novel algorithm based on label correlation and generalized condensed nearest neighbors for multi-label prototype selection, CO-GCNN. The main contributions of this algorithm are (1) an attempt to improve the quality of the selected prototypes by considering the label correlation in the process of prototype selection for multi-label data and (2) the use of the problem transformation method to empower the GCNN algorithm to handle multi-label data. The experimental results validate that the method can effectively reduce the instances of multi-labeled datasets and increase the classification performance. In the future, we will still focus on and try to develop new multi-label prototype selection algorithms; combining prototype selection and feature selection techniques is also a promising research direction. Furthermore, how to solve the problem of class imbalance in the prototype selection needs to be investigated.