1 Introduction

Multi-label learning aims to predict multiple labels simultaneously for an instance. It has received considerable attention due to its applications in a wide range of domains, such as image recognition, natural language processing, and complex networks. For instance, multi-label learning for fundus images can effectively improve the accuracy of diagnosis [1]; with increasing information on the internet, multi-label learning for online text can enhance retrieval accuracy [2]; and multiple labels on user profiles can be helpful for individual recommendations and marketing on social networks [3].

Compared with single-label problems, multi-label data often have large feature spaces. The number of features can reach tens of thousands when describing semantics [4,5,6], and some can be redundant or irrelevant for classification tasks. Moreover, high-dimensional feature space often brings negative impacts on classification tasks. Therefore, a number of algorithms concentrate on feature compression techniques to obtain a low-dimension expression of multi-label data, effectively improving the performance of multi-label classification. Mutual information is widely used for feature compression, which enables efficient performance in multi-label classification [7, 8]. However, most feature compression algorithms construct an identical low-dimensional feature space for all labels [9, 10]. In other words, different labels share the same feature expression. In multi-label problems, labels reflect different semantics. Therefore, labels have unique features, named label-specific features, which can be used to distinguish them. These features are mostly related to corresponding labels and they are the most appropriate to distinguish labels. Zhang and Wu [11] introduced the concept of label-specific features in the method LIFT for the first time in 2015. Although there are some variants of LIFT, such as LETTER [12], LSDM [13], LF-LPLC [14], the research of label-specific features is ongoing.

In multi-label problems, labels do not occur independently but present some dependence. In other words, some labels tend to appear together in many instances, while others rarely co-occur. Using label correlation is conducive to learning the more efficient and robust classification model [15]. Labels sometimes have few positive instances, in which case it is important to use label correlation. To make full use of label correlation has become a major research direction in multi-label classification, and is important in many algorithms [16,17,18,19]. Considering the complexity of real-world label correlation, some label correlations are global and some others may be local. Although most existing algorithms have considered global [20,21,22] or local [15, 23, 24] label correlation for multi-label learning, both global and local label correlation are less taken into account.

As discussed above, label-specific features and label correlation are two important characteristics for multi-label learning. We unify these and propose label-specific features with global and local label correlation (LFGLC), which calculates global and local label correlation by label co-occurrence and neighborhood information, respectively, and adds label correlation to linear regression with the 1-norm to learn label-specific features for each label.

LFGLC makes the following contributions:

  1. 1.

    LFGLC integrates both global and local label correlation to select label-specific features. To the best of our knowledge, this is the first work to make reconstructed features more discriminative.

  2. 2.

    Linear regression modeled by LFGLC can be simultaneously applied to multi-label classification and feature selection.

  3. 3.

    Experiments on multiple data sets with different sizes and domains show that LFGLC outperforms several multi-label classification and feature selection algorithms in terms of both example-based evaluation metrics and label-based evaluation metrics.

The rest of this article is organized as follows. Section 2 reviews related work on label correlation and label-specific features for multi-label learning. Section 3 describes LFGLC. Experimental results and analysis are shown in Section 4. Section 5 presents conclusions.

2 Related work

Our work is related to label correlation and label-specific features for multi-label learning. We present related algorithms on label correlation and label-specific features based algorithms.

2.1 Label correlation

Making full use of label correlation is a major research direction of multi-label classification. [16,17,18,19, 25]. Label correlation strategies can be divided into three types [26]: first-order, second-order and high-order. First-order algorithms ignore label correlation completely, including BR [27], LIFT [11] and MLk NN [28]. Second-order algorithms consider pairwise label correlation, such as CPNL [29], PCT [30] and GBRAML [31]. High-order algorithms consider the correlation between more than two labels, such as CC [20], BNCC [32] and MLMF [33].

In terms of relation extraction or the use of perspective, label correlation based algorithms can be categorized as global, local, and global-local combined relation algorithms.

2.1.1 Global relation algorithm

A global relation algorithm, such as CC [20], CLSF [21] and A-GCN [22], assumes that label correlation is global. In other words, the correlation between labels exists in all training data. For example, CC puts all labels in a random sequence. Binary classifier outputs of previous labels are added to a label’s original feature space as new features, and the corresponding binary classifiers are learned in accordance with this sequence. Label correlation is constructed on all training data. A-GCN uses a label graph to learn global label correlations with word embeddings.

2.1.2 Local relation algorithm

In a local relation algorithm, label correlation exists in part of the training data. For example, “apple” and “fruit” have a strong relation in gourmet magazines, while “apple” and “digital equipment” often occur together in technical journals. Obviously, the dependence relations of labels only exist in some data in this case. If such label correlation is extracted or used from the global perspective, unnecessary and even misguided constraints will be imposed over all instances, which will decrease the performance of classification models. LPLC [15] considers label correlation locally, finding the positive and negative label correlation of each label for all training instances. Then, for each testing instance, the maximum posterior probability is used for prediction based on the local positive and negative label correlation of its k-nearest neighbors. Ma [23] divides the training data into several groups, whose instances share different label correlations.

2.1.3 Global-local combined relation algorithm

Global-local combined relation algorithms consider both global and local label correlation to establish a high-efficiency classification model. For example, GLOCAL [34] learns both global and local label correlation by manifold regularization. GLkEL [35] selects the most correlated k-labelsets from label space by approximated joint mutual information to evaluate global label correlation. Then, it clusters the training data into different groups and evaluates the local label correlation in each group.

2.2 Label-specific features

Multi-label algorithms often use identical feature expressions to build classification models for different labels. In other words, different labels use the same feature matrix in the learning process. However, the LIFT [11] algorithm assumes that labels have unique expressions. The concept of label-specific features has evident differences from the concept of traditional feature compression.

At present, there are two main methods to construct label-specific features. One is feature extraction and the other is feature selection. The former one is represented by LIFT, while the latter one is represented by LLSF [36].

2.2.1 Feature extraction based label-specific features

LIFT [11] extracts label-specific features for each label through feature extraction. Specifically, instances related to any label are viewed as positive instances, and other instances as negative. K-means [37] is utilized to cluster on the positive and negative instance sets. Distances from each original instance to the centers of these clusters are calculated, which form the new features. Next, binary classifiers are learned on these label-specific features. For different labels, distributions of positive and negative instances are different so that the reconstructed label-specific features vary from each other. Extensive experiments have demonstrated the effect of LIFT, resulting in the proposal of a number of algorithms.

Based on LIFT, LF-LPLC [14] integrates label-specific features and local pairwise label correlation, where the specific features of each label are expanded by uniting the related features from correlated labels. This enriches the labels’ semantic information and somewhat solves the class-imbalance problem. LETTER [12] extracts label-specific features from instance and feature levels. From the instance level, sparse and prototype constraints are used to find more discriminative instance centers. From the feature level, clustering is utilized to find feature centers from the original features of positive and negative instances. The final label-specific features are composed of centers extracted from the above two levels. Related work includes LSDM [13], ELIFT [38], and so on.

2.2.2 Feature selection based label-specific features

The above algorithms all adopt feature transformation to extract label-specific features. However, the LLSF algorithm proposed by Huang [36] learns label-specific features through a feature selection technique. LLSF assumes that each label is only related to some of the original features, and it expresses such sparsity in linear regression with 1 constraint. Nonzero regression parameters indicate that the corresponding features are label-specific, and other features are not.

The objective function of LLSF assumes that strongly correlated labels have more label-specific features than weakly correlated labels. Since LLSF implements feature selection through linear regression, it can learn binary classification models based on the selected features. MCUL [39] also utilizes 1-norm regularization on the coefficient matrix to learn sparse label-specific features, so as to deal with missing and completely unobserved labels. NSLSF [40] considers that the sparsity assumption does not hold in some applications, and proposes a feature selection based approach to select label-specific features. It translates logic labels to numeric labels to convey more semantic information and embeds the label correlation. Linear regression with 1 constraint describes the discrimination of label-specific features based on the numeric labels.

3 Proposed algorithm

Given a multi-label data set D = {(xi,yi)|1 ≤ in} with n instances, we denote feature set X = [x1,x2,…,xn]TRn×d, where d is the dimension of features. And let \(Y=[\boldsymbol {y}_{1},\boldsymbol {y}_{2},\ldots ,\boldsymbol {y}_{n}]^{T}\in \{ 0,1\}^{n\times l}\) denotes label set, where l is the number of labels. If yij = 1, instance xi belongs to label yj, and otherwise, yij = 0. LFGLC aims to use linear regression with the 1-norm to find label-specific features for each label, based on which it transforms multi-label classification to several binary classifications. To further improve the performance of classification, both global and local label correlation are taken into account. Label-specific features selected from the original features can achieve higher discriminability for classification.

As shown in Fig. 1, the main process of LFGLC can be summarized as the following three parts: global label correlation calculation, local label correlation calculation and label-specific feature selection. In global label correlation calculation, pairwise label correlation is calculated by the label co-occurrence frequency. In local label correlation calculation, the label correlation is calculated according to the instance and its neighbors. In label-specific feature selection, linear regression with the 1-norm on parameters and constraints on label correlations is employed to select features for each label.

Fig. 1
figure 1

Global and local label correlation, label-specific features for each label are represented in different colors. The thickness of the connection between labels indicates the strength of correlation. Note: For ease of reading and understanding of these images, please read the corresponding online version of the paper (colors have been retained)

3.1 Label-specific feature selection

As mentioned, label-specific features are more discriminative, so as to construct effective classification models for all labels. We use linear regression with the 1-norm to find label-specific features, as introduced in LLSF [36]. The objective function can be formulated as:

$$ {}_{W}^{min}L(W)+R(W), $$
(1)

where L(W) is formulated as:

$$ L(W)=\frac{1}{2}{\sum}_{i=1}^{n}\|\boldsymbol{x}_{i}W+\boldsymbol{b}-\boldsymbol{y}_{i}\|_{2}, $$
(2)

where W = [w1,w2,…,wl] ∈ Rd×l denotes the coefficient of linear regression, and b = [b1,b2,…,bl] ∈ Rl denotes the bias of linear regression. The bias b can be added to the coefficient W when the constant value 1 is added as an additional dimension to feature set X. Then L(W) can be simplified to:

$$ \begin{aligned} L(W)&=\frac{1}{2}{\sum}_{i=1}^{n}\|\boldsymbol{x}_{i}W-\boldsymbol{y}_{i}\|_{2}\\ &=\frac{1}{2}Tr((XW-Y)^{T}D(XW-Y)), \end{aligned} $$
(3)

where D is a diagonal matrix and its diagonal element dii is formulated as:

$$ d_{ii}=\frac{1}{\|\boldsymbol{x}_{i}W-\boldsymbol{y}_{i}\|_{2}}. $$
(4)

To select features for each label, the 1-norm is added to the coefficient W,

$$ R(W)=\|W\|_{1}, $$
(5)

and can make the coefficient W sparse. For each wi = [wi1,wi2,…,wid]T, the value of wij indicates the discriminability of the j-th feature to label yi. If wij= 0, then the j-th feature is not helpful to label yi, and otherwise it can be regarded as a label-specific feature to label yi.

The above label-specific feature selection does not consider label correlation. We next will utilize both global and local label correlation to further constrain the coefficient.

3.2 Global label correlation calculation

As introduced in Section 2, labels do not occur independently but present some dependence in multi-label problems. Exploiting label correlation can effectively improve the performance of multi-label classifiers. Similar to LLSF [36], we assume that two strongly correlated labels may share more label-specific features than weakly correlated labels. Then the inner product between the corresponding coefficients of labels will be large when these labels are strongly correlated, and otherwise it will be small. Here we denote GC(W) as global label correlation, as shown in (6).

$$ GC(W)=\frac{1}{2}{\sum}_{i=1}^{l}{\sum}_{j=1}^{l}c_{ij}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}=\frac{1}{2}Tr(WCW^{T}), $$
(6)
$$ c_{ij}=\frac{\vert \boldsymbol{y}^{i}{\Delta}\boldsymbol{y}^{j}\vert}{\vert \boldsymbol{y}^{i}\cup\boldsymbol{y}^{j}\vert}, $$
(7)

where cij is the correlation coefficient between label yi and label yj. yi and yj are the i-th column and the j-th column of Y. It can be seen from (7), the value of cij will be small when there are more co-occurrence labels between yi and yj, and otherwise it is large.

3.3 Local label correlation calculation

As discussed in Section 2, label correlation also exists in some of instances. A classification model that considers local label correlation can be more suitable to real-world problems. Motivated by previous work [28], instance may have more similar labels with its neighbors. The proposed algorithm finds the k-nearest neighbors of each instance by Euclidean distance firstly. Then in the neighborhood of an instance, the probabilities of labels are calculated by (8).

$$ \boldsymbol{N}_{i}=\frac{1}{k}{\sum}_{i=1}^{k}\boldsymbol{y}_{N_{i}}, $$
(8)

where Ni denotes the probabilities of labels in the neighborhood of an instance, and \(\boldsymbol {y}_{N_{i}}\) denotes the labels of the i-th neighbor. Based on (8), local label correlation LC(W) can be formulated as:

$$ \begin{aligned} {}LC(W){}&={\sum}_{i=1}^{n}\|\boldsymbol{x}_{i}W-\boldsymbol{N}_{i}\|_{2}^{2}\\ &=\|XW-N\|_{F}^{2}. \end{aligned} $$
(9)

3.4 Optimization via accelerated proximal gradient

According to the definition of each term, we unify label-specific feature selection and label correlation, and can rewrite the whole objective function of LFGLC as:

$$ \begin{aligned} F(W)&=\frac{1}{2}Tr((XW-Y)^{T}D(XW-Y))\\ &+\frac{\alpha}{2}Tr(WCW^{T})+\frac{\beta}{2}\|XW-N\|_{F}^{2}+\gamma\|W\|_{1}, \end{aligned} $$
(10)

where α, β and γ are nonnegative parameters that control the contribution of each term. The objective function is a convex optimization problem. To solve the nonsmoothness caused by 1-norm, the accelerated proximal gradient method is employed to optimize this objective function. General accelerated proximal gradient method can divide the objective function into the following two parts [43]:

$$ {}_{W}^{min}{F(W)=f(W)+g(W)}, $$
(11)

f(W) and g(W) are convex, but f(W) is smooth while g(W) is nonsmooth. f(W) holds Lipschitz continuous gradient: ∥∇f(W1) −∇f(W2)∥≤ LfW1W2∥, where Lf is the Lipschitz constant. f(W) and g(W) are formulated as:

$$ \begin{aligned} f(W)&=\frac{1}{2}Tr((XW-Y)^{T}D(XW-Y))\\ &+\frac{\alpha}{2}Tr(WCW^{T})+\frac{\beta}{2}\|XW-N\|_{F}^{2}, \end{aligned} $$
(12)
$$ g(W)=\gamma\|W\|_{1}. $$
(13)

f(W) denotes the derivative of f(W) and it can be calculated by:

$$ \nabla f(W) = X^{T}DXW-X^{T}DY+\alpha WC+\beta (X^{T}XW-X^{T}N). $$
(14)

Given W1 and W2, we have

$$ \begin{aligned} &\|\nabla f(W_{1})-\nabla f(W_{2})\|_{F}^{2}\\ &=\|X^{T}DX{\Delta} W+\alpha{\Delta} WC+\beta X^{T}X{\Delta} W\|_{F}^{2}\\ &\leq 3\|X^{T}DX{\Delta} W\|_{F}^{2}+3\|\alpha {\Delta} WC\|_{F}^{2}+3\|\beta X^{T}X{\Delta} W\|_{F}^{2}\\ &\leq 3\|X^{T}DX\|_{2}^{2}\|{\Delta} W\|_{F}^{2}+3\|\alpha C\|_{2}^{2}\|{\Delta} W\|_{F}^{2}\\&+3\|\beta X^{T}X\|_{2}^{2}\|{\Delta} W\|_{F}^{2}\\ &=3(\|X^{T}DX\|_{2}^{2}+\|\alpha C\|_{2}^{2}+\|\beta X^{T}X\|_{2}^{2})\|W\|_{F}^{2}, \end{aligned} $$
(15)

thus, the Lipschitz constant Lf can be calculated as:

$$ L_{f}=\sqrt{3(\|X^{T}DX\|_{2}^{2}+\|\alpha C\|_{2}^{2}+\|\beta X^{T}X\|_{2}^{2})}. $$
(16)

The pseudocode of the optimization of LFGLC is summarized in Algorithm 1. Steps 6-12 are the iteration process of the accelerated proximal gradient method. Previous work [42] showed that for a sequence bt satisfying \({b_{t}^{2}}-b_{t}\leq b_{t-1}^{2}\), the convergence rate can be improved to \(\mathcal {O}(t^{-2})\) when letting \(W^{(t)}= W_{t}+\frac {b_{(t-1)}-1}{b_{t}}(W_{t}-W_{t-1})\), where Wt is the coefficient W obtained at the t-th iteration. The 1-norm can be solved by the soft-thresholding operator S𝜖[w] in each iteration, and the step size 𝜖 of the soft-thresholding operator is set to \(\frac {\gamma }{L_{f}}\) in this optimization process.

After learning the coefficient W, the prediction of the test data Xt can be calculated by sign(Stτ). St = XtW, τ is the threshold which is set to 0.5 in LFGLC. The pseudocode of the test procedure is summarized in Algorithm 2.

figure g
figure h

3.5 Discussion

Note that LFGLC is similar to LLSF [36] and JFSC [41]. LLSF uses cosine similarity to calculate global label correlation and the correlation among labels is added to linear regression with the 1-norm to select label-specific features for each label. Based on LLSF, JFSC adds a Fisher discriminant-based regularization term to obtain a large inter-class distance and a small inner-class distance for classification. These algorithms perform competitively on label-specific feature selection and classification, but they neglect local label correlation. Different from them, LFGLC is devoted to take both global and local label correlation into consideration. Global label correlation is calculated by the label co-occurrence frequency between label pairs, and local label correlation is calculated by each instance with its neighbors. Both are added to linear regression with the 1-norm to select more discriminative label-specific features for multi-label classification.

3.6 Complexity analysis

The complexity of the optimization of LFGLC includes initialization and iteration. In initialization, the step of initializing coefficient W1 has complexity \(\mathcal {O}(nd^{2}+d^{3}+ndl+d^{2}l)\). Step 2 calculates the global label correlation matrix C, and it needs \(\mathcal {O}(nl^{2})\). In step 3, the calculation of N includes finding the k-nearest neighbors of each instance and calculating the probabilities of labels in the neighborhood, which needs \(\mathcal {O}(n^{2}d+nkd)\). In iteration, step 5 calculates the diagonal matrix D, with complexity \(\mathcal {O}(ndl)\). The calculation of the Lipschitz constant Lf in step 6 has complexity \(\mathcal {O}(nd^{2}+d^{3}+l^{3})\). To calculate the derivative of f(W) in step 8 has complexity \(\mathcal {O}(nd^{2}+d^{2}l+ndl+dl^{2})\).

4 Experiments

4.1 Data sets

Experiments were conducted on 12 real-world multi-label data sets, as described in Table 1. “Cardinality” is the average number of labels of each instance in one data set. All these data sets can be obtained from mulanFootnote 1, lamdaFootnote 2, and mekaFootnote 3.

Table 1 Experimental data sets

4.2 Evaluation metrics

For a given testing data set Dt = {(xi,yi)|1 ≤ int}, the ground-truth label of the i-th instance xi is represented as \(\boldsymbol {y}_{i}\in \{ 0,1\}^{l}\), and let \(\hat {\boldsymbol {y}_{i}}\in \{ 0,1\}^{l}\) denotes the predicted label of the i-th instance xi. There are 7 multi-label evaluation metrics used for the evaluation of LFGLC. These evaluation metrics can be divided into the following two types: example-based evaluation metrics and label-based evaluation metrics.

Example-based evaluation metrics:

Accuracy measures Jaccard similarity between ground-truth and predicted labels:

$$ Accuracy=\frac{1}{n_{t}}{\sum}_{i=1}^{n_{t}}\frac{\vert\boldsymbol{y}_{i}\cap\hat{\boldsymbol{y}_{i}}\vert}{\vert\boldsymbol{y}_{i}\cup\hat{\boldsymbol{y}_{i}}\vert}. $$
(17)

Precision is the proportion of positive labels that are predicted correctly:

$$ Precision=\frac{1}{n_{t}}{\sum}_{i=1}^{n_{t}}\frac{\vert\boldsymbol{y}_{i}\cap\hat{\boldsymbol{y}_{i}}\vert}{\vert\hat{\boldsymbol{y}_{i}}\vert}. $$
(18)

Recall is the proportion of ground-truth positive labels that are correctly predicted:

$$ Recall=\frac{1}{n_{t}}{\sum}_{i=1}^{n_{t}}\frac{\vert\boldsymbol{y}_{i}\cap\hat{\boldsymbol{y}_{i}}\vert}{\vert\boldsymbol{y}_{i}\vert}. $$
(19)

F1 evaluates the harmonic mean between Precision and Recall:

$$ F_{1}=\frac{1}{n_{t}}{\sum}_{i=1}^{n_{t}}\frac{2\cdot Precision\cdot Recall}{Precision+Recall}. $$
(20)

Exact-Match evaluates how many times the predicted and ground-truth labels are exactly matched:

$$ Exact-Match=\frac{1}{n_{t}}{\sum}_{i=1}^{n_{t}}\vert\boldsymbol{y}_{i}=\hat{\boldsymbol{y}_{i}}\vert $$
(21)

Label-based evaluation metrics: There are two metrics of averaging across the labels:

$$ Macro\ F_{1}=\frac{1}{l}{\sum}_{j=1}^{l}F_{1}(TP_{j},FP_{j},TN_{j},FN_{j}), $$
(22)
$$ \mathit{Micro}\ F_{1} = \frac{1}{l}F_{1}\left( {\sum}_{j=1}^{l}\mathit{TP}_{j},{\sum}_{j=1}^{l}\mathit{FP}_{j},{\sum}_{j=1}^{l}\mathit{TN}_{j},{\sum}_{j=1}^{l}\mathit{FN}_{j}\right), $$
(23)

where TPj,FPj,TNj,FNj are the number of true positive, false positive, true negative, and false negative instances with respect to label yj respectively.

4.3 Comparison methods

We compared the multi-label classification performance of LFGLC with several state-of-the-art algorithms:

BR [27] transforms multi-label classification to several binary classification tasks without considering label correlation, where each binary classifier corresponds to one label.

CC [20] puts all labels in a random sequence, and in accordance with each label, learns the corresponding binary classifier. For each label, the binary classifier outputs of its previous labels are added as new features.

MLk NNFootnote 4 [28] finds the k-nearest neighbors for each instance in Euclidean space. The maximum posterior probability of each label is used to estimate the probability based on the number of neighbors belonging to each label. The parameter k is set to 10.

Lasso [43] uses linear regression with the 1-norm to select features from the original feature space according to nonzero regression coefficients, while neglecting label correlation. Parameter α is searched in {2− 10,2− 9,…,210}.

LLSFFootnote 5 [36] uses cosine similarity to calculate pairwise label correlation, which is added to linear regression with the 1-norm to select label-specific features for each label. Parameters α and β are searched in {2− 10,2− 9,…,210}. ρ is searched in {0.1,1,10}.

Based on LLSF, JFSC [41] uses a Fisher discriminant-based regularization term to achieve a large inter-class distance and small inner-class distance for classification. Parameters α, β and γ are searched in {4− 5,4− 4,…,45}. η is searched in {0.1,1,10}.

Based on LLSF, NSLSF [40] translates logic labels to numeric labels to convey more semantic information and embed label correlations. Parameters α and β are same as LLSF, ρ is set to 0.5.

The searching scales of parameters α, β, γ and η in LFGLC are same as JSFC. The number of k-nearest neighbors is set to 10.

BR, MLk NN, and Lasso are first-order algorithms that do not consider label correlation; LLSF, JFSC, and NSLSF are second-order algorithms with global label correlation; and CC can be regarded as a high-order algorithm with global label correlation. Specifically, Lasso, LLSF, JFSC, and NSLSF are feature selection based label-specific features algorithms. For fair comparisons, the parameters of these algorithms are set according to the suggestions in their original papers. LIBSVM [44] with a linear kernel is employed as the base binary classifier for BR and CC, and the parameter C is set to 1. For the sake of fairness, the threshold of LFGLC is set to 0.5, the same as other comparison algorithms. Of course, we can get an appropriate value for the threshold of every algorithm with the use of a tuning phase. In this paper, we mainly study the effect of label correlations and label-specific features for multi-label classification, so we will learn the threshold in the next study.

4.4 Results of multi-label classification

The experiment used 5-fold cross-validation on each data set to evaluate the performance of multi-label classification. The average results of each algorithm on 12 data sets with 7 evaluation metrics are summarized in Tables 2 and 3. The best result in a row is bolded. “” after a metric denotes that a larger value indicates better performance. For each evaluation metric, an “Ave.rank” row reports the average rank value over all data sets for each algorithm. A smaller rank indicates better performance. To more intuitively reflect the average rank of these algorithms, the average rank and overall average rank are depicted in Fig. 2. According to the experimental results in Tables 2 and 3, the observations are summarized as follows:

  1. 1.

    The proposed algorithm obviously outperforms the first-order algorithms (BR, MLkNN, Lasso) on all evaluation metrics, perhaps because LFGLC considers label correlation in multi-label classification, which is different from these first-order algorithms. Hence the consideration of label correlation can effectively improve multi-label classification performance.

  2. 2.

    Second-order (LLSF, JSFC, NSLSF) and high-order (CC) algorithms considere global label correlation but neglect local label correlation. LFGLC outperforms these algorithms, which indicates the effectiveness of local label correlation for multi-label classification. Compared with similar algorithms (LLSF, JSFC, LFGLC), the proposed algorithm can obtain more suitable regression parameters for classification.

  3. 3.

    On these evaluation metrics, LFGLC statistically performs better on Accuracy, Precision, Recall, F1, Exact-Match, Macro F1 and Micro F1 over all data sets. This validates the superiority of the proposed algorithm. It is worth mentioning that Precision and Recall are generally contradictory. Because LFGLC considers the possibility that each instance may have labels similar to its neighbors, there will be more labels appearing in the prediction. Hence it performed better at Recall.

Table 2 Experimental results (mean± std(rank)) of different comparison algorithms on Accuracy, Precision, Recall, F1 and Exact-Match
Table 3 Experimental results (mean± std(rank)) of different comparison algorithms on Macro F1 and Micro F1
Fig. 2
figure 2

Results of the rank of all algorithms. Note: For ease of reading and understanding of these images, please read the corresponding online version of the paper (colors have been retained)

Figure 2 shows the overall average rank of algorithms, the order of all algorithms can be ranked as LFGLC≻ NSLSF≻ JFSC≻ LLSF≻ CC≻ Lasso≻ BR≻ MLk NN. In summary, the proposed algorithm performs competitively in multi-label classification against other comparison algorithms.

To analyze the statistical performance among these algorithms systematically, Friedman test [45] was conducted here. Table 4 summarizes the Friedman statistics FF and the critical value for each evaluation metric. This shows that the null hypothesis of equivalent performance among these comparison algorithms is rejected at significance level α = 0.10 for each evaluation metric.

Table 4 Friedman statistics FF and the critical value at 0.10 significance level in terms of each evaluation metric

To analyze the relative performance among these algorithms, the post-hoc Nemenyi test [45] was conducted and LFGLC is treated as the control algorithm. The performance between control algorithm and one comparison algorithm will be significantly different if their average ranks differ by at least one CD (CD = 2.780 in this paper). Figure 3 shows the CD diagrams on each evaluation metric. In each subfigure, any comparison algorithm whose average rank is within one CD to that of LFGLC is connected, and otherwise it is considered to have significantly different performance against LFGLC

Fig. 3
figure 3

Nemenyi test. The performance of those algorithms which are not significantly different from LFGLC are connected (CD= 2.780 at 0.10 significance level)

.

4.5 Results of multi-label feature selection

To evaluate the performance of feature selection for multi-label learning, the proposed algorithm was compared with 4 feature selection based label-specific features algorithms (Lasso, LLSF, JSFC, NSLSF) on 5 multi-label data sets (cal500, emotions, medical, image, education) because of the space limitation. For each label, some features are selected from the original features according to the corresponding top weights from regression coefficients.

In the experiment, 5-fold cross-validation was conducted on each data set to evaluate the results in terms of Accuracy, F1, Exact-Match, Macro F1 and Micro F1 for all algorithms. Because F1 is the harmonic mean of Precision and Recall, for simplicity, we only select F1 to evaluate the performance. The parameters of these algorithms are set the same as for multi-label classification. The top {10%, 20%, …, 50%} of the original features are taken as the selected features. LIBSVM [44] with a linear kernel is employed as the base classifier for all algorithms. Figure 4 displays the average results of algorithms over each data set, and using “ALL” as a baseline, without selecting from the original features. According to the experimental results, the following observations can be made:

  1. 1.

    All feature selection algorithms generally outperform the baseline “ALL”, which indicates that feature selection can effectively improve the performance of multi-label classification to some extent.

  2. 2.

    Label-specific features learned from LFGLC generally perform better than other comparison algorithms. Specifically, Lasso conducts feature selection without considering label correlation, LLSF, JFSC and NSLSF learn label-specific features with global label correlation. The results indicate that considering label correlation can be useful for feature selection and considering both global and local label correlation can further improve the performance of feature selection.

  3. 3.

    For different data sets, the performance of feature selection presents different change trends. The best performance for most data sets is obtained for some intermediate number of selected features, perhaps because few selected features will cause some important features missed, and a large number of selected features may introduce useless features, that degrade performance.

Fig. 4
figure 4

Results of the performance of feature selection. Note: For ease of reading and understanding of these images, please read the corresponding online version of the paper (colors have been retained)

4.6 Global and local label correlation

To intuitively show the global and local label correlation learned by LFGLC, the global pairwise label correlation matrix and probabilities of labels in the neighborhood learned from the Image data set are depicted in Fig. 5. We can observe from Fig. 5(a) that “mountains” is correlated with “trees”, and “sea” is correlated with “sunset” globally. In local label correlation (Fig. 5(b)), we randomly select 10 neighborhoods, whose label correlations are different. For example, “mountains” is correlated with “sea” in neighborhood 7, but “sea” is correlated with “sunset” in neighborhood 8. These further illustrate the complexity of label correlation.

Fig. 5
figure 5

Global and local label correlation on Image data set. The label index from small to large represents “desert”, “mountains”, “sea”, “sunset” and “trees”. Note: For ease of reading and understanding of these images, please read the corresponding online version of the paper (colors have been retained)

4.7 Parameter sensitivity analysis

The objective function of LFGLC has several terms, whose contributions are controlled by these parameters (α, β, γ). The performance of LFGLC will be affected when the values of these parameters are changed. We conducted parameter sensitivity analysis of LFGLC on the yeast data set. Parameters corresponding to the best performance are first selected. Then the value of one parameter is fixed, and the values of the other two parameters are varied in {4− 5,4− 4,…,45}.

Figure 6 shows the average results of 5-fold cross-validation in terms of Accuracy, F1, Exact-Match, Macro F1, Micro F1. It can be seen that the best performance of Accuracy, F1 and Micro F1 is mostly obtained at the endpoint of the coordinate plane. But for evaluation metric Exact-Match and Macro F1, some intermediate values of parameters achieve the best performance. Experimental results show that the performance of LFGLC is sensitive to parameters change in some intervals, and metrics generally achieve the best performance with different parameter values. Thus, to obtain the best performance on a certain data set, we suggest finding parameter values by searching on the validation set. Searching parameters will obviously cost much time especially for large-scale data sets.

Fig. 6
figure 6

Parameter sensitivity analysis of LFGLC. Note: For ease of reading and understanding of these images, please read the corresponding online version of the paper (colors have been retained)

5 Conclusion

In this paper, we propose a new label-specific feature selection and multi-label classification algorithm LFGLC, which considers the complexity of real-world label correlation. Both global and local label correlation are taken into account to learn more discriminative label-specific features. For each label, label-specific features are selected from original features according to the nonzero regression coefficients. Experimental results show that combining global and local label correlation can be useful for multi-label learning. The proposed algorithm achieves a competitive performance against several algorithms in multi-label classification and feature selection. Considering that correlations between labels are not equal, we will try to find more compatible label correlation for multi-label learning in our future work.