Learning label-specific features with global and local label correlation for multi-label classification

Weng, Wei; Wei, Bowen; Ke, Wen; Fan, Yuling; Wang, Jinbo; Li, Yuwen

doi:10.1007/s10489-022-03386-7

Learning label-specific features with global and local label correlation for multi-label classification

Published: 19 May 2022

Volume 53, pages 3017–3033, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

Learning label-specific features with global and local label correlation for multi-label classification

Download PDF

Wei Weng ORCID: orcid.org/0000-0002-8484-9065¹,
Bowen Wei¹,
Wen Ke¹,
Yuling Fan²,
Jinbo Wang³ &
…
Yuwen Li⁴

1141 Accesses
11 Citations
1 Altmetric
Explore all metrics

Abstract

Multi-label algorithms often use an identical feature space to build classification models for all labels. However, labels generally express different semantic information and should have their own characteristics. A few algorithms have been proposed to find label-specific features to construct discriminative classification models. Some use global label correlation to make the reconstructed features more discriminative, but they usually neglect the local correlation between labels. To solve this problem, we propose a new algorithm, named learning Label-specific Features with Global and Local label Correlation (LFGLC). The algorithm integrates both global and local label correlation to extract label-specific features for each label. Specifically, global label correlation is calculated by the label co-occurrence frequency between label pairs, and local label correlation is learned from the neighborhood of each instance. Comprehensive experiments on 12 multi-label data sets clearly manifest that the proposed algorithm performs competitively in feature selection and multi-label classification.

Learning multi-label label-specific features via global and local label correlations

Article 27 January 2022

Joint Label-Specific Features and Correlation Information for Multi-Label Learning

Article 27 March 2020

Multi-label classification by formulating label-specific features from simultaneous instance level and feature level

Article 12 November 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Multi-label learning aims to predict multiple labels simultaneously for an instance. It has received considerable attention due to its applications in a wide range of domains, such as image recognition, natural language processing, and complex networks. For instance, multi-label learning for fundus images can effectively improve the accuracy of diagnosis [1]; with increasing information on the internet, multi-label learning for online text can enhance retrieval accuracy [2]; and multiple labels on user profiles can be helpful for individual recommendations and marketing on social networks [3].

Compared with single-label problems, multi-label data often have large feature spaces. The number of features can reach tens of thousands when describing semantics [4,5,6], and some can be redundant or irrelevant for classification tasks. Moreover, high-dimensional feature space often brings negative impacts on classification tasks. Therefore, a number of algorithms concentrate on feature compression techniques to obtain a low-dimension expression of multi-label data, effectively improving the performance of multi-label classification. Mutual information is widely used for feature compression, which enables efficient performance in multi-label classification [7, 8]. However, most feature compression algorithms construct an identical low-dimensional feature space for all labels [9, 10]. In other words, different labels share the same feature expression. In multi-label problems, labels reflect different semantics. Therefore, labels have unique features, named label-specific features, which can be used to distinguish them. These features are mostly related to corresponding labels and they are the most appropriate to distinguish labels. Zhang and Wu [11] introduced the concept of label-specific features in the method LIFT for the first time in 2015. Although there are some variants of LIFT, such as LETTER [12], LSDM [13], LF-LPLC [14], the research of label-specific features is ongoing.

In multi-label problems, labels do not occur independently but present some dependence. In other words, some labels tend to appear together in many instances, while others rarely co-occur. Using label correlation is conducive to learning the more efficient and robust classification model [15]. Labels sometimes have few positive instances, in which case it is important to use label correlation. To make full use of label correlation has become a major research direction in multi-label classification, and is important in many algorithms [16,17,18,19]. Considering the complexity of real-world label correlation, some label correlations are global and some others may be local. Although most existing algorithms have considered global [20,21,22] or local [15, 23, 24] label correlation for multi-label learning, both global and local label correlation are less taken into account.

As discussed above, label-specific features and label correlation are two important characteristics for multi-label learning. We unify these and propose label-specific features with global and local label correlation (LFGLC), which calculates global and local label correlation by label co-occurrence and neighborhood information, respectively, and adds label correlation to linear regression with the ℓ₁-norm to learn label-specific features for each label.

LFGLC makes the following contributions:

1.
LFGLC integrates both global and local label correlation to select label-specific features. To the best of our knowledge, this is the first work to make reconstructed features more discriminative.
2.
Linear regression modeled by LFGLC can be simultaneously applied to multi-label classification and feature selection.
3.
Experiments on multiple data sets with different sizes and domains show that LFGLC outperforms several multi-label classification and feature selection algorithms in terms of both example-based evaluation metrics and label-based evaluation metrics.

The rest of this article is organized as follows. Section 2 reviews related work on label correlation and label-specific features for multi-label learning. Section 3 describes LFGLC. Experimental results and analysis are shown in Section 4. Section 5 presents conclusions.

2 Related work

Our work is related to label correlation and label-specific features for multi-label learning. We present related algorithms on label correlation and label-specific features based algorithms.

2.1 Label correlation

Making full use of label correlation is a major research direction of multi-label classification. [16,17,18,19, 25]. Label correlation strategies can be divided into three types [26]: first-order, second-order and high-order. First-order algorithms ignore label correlation completely, including BR [27], LIFT [11] and MLk NN [28]. Second-order algorithms consider pairwise label correlation, such as CPNL [29], PCT [30] and GBRAML [31]. High-order algorithms consider the correlation between more than two labels, such as CC [20], BNCC [32] and MLMF [33].

In terms of relation extraction or the use of perspective, label correlation based algorithms can be categorized as global, local, and global-local combined relation algorithms.

2.1.1 Global relation algorithm

A global relation algorithm, such as CC [20], CLSF [21] and A-GCN [22], assumes that label correlation is global. In other words, the correlation between labels exists in all training data. For example, CC puts all labels in a random sequence. Binary classifier outputs of previous labels are added to a label’s original feature space as new features, and the corresponding binary classifiers are learned in accordance with this sequence. Label correlation is constructed on all training data. A-GCN uses a label graph to learn global label correlations with word embeddings.

2.1.2 Local relation algorithm

In a local relation algorithm, label correlation exists in part of the training data. For example, “apple” and “fruit” have a strong relation in gourmet magazines, while “apple” and “digital equipment” often occur together in technical journals. Obviously, the dependence relations of labels only exist in some data in this case. If such label correlation is extracted or used from the global perspective, unnecessary and even misguided constraints will be imposed over all instances, which will decrease the performance of classification models. LPLC [15] considers label correlation locally, finding the positive and negative label correlation of each label for all training instances. Then, for each testing instance, the maximum posterior probability is used for prediction based on the local positive and negative label correlation of its k-nearest neighbors. Ma [23] divides the training data into several groups, whose instances share different label correlations.

2.1.3 Global-local combined relation algorithm

Global-local combined relation algorithms consider both global and local label correlation to establish a high-efficiency classification model. For example, GLOCAL [34] learns both global and local label correlation by manifold regularization. GLkEL [35] selects the most correlated k-labelsets from label space by approximated joint mutual information to evaluate global label correlation. Then, it clusters the training data into different groups and evaluates the local label correlation in each group.

2.2 Label-specific features

Multi-label algorithms often use identical feature expressions to build classification models for different labels. In other words, different labels use the same feature matrix in the learning process. However, the LIFT [11] algorithm assumes that labels have unique expressions. The concept of label-specific features has evident differences from the concept of traditional feature compression.

At present, there are two main methods to construct label-specific features. One is feature extraction and the other is feature selection. The former one is represented by LIFT, while the latter one is represented by LLSF [36].

2.2.1 Feature extraction based label-specific features

LIFT [11] extracts label-specific features for each label through feature extraction. Specifically, instances related to any label are viewed as positive instances, and other instances as negative. K-means [37] is utilized to cluster on the positive and negative instance sets. Distances from each original instance to the centers of these clusters are calculated, which form the new features. Next, binary classifiers are learned on these label-specific features. For different labels, distributions of positive and negative instances are different so that the reconstructed label-specific features vary from each other. Extensive experiments have demonstrated the effect of LIFT, resulting in the proposal of a number of algorithms.

Based on LIFT, LF-LPLC [14] integrates label-specific features and local pairwise label correlation, where the specific features of each label are expanded by uniting the related features from correlated labels. This enriches the labels’ semantic information and somewhat solves the class-imbalance problem. LETTER [12] extracts label-specific features from instance and feature levels. From the instance level, sparse and prototype constraints are used to find more discriminative instance centers. From the feature level, clustering is utilized to find feature centers from the original features of positive and negative instances. The final label-specific features are composed of centers extracted from the above two levels. Related work includes LSDM [13], ELIFT [38], and so on.

2.2.2 Feature selection based label-specific features

The above algorithms all adopt feature transformation to extract label-specific features. However, the LLSF algorithm proposed by Huang [36] learns label-specific features through a feature selection technique. LLSF assumes that each label is only related to some of the original features, and it expresses such sparsity in linear regression with ℓ₁ constraint. Nonzero regression parameters indicate that the corresponding features are label-specific, and other features are not.

The objective function of LLSF assumes that strongly correlated labels have more label-specific features than weakly correlated labels. Since LLSF implements feature selection through linear regression, it can learn binary classification models based on the selected features. MCUL [39] also utilizes ℓ₁-norm regularization on the coefficient matrix to learn sparse label-specific features, so as to deal with missing and completely unobserved labels. NSLSF [40] considers that the sparsity assumption does not hold in some applications, and proposes a feature selection based approach to select label-specific features. It translates logic labels to numeric labels to convey more semantic information and embeds the label correlation. Linear regression with ℓ₁ constraint describes the discrimination of label-specific features based on the numeric labels.

3 Proposed algorithm

Given a multi-label data set D = {(x_i,y_i)|1 ≤ i ≤ n} with n instances, we denote feature set X = [x₁,x₂,…,x_n]^T ∈ R^n×d, where d is the dimension of features. And let $Y=[\boldsymbol {y}_{1},\boldsymbol {y}_{2},\ldots ,\boldsymbol {y}_{n}]^{T}\in \{ 0,1\}^{n\times l}$ denotes label set, where l is the number of labels. If y_ij = 1, instance x_i belongs to label y_j, and otherwise, y_ij = 0. LFGLC aims to use linear regression with the ℓ₁-norm to find label-specific features for each label, based on which it transforms multi-label classification to several binary classifications. To further improve the performance of classification, both global and local label correlation are taken into account. Label-specific features selected from the original features can achieve higher discriminability for classification.

As shown in Fig. 1, the main process of LFGLC can be summarized as the following three parts: global label correlation calculation, local label correlation calculation and label-specific feature selection. In global label correlation calculation, pairwise label correlation is calculated by the label co-occurrence frequency. In local label correlation calculation, the label correlation is calculated according to the instance and its neighbors. In label-specific feature selection, linear regression with the ℓ₁-norm on parameters and constraints on label correlations is employed to select features for each label.

3.1 Label-specific feature selection

As mentioned, label-specific features are more discriminative, so as to construct effective classification models for all labels. We use linear regression with the ℓ₁-norm to find label-specific features, as introduced in LLSF [36]. The objective function can be formulated as:

$$ {}_{W}^{min}L(W)+R(W), $$

(1)

where L(W) is formulated as:

$$ L(W)=\frac{1}{2}{\sum}_{i=1}^{n}\|\boldsymbol{x}_{i}W+\boldsymbol{b}-\boldsymbol{y}_{i}\|_{2}, $$

(2)

where W = [w₁,w₂,…,w_l] ∈ R^d×l denotes the coefficient of linear regression, and b = [b₁,b₂,…,b_l] ∈ R^1×l denotes the bias of linear regression. The bias b can be added to the coefficient W when the constant value 1 is added as an additional dimension to feature set X. Then L(W) can be simplified to:

$$ \begin{aligned} L(W)&=\frac{1}{2}{\sum}_{i=1}^{n}\|\boldsymbol{x}_{i}W-\boldsymbol{y}_{i}\|_{2}\\ &=\frac{1}{2}Tr((XW-Y)^{T}D(XW-Y)), \end{aligned} $$

(3)

where D is a diagonal matrix and its diagonal element d_ii is formulated as:

$$ d_{ii}=\frac{1}{\|\boldsymbol{x}_{i}W-\boldsymbol{y}_{i}\|_{2}}. $$

(4)

To select features for each label, the ℓ₁-norm is added to the coefficient W,

$$ R(W)=\|W\|_{1}, $$

(5)

and can make the coefficient W sparse. For each w_i = [w_i1,w_i2,…,w_id]^T, the value of w_ij indicates the discriminability of the j-th feature to label y_i. If w_ij= 0, then the j-th feature is not helpful to label y_i, and otherwise it can be regarded as a label-specific feature to label y_i.

The above label-specific feature selection does not consider label correlation. We next will utilize both global and local label correlation to further constrain the coefficient.

3.2 Global label correlation calculation

As introduced in Section 2, labels do not occur independently but present some dependence in multi-label problems. Exploiting label correlation can effectively improve the performance of multi-label classifiers. Similar to LLSF [36], we assume that two strongly correlated labels may share more label-specific features than weakly correlated labels. Then the inner product between the corresponding coefficients of labels will be large when these labels are strongly correlated, and otherwise it will be small. Here we denote GC(W) as global label correlation, as shown in (6).

$$ GC(W)=\frac{1}{2}{\sum}_{i=1}^{l}{\sum}_{j=1}^{l}c_{ij}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}=\frac{1}{2}Tr(WCW^{T}), $$

(6)

$$ c_{ij}=\frac{\vert \boldsymbol{y}^{i}{\Delta}\boldsymbol{y}^{j}\vert}{\vert \boldsymbol{y}^{i}\cup\boldsymbol{y}^{j}\vert}, $$

(7)

where c_ij is the correlation coefficient between label y_i and label y_j. yⁱ and y^j are the i-th column and the j-th column of Y. It can be seen from (7), the value of c_ij will be small when there are more co-occurrence labels between yⁱ and y^j, and otherwise it is large.

3.3 Local label correlation calculation

As discussed in Section 2, label correlation also exists in some of instances. A classification model that considers local label correlation can be more suitable to real-world problems. Motivated by previous work [28], instance may have more similar labels with its neighbors. The proposed algorithm finds the k-nearest neighbors of each instance by Euclidean distance firstly. Then in the neighborhood of an instance, the probabilities of labels are calculated by (8).

$$ \boldsymbol{N}_{i}=\frac{1}{k}{\sum}_{i=1}^{k}\boldsymbol{y}_{N_{i}}, $$

(8)

where N_i denotes the probabilities of labels in the neighborhood of an instance, and $\boldsymbol {y}_{N_{i}}$ denotes the labels of the i-th neighbor. Based on (8), local label correlation LC(W) can be formulated as:

$$ \begin{aligned} {}LC(W){}&={\sum}_{i=1}^{n}\|\boldsymbol{x}_{i}W-\boldsymbol{N}_{i}\|_{2}^{2}\\ &=\|XW-N\|_{F}^{2}. \end{aligned} $$

(9)

3.4 Optimization via accelerated proximal gradient

According to the definition of each term, we unify label-specific feature selection and label correlation, and can rewrite the whole objective function of LFGLC as:

$$ \begin{aligned} F(W)&=\frac{1}{2}Tr((XW-Y)^{T}D(XW-Y))\\ &+\frac{\alpha}{2}Tr(WCW^{T})+\frac{\beta}{2}\|XW-N\|_{F}^{2}+\gamma\|W\|_{1}, \end{aligned} $$

(10)

where α, β and γ are nonnegative parameters that control the contribution of each term. The objective function is a convex optimization problem. To solve the nonsmoothness caused by ℓ₁-norm, the accelerated proximal gradient method is employed to optimize this objective function. General accelerated proximal gradient method can divide the objective function into the following two parts [43]:

$$ {}_{W}^{min}{F(W)=f(W)+g(W)}, $$

(11)

f(W) and g(W) are convex, but f(W) is smooth while g(W) is nonsmooth. f(W) holds Lipschitz continuous gradient: ∥∇f(W₁) −∇f(W₂)∥≤ L_f∥W₁ − W₂∥, where L_f is the Lipschitz constant. f(W) and g(W) are formulated as:

$$ \begin{aligned} f(W)&=\frac{1}{2}Tr((XW-Y)^{T}D(XW-Y))\\ &+\frac{\alpha}{2}Tr(WCW^{T})+\frac{\beta}{2}\|XW-N\|_{F}^{2}, \end{aligned} $$

(12)

$$ g(W)=\gamma\|W\|_{1}. $$

(13)

∇f(W) denotes the derivative of f(W) and it can be calculated by:

$$ \nabla f(W) = X^{T}DXW-X^{T}DY+\alpha WC+\beta (X^{T}XW-X^{T}N). $$

(14)

Given W₁ and W₂, we have

$$ \begin{aligned} &\|\nabla f(W_{1})-\nabla f(W_{2})\|_{F}^{2}\\ &=\|X^{T}DX{\Delta} W+\alpha{\Delta} WC+\beta X^{T}X{\Delta} W\|_{F}^{2}\\ &\leq 3\|X^{T}DX{\Delta} W\|_{F}^{2}+3\|\alpha {\Delta} WC\|_{F}^{2}+3\|\beta X^{T}X{\Delta} W\|_{F}^{2}\\ &\leq 3\|X^{T}DX\|_{2}^{2}\|{\Delta} W\|_{F}^{2}+3\|\alpha C\|_{2}^{2}\|{\Delta} W\|_{F}^{2}\\&+3\|\beta X^{T}X\|_{2}^{2}\|{\Delta} W\|_{F}^{2}\\ &=3(\|X^{T}DX\|_{2}^{2}+\|\alpha C\|_{2}^{2}+\|\beta X^{T}X\|_{2}^{2})\|W\|_{F}^{2}, \end{aligned} $$

(15)

thus, the Lipschitz constant L_f can be calculated as:

$$ L_{f}=\sqrt{3(\|X^{T}DX\|_{2}^{2}+\|\alpha C\|_{2}^{2}+\|\beta X^{T}X\|_{2}^{2})}. $$

(16)

The pseudocode of the optimization of LFGLC is summarized in Algorithm 1. Steps 6-12 are the iteration process of the accelerated proximal gradient method. Previous work [42] showed that for a sequence b_t satisfying ${b_{t}^{2}}-b_{t}\leq b_{t-1}^{2}$, the convergence rate can be improved to $\mathcal {O}(t^{-2})$ when letting $W^{(t)}= W_{t}+\frac {b_{(t-1)}-1}{b_{t}}(W_{t}-W_{t-1})$, where W_t is the coefficient W obtained at the t-th iteration. The ℓ₁-norm can be solved by the soft-thresholding operator S_𝜖[w] in each iteration, and the step size 𝜖 of the soft-thresholding operator is set to $\frac {\gamma }{L_{f}}$ in this optimization process.

After learning the coefficient W, the prediction of the test data X_t can be calculated by sign(S_t − τ). S_t = X_tW, τ is the threshold which is set to 0.5 in LFGLC. The pseudocode of the test procedure is summarized in Algorithm 2.

3.5 Discussion

Note that LFGLC is similar to LLSF [36] and JFSC [41]. LLSF uses cosine similarity to calculate global label correlation and the correlation among labels is added to linear regression with the ℓ₁-norm to select label-specific features for each label. Based on LLSF, JFSC adds a Fisher discriminant-based regularization term to obtain a large inter-class distance and a small inner-class distance for classification. These algorithms perform competitively on label-specific feature selection and classification, but they neglect local label correlation. Different from them, LFGLC is devoted to take both global and local label correlation into consideration. Global label correlation is calculated by the label co-occurrence frequency between label pairs, and local label correlation is calculated by each instance with its neighbors. Both are added to linear regression with the ℓ₁-norm to select more discriminative label-specific features for multi-label classification.

3.6 Complexity analysis

The complexity of the optimization of LFGLC includes initialization and iteration. In initialization, the step of initializing coefficient W₁ has complexity $\mathcal {O}(nd^{2}+d^{3}+ndl+d^{2}l)$. Step 2 calculates the global label correlation matrix C, and it needs $\mathcal {O}(nl^{2})$. In step 3, the calculation of N includes finding the k-nearest neighbors of each instance and calculating the probabilities of labels in the neighborhood, which needs $\mathcal {O}(n^{2}d+nkd)$. In iteration, step 5 calculates the diagonal matrix D, with complexity $\mathcal {O}(ndl)$. The calculation of the Lipschitz constant L_f in step 6 has complexity $\mathcal {O}(nd^{2}+d^{3}+l^{3})$. To calculate the derivative of f(W) in step 8 has complexity $\mathcal {O}(nd^{2}+d^{2}l+ndl+dl^{2})$.

4 Experiments

4.1 Data sets

Experiments were conducted on 12 real-world multi-label data sets, as described in Table 1. “Cardinality” is the average number of labels of each instance in one data set. All these data sets can be obtained from mulan^{Footnote 1}, lamda^{Footnote 2}, and meka^{Footnote 3}.

Table 1 Experimental data sets

Full size table

4.2 Evaluation metrics

For a given testing data set D_t = {(x_i,y_i)|1 ≤ i ≤ n_t}, the ground-truth label of the i-th instance x_i is represented as $\boldsymbol {y}_{i}\in \{ 0,1\}^{l}$, and let $\hat {\boldsymbol {y}_{i}}\in \{ 0,1\}^{l}$ denotes the predicted label of the i-th instance x_i. There are 7 multi-label evaluation metrics used for the evaluation of LFGLC. These evaluation metrics can be divided into the following two types: example-based evaluation metrics and label-based evaluation metrics.

Example-based evaluation metrics:

Accuracy measures Jaccard similarity between ground-truth and predicted labels:

$$ Accuracy=\frac{1}{n_{t}}{\sum}_{i=1}^{n_{t}}\frac{\vert\boldsymbol{y}_{i}\cap\hat{\boldsymbol{y}_{i}}\vert}{\vert\boldsymbol{y}_{i}\cup\hat{\boldsymbol{y}_{i}}\vert}. $$

(17)

Precision is the proportion of positive labels that are predicted correctly:

$$ Precision=\frac{1}{n_{t}}{\sum}_{i=1}^{n_{t}}\frac{\vert\boldsymbol{y}_{i}\cap\hat{\boldsymbol{y}_{i}}\vert}{\vert\hat{\boldsymbol{y}_{i}}\vert}. $$

(18)

Recall is the proportion of ground-truth positive labels that are correctly predicted:

$$ Recall=\frac{1}{n_{t}}{\sum}_{i=1}^{n_{t}}\frac{\vert\boldsymbol{y}_{i}\cap\hat{\boldsymbol{y}_{i}}\vert}{\vert\boldsymbol{y}_{i}\vert}. $$

(19)

F₁ evaluates the harmonic mean between Precision and Recall:

$$ F_{1}=\frac{1}{n_{t}}{\sum}_{i=1}^{n_{t}}\frac{2\cdot Precision\cdot Recall}{Precision+Recall}. $$

(20)

Exact-Match evaluates how many times the predicted and ground-truth labels are exactly matched:

$$ Exact-Match=\frac{1}{n_{t}}{\sum}_{i=1}^{n_{t}}\vert\boldsymbol{y}_{i}=\hat{\boldsymbol{y}_{i}}\vert $$

(21)

Label-based evaluation metrics: There are two metrics of averaging across the labels:

$$ Macro\ F_{1}=\frac{1}{l}{\sum}_{j=1}^{l}F_{1}(TP_{j},FP_{j},TN_{j},FN_{j}), $$

(22)

$$ \mathit{Micro}\ F_{1} = \frac{1}{l}F_{1}\left( {\sum}_{j=1}^{l}\mathit{TP}_{j},{\sum}_{j=1}^{l}\mathit{FP}_{j},{\sum}_{j=1}^{l}\mathit{TN}_{j},{\sum}_{j=1}^{l}\mathit{FN}_{j}\right), $$

(23)

where TP_j,FP_j,TN_j,FN_j are the number of true positive, false positive, true negative, and false negative instances with respect to label y_j respectively.

4.3 Comparison methods

We compared the multi-label classification performance of LFGLC with several state-of-the-art algorithms:

BR [27] transforms multi-label classification to several binary classification tasks without considering label correlation, where each binary classifier corresponds to one label.

CC [20] puts all labels in a random sequence, and in accordance with each label, learns the corresponding binary classifier. For each label, the binary classifier outputs of its previous labels are added as new features.

MLk NN^{Footnote 4} [28] finds the k-nearest neighbors for each instance in Euclidean space. The maximum posterior probability of each label is used to estimate the probability based on the number of neighbors belonging to each label. The parameter k is set to 10.

Lasso [43] uses linear regression with the ℓ₁-norm to select features from the original feature space according to nonzero regression coefficients, while neglecting label correlation. Parameter α is searched in {2^− 10,2^− 9,…,2¹⁰}.

LLSF^{Footnote 5} [36] uses cosine similarity to calculate pairwise label correlation, which is added to linear regression with the ℓ₁-norm to select label-specific features for each label. Parameters α and β are searched in {2^− 10,2^− 9,…,2¹⁰}. ρ is searched in {0.1,1,10}.

Based on LLSF, JFSC [41] uses a Fisher discriminant-based regularization term to achieve a large inter-class distance and small inner-class distance for classification. Parameters α, β and γ are searched in {4^− 5,4^− 4,…,4⁵}. η is searched in {0.1,1,10}.

Based on LLSF, NSLSF [40] translates logic labels to numeric labels to convey more semantic information and embed label correlations. Parameters α and β are same as LLSF, ρ is set to 0.5.

The searching scales of parameters α, β, γ and η in LFGLC are same as JSFC. The number of k-nearest neighbors is set to 10.

BR, MLk NN, and Lasso are first-order algorithms that do not consider label correlation; LLSF, JFSC, and NSLSF are second-order algorithms with global label correlation; and CC can be regarded as a high-order algorithm with global label correlation. Specifically, Lasso, LLSF, JFSC, and NSLSF are feature selection based label-specific features algorithms. For fair comparisons, the parameters of these algorithms are set according to the suggestions in their original papers. LIBSVM [44] with a linear kernel is employed as the base binary classifier for BR and CC, and the parameter C is set to 1. For the sake of fairness, the threshold of LFGLC is set to 0.5, the same as other comparison algorithms. Of course, we can get an appropriate value for the threshold of every algorithm with the use of a tuning phase. In this paper, we mainly study the effect of label correlations and label-specific features for multi-label classification, so we will learn the threshold in the next study.

4.4 Results of multi-label classification

The experiment used 5-fold cross-validation on each data set to evaluate the performance of multi-label classification. The average results of each algorithm on 12 data sets with 7 evaluation metrics are summarized in Tables 2 and 3. The best result in a row is bolded. “↑” after a metric denotes that a larger value indicates better performance. For each evaluation metric, an “Ave.rank” row reports the average rank value over all data sets for each algorithm. A smaller rank indicates better performance. To more intuitively reflect the average rank of these algorithms, the average rank and overall average rank are depicted in Fig. 2. According to the experimental results in Tables 2 and 3, the observations are summarized as follows:

1.
The proposed algorithm obviously outperforms the first-order algorithms (BR, MLkNN, Lasso) on all evaluation metrics, perhaps because LFGLC considers label correlation in multi-label classification, which is different from these first-order algorithms. Hence the consideration of label correlation can effectively improve multi-label classification performance.
2.
Second-order (LLSF, JSFC, NSLSF) and high-order (CC) algorithms considere global label correlation but neglect local label correlation. LFGLC outperforms these algorithms, which indicates the effectiveness of local label correlation for multi-label classification. Compared with similar algorithms (LLSF, JSFC, LFGLC), the proposed algorithm can obtain more suitable regression parameters for classification.
3.
On these evaluation metrics, LFGLC statistically performs better on Accuracy, Precision, Recall, F₁, Exact-Match, Macro F₁ and Micro F₁ over all data sets. This validates the superiority of the proposed algorithm. It is worth mentioning that Precision and Recall are generally contradictory. Because LFGLC considers the possibility that each instance may have labels similar to its neighbors, there will be more labels appearing in the prediction. Hence it performed better at Recall.

Table 2 Experimental results (mean± std(rank)) of different comparison algorithms on Accuracy, Precision, Recall, F₁ and Exact-Match

Full size table

Table 3 Experimental results (mean± std(rank)) of different comparison algorithms on Macro F₁ and Micro F₁

Full size table

Figure 2 shows the overall average rank of algorithms, the order of all algorithms can be ranked as LFGLC≻ NSLSF≻ JFSC≻ LLSF≻ CC≻ Lasso≻ BR≻ MLk NN. In summary, the proposed algorithm performs competitively in multi-label classification against other comparison algorithms.

To analyze the statistical performance among these algorithms systematically, Friedman test [45] was conducted here. Table 4 summarizes the Friedman statistics F_F and the critical value for each evaluation metric. This shows that the null hypothesis of equivalent performance among these comparison algorithms is rejected at significance level α = 0.10 for each evaluation metric.

Table 4 Friedman statistics F_F and the critical value at 0.10 significance level in terms of each evaluation metric

Full size table

To analyze the relative performance among these algorithms, the post-hoc Nemenyi test [45] was conducted and LFGLC is treated as the control algorithm. The performance between control algorithm and one comparison algorithm will be significantly different if their average ranks differ by at least one CD (CD = 2.780 in this paper). Figure 3 shows the CD diagrams on each evaluation metric. In each subfigure, any comparison algorithm whose average rank is within one CD to that of LFGLC is connected, and otherwise it is considered to have significantly different performance against LFGLC

.

4.5 Results of multi-label feature selection

To evaluate the performance of feature selection for multi-label learning, the proposed algorithm was compared with 4 feature selection based label-specific features algorithms (Lasso, LLSF, JSFC, NSLSF) on 5 multi-label data sets (cal500, emotions, medical, image, education) because of the space limitation. For each label, some features are selected from the original features according to the corresponding top weights from regression coefficients.

In the experiment, 5-fold cross-validation was conducted on each data set to evaluate the results in terms of Accuracy, F₁, Exact-Match, Macro F₁ and Micro F₁ for all algorithms. Because F₁ is the harmonic mean of Precision and Recall, for simplicity, we only select F₁ to evaluate the performance. The parameters of these algorithms are set the same as for multi-label classification. The top {10%, 20%, …, 50%} of the original features are taken as the selected features. LIBSVM [44] with a linear kernel is employed as the base classifier for all algorithms. Figure 4 displays the average results of algorithms over each data set, and using “ALL” as a baseline, without selecting from the original features. According to the experimental results, the following observations can be made:

1.
All feature selection algorithms generally outperform the baseline “ALL”, which indicates that feature selection can effectively improve the performance of multi-label classification to some extent.
2.
Label-specific features learned from LFGLC generally perform better than other comparison algorithms. Specifically, Lasso conducts feature selection without considering label correlation, LLSF, JFSC and NSLSF learn label-specific features with global label correlation. The results indicate that considering label correlation can be useful for feature selection and considering both global and local label correlation can further improve the performance of feature selection.
3.
For different data sets, the performance of feature selection presents different change trends. The best performance for most data sets is obtained for some intermediate number of selected features, perhaps because few selected features will cause some important features missed, and a large number of selected features may introduce useless features, that degrade performance.

4.6 Global and local label correlation

To intuitively show the global and local label correlation learned by LFGLC, the global pairwise label correlation matrix and probabilities of labels in the neighborhood learned from the Image data set are depicted in Fig. 5. We can observe from Fig. 5(a) that “mountains” is correlated with “trees”, and “sea” is correlated with “sunset” globally. In local label correlation (Fig. 5(b)), we randomly select 10 neighborhoods, whose label correlations are different. For example, “mountains” is correlated with “sea” in neighborhood 7, but “sea” is correlated with “sunset” in neighborhood 8. These further illustrate the complexity of label correlation.

4.7 Parameter sensitivity analysis

The objective function of LFGLC has several terms, whose contributions are controlled by these parameters (α, β, γ). The performance of LFGLC will be affected when the values of these parameters are changed. We conducted parameter sensitivity analysis of LFGLC on the yeast data set. Parameters corresponding to the best performance are first selected. Then the value of one parameter is fixed, and the values of the other two parameters are varied in {4^− 5,4^− 4,…,4⁵}.

Figure 6 shows the average results of 5-fold cross-validation in terms of Accuracy, F₁, Exact-Match, Macro F₁, Micro F₁. It can be seen that the best performance of Accuracy, F₁ and Micro F₁ is mostly obtained at the endpoint of the coordinate plane. But for evaluation metric Exact-Match and Macro F₁, some intermediate values of parameters achieve the best performance. Experimental results show that the performance of LFGLC is sensitive to parameters change in some intervals, and metrics generally achieve the best performance with different parameter values. Thus, to obtain the best performance on a certain data set, we suggest finding parameter values by searching on the validation set. Searching parameters will obviously cost much time especially for large-scale data sets.

5 Conclusion

In this paper, we propose a new label-specific feature selection and multi-label classification algorithm LFGLC, which considers the complexity of real-world label correlation. Both global and local label correlation are taken into account to learn more discriminative label-specific features. For each label, label-specific features are selected from original features according to the nonzero regression coefficients. Experimental results show that combining global and local label correlation can be useful for multi-label learning. The proposed algorithm achieves a competitive performance against several algorithms in multi-label classification and feature selection. Considering that correlations between labels are not equal, we will try to find more compatible label correlation for multi-label learning in our future work.

Notes

References

Lin J, Cai Q, Lin M (2021) Multi-label classification of Fundus images with graph convolutional network and self-supervised learning. IEEE Signal Process Lett 28:454–458
Article Google Scholar
Huang X, Chen B, Xiao L, Yu J, Jing L (2021) Label-aware document representation via hybrid attention for extreme multi-label text classification. Neural Process Lett, pp 1–17
Wen J, Wei L, Zhou W, Han J, Guo T (2020) GCN-IA: user profile based on graph convolutional network with implicit association labels. In: Conference on computational science. pp 355–364
Sun Z, Zhang J, Dai L, Li C, Zhou C, Xin J, Li S (2019) Mutual information based multi-label feature selection via constrained convex optimization. Neurocomputing 329:447–456
Article Google Scholar
Bayati H, Dowlatshahi M, Paniri M (2020) MLPSO: a filter multi-label feature selection based on particle swarm optimization. In: Conference on computer society of Iran pp 1–6
Zhang J, Luo Z, Li C, Zhou C, Li S (2019) Manifold regularized discriminative feature selection for multi-label learning. Pattern Recognit 95:136–150
Article Google Scholar
Gonzalez-Lopez J, Ventura S, Cano A (2020) Distributed multi-label feature selection using individual mutual information measures. Knowl-Based Syst, p 188
Gonzalez-Lopez J, Ventura S, Cano A (2020) Distributed selection of continuous features in multilabel classification using mutual information. IEEE Trans on Neural Netw Learn Syst 31(7):2280–2293
MathSciNet Google Scholar
Alalga A, Benabdeslem K, Taleb N (2015) Soft-constrained Laplacian score for semi-supervised multi-label feature selection. Knowl Inf Syst 47(1):75–98
Article Google Scholar
Huang R, Jiang W, Sun G (2018) Manifold-based constraint Laplacian score for multi-label feature selection. Pattern Recognit Lett 112:346–352
Article Google Scholar
Zhang ML, Wu L (2015) Lift: multi-label learning with label-specific features. IEEE Trans Pattern Anal Mach Intell 37(1):107–120
Article Google Scholar
Guan Y, Li W, Zhang B, Han B, Ji M (2020) Multi-label classification by formulating label-specific features from simultaneous instance level and feature level. Applied Intell 9:1–16
Google Scholar
Guo Y, Chung F, Li G, Wang J, Gee JC (2019) leveraging Label-specific discriminant mapping features for multi-label learning. ACM Trans Knowl Discovery Data 13(2):1–23
Article Google Scholar
Weng W, Lin Y, Wu S, Li Y, Kang Y (2018) Multi-label learning based on label specific features and local pairwise label correlation. Neurocomputing 273:385–394
Article Google Scholar
Huang J, Li GR, Wang SH, Xue Z, Huang QM (2017) Multi-Label Classification by exploiting local positive and negative pairwise label correlation. Neurocomputing 257:164–174
Article Google Scholar
Huang R, Kang L (2021) Local positive and negative label correlation analysis with label awareness for multi-label classification. Int J Mach Learn Cybern, pp 1–14
Cheng Z, Zeng Z (2020) Joint label-specific features and label correlation for multi-label learning with missing label. Applied Intell 50(11):4029–4049
Article Google Scholar
Bao J, Wang Y, Cheng Y (2021) Asymmetry label correlation for multi-label learning. Applied Intell, pp 1–13
Che X, Chen D, Mi J (2021) Feature distribution-based label correlation in multi-label classification. Int J Mach Learn Cybern, pp 1–15
Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach Learn 85:333–359
Article MathSciNet Google Scholar
Che X, Chen D, Mi J (2020) A novel approach for learning label correlation with application to feature selection of multi-label data. Inf Sci 512:795–812
Article MathSciNet MATH Google Scholar
Li Q, Peng X, Qiao Y, Peng Q (2020) Learning label correlations for multi-label image recognition with graph networks. Pattern Recognit Lett 138:378–384
Article Google Scholar
Ma J, Chiu B, Chow T (2020) Multilabel classification with group-based mapping: a framework with local feature selection and local label correlation. IEEE Trans Cybern
Nan G, Li Q, Dou R, Liu J (2018) Local positive and negative correlation-based k-labelsets for multi-label classification. Neurocomputing 318:90–101
Article Google Scholar
Xiao J, Tang S (2020) Joint Learning of Binary Classifiers and Pairwise Label Correlations for Multi-label Image Classification. In: IEEE conference on multimedia information processing and retrieval. pp 25–30
Li YK, Zhang ML, Geng X (2015) Leveraging implicit relative labeling-importance information for effective multi-label learning. In: IEEE international conference on data mining. pp 251–260
Boutell MR, Luo J, Shen X, Brown CM (2014) Learning multi-label scene classification. Pattern Recognit 37:1757–1771
Article Google Scholar
Zhang ML, Zhou ZH (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recognit 40:2038–2048
Article MATH Google Scholar
Wu G, Tian Y, Liu D (2018) Cost-sensitive multi-label learning with positive and negative label pairwise correlations. Neural Netw 108:411–423
Article Google Scholar
Xu H, Xu L (2017) Multi-label feature selection algorithm based on label pairwise ranking comparison transformation. In: International joint conference on neural networks. pp 1210–1217
Zhang Y, Zhao T, Miao D, Pedrycz W (2021) Granular multilabel batch active learning with pairwise label correlation. IEEE Trans on Systems, Man, and Cybern
Wang R, Ye S, Li K, Kwong S (2021) Bayesian network based label correlation analysis for multi-label classifier chain. Inf Sci 554:256–275
Article MathSciNet MATH Google Scholar
He Z F, Yang M, Gao Y, Liu H D, Yin Y (2019) Joint multi-label classification and label correlations with missing labels and feature selection. Knowl-Based Syst 163:145–158
Article Google Scholar
Zhu Y, Kwok J T, Zhou Z H (2018) Multi-Label Learning with global and local label correlation. IEEE Trans Knowl Data Eng 30(6):1081–1094
Article Google Scholar
Yan Y, Li S, Xiao Z, Wang A, Li Z, Zhang J (2018) k-Labelsets for Multimedia Classification with Global and Local Label Correlation. In: International conference on multimedia Mmodeling. pp 177–188
Huang J, Li GR, Huang QM, Wu XD (2015) Learning label specific features for multi-label classification. In: IEEE international conference on data mining. pp 181–190
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surveys 31(3):264–323
Article Google Scholar
Wei XY, Yu ZW, Zhang CQ, Hu QH (2018) Ensemble of label specific features for multi-label classification. In: IEEE international conference on multimedia and expo. pp 1–6
Huang J, Xu L, Qian K, Wang J, Yamanishi K (2021) Multi-label learning with missing and completely unobserved labels. IEEE Trans Knowl Data Eng 35:1061–1086
MathSciNet MATH Google Scholar
Weng W, Chen YN, Chen CL, Wu SX, Liu JH (2020) Non-sparse label specific features selection for multi-label classification. Neurocomputing 377:85–94
Article Google Scholar
Huang J, Li GR, Huang QM, Wu XD (2018) Joint feature selection and classification for multilabel learning. IEEE Trans Cybern 48(3):876–889
Article Google Scholar
Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. Siam J Imaging Sci 2(1):183–202
Article MathSciNet MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc 58(1):267–288
MathSciNet MATH Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: A library for support vectormachines. ACM Trans Intell Syst Technol 2(3):1–27
Article Google Scholar
Demiar J, Schuurmans D (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1):1–30
MathSciNet Google Scholar

Download references

Acknowledgements

This research was supported by the natural Science Foundation of Fujian Province of China (No. 2021J011187).

Author information

Authors and Affiliations

Department of Computer and Information Engineering, Xiamen University of Technology, Xiamen, 361024, China
Wei Weng, Bowen Wei & Wen Ke
School of Automation, Xiamen University, Xiamen, 361005, China
Yuling Fan
School of Economics, Xiamen University, Xiamen, 361005, China
Jinbo Wang
School of Instrument Science and Engineering, Southeast University, Nanjing, 210096, China
Yuwen Li

Authors

Wei Weng
View author publications
You can also search for this author in PubMed Google Scholar
Bowen Wei
View author publications
You can also search for this author in PubMed Google Scholar
Wen Ke
View author publications
You can also search for this author in PubMed Google Scholar
Yuling Fan
View author publications
You can also search for this author in PubMed Google Scholar
Jinbo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuwen Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Weng.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Wei Weng and Bowen Wei contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Weng, W., Wei, B., Ke, W. et al. Learning label-specific features with global and local label correlation for multi-label classification. Appl Intell 53, 3017–3033 (2023). https://doi.org/10.1007/s10489-022-03386-7

Download citation

Accepted: 12 February 2022
Published: 19 May 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s10489-022-03386-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Learning label-specific features with global and local label correlation for multi-label classification

Abstract

Similar content being viewed by others

Learning multi-label label-specific features via global and local label correlations

Joint Label-Specific Features and Correlation Information for Multi-Label Learning

Multi-label classification by formulating label-specific features from simultaneous instance level and feature level

Explore related subjects

1 Introduction

2 Related work

2.1 Label correlation

2.1.1 Global relation algorithm

2.1.2 Local relation algorithm

2.1.3 Global-local combined relation algorithm

2.2 Label-specific features

2.2.1 Feature extraction based label-specific features

2.2.2 Feature selection based label-specific features

3 Proposed algorithm

3.1 Label-specific feature selection

3.2 Global label correlation calculation

3.3 Local label correlation calculation

3.4 Optimization via accelerated proximal gradient

3.5 Discussion

3.6 Complexity analysis

4 Experiments

4.1 Data sets

4.2 Evaluation metrics

4.3 Comparison methods

4.4 Results of multi-label classification

4.5 Results of multi-label feature selection

4.6 Global and local label correlation

4.7 Parameter sensitivity analysis

5 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation