Joint label-specific features and label correlation for multi-label learning with missing label

Cheng, Ziwei; Zeng, Ziwei

doi:10.1007/s10489-020-01715-2

Joint label-specific features and label correlation for multi-label learning with missing label

Published: 08 July 2020

Volume 50, pages 4029–4049, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

Joint label-specific features and label correlation for multi-label learning with missing label

Download PDF

Cheng Ziwei¹ &
Zeng Ziwei¹

1306 Accesses
29 Citations
Explore all metrics

Abstract

Existing multi-label learning classification algorithms ignore that class labels may be determined by some features in the original feature space. And only a partial label of each instance can be obtained for some real applications. Therefore, we propose a novel algorithm named joint Label-Specific features and Label Correlation for multi-label learning with Missing Label (LSLC-ML) and its optimized version to solve the above-mentioned problems. First, a missing label can be recovered by the learned positive and negative label correlations from the incomplete training data sets, then the label-specific features can be selected, finally the multi-label classification task can be modeled by combining the labelspecific feature selections, missing labels and positive and negative label correlations. The experimental results show that our algorithm LSLC-ML has strong competitiveness compared with some state-of-the-art algorithms in evaluation matrices when tested on benchmark multi-label data sets.

Multi-label learning with missing labels using sparse global structure for label-specific features

Article 24 January 2023

Joint Label-Specific Features and Correlation Information for Multi-Label Learning

Article 27 March 2020

Discriminatory Label-specific Weights for Multi-label Learning with Missing Labels

Article 02 July 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

As an important branch of Artificial Intelligence, machine learning has been widely used in many practical applications, especially in the field of smart applied approaches. Many scholars have proposed many algorithms one after another, such as Zhou et al. [1], Bencherif et al. [2], Souza et al. [3], and Bae et al. [4]. These algorithms mainly use fuzzy neural networks to solve related problems. Traditional supervised learning methods are often only aimed at a single label, but in reality, an object often has multiple classification labels. This type of supervised learning is called multi-label classification. At present, multi-label learning has been widely used in web page classification [5], image annotation [6], biological analysis [7] and other fields.

There are two ways to solve multi-label classification algorithm: problem transformation and algorithm adaptation. The idea of problem transformation is to transform multi-label classification into a series of single-label classifications, such as MLSVM [6] (Multi-label Support Vector Mechine). However, MLSVM does not take the label correlations into account. Different from the problem transformation, the idea of algorithm adaptation is to adapt the existing single-label classification algorithm to solve the multi-label classification, such as ML-KNN [8](Multi-Label K Nearest Neighbor), it introduces MAP(Maximum A Posteriori Probability Estimate) based on the traditional KNN algorithm, but the ML-KNN ignores the label correlations. Therefore, many algorithms related to label correlations have been proposed one after another, such as [9,10,11,12,13,14] etc. In fact, there is also a mutual exclusion phenomenon between labels, which are called negative correlations. For example, both Huang et al. [15] and Wu et al. [16] are linking the positive and negative label correlations. There are also some scholars who put forward some corresponding methods to solve the problem of class imbalance in multi-label learning, such as [17,18,19].

In the classification task, sometimes the curse of dimensionality often occurs because the data features are too large and feature selection is a method to solve the dimension problem. Up to now, a lot of algorithms have been proposed, which based on the assumption that selected features have something to do with all class labels, such as [20,21,22,23,24] etc. However, each class label might be determined by some specific feature subsets. For example, in some text classification scenarios, if there are a series of vocabulary such as” basketball” and” ping-pong ball” in the layout and then there will be a greater probability that the layout will be judged as” sport”. These words are called label-specific features. In order to solve the problems mentioned above, many new algorithms are proposed to select label-specific features [25,26,27].

Unfortunately, many state-of-the-art methods have a poor performance when the labels are missing in some applications. According to previous research, it is inappropriate that many multi-label learning classification algorithms based on the assumption that all class labels are observed. But in many practical application, it is difficult to obtain complete labels.

To solve missing label problem in classification tasks, there are many algorithm have been proposed, such as [28,29,30,31,32,33], and it is feasible to deal with missing label. However, these algorithms use identical data representation composed of all the features in the discrimination of all the class labels. Therefore, how to recover missing label by the learned positive and negative label correlations from the incomplete training data sets of each different class label and select label-specific features with it remains a hot spots of multi-label learning with missing labels. The motivation of this article is summarized as follows.

The class labels may be determined by some features in the original feature space.
Many algorithms only deal with the missing labels by the positive label correlations and ignore the negative label correlations.

To solve the aforesaid two problems, we proposed a novel algorithm which jointly label-Specific features selections and recover missing labels for multi-label learning by using label correlation in a unified model. Compared with other algorithms, the innovation of the algorithm proposed in this paper is that when completing the missing label matrix, we not only use the positive correlation of labels, but also use the negative correlation of labels. The main contributions are summarized as follows.

This is the first model that combines label-specific features selections, missing label and label correlations of positive and negative.
The positive and negative label correlations can be learned directly from the incomplete training datasets, and used to complete the matrix and multi-label classification model can be modeled by selecting label-specific features iteratively.
Our proposed algorithm performs competitive performance against existing state-of-the-art algorithm from the experiment in this paper.

The rest of this paper is organized as follows. Section 2 reviews the related work of labelspecific features, missing labels and label correlations. In Section 3, we propose two versions of the algorithm. Section 4 gives the optimization of the proposed algorithm. The experimental setup and conclusion are shown in Section 5 and Section 6 respectively.

2 Related work

2.1 Multi-label learning with label-specific feature

In multi-label learning, it is useful to the classification results if the label-specific feature can be found. Zhang et al. [25] proposed LIFT, and it calculates the distance between the original data and the cluster center by clustering of positive and negative instances of each class label respectively, and takes the distance as part of the original data features. Similarity, LSDM [34] utilizes clustering algorithm by setting different ratio parameter for positive and negative instances class label respectively, and finds label-specific feature, which consists of distance information and spatial topology information. However, they does not consider label correlations. Comparatively speaking, LSF-CI [35] selects label-specific feature for each class label by modeling label correlations in label space and feature space simultaneously.

Huang et al. [26] and Weng et al. [36] proposed LLSF-DL and LF-LPLC respectively, the former joints label-specific features and class-dependent labels by stacking sparesly, while the latter joints label-specific features and local pairwise label correlations. Therefore, although these algorithms can deal with label-specific feature, they do not perform particularly well when the label is missing. There are also some scholars who consider this issue from other angles, such as [37, 38]. NSLSF [37] translates the logic labels to the numerical ones and utilize the numerical labels for extracting label specific features. FSSL [38] selects labelspecific features for each newly-arrived label with designing inter-class discrimination and intra-class neighbor recognition.

2.2 Multi-label learning with missing labels

It is inappropriate to assume that all labels are observed. Therefore, there are many algorithms for missing labels, such as [28,29,30]. Wu et al. [28] proposed a new model for multi-label with missing label and it has been proved that this model can solve the problem of label missing. Maxide [29] speedups matrix completion by exploring the side information and it is assumed that the matrix to be supplemented and the side information matrix are shared the latent information. SVMNN [30] is an algorithm based on the SVM method, it constructs two graph to explore sample smoothness and class smoothness respectively, and then the loss function is modeled which combines these two factors. The algorithm mentioned just now is aimed at missing labels for each instance is unknown, so [32, 33] are proposed based on which partial missing labels are known. SSWL [33] is a semi-supervised learning to supplement missing label by both instance similarity and label similarity are considered. iMVML [32] explores a shared subspace that handles multi-view multi-label learning, and the subspace can be learned from incomplete views with weak labels, label correlations and a predictor in this subspace simultaneously. Besides, there are some frameworks that can handle label correlations and missing label or others simultaneously, such as [31, 39]. LSML [39] can handle multi-label task with missing label, label correlations and label-specific feature selection and it is similar to our method in this paper that select label-specific feature by adding l₁ regularization. MLMLFS [31] jointly deals with feature selections with missing label and models classifier in a unified framework. It is modeled by l_2,p regularization to select the most discriminative features and a manifold regularization to explore label correlations. That is to say, how to find the label correlations is the most critical issue.

2.3 Multi-label learning with label correlations

Whatever it is missing labels or feature selections, the label correlation plays an important role in it and it has been proved to improve the classification ability of the multi-label learning algorithm. Examples are ML-LOC [10], Glocal [13] etc. ML-LOC can obtain local label correlations by clustering and it is taken into account in the feature space, while Glocal can obtain global and local label correlation based on ML-LOC. However, these label correlations are only positive label correlations. There is also a mutual exclusion between labels, called negative correlations, such as RAkEL [12], LPLC [15] and CPNL [16]. LPLC can explore positive and negative label correlations matrix during the training process by calculating the max posterior probability of each class label given by another one on nearest neighbors sets. Similarity, RAkEL is an improved version of [14], it adds positive and negative correlations into k-labelsets by calculating the cosine similarity of the two matrices. Nevertheless, CPNL does not obtain positive and negative label correlation through calculation, but is modeled in a unified framework. In conclusion, it is proposed to joint label-specific features selections and label Correlation for Multi-label Learning with missing label in this paper.

3 The proposed model

3.1 Notations

For an arbitrary matrix W, Wⁱ and W^j denote the i-th row and j-th column of W, W^T denote the deposition of W. In multi-label learning, given dataset $ D={\left\{\left({\mathbf{x}}_i,{\mathbf{y}}_i\right)\right\}}_{i=1}^n $, where we denote the feature vector of the i-th instance by x_i ∈ R^m, and denote the i-th ground-truth label vector by y_i ∈ {0, 1}^l, and m represents the number of features of the instance and l represents the number of class labels associated with an instance. For simplicity, the training data is expressed as matrix X = [x₁, x₂, …, x_n]^T ∈ R^n × m; Y = [y₁, y₂, …, y_n]^T ∈ {0, 1}^n × l, where y_il = 1 represents the i-th instance is associated with the l-th label, and y_il = 0 represents the i-th instance is nothing to do with the l-th label or the l-th label is missing.

3.2 Label-specific feature selection

Given datasets, we propose the assumption that each class label is only determined by a certain subset of specific features from the original feature space. In order to select this feature, that is, label–specific feature, l₁ regularization is used in the modeling of a linear model. Naturally, label–specific feature is selected by no-zero values of each wⁱ ∈ R^m, the optimization problem is defined as

$$ \underset{\mathbf{W}}{\min}\frac{1}{2}{\left|\left|\mathbf{XW}-\mathbf{Y}\right|\right|}_F^2+{\lambda}_1{\left|\left|\begin{array}{c}\mathbf{W}\end{array}\right|\right|}_1 $$

(1)

where W ∈ R^m × l is the regression coefficient and W = [w¹, w², …, w^l], and l₁ is the tradeoff parameter. $ {\left\Vert \ast \right\Vert}_F^2 $ is the summing the squares of each element in matrix *.

3.3 Learning multi-label with missing label

In previous studies, we assume that the labels of the given training data sets are complete and all are observable. However, this assumption is often not true. Considering that labeling is a huge project, missing labels have become a hot spot in multi-label learning. In order to solve such problems, it is proposed to a method that recovers incomplete label matrix to complete label matrix by making full use of the label correlations. First of all, we assume that any unobserved, that is, missing labels can be explored by the relationship between observed labels. Let P ∈ R^l × l be label correlations, where p_ij represents the degree of correlation between the i-th label and the j-th label. Following the idea of Huang et al. [39], we assume that one class label may be correlated with only a subset of class labels, thus we add the l₁ –norm regularizer over P. The optimization problem can be rewritten as

$$ {\displaystyle \begin{array}{cc}\underset{W,P}{\min}\kern1em & \frac{1}{2}{\left|\left|\begin{array}{c}\mathbf{XW}-\mathbf{Y}\mathbf{P}\end{array}\right|\right|}_F^2+{\lambda}_1{\left|\left|\begin{array}{c}\mathbf{W}\end{array}\right|\right|}_1+{\lambda}_2{\left|\left|\begin{array}{c}\mathbf{P}\end{array}\right|\right|}_1+\\ {}& {\lambda}_4{\left|\left|\begin{array}{c}\mathbf{YP}-\mathbf{Y}\end{array}\right|\right|}_F^2\\ {}s.t.\kern1em & \mathbf{P}\ge 0\end{array}} $$

(2)

3.4 Learning positive label correlations

In the multi-label classification task, the accuracy of classification can be greatly improved if the relationship between labels can be considered. Therefore, we incorporate the label pairwise correlation by calculating the Euclidean distance between any pair of coefficient vectors which are constrained by the learned label correlation matrix P. Here, for two label y_i and y_j have positive correlations, that is, p_ij or p_ji is large, and their label-specific features have more probability to be same. That is to say, the corresponding regression coefficient wⁱ and w^j will be more similar, and vice versa. According to what has been mentioned above, we can add the regularizer by the Euclidean distance of wi and wj as follows,

$$ {\displaystyle \begin{array}{c}\mathrm{POS}=\kern1em \frac{1}{2}\sum \limits_{i,j=1}^l{p}_{ij}{\left|\left|{\mathbf{W}}^i-{\mathbf{W}}^j\right|\right|}_2^2\\ {}\kern1.59em =\kern1em \frac{1}{2}\left(\sum \limits_{i,j=1}^l{p}_{ij}\left({\left|\left|\begin{array}{c}{\mathbf{W}}^i\end{array}\right|\right|}_2^2+{\left|\left|\begin{array}{c}{\mathbf{W}}^j\end{array}\right|\right|}_2^2\right)-2\sum \limits_{i,j=1}^l{p}_{ij}\left\langle {\mathbf{W}}^i,{\mathbf{W}}^j\right\rangle \right)\\ {}=\kern1em tr\left(\mathbf{W}{\mathbf{D}}_p{W}^T\right)- tr\left(\mathbf{W}\mathbf{P}{\mathbf{W}}^T\right)\\ {}=\kern1em tr\left(\mathbf{W}{\mathbf{L}}_p{\mathbf{W}}^T\right)\end{array}} $$

(3)

Where L_p = D_p − P ∈ R^l × l is the laplacian matrix and D_p is the diagonal matrix with P1, where 1 is a matrix with all values one. Thus, the optimization problem can be rewritten as,

$$ {\displaystyle \begin{array}{cc}\underset{W,P}{\min}\kern1em & \frac{1}{2}{\left|\left|\begin{array}{c}\mathbf{XW}-\mathbf{Y}\mathbf{P}\end{array}\right|\right|}_F^2+{\lambda}_1{\left|\left|\begin{array}{c}\mathbf{W}\end{array}\right|\right|}_1+{\lambda}_2{\left|\left|\begin{array}{c}\mathbf{P}\end{array}\right|\right|}_1+\\ {}& {\lambda}_4{\left|\left|\begin{array}{c}\mathbf{YP}-\mathbf{Y}\end{array}\right|\right|}_F^2+{\lambda}_5 tr\left(\mathbf{W}{L}_p{\mathbf{W}}^T\right)\\ {}s.t.\kern1em & \mathbf{P}\ge 0\kern1em \end{array}} $$

(4)

3.5 Learning negative label correlations

In fact, the label relation mentioned above be actually a positive label correlation. There is also a mutual exclusion between labels, called negative label correlations [16]. For example, for” aquarium”, the relationship between” seal” and” aquarium” will be relatively close, but the relationship with” tiger” will be relatively weak. Thus, similar to the idea of positive correlation of labels, we can learned negative label correlation matrix N, for two label y_i and y_j have negative correlations, that is, n_ij or n_ji is large, and their label-specific features have more probability to be same. That is to say, the corresponding regression coefficient wⁱ and w^j will be more similar, and vice versa. Note, the Euclidean distance of wⁱ and w^j should be large. Therefore, we can add the regularizer as follows,

$$ {\displaystyle \begin{array}{c}\mathrm{NEG}=\kern1em \frac{1}{2}\sum \limits_{i,j=1}^l{n}_{ij}{\left|\left|{\mathbf{W}}^i-\left(-{\mathbf{W}}^j\right)\right|\right|}_2^2\\ {}\kern1.9em =\kern1em \frac{1}{2}\left(\sum \limits_{i,j=1}^l{n}_{ij}\left({\left|\left|\begin{array}{c}{\mathbf{W}}^i\end{array}\right|\right|}_2^2+{\left|\left|\begin{array}{c}{\mathbf{W}}^j\end{array}\right|\right|}_2^2\right)+2\sum \limits_{i,j=1}^l{n}_{ij}\left\Vert {\mathbf{W}}^i,{\mathbf{W}}^j\right\Vert \right)\\ {}=\kern1em tr\left(\mathbf{W}{\mathbf{D}}_n{\mathbf{W}}^T\right)+ tr\left(\mathbf{W}\mathbf{N}{\mathbf{W}}^T\right)\\ {}=\kern1em tr\left(\mathbf{W}{\mathbf{L}}_n{\mathbf{W}}^T\right)\end{array}} $$

(5)

where L_n = D_n + N ∈ R^l × l and D_n is the diagonal matrix with N1, where 1 is a matrix with all values one. Note that we propose a hypothesis that the relationship between labels is a combination of positive label correlation and negative label correlation. For simplicity, we adopt linear combination here, that is, P + N. Thus, the final optimization problem can be rewritten as,

$$ {\displaystyle \begin{array}{cc}\underset{W,P,N}{\min}\kern1em & \frac{1}{2}{\left|\left|\begin{array}{c}\mathbf{XW}-\mathbf{Y}\left(\mathbf{P}+\mathbf{N}\right)\end{array}\right|\right|}_F^2+{\lambda}_1{\left|\left|\begin{array}{c}\mathbf{W}\end{array}\right|\right|}_1+{\lambda}_2{\left|\left|\begin{array}{c}\mathbf{P}\end{array}\right|\right|}_1+{\lambda}_3{\left|\left|\begin{array}{c}\mathbf{N}\end{array}\right|\right|}_1+\\ {}& {\lambda}_4{\left|\left|\begin{array}{c}\mathbf{Y}\left(\mathbf{P}+\mathbf{N}\right)-\mathbf{Y}\end{array}\right|\right|}_F^2+{\lambda}_5 tr\left(\mathbf{W}{\mathbf{L}}_p{\mathbf{W}}^T\right)+{\lambda}_6 tr\left(\mathbf{W}{\mathbf{L}}_n{\mathbf{W}}^T\right)\\ {}s.t.\kern11.5em & \mathbf{P}\ge 0\kern1em \mathbf{N}\ge 0\end{array}} $$

(6)

4 Model optimization and prediction

In this section, considering that object function is non-smooth, the accelerated proximal gradient method is applied in this issue according to paper [39]. For simplicity, let Φ be model parameters, that is, Φ = (W, P, N). Thus, object function(6) can be written as,

$$ \underset{\boldsymbol{\Phi}}{\min}\left\{F\left(\boldsymbol{\Phi} \right)\equiv f\left(\boldsymbol{\Phi} \right)+g\left(\boldsymbol{\Phi} \right)\right\} $$

(7)

Where

$$ {\displaystyle \begin{array}{cc}f\left(\boldsymbol{\Phi} \right)=& \frac{1}{2}{\left|\left|\mathbf{XW}-\mathbf{Y}\left(\mathbf{P}+\mathbf{N}\right)\right|\right|}_F^2+{\lambda}_4{\left|\left|\begin{array}{c}\mathbf{Y}\left(\mathbf{P}+\mathbf{N}\right)-\mathbf{Y}\end{array}\right|\right|}_F^2\\ {}& +{\lambda}_5 tr\left(\mathbf{W}{\mathbf{L}}_{\mathbf{P}}{\mathbf{W}}^T\right)+{\lambda}_6 tr\left(\mathbf{W}{\mathbf{L}}_{\mathbf{N}}{\mathbf{W}}^T\right)\end{array}} $$

(8)

$$ g\left(\boldsymbol{\Phi} \right)={\lambda}_1{\left|\left|\mathbf{W}\right|\right|}_1+{\lambda}_2{\left|\left|\begin{array}{c}\mathbf{P}\end{array}\right|\right|}_1+{\lambda}_3{\left|\left|\begin{array}{c}\mathbf{N}\end{array}\right|\right|}_1 $$

(9)

Note, although function f(Φ) and g(Φ) are convex, the latter is not smooth. First of all, we use the following function defined in paper [40].

$$ {Q}_L\left(\boldsymbol{\Phi}, {\boldsymbol{\Phi}}^i\right)=f\left({\boldsymbol{\Phi}}^i\right)+\left\langle \bigtriangledown f\left(\boldsymbol{\Phi} \right),\boldsymbol{\Phi} -{\boldsymbol{\Phi}}^i\right\rangle +\frac{\mathbf{L}}{2}{\left|\left|\boldsymbol{\Phi} -{\boldsymbol{\Phi}}^i\right|\right|}_F^2+g\left(\boldsymbol{\Phi} \right) $$

(10)

Here, for any L ≥ L_f, Q_L(Φ, Φⁱ) ≥ F(Φ) will always be found, where L_f is the Lipschitz constant. Next, minimizing a sequence of separable quadratic approximations to F(Φ) by the proximal algorithm. Similar to paper [40], we can solve Φ by minimizing Q_L(Φ), the calculation formula is as follows,

$$ {\displaystyle \begin{array}{c}{\boldsymbol{\Phi}}^{\ast }=\arg \underset{\boldsymbol{\Phi}}{\min }{Q}_{\mathbf{L}}\left(\boldsymbol{\Phi}, {\boldsymbol{\Phi}}^i\right)\\ {}\kern1.65em =\arg \underset{\boldsymbol{\Phi}}{\min }g\left(\boldsymbol{\Phi} \right)+\frac{L}{2}{\left|\left|\boldsymbol{\Phi} -{G}^i\right|\right|}_F^2\end{array}} $$

(11)

where

$$ {\mathbf{G}}^i={\boldsymbol{\Phi}}^i-\frac{1}{\mathbf{L}}\bigtriangledown f\left({\boldsymbol{\Phi}}^i\right) $$

(12)

It is effective that speed up the convergence by set $ {\boldsymbol{\Phi}}^i=\boldsymbol{\Phi} +\frac{\alpha_{i-1}}{\alpha_i}\left({\boldsymbol{\Phi}}_i-{\boldsymbol{\Phi}}_{i-1}\right) $ for sequence α_i, where α_i satisfies $ {\alpha}_{i+1}^2-{\alpha}_{i+1}\le {\alpha}_i^2 $, and Φ_iis the i-th iteration of Φ .

4.1 Update W

Firstly, Updating W with fixed P and N, the derivation of function (8) can be calculated as follows,

$$ {\bigtriangledown}_{\mathbf{W}}f\left(\boldsymbol{\Phi} \right)={\mathbf{X}}^T\mathbf{XW}-{\mathbf{X}}^T\mathbf{YP}-{\mathbf{X}}^T\mathbf{YN}+{\lambda}_5\mathbf{W}\left({\mathbf{L}}_{\mathrm{p}}+{\mathbf{L}}_{\mathrm{p}}^T\right)+{\lambda}_6\mathbf{W}\left({\mathbf{L}}_{\mathrm{n}}+{\mathbf{L}}_{\mathrm{n}}^T\right) $$

(13)

Then, according to Eq. (11), W can be updated by

$$ {\mathbf{W}}^i={\mathbf{W}}_i+\frac{\alpha_{i-1}-1}{\alpha_i}\left({\mathbf{W}}_i-{\mathbf{W}}_{i-1}\right) $$

(14)

$$ {\mathbf{W}}^{i+1}={\mathbf{prox}}_{\epsilon}\left({\mathbf{W}}^i-\frac{1}{L}{\bigtriangledown}_{\mathrm{W}}f\left({\mathbf{W}}^i,\mathbf{P},\mathbf{N}\right)\right) $$

(15)

Where, ε is step size. Here, because W in g(Φ) is regularization term with l₁-norm, the element–wise soft–threshold operator is adopted in solving this value. The calculation formula is as follows,

$$ {\mathbf{prox}}_{\epsilon}\left({w}_{ij}\right)={\left(|{w}_{ij}|-\epsilon \right)}_{+}\operatorname{sign}\left({w}_{ij}\right) $$

(16)

where (⋅)₊ = max (⋅, 0).

4.2 Update P

Similarly, Updating P with fixed W and N, the derivation of function (8) can be calculated as follows,

$$ {\bigtriangledown}_{\mathrm{P}}f\left(\boldsymbol{\Phi} \right)=\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\mathbf{P}+\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\mathbf{N}-{\mathbf{Y}}^T\mathbf{XW}-{\lambda}_4{\mathbf{Y}}^T\mathbf{Y} $$

(17)

Thus, P can be updated by

$$ {\mathbf{P}}^i={\mathbf{P}}_i+\frac{\alpha_{i-1}-1}{\alpha_i}\left({\mathbf{P}}_i-{\mathbf{P}}_{i-1}\right) $$

(18)

$$ {\mathbf{P}}^{i+1}={\mathbf{prox}}_{\epsilon}\left({\mathbf{P}}^i-\frac{1}{L}{\bigtriangledown}_{\mathrm{P}}f\left(\mathbf{W},{\mathbf{P}}^i,\mathbf{N}\right)\right) $$

(19)

Where, ε is step size. Here, because P in g(Φ) is nonnegative regularization term with l₁-norm, the element–wise soft–threshold operator is adopted in solving this value. The calculation formula is as follows,

$$ pro{x}_{\epsilon}\left({p}_{ij}-\epsilon \right)={\left(|{p}_{ij}|-\epsilon \right)}_{+} $$

(20)

where (⋅)₊ = max (⋅, 0).

4.3 Update N

Similarly, Updating N with fixed W and P, the derivation of function (8) can be calculated as follows,

$$ {\bigtriangledown}_{\mathrm{N}}f\left(\Phi \right)=\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\mathbf{P}+\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\mathbf{N}-{\mathbf{Y}}^T\mathbf{XW}-{\lambda}_4{\mathbf{Y}}^T\mathbf{Y} $$

(21)

Thus, N can be updated by

$$ {\mathbf{N}}^i={\mathbf{N}}_i+\frac{\alpha_{i-1}-1}{\alpha_i}\left({\mathbf{N}}_i-{\mathbf{N}}_{i-1}\right) $$

(22)

$$ {\mathbf{N}}^{i+1}=\mathbf{pro}{\mathbf{x}}_{\overset{`}{o}}\left({\mathbf{N}}^i-\frac{1}{L}{\bigtriangledown}_{\mathrm{N}}f\left(\mathbf{W},\mathbf{P},{\mathbf{N}}^i\right)\right) $$

(23)

where, ε is step size. Here, because N in g(Φ) is nonnegative regularization term with l₁-norm, the element–wise soft–threshold operator is adopted in solving this value. The calculation formula is as follows,

$$ \mathbf{pro}{\mathbf{x}}_{\epsilon}\left({n}_{ij}-\epsilon \right)={\left(|{n}_{ij}|-\epsilon \right)}_{+} $$

(24)

where (⋅)₊ = max (⋅, 0). Note, the derivative of the function(8) with respect to P is the same as the derivative of N, because we assume that the label relationship is a simplest linear combination of P and N, that is, P + N. All optimization steps are summarized in algorithm1, called LSLC for simplicity.

4.4 Lipschitz constant

Given model parameters Φ₁ = (W₁, P₁, N₁) and Φ₂ = (W₂, P₂, N₂), then, we have

$$ {\displaystyle \begin{array}{c}{\left|\left|\nabla f\left({\boldsymbol{\Phi}}_1\right)-\nabla f\left({\boldsymbol{\Phi}}_2\right)\right|\right|}_F^2=\left|\left|\begin{array}{c}{\mathbf{X}}^T\mathbf{X}\boldsymbol{\Delta } \mathbf{W}-{\mathbf{X}}^T\mathbf{Y}\boldsymbol{\Delta } \mathbf{P}-{\mathbf{X}}^T\mathbf{Y}\boldsymbol{\Delta } \mathbf{N}+{\lambda}_5\boldsymbol{\Delta} \mathbf{W}\left({\mathbf{L}}_{\mathbf{P}}+{\mathbf{L}}_{\mathbf{P}}^T\right)+{\lambda}_6\boldsymbol{\Delta} \mathbf{W}\left({\mathbf{L}}_{\mathbf{N}}+{\mathbf{L}}_{\mathbf{N}}^T\right)\\ {}\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\boldsymbol{\Delta } \mathbf{P}+\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\boldsymbol{\Delta } \mathbf{N}-{\mathbf{Y}}^T\mathbf{X}\boldsymbol{\Delta } \mathbf{W}\\ {}\left(1+{\lambda}_4\right){\mathbf{Y}}^T Y\varDelta P+\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\boldsymbol{\Delta } \mathbf{N}-{\mathbf{Y}}^T\mathbf{X}\boldsymbol{\Delta } \mathbf{W}\end{array}\right|\right|\\ {}={\left|\left|{\mathbf{X}}^T\mathbf{X}\Delta \mathbf{W}-{\mathbf{X}}^T\mathbf{Y}\Delta \mathbf{P}-{\mathbf{X}}^T\mathbf{Y}\Delta \mathbf{N}+{\lambda}_5\Delta \mathbf{W}\left({\mathbf{L}}_{\mathrm{p}}+{\mathbf{L}}_{\mathrm{p}}^T\right)+{\lambda}_6\Delta \mathbf{W}\left({\mathbf{L}}_{\mathrm{n}}+{\mathbf{L}}_{\mathrm{n}}^T\right)\right|\right|}_F^2+\\ {}2{\left|\left|\begin{array}{c}\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\Delta \mathbf{P}+\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\Delta \mathbf{N}-{\mathbf{Y}}^TX\Delta \mathbf{W}\end{array}\right|\right|}_F^2\\ {}\le 5{\left|\left|\begin{array}{c}{\mathbf{X}}^T\mathbf{X}\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{W}\end{array}\right|\right|}_F^2+5{\left|\left|\begin{array}{c}{\mathbf{X}}^T\mathbf{Y}\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{P}\end{array}\right|\right|}_F^2+5{\left|\left|\begin{array}{c}{\mathbf{X}}^T\mathbf{Y}\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{N}\end{array}\right|\right|}_F^2+\\ {}5{\left|\left|\begin{array}{c}{\lambda}_5\left({\mathbf{L}}_{\mathrm{p}}+{\mathbf{L}}_{\mathrm{p}}^T\right)\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{W}\end{array}\right|\right|}_F^2+5{\left|\left|\begin{array}{c}{\lambda}_6\left({\mathbf{L}}_{\mathbf{N}}+{\mathbf{L}}_{\mathbf{N}}^T\right)\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{W}\end{array}\right|\right|}_F^2+\\ {}3{\left|\left|\begin{array}{c}\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{P}\end{array}\right|\right|}_F^2+3{\left|\left|\begin{array}{c}\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{N}\end{array}\right|\right|}_F^2+3{\left|\left|\begin{array}{c}{\mathbf{Y}}^T\mathbf{X}\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{W}\end{array}\right|\right|}_F^2\\ {}\le \left(5{\left|\left|\begin{array}{c}{\mathbf{X}}^T\mathbf{X}\end{array}\right|\right|}_2^2+5{\left|\left|\begin{array}{c}{\mathbf{X}}^T\mathbf{Y}\end{array}\right|\right|}_2^2+5{\left|\left|\begin{array}{c}{\lambda}_5\left({\mathbf{L}}_{\mathrm{p}}+{\mathbf{L}}_{\mathrm{p}}^T\right)\end{array}\right|\right|}_2^2+5{\left|\left|\begin{array}{c}{\lambda}_6\left({\mathbf{L}}_{\mathrm{n}}+{\mathbf{L}}_{\mathrm{n}}^T\right)\end{array}\right|\right|}_2^2+3{\left|\left|\begin{array}{c}\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\end{array}\right|\right|}_2^2\right){\left|\left|\begin{array}{c}\Delta \mathbf{W}\\ {}\Delta \mathbf{P}\\ {}\Delta \mathbf{N}\end{array}\right|\right|}_F^2\end{array}} $$

(25)

Thus, the Lipschitz constant of optimization problem (6) can be calculated by

$$ {\mathbf{L}}_f=\sqrt{5{\left|\left|{\mathbf{X}}^{\mathbf{T}}\mathbf{X}\right|\right|}_2^2+5{\left|\left|{\mathbf{X}}^{\mathbf{T}}\mathbf{Y}\right|\right|}_2^2+5{\left|\left|\begin{array}{c}{\lambda}_5\left({\mathbf{L}}_{{\mathbf{p}}^{\prime }}\right)\end{array}\right|\right|}_2^2+5{\left|\left|\begin{array}{c}{\lambda}_6\left({\mathbf{L}}_{{\mathrm{n}}^{\prime }}\right)\end{array}\right|\right|}_2^2+3{\left|\left|\begin{array}{c}\left(1+{\lambda}_4\right){\mathbf{Y}}^{\mathbf{T}}\mathbf{Y}\end{array}\right|\right|}_2^2} $$

(26)

where $ {\mathbf{L}}_{{\mathbf{p}}^{\prime }}={\mathbf{L}}_{\mathbf{p}}+{\mathbf{L}}_{\mathbf{p}}^{\mathbf{T}} $ and $ {\mathbf{L}}_{{\mathbf{n}}^{\prime }}={\mathbf{L}}_{\mathbf{n}}+{\mathbf{L}}_{\mathbf{n}}^{\mathbf{T}} $.

4.5 Time complex

In this part, let us briefly analyze the time complexity of the optimization algorithm. For matrix X ∈ R^n × m, Y ∈ {0, 1}^n × l, W ∈ R^m × l, P ∈ R^l × l, N ∈ R^l × l, L ∈ L^l × l, where n denotes the number of instances, l denotes the number of labels for corresponding instances, and m denotes the dimensions of instances. In LSLC-1, the positive and negative label correlations L_p and L_n need O(l²) in step 3, and the other key steps of the algorithm lie in the calculation of the Lipschitz constant and the derivation of the matrix. In step 4, the calculation of the Lipschitz constant needs O(m³ + l³ + mnl). The derivation of the matrix focuses on 6, 9 and 12 steps, the derivation of the matrix W needs O(nm² + ml² + lm² + mnl), note that the derivation of the matrix P and N are the similar, so the time complexity of them needs O(nl² + l³ + ml² + mnl). As mentioned above, the time complexity of the calculating the Lipschitz constant can be reduced by finding a appropriate L ≥ L_f [40] and it needs O(ml² + m² n + m² l + nl² + mnl + nl + l² + l³). These are summarized in algorithm 2, called LSLCnew for simplicity.

4.6 Model prediction

In this section, we mainly use two optimization algorithms mentioned above to obtain matrix W. For unlabeled test data sets D_ts, we can predict the label of D_ts by calculating the score matrix Sc and then using Sign(Sc - τ) with a given threshold τ, where Sc = W D_ts and τ is set to be 0.5 in this experiment. The process of prediction is summarized in algorithm 3.

5 Experiment

5.1 Data sets

The algorithms proposed in this paper are all simulated on benchmark multi-label data sets. The specific statistical information of multi-label data sets are shown in Table 1. During the simulation, for each data set, we randomly selected 80% of the data as training data and the rest as test data. The experiment is repeated ten times and a five-fold cross test is used.

Table 1 The statistical information of data sets

Full size table

5.2 Evaluation measures

In order to evaluate the performance of the experimental results, we use seven evaluation metrics. For simplicity, let t denote the number of test data, Y_i and $ {\overset{\sim }{\mathbf{Y}}}_i $ denote the sets of ground-truth labels and predicted labels of the i-th instance, respectively; $ {Z}_j^{\mathbf{P}} $ and $ {Z}_j^{\mathbf{N}} $ denote the sets of positive and negative instance of j-th class label. $ {C}_{{\mathrm{X}}_i}\left(\mathbf{Y}\right) $ is confidence score of x_i associated with label y. Thus, evaluation metrics can be computed as follows,

Hamming Loss(HL) evaluates error matching between the ground-truth labels Y_i and the predicted labels $ {\overset{\sim }{\mathbf{Y}}}_i $.

$$ HL=\frac{1}{t}\sum \limits_{i=1}^t\left(\frac{1}{t}\left|Y\triangle \overset{\sim }{Y}\right|\right) $$

(27)

where △ represents the difference between the two data sets.

Subset Accuracy(SA) evaluates the accuracy between the ground-truth labels Y_i and the predicted labels $ {\overset{\sim }{\mathbf{Y}}}_i $ in a more rigorous way.

$$ SA=\frac{1}{t}\sum \limits_{i=1}^t\left\Vert \mathbf{Y}=\overset{\sim }{\mathbf{Y}}\right\Vert $$

(28)

where ‖·‖ is an indication function.

Micro F1(MF1) is an extended version of the F1 Measure, and it takes each entry of the label vector as a single instance regardless of the distance between labels.

$$ MF1=\frac{2\sum \limits_{j=1}^l\sum \limits_{i=1}^t{Y}_{ij}{\overset{\sim }{Y}}_{ij}}{\sum \limits_{j=1}^l\sum \limits_{i=1}^t{Y}_{ij}+\sum \limits_{j=1}^l\sum \limits_{i=1}^t{\overset{\sim }{Y}}_{ij}} $$

(29)

Average Precision(AP) evaluates the average score of related labels that are higher than the specific labels

(30)

Ranking Loss(RL) evaluates the fraction that the ranking of unrelated labels of the instance is lower than that of related labels.

$$ RL=\frac{1}{t}\sum \limits_{i=1}^t\left(\frac{1}{\left|{Y}_i\right|\left|{\overset{\sim }{Y}}_i\right|}\left|\left\{\left({y}^{\prime },{y}^{{\prime\prime}}\right)|{C}_{x_i}\left({y}^{\prime}\right)\le {C}_{x_i}\left({y}^{{\prime\prime}}\right)|\left({y}^{\prime },{y}^{{\prime\prime}}\right)\in {Y}_i\times {\overset{\sim }{Y}}_i\right\}\right|\right) $$

(31)

One Error(OE) evaluates the fraction that the highest ranking label of the instance is not correctly labeled.

$$ OE=\frac{1}{t}\sum \limits_{i=1}^t\left\Vert \arg \underset{y_i\in Y}{\max }{C}_{x_i}\left({y}_i\right)\right\Vert $$

(32)

where ‖·‖ is an indication function.

Average AUC(AUC) evaluates the average fraction that the ranking of positive instance of the all class labels is higher than that of negative instance.

$$ AUC=\frac{1}{\mathrm{l}}\sum \limits_{i=1}^l\frac{\left|\left\{\left({x}^{\prime },{x}^{{\prime\prime}}\right)|{C}_{x^{\prime }}\left({y}_j\right)\ge {C}_{x^{{\prime\prime} }}\left({y}_j\right),\left({x}^{\prime },{x}^{{\prime\prime}}\right)\in {Z}_j^p\times {Z}_j^n\right\}\right|}{\left|{Z}_j^p\right|\left|{Z}_j^n\right|} $$

(33)

For these evaluation metrics, for HL, RL and OE, the smaller the better; whereas for SA, MF1, AP and AUC, the bigger the better.

5.3 Compared algorithms

The two versions of algorithm are proposed in this paper, which are compared with the following state-of-the-art for multi-label learning algorithm. For simplicity and efficiency, all configuration for parameters in the compared algorithms are suggested in the original paper.

MLKNN [8] It introduces maximum posteriori probability (MAP) to deal with multi-label classification based on traditional KNN algorithm. The parameter k is found in {3, 5, …, 21}.

LPLC [15] It explores local positive and negative pairwise label correlations based on traditional KNN algorithm. The parameter k is found in {3, 5, …, 21} and the tradeoff parameter α is found in {0.1, 0.2, …, 1}.

LIFT [25] It learns label-specific features through the idea of clustering algorithm. Here, base learner is linear kernel SVM, r = 0.1.

LLSF [26] It learns label-specific features. The regulation parameters λ is found in {2⁻¹⁰, …, 2¹⁰}.

Glocal [13] It explores both global and local label correlations when recover missing labels. The paramter λ = 1,from λ₁ to λ₅ are found in {2⁻⁵, …, 2¹}, k is found in {0.1l, …, 0.6l}, and g is found in {2⁻¹⁰, …, 2¹⁰}.

LSML [39] It directly learns label-specific features with missing labels. All paramters are found in {2⁻⁵, …, 2¹}.

LSLC and LSLCnew It is two versions of algorithm are proposed in this paper. It directly learns the label-specific features that missing labels while considering the positive and negative correlation of labels. All paramters are found in {2⁻¹⁰, …, 2¹}.

5.4 Comparison experiments

In the comparison experiments, we control the missing rate of class labels to be in the range of 10% to 60% with a step size of 10% and all data is normalized for fairness. Thus, for every comparing algoritm, the average result(mean std) on eleven data sets in terms of evaluation metrics are consulted on Tables 2, 3, 4, 5, 6, 7, and 8. Considering the limitation of table size, the experimental results here only give the average value, and the complete experimental results(mean ± std) are recorded in the uploaded zip file containing 11 Execl files. In order to better evaluate the performance of the algorithm, Friedman test [41] and Holm test [42] are used in here. Table 10 summarizes the Friedman statistics F_F and the corresponding critical value in terms of every metric. As shown in Table 10, at significance level α = 0.05, the full hypothesis that the compared algorithm has performance of equivalent is rejected in terms of all the evaluation metric. As a result, the Nemenyi test [41] is employed to test whether our method LSLC can perform competitiveness against comparing algorithm. The performance of two algorithms will be significantly different if their averge ranks differ by at least one critical difference $ CD={q}_{\alpha}\sqrt{\frac{k\left(k+1\right)}{6N}} $ . Here, q_α = 3.031 at significance level α = 0.05, therefore CD = 1.2924(k = 8, N = 66). The detailed CD diagrams are shown in Fig. 1. Holm test is based on the ranking of multiple classifiers. Firstly, the statistic $ Z=\frac{Rank_i-{Rank}_j}{SE} $ obeying the standard normal distribution is calculated, where $ SE=\frac{k\left(k+1\right)}{6N} $, Rank_i is the average ranking of the i-th classifier on all data sets. Then the corresponding probability p can be obtained according to Z and sorted in ascending order. Finally, the probability p and α/(k − i) are compared to judge whether the original assumption (the two classifiers have the same performance) is rejected. Table 9 summarizes the Holm test result.

Table 2 Experimental results of each comparing algorithm on eleven data sets in terms of HL. For HL, the smaller the better. Best results are highlighted in bold

Full size table

Table 3 Experimental results of each comparing algorithm on eleven data sets in terms ofSA. For SA, the bigger the better. Best results are highlighted in bold

Full size table

Table 4 Experimental results of each comparing algorithm on eleven data sets in terms of MF1. For MF1, the bigger the better. Best results are highlighted in bold

Full size table

Table 5 Experimental results of each comparing algorithm on eleven data sets in terms of AP. For AP, the bigger the better. Best results are highlighted in bold

Full size table

Table 6 Experimental results of each comparing algorithm on eleven data sets in terms of OE. For OE, the smaller the better. Best results are highlighted in bold

Full size table

Table 7 Experimental results of each comparing algorithm on eleven data sets in terms of RL. For RL, the smaller the better. Best results are highlighted in bold

Full size table

Table 8 Experimental results of each comparing algorithm on eleven data sets in terms of AUC. For AUC, the bigger the better. Best results are highlighted in bold

Full size table

Table 9 The Holm test result

Full size table

Based on the results of comparison experiments, the following observations can be gained:

LSLC can outperform other state-of-the-art for multi-label learning algorithms in some case of different label missing rates. According to the Tables 2, 3, 4, 5, 6, 7, and 8, we mainly analyze the experimental results from two directions. From the horizontal direction, we can see clearly that among all the evaluation matrices, the performance of LSLC and LSLCnew perform slightly better on most data sets than other comparison algorithms, and an analysis of Fig. 1 reveals that LPLC and LPLCnew are relatively left-sided in the evaluation matrices compared with other algorithms. From the vertical direction, the experimental results of MLKNN, LPLC, LIFT, LLSF and Glocal have obvious fluctuation with the increase of label missing rate. But the experimental results of algorithm LSML and LSLC show change relatively smooth and only rises slowly in a small range. The main cause of phenomenon is attributed to the algorithm itself. LPLC only considers positive label correlations. Both LIFT and LLSF consider only label-specific features. For MLKNN, LPLC and LLSF, they cannot handle missing labels. Although Glocal takes into account the fact that label correlations and missing labels, it cannot select label-specific features.
It can be seen from Table 9 that the performance of LSLC is different from that of MLKNN, LPLC, LIFT, LLSF and Glocal classifiers. Although the performance of LSLC and LSML is the same in holm test, it can be seen from the results in Fig. 1 that the performance of LSLC is better than that of LSML. LSML can handle missing labels, label correlations and label-specific features, but it only uses positive label correlations. In LSLC, first of all, the positive and negative label correlations can be learned from the incomplete training data sets, and then missing label will be recovered by learned label correlations, finally multi-label classification task will be modeled by selecting label-specific features iteratively.
LSLCnew still shows better performance compared with LSLC and it is faster than LSLC according to experimental results of all data sets.

In conclusion, our proposed algorithm shows a competitive performance compared to other exist state-of-the-art algorithm for multi-label classification task with missing label (Table 10).

Table 10 The Friedman statistic F_F and critical value

Full size table

5.5 Convergence

In our proposed algorithm, the accelerated proximal gradient(APG) is used to quickly solve the objective loss function. The loss function value w.r.t the number of iterations is shown in Fig. 2, and we conduct it over ‘medical’ data set. An analysis of Fig. 2 reveals that the ordinate value drops rapidly in the early stage over these two data sets and tends to be stable. Similarly, this phenomenon also occurs in other data sets.

5.6 Analysis of model parameters

In this section, we mainly analyze model parameters, that is, label-specific features and label correlations. Firstly, we conduct LSLC over ‘medical’ dataset with 60% label missing rate and visualize matrix W, which is shown in Fig. 3. Here, the horizontal axis and the vertical axis represent feature index and label index respectively. As can be seen from Fig. 3, we can select label-specific features represented by white color, and each class label is only determined by a certain subset of specific features from original feature space. At the same time, label-specific feature selection is based on the assumption that the label matrix is complete. Thus, the label correlations play an important role in dealing with missing label task and it can be obtained by the learned model parameter P and N according to 3.4 and 3.5 respectively. Similarity, we conduct LSLC over ‘medical’ dataset and visualize P and N two matrices are shown in Figs. 4 and 5. Here, the horizontal axis and the vertical axis represent label index. Note, the points with white color indicate the nozero entries of P and N. After observing the Fig. 4, we can arrive at the following conclusion that each label is only positive label correlation with some other labels. Likewise, this phenomenon also occurs in the Fig. 5.

To intuitively see the impact of positive and negative label correlations on classification results. We degenerate the algorithm of LSLC into two algorithms of LSLC-P and LSLC-N respectively. The former only positive label correlations are considered, while the latter only negative label correlations are considered. Simplicity, we conduct these algorithms over ‘medical’, ‘stack-chemistry’ and ‘rcv1subsets1’ datasets with 60% missing rate of class label. As shown in Table 11, it is effective to strengthen the classification ability if the positive and negative label correlations are considered.

Table 11 The Experimental results of these models. For HL,RL and OE, the smaller the better; whereas for AP and AUC, the bigger the better. Best results are highlighted in bold

Full size table

5.7 Sensitivity to parameters

In this section, we mainly analyze the three parameters λ₁, λ₅, and λ₆, where λ₁ controls the sparsity of label-specific, λ₅ controls the positive label relationship and λ₆ controls the negative label relationship. We conduct the experiment on” medical” data set with 60% missing label, and dynamically changing two of parameters by fixing other four parameters with the optimal value respectively. We randomly selected 80% of the data as training data and the rest as test data. The experiment is repeated ten times and a five-fold cross test is used. The experimental results for evaluation metrics of MF1 and AUC are summarized in Fig. 6. An analysis of these Figures reveals that LSLC is sensitive obviously to these tradeoff parameters λ₁, λ₅, and λ₆ in the set rang. When any one of the three parameters λ₁, λ₅, and λ₆ exceeds a certain range, the performance of algorithm LSLC will be weakened.

6 Conclusion

In this paper, we have proposed a novel algorithm named LSLC, which can deal with labelspecific for multi-label learning with missing labels by taking both positive label correlations and negative label correlations into consideration. For missing labels, LSLC can complete the missing label matrix through the positive and negative label correlations and then completion label matrix drives label-specific features selection. Besides, the APG is applied in our proposed algorithm to quickly solve the objective function iteratively. Futhermore, the experimental results show that both positive and negative correlation of labels plays a certain role in the final classification effect of the classifier. After a series of experiments in eleven benchmark data sets, our algorithm shows better performance and can be better select special label features from multi-label classification tasks with missing labels.

References

Zhou H, Zhao H, Zhang Y (2020) Nonlinear system modeling using self-organizing fuzzy neural networks for industrial applications[J]. Appl Intell:1–16
Bencherif A, Chouireb F (2019) A recurrent TSK interval type-2 fuzzy neural networks control with online structure and parameter learning for mobile robot trajectory tracking[J]. Appl Intell 49(11):3881–3893
Google Scholar
de Campos SPV, Guimaraes AJ, Araujo VS et al (2019) Incremental regularized Data Density-Based Clustering neural networks to aid in the construction of effort forecasting systems in software development[J]. Appl Intell 49(9):3221–3234
Google Scholar
Bae JS, Oh SK, Pedrycz W et al (2019) Design of fuzzy radial basis function neural network classifier based on information data preprocessing for recycling black plastic wastes: comparative studies of ATR FT-IR and Raman spectroscopy[J]. Appl Intell 49(3):929–949
Google Scholar
Ciarelli PM, Oliveira E, Salles EOT (2014) Multi-label incremental learning applied to web page categorization. Neural Comput Appl 24(6):1403–1419
Google Scholar
Boutell MR, Luo J, Shen X et al (2014) Learning multi-label scene classification. Pattern Recognit 37(9):1757–1771
Google Scholar
Alves RT, Delgado MR, Freitas AA (2008) Multi-label hierarchical classification of protein functions with artificial immune systems. In: Brazilian Symposium on Bioinformatics. Springer, Berlin, Heidelberg, pp 1–12
Google Scholar
Zhang ML, Zhou ZH (2007) ML-KNN:A lazy learning approach to multi-label learning. Pattern Recognit 40(7):2038–2048
MATH Google Scholar
Read J, Pfahringer B, Holmes G et al (2011) Classifier chains for multi-label classification. Mach Learn 85(3):333–359
MathSciNet Google Scholar
Huang S-J, Zhou Z-H (2012) Multi-label learning by exploiting label correlations locally. In: Proceedings of the 26th AAAI Conference on Artificial Intelligence, pp 949–955
Charte F, Rivera AJ, Del Jesus MJ et al (2014) LI-MLC: a label inference methodology for addressing high dimensionality in the label space for multilabel classification. IEEE Trans Neural Netw Learn Syst 25(10):1842–1854
Google Scholar
Nan G, Li Q, Dou R et al (2018) Local positive and negative correlation-based k-labelsets for multi-label classification. Neurocomputing 318:90–101
Google Scholar
Zhu Y, Kwok JT, Zhou ZH (2017) Multi-label learning with global and local label correlation. IEEE Trans Knowl Data Eng 30(6):1081–1094
Google Scholar
Tsoumakas G, Katakis I, Vlahavas I (2010) Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng 23(7):1079–1089
Google Scholar
Huang J, Li G, Wang S et al (2017) Multi-label classification by exploiting local positive and negative pairwise label correlation. Neurocomputing 257:164–174
Google Scholar
Wu G, Tian Y, Liu D (2018) Cost-sensitive multi-label learning with positive and negative label pairwise correlations. Neural Netw 108:411–423
Google Scholar
Zhang C, Bi J, Xu S, Ramentol E, Fan G, Qiao B, Fujita H (2019) Multi-Imbalance: an open-source software for multi-class imbalance learning[J]. Knowl-Based Syst 174
Sun J, Li H, Fujita H, Binbin F, Ai W (2020) Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting[J]. Inform Fusion 54
Sun J, Lang J, Fujita H, Li H (2018) Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates[J]. Inform Sci 425
Doquire G, Verleysen M (2013) Mutual information-based feature selection for multilabel classification. Neurocomputing 122:148–155
MATH Google Scholar
Lee J, Kim DW (2015) Mutual information-based multi-label feature selection using interaction information. Expert Syst Appl 42(4):2013–2025
Google Scholar
Wang C, Lin Y, Liu J (2019) Feature selection for multi-label learning with missing labels. Appl Intell:1–16
Jorge G-L, Ventura S, Cano A (2019) Distributed selection of continuous features in multilabel classification using mutual information[J]. IEEE Trans Neural Netw Learn Syst
Jorge G-L, Ventura S, Cano A (2020) Distributed multi-label feature selection using individual mutual information measures[J]. Knowl-Based Syst 188:105052
Google Scholar
Zhang ML, Wu L (2014) Lift: Multi-label learning with label-specific features. IEEE Trans Pattern Anal Mach Intell 37(1):107–120
Google Scholar
Huang J, Li G, Huang Q et al (2016) Learning label-specific features and class-dependent labels for multi-label classification. IEEE Trans Knowl Data Eng 28(12):3309–3323
Google Scholar
Xu S, Yang X, Yu H et al (2016) Multi-label learning with label-specific feature reduction. Knowl-Based Syst 104:52–61
Google Scholar
Wu B, Lyu S, Hu BG et al (2015) Multi-label learning with missing labels for image annotation and facial action unit recognition. Pattern Recognit 48(7):2279–2289
Google Scholar
Xu M, Jin R, Zhou ZH (2013) Speedup matrix completion with side information: Application to multi-label learning. In: Advances in neural information processing systems, pp 2301–2309
Liu Y, Wen K, Gao Q et al (2018) SVM based multi-label learning with missing labels for image annotation. Pattern Recogn 78:307–317
Google Scholar
Zhu P, Xu Q, Hu Q, Zhang C, Zhao H (2018) Multi-label feature selection with missing labels. Pattern Recogn 74:488–502
Google Scholar
Tan Q, Yu G, Domeniconi C, Wang J, Zhang Z (2018) Incomplete multi-view weak-label learning. In IJCAI
Dong HC, Li YF, Zhou ZH (2018) Learning from semi-supervised weak-label data. In: AAAI
Guo Y, Chung F, Li G, et al (2019) Leveraging label-specific discriminant mapping features for multi-label learning[J]. ACM Transactions on Knowledge Discovery from Data (TKDD) 13(2):1–23
Han H, Huang M, Zhang Y et al (2019) Multi-label learning with label specific features using correlation information. IEEE Access 7:11474–11484
Google Scholar
Weng W, Lin Y, Wu S et al (2018) Multi-label learning based on label-specific features and local pairwise label correlation. Neurocomputing 273:385–394
Google Scholar
Weng W, Chen Y-N, Chen C-L, Wu S-X, Liu J-H (2020) Non-sparse label specific features selection for multi-label classification[J]. Neurocomputing 377
Liu J, Li Y, Weng W et al (2020) Feature selection for multi-label learning with streaming label[J]. Neurocomputing
Huang J, Qin F, Zheng X et al (2019) Improving multi-label classification with missing labels by learning label-specific features. Inf Sci 492:124–146
MathSciNet Google Scholar
Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1):183–202
MathSciNet MATH Google Scholar
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets[J]. J Mach Learn Res 7(Jan):1–30
MathSciNet MATH Google Scholar
Hühn JC, Hüllermeier E (2008) FR3: A fuzzy rule learner for inducing reliable classifiers[J]. IEEE Transactions on Fuzzy Systems 17(1):138–149
Google Scholar

Download references

Author information

Authors and Affiliations

University of Science and Technology Liaoning, Anshan, China
Cheng Ziwei & Zeng Ziwei

Authors

Cheng Ziwei
View author publications
You can also search for this author in PubMed Google Scholar
Zeng Ziwei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zeng Ziwei.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, Z., Zeng, Z. Joint label-specific features and label correlation for multi-label learning with missing label. Appl Intell 50, 4029–4049 (2020). https://doi.org/10.1007/s10489-020-01715-2

Download citation

Published: 08 July 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s10489-020-01715-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Joint label-specific features and label correlation for multi-label learning with missing label

Abstract

Similar content being viewed by others

Multi-label learning with missing labels using sparse global structure for label-specific features

Joint Label-Specific Features and Correlation Information for Multi-Label Learning

Discriminatory Label-specific Weights for Multi-label Learning with Missing Labels

Explore related subjects

1 Introduction

2 Related work

2.1 Multi-label learning with label-specific feature

2.2 Multi-label learning with missing labels

2.3 Multi-label learning with label correlations

3 The proposed model

3.1 Notations

3.2 Label-specific feature selection

3.3 Learning multi-label with missing label

3.4 Learning positive label correlations

3.5 Learning negative label correlations

4 Model optimization and prediction

4.1 Update W

4.2 Update P

4.3 Update N

4.4 Lipschitz constant

4.5 Time complex

4.6 Model prediction

5 Experiment

5.1 Data sets

5.2 Evaluation measures

5.3 Compared algorithms

5.4 Comparison experiments

5.5 Convergence

5.6 Analysis of model parameters

5.7 Sensitivity to parameters

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation