1 Introduction

As an important branch of Artificial Intelligence, machine learning has been widely used in many practical applications, especially in the field of smart applied approaches. Many scholars have proposed many algorithms one after another, such as Zhou et al. [1], Bencherif et al. [2], Souza et al. [3], and Bae et al. [4]. These algorithms mainly use fuzzy neural networks to solve related problems. Traditional supervised learning methods are often only aimed at a single label, but in reality, an object often has multiple classification labels. This type of supervised learning is called multi-label classification. At present, multi-label learning has been widely used in web page classification [5], image annotation [6], biological analysis [7] and other fields.

There are two ways to solve multi-label classification algorithm: problem transformation and algorithm adaptation. The idea of problem transformation is to transform multi-label classification into a series of single-label classifications, such as MLSVM [6] (Multi-label Support Vector Mechine). However, MLSVM does not take the label correlations into account. Different from the problem transformation, the idea of algorithm adaptation is to adapt the existing single-label classification algorithm to solve the multi-label classification, such as ML-KNN [8](Multi-Label K Nearest Neighbor), it introduces MAP(Maximum A Posteriori Probability Estimate) based on the traditional KNN algorithm, but the ML-KNN ignores the label correlations. Therefore, many algorithms related to label correlations have been proposed one after another, such as [9,10,11,12,13,14] etc. In fact, there is also a mutual exclusion phenomenon between labels, which are called negative correlations. For example, both Huang et al. [15] and Wu et al. [16] are linking the positive and negative label correlations. There are also some scholars who put forward some corresponding methods to solve the problem of class imbalance in multi-label learning, such as [17,18,19].

In the classification task, sometimes the curse of dimensionality often occurs because the data features are too large and feature selection is a method to solve the dimension problem. Up to now, a lot of algorithms have been proposed, which based on the assumption that selected features have something to do with all class labels, such as [20,21,22,23,24] etc. However, each class label might be determined by some specific feature subsets. For example, in some text classification scenarios, if there are a series of vocabulary such as” basketball” and” ping-pong ball” in the layout and then there will be a greater probability that the layout will be judged as” sport”. These words are called label-specific features. In order to solve the problems mentioned above, many new algorithms are proposed to select label-specific features [25,26,27].

Unfortunately, many state-of-the-art methods have a poor performance when the labels are missing in some applications. According to previous research, it is inappropriate that many multi-label learning classification algorithms based on the assumption that all class labels are observed. But in many practical application, it is difficult to obtain complete labels.

To solve missing label problem in classification tasks, there are many algorithm have been proposed, such as [28,29,30,31,32,33], and it is feasible to deal with missing label. However, these algorithms use identical data representation composed of all the features in the discrimination of all the class labels. Therefore, how to recover missing label by the learned positive and negative label correlations from the incomplete training data sets of each different class label and select label-specific features with it remains a hot spots of multi-label learning with missing labels. The motivation of this article is summarized as follows.

  • The class labels may be determined by some features in the original feature space.

  • Many algorithms only deal with the missing labels by the positive label correlations and ignore the negative label correlations.

To solve the aforesaid two problems, we proposed a novel algorithm which jointly label-Specific features selections and recover missing labels for multi-label learning by using label correlation in a unified model. Compared with other algorithms, the innovation of the algorithm proposed in this paper is that when completing the missing label matrix, we not only use the positive correlation of labels, but also use the negative correlation of labels. The main contributions are summarized as follows.

  • This is the first model that combines label-specific features selections, missing label and label correlations of positive and negative.

  • The positive and negative label correlations can be learned directly from the incomplete training datasets, and used to complete the matrix and multi-label classification model can be modeled by selecting label-specific features iteratively.

  • Our proposed algorithm performs competitive performance against existing state-of-the-art algorithm from the experiment in this paper.

The rest of this paper is organized as follows. Section 2 reviews the related work of labelspecific features, missing labels and label correlations. In Section 3, we propose two versions of the algorithm. Section 4 gives the optimization of the proposed algorithm. The experimental setup and conclusion are shown in Section 5 and Section 6 respectively.

2 Related work

2.1 Multi-label learning with label-specific feature

In multi-label learning, it is useful to the classification results if the label-specific feature can be found. Zhang et al. [25] proposed LIFT, and it calculates the distance between the original data and the cluster center by clustering of positive and negative instances of each class label respectively, and takes the distance as part of the original data features. Similarity, LSDM [34] utilizes clustering algorithm by setting different ratio parameter for positive and negative instances class label respectively, and finds label-specific feature, which consists of distance information and spatial topology information. However, they does not consider label correlations. Comparatively speaking, LSF-CI [35] selects label-specific feature for each class label by modeling label correlations in label space and feature space simultaneously.

Huang et al. [26] and Weng et al. [36] proposed LLSF-DL and LF-LPLC respectively, the former joints label-specific features and class-dependent labels by stacking sparesly, while the latter joints label-specific features and local pairwise label correlations. Therefore, although these algorithms can deal with label-specific feature, they do not perform particularly well when the label is missing. There are also some scholars who consider this issue from other angles, such as [37, 38]. NSLSF [37] translates the logic labels to the numerical ones and utilize the numerical labels for extracting label specific features. FSSL [38] selects labelspecific features for each newly-arrived label with designing inter-class discrimination and intra-class neighbor recognition.

2.2 Multi-label learning with missing labels

It is inappropriate to assume that all labels are observed. Therefore, there are many algorithms for missing labels, such as [28,29,30]. Wu et al. [28] proposed a new model for multi-label with missing label and it has been proved that this model can solve the problem of label missing. Maxide [29] speedups matrix completion by exploring the side information and it is assumed that the matrix to be supplemented and the side information matrix are shared the latent information. SVMNN [30] is an algorithm based on the SVM method, it constructs two graph to explore sample smoothness and class smoothness respectively, and then the loss function is modeled which combines these two factors. The algorithm mentioned just now is aimed at missing labels for each instance is unknown, so [32, 33] are proposed based on which partial missing labels are known. SSWL [33] is a semi-supervised learning to supplement missing label by both instance similarity and label similarity are considered. iMVML [32] explores a shared subspace that handles multi-view multi-label learning, and the subspace can be learned from incomplete views with weak labels, label correlations and a predictor in this subspace simultaneously. Besides, there are some frameworks that can handle label correlations and missing label or others simultaneously, such as [31, 39]. LSML [39] can handle multi-label task with missing label, label correlations and label-specific feature selection and it is similar to our method in this paper that select label-specific feature by adding l1 regularization. MLMLFS [31] jointly deals with feature selections with missing label and models classifier in a unified framework. It is modeled by l2,p regularization to select the most discriminative features and a manifold regularization to explore label correlations. That is to say, how to find the label correlations is the most critical issue.

2.3 Multi-label learning with label correlations

Whatever it is missing labels or feature selections, the label correlation plays an important role in it and it has been proved to improve the classification ability of the multi-label learning algorithm. Examples are ML-LOC [10], Glocal [13] etc. ML-LOC can obtain local label correlations by clustering and it is taken into account in the feature space, while Glocal can obtain global and local label correlation based on ML-LOC. However, these label correlations are only positive label correlations. There is also a mutual exclusion between labels, called negative correlations, such as RAkEL [12], LPLC [15] and CPNL [16]. LPLC can explore positive and negative label correlations matrix during the training process by calculating the max posterior probability of each class label given by another one on nearest neighbors sets. Similarity, RAkEL is an improved version of [14], it adds positive and negative correlations into k-labelsets by calculating the cosine similarity of the two matrices. Nevertheless, CPNL does not obtain positive and negative label correlation through calculation, but is modeled in a unified framework. In conclusion, it is proposed to joint label-specific features selections and label Correlation for Multi-label Learning with missing label in this paper.

3 The proposed model

3.1 Notations

For an arbitrary matrix W, Wi and Wj denote the i-th row and j-th column of W, WT denote the deposition of W. In multi-label learning, given dataset \( D={\left\{\left({\mathbf{x}}_i,{\mathbf{y}}_i\right)\right\}}_{i=1}^n \), where we denote the feature vector of the i-th instance by xi ∈ Rm, and denote the i-th ground-truth label vector by yi ∈ {0, 1}l, and m represents the number of features of the instance and l represents the number of class labels associated with an instance. For simplicity, the training data is expressed as matrix X = [x1, x2, …, xn]T ∈ Rn × m; Y = [y1, y2, …, yn]T ∈ {0, 1}n × l, where yil = 1 represents the i-th instance is associated with the l-th label, and yil = 0 represents the i-th instance is nothing to do with the l-th label or the l-th label is missing.

3.2 Label-specific feature selection

Given datasets, we propose the assumption that each class label is only determined by a certain subset of specific features from the original feature space. In order to select this feature, that is, label–specific feature, l1 regularization is used in the modeling of a linear model. Naturally, label–specific feature is selected by no-zero values of each wi ∈ Rm, the optimization problem is defined as

$$ \underset{\mathbf{W}}{\min}\frac{1}{2}{\left|\left|\mathbf{XW}-\mathbf{Y}\right|\right|}_F^2+{\lambda}_1{\left|\left|\begin{array}{c}\mathbf{W}\end{array}\right|\right|}_1 $$
(1)

where W ∈ Rm × l is the regression coefficient and W = [w1, w2, …, wl], and l1 is the tradeoff parameter. \( {\left\Vert \ast \right\Vert}_F^2 \) is the summing the squares of each element in matrix *.

3.3 Learning multi-label with missing label

In previous studies, we assume that the labels of the given training data sets are complete and all are observable. However, this assumption is often not true. Considering that labeling is a huge project, missing labels have become a hot spot in multi-label learning. In order to solve such problems, it is proposed to a method that recovers incomplete label matrix to complete label matrix by making full use of the label correlations. First of all, we assume that any unobserved, that is, missing labels can be explored by the relationship between observed labels. Let P ∈ Rl × l be label correlations, where pij represents the degree of correlation between the i-th label and the j-th label. Following the idea of Huang et al. [39], we assume that one class label may be correlated with only a subset of class labels, thus we add the l1 –norm regularizer over P. The optimization problem can be rewritten as

$$ {\displaystyle \begin{array}{cc}\underset{W,P}{\min}\kern1em & \frac{1}{2}{\left|\left|\begin{array}{c}\mathbf{XW}-\mathbf{Y}\mathbf{P}\end{array}\right|\right|}_F^2+{\lambda}_1{\left|\left|\begin{array}{c}\mathbf{W}\end{array}\right|\right|}_1+{\lambda}_2{\left|\left|\begin{array}{c}\mathbf{P}\end{array}\right|\right|}_1+\\ {}& {\lambda}_4{\left|\left|\begin{array}{c}\mathbf{YP}-\mathbf{Y}\end{array}\right|\right|}_F^2\\ {}s.t.\kern1em & \mathbf{P}\ge 0\end{array}} $$
(2)

3.4 Learning positive label correlations

In the multi-label classification task, the accuracy of classification can be greatly improved if the relationship between labels can be considered. Therefore, we incorporate the label pairwise correlation by calculating the Euclidean distance between any pair of coefficient vectors which are constrained by the learned label correlation matrix P. Here, for two label yi and yj have positive correlations, that is, pij or pji is large, and their label-specific features have more probability to be same. That is to say, the corresponding regression coefficient wi and wj will be more similar, and vice versa. According to what has been mentioned above, we can add the regularizer by the Euclidean distance of wi and wj as follows,

$$ {\displaystyle \begin{array}{c}\mathrm{POS}=\kern1em \frac{1}{2}\sum \limits_{i,j=1}^l{p}_{ij}{\left|\left|{\mathbf{W}}^i-{\mathbf{W}}^j\right|\right|}_2^2\\ {}\kern1.59em =\kern1em \frac{1}{2}\left(\sum \limits_{i,j=1}^l{p}_{ij}\left({\left|\left|\begin{array}{c}{\mathbf{W}}^i\end{array}\right|\right|}_2^2+{\left|\left|\begin{array}{c}{\mathbf{W}}^j\end{array}\right|\right|}_2^2\right)-2\sum \limits_{i,j=1}^l{p}_{ij}\left\langle {\mathbf{W}}^i,{\mathbf{W}}^j\right\rangle \right)\\ {}=\kern1em tr\left(\mathbf{W}{\mathbf{D}}_p{W}^T\right)- tr\left(\mathbf{W}\mathbf{P}{\mathbf{W}}^T\right)\\ {}=\kern1em tr\left(\mathbf{W}{\mathbf{L}}_p{\mathbf{W}}^T\right)\end{array}} $$
(3)

Where Lp = Dp − P ∈ Rl × l is the laplacian matrix and Dp is the diagonal matrix with P1, where 1 is a matrix with all values one. Thus, the optimization problem can be rewritten as,

$$ {\displaystyle \begin{array}{cc}\underset{W,P}{\min}\kern1em & \frac{1}{2}{\left|\left|\begin{array}{c}\mathbf{XW}-\mathbf{Y}\mathbf{P}\end{array}\right|\right|}_F^2+{\lambda}_1{\left|\left|\begin{array}{c}\mathbf{W}\end{array}\right|\right|}_1+{\lambda}_2{\left|\left|\begin{array}{c}\mathbf{P}\end{array}\right|\right|}_1+\\ {}& {\lambda}_4{\left|\left|\begin{array}{c}\mathbf{YP}-\mathbf{Y}\end{array}\right|\right|}_F^2+{\lambda}_5 tr\left(\mathbf{W}{L}_p{\mathbf{W}}^T\right)\\ {}s.t.\kern1em & \mathbf{P}\ge 0\kern1em \end{array}} $$
(4)

3.5 Learning negative label correlations

In fact, the label relation mentioned above be actually a positive label correlation. There is also a mutual exclusion between labels, called negative label correlations [16]. For example, for” aquarium”, the relationship between” seal” and” aquarium” will be relatively close, but the relationship with” tiger” will be relatively weak. Thus, similar to the idea of positive correlation of labels, we can learned negative label correlation matrix N, for two label yi and yj have negative correlations, that is, nij or nji is large, and their label-specific features have more probability to be same. That is to say, the corresponding regression coefficient wi and wj will be more similar, and vice versa. Note, the Euclidean distance of wi and wj should be large. Therefore, we can add the regularizer as follows,

$$ {\displaystyle \begin{array}{c}\mathrm{NEG}=\kern1em \frac{1}{2}\sum \limits_{i,j=1}^l{n}_{ij}{\left|\left|{\mathbf{W}}^i-\left(-{\mathbf{W}}^j\right)\right|\right|}_2^2\\ {}\kern1.9em =\kern1em \frac{1}{2}\left(\sum \limits_{i,j=1}^l{n}_{ij}\left({\left|\left|\begin{array}{c}{\mathbf{W}}^i\end{array}\right|\right|}_2^2+{\left|\left|\begin{array}{c}{\mathbf{W}}^j\end{array}\right|\right|}_2^2\right)+2\sum \limits_{i,j=1}^l{n}_{ij}\left\Vert {\mathbf{W}}^i,{\mathbf{W}}^j\right\Vert \right)\\ {}=\kern1em tr\left(\mathbf{W}{\mathbf{D}}_n{\mathbf{W}}^T\right)+ tr\left(\mathbf{W}\mathbf{N}{\mathbf{W}}^T\right)\\ {}=\kern1em tr\left(\mathbf{W}{\mathbf{L}}_n{\mathbf{W}}^T\right)\end{array}} $$
(5)

where Ln = Dn + N ∈ Rl × l and Dn is the diagonal matrix with N1, where 1 is a matrix with all values one. Note that we propose a hypothesis that the relationship between labels is a combination of positive label correlation and negative label correlation. For simplicity, we adopt linear combination here, that is, P + N. Thus, the final optimization problem can be rewritten as,

$$ {\displaystyle \begin{array}{cc}\underset{W,P,N}{\min}\kern1em & \frac{1}{2}{\left|\left|\begin{array}{c}\mathbf{XW}-\mathbf{Y}\left(\mathbf{P}+\mathbf{N}\right)\end{array}\right|\right|}_F^2+{\lambda}_1{\left|\left|\begin{array}{c}\mathbf{W}\end{array}\right|\right|}_1+{\lambda}_2{\left|\left|\begin{array}{c}\mathbf{P}\end{array}\right|\right|}_1+{\lambda}_3{\left|\left|\begin{array}{c}\mathbf{N}\end{array}\right|\right|}_1+\\ {}& {\lambda}_4{\left|\left|\begin{array}{c}\mathbf{Y}\left(\mathbf{P}+\mathbf{N}\right)-\mathbf{Y}\end{array}\right|\right|}_F^2+{\lambda}_5 tr\left(\mathbf{W}{\mathbf{L}}_p{\mathbf{W}}^T\right)+{\lambda}_6 tr\left(\mathbf{W}{\mathbf{L}}_n{\mathbf{W}}^T\right)\\ {}s.t.\kern11.5em & \mathbf{P}\ge 0\kern1em \mathbf{N}\ge 0\end{array}} $$
(6)

4 Model optimization and prediction

In this section, considering that object function is non-smooth, the accelerated proximal gradient method is applied in this issue according to paper [39]. For simplicity, let Φ be model parameters, that is, Φ = (W, P, N). Thus, object function(6) can be written as,

$$ \underset{\boldsymbol{\Phi}}{\min}\left\{F\left(\boldsymbol{\Phi} \right)\equiv f\left(\boldsymbol{\Phi} \right)+g\left(\boldsymbol{\Phi} \right)\right\} $$
(7)

Where

$$ {\displaystyle \begin{array}{cc}f\left(\boldsymbol{\Phi} \right)=& \frac{1}{2}{\left|\left|\mathbf{XW}-\mathbf{Y}\left(\mathbf{P}+\mathbf{N}\right)\right|\right|}_F^2+{\lambda}_4{\left|\left|\begin{array}{c}\mathbf{Y}\left(\mathbf{P}+\mathbf{N}\right)-\mathbf{Y}\end{array}\right|\right|}_F^2\\ {}& +{\lambda}_5 tr\left(\mathbf{W}{\mathbf{L}}_{\mathbf{P}}{\mathbf{W}}^T\right)+{\lambda}_6 tr\left(\mathbf{W}{\mathbf{L}}_{\mathbf{N}}{\mathbf{W}}^T\right)\end{array}} $$
(8)
$$ g\left(\boldsymbol{\Phi} \right)={\lambda}_1{\left|\left|\mathbf{W}\right|\right|}_1+{\lambda}_2{\left|\left|\begin{array}{c}\mathbf{P}\end{array}\right|\right|}_1+{\lambda}_3{\left|\left|\begin{array}{c}\mathbf{N}\end{array}\right|\right|}_1 $$
(9)

Note, although function f(Φ) and g(Φ) are convex, the latter is not smooth. First of all, we use the following function defined in paper [40].

$$ {Q}_L\left(\boldsymbol{\Phi}, {\boldsymbol{\Phi}}^i\right)=f\left({\boldsymbol{\Phi}}^i\right)+\left\langle \bigtriangledown f\left(\boldsymbol{\Phi} \right),\boldsymbol{\Phi} -{\boldsymbol{\Phi}}^i\right\rangle +\frac{\mathbf{L}}{2}{\left|\left|\boldsymbol{\Phi} -{\boldsymbol{\Phi}}^i\right|\right|}_F^2+g\left(\boldsymbol{\Phi} \right) $$
(10)

Here, for any L ≥ Lf, QL(Φ, Φi) ≥ F(Φ) will always be found, where Lf is the Lipschitz constant. Next, minimizing a sequence of separable quadratic approximations to F(Φ) by the proximal algorithm. Similar to paper [40], we can solve Φ by minimizing QL(Φ), the calculation formula is as follows,

$$ {\displaystyle \begin{array}{c}{\boldsymbol{\Phi}}^{\ast }=\arg \underset{\boldsymbol{\Phi}}{\min }{Q}_{\mathbf{L}}\left(\boldsymbol{\Phi}, {\boldsymbol{\Phi}}^i\right)\\ {}\kern1.65em =\arg \underset{\boldsymbol{\Phi}}{\min }g\left(\boldsymbol{\Phi} \right)+\frac{L}{2}{\left|\left|\boldsymbol{\Phi} -{G}^i\right|\right|}_F^2\end{array}} $$
(11)

where

$$ {\mathbf{G}}^i={\boldsymbol{\Phi}}^i-\frac{1}{\mathbf{L}}\bigtriangledown f\left({\boldsymbol{\Phi}}^i\right) $$
(12)

It is effective that speed up the convergence by set \( {\boldsymbol{\Phi}}^i=\boldsymbol{\Phi} +\frac{\alpha_{i-1}}{\alpha_i}\left({\boldsymbol{\Phi}}_i-{\boldsymbol{\Phi}}_{i-1}\right) \) for sequence αi, where αi satisfies \( {\alpha}_{i+1}^2-{\alpha}_{i+1}\le {\alpha}_i^2 \), and Φiis the i-th iteration of Φ .

4.1 Update W

Firstly, Updating W with fixed P and N, the derivation of function (8) can be calculated as follows,

$$ {\bigtriangledown}_{\mathbf{W}}f\left(\boldsymbol{\Phi} \right)={\mathbf{X}}^T\mathbf{XW}-{\mathbf{X}}^T\mathbf{YP}-{\mathbf{X}}^T\mathbf{YN}+{\lambda}_5\mathbf{W}\left({\mathbf{L}}_{\mathrm{p}}+{\mathbf{L}}_{\mathrm{p}}^T\right)+{\lambda}_6\mathbf{W}\left({\mathbf{L}}_{\mathrm{n}}+{\mathbf{L}}_{\mathrm{n}}^T\right) $$
(13)

Then, according to Eq. (11), W can be updated by

$$ {\mathbf{W}}^i={\mathbf{W}}_i+\frac{\alpha_{i-1}-1}{\alpha_i}\left({\mathbf{W}}_i-{\mathbf{W}}_{i-1}\right) $$
(14)
$$ {\mathbf{W}}^{i+1}={\mathbf{prox}}_{\epsilon}\left({\mathbf{W}}^i-\frac{1}{L}{\bigtriangledown}_{\mathrm{W}}f\left({\mathbf{W}}^i,\mathbf{P},\mathbf{N}\right)\right) $$
(15)

Where, ε is step size. Here, because W in g(Φ) is regularization term with l1-norm, the element–wise soft–threshold operator is adopted in solving this value. The calculation formula is as follows,

$$ {\mathbf{prox}}_{\epsilon}\left({w}_{ij}\right)={\left(|{w}_{ij}|-\epsilon \right)}_{+}\operatorname{sign}\left({w}_{ij}\right) $$
(16)

where (⋅)+ =  max (⋅, 0).

4.2 Update P

Similarly, Updating P with fixed W and N, the derivation of function (8) can be calculated as follows,

$$ {\bigtriangledown}_{\mathrm{P}}f\left(\boldsymbol{\Phi} \right)=\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\mathbf{P}+\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\mathbf{N}-{\mathbf{Y}}^T\mathbf{XW}-{\lambda}_4{\mathbf{Y}}^T\mathbf{Y} $$
(17)

Thus, P can be updated by

$$ {\mathbf{P}}^i={\mathbf{P}}_i+\frac{\alpha_{i-1}-1}{\alpha_i}\left({\mathbf{P}}_i-{\mathbf{P}}_{i-1}\right) $$
(18)
$$ {\mathbf{P}}^{i+1}={\mathbf{prox}}_{\epsilon}\left({\mathbf{P}}^i-\frac{1}{L}{\bigtriangledown}_{\mathrm{P}}f\left(\mathbf{W},{\mathbf{P}}^i,\mathbf{N}\right)\right) $$
(19)

Where, ε is step size. Here, because P in g(Φ) is nonnegative regularization term with l1-norm, the element–wise soft–threshold operator is adopted in solving this value. The calculation formula is as follows,

$$ pro{x}_{\epsilon}\left({p}_{ij}-\epsilon \right)={\left(|{p}_{ij}|-\epsilon \right)}_{+} $$
(20)

where (⋅)+ =  max (⋅, 0).

4.3 Update N

Similarly, Updating N with fixed W and P, the derivation of function (8) can be calculated as follows,

$$ {\bigtriangledown}_{\mathrm{N}}f\left(\Phi \right)=\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\mathbf{P}+\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\mathbf{N}-{\mathbf{Y}}^T\mathbf{XW}-{\lambda}_4{\mathbf{Y}}^T\mathbf{Y} $$
(21)

Thus, N can be updated by

$$ {\mathbf{N}}^i={\mathbf{N}}_i+\frac{\alpha_{i-1}-1}{\alpha_i}\left({\mathbf{N}}_i-{\mathbf{N}}_{i-1}\right) $$
(22)
$$ {\mathbf{N}}^{i+1}=\mathbf{pro}{\mathbf{x}}_{\overset{`}{o}}\left({\mathbf{N}}^i-\frac{1}{L}{\bigtriangledown}_{\mathrm{N}}f\left(\mathbf{W},\mathbf{P},{\mathbf{N}}^i\right)\right) $$
(23)

where, ε is step size. Here, because N in g(Φ) is nonnegative regularization term with l1-norm, the element–wise soft–threshold operator is adopted in solving this value. The calculation formula is as follows,

$$ \mathbf{pro}{\mathbf{x}}_{\epsilon}\left({n}_{ij}-\epsilon \right)={\left(|{n}_{ij}|-\epsilon \right)}_{+} $$
(24)

where (⋅)+ =  max (⋅, 0). Note, the derivative of the function(8) with respect to P is the same as the derivative of N, because we assume that the label relationship is a simplest linear combination of P and N, that is, P + N. All optimization steps are summarized in algorithm1, called LSLC for simplicity.

figure a
figure b
figure c

4.4 Lipschitz constant

Given model parameters Φ1 = (W1, P1, N1) and Φ2 = (W2, P2, N2), then, we have

$$ {\displaystyle \begin{array}{c}{\left|\left|\nabla f\left({\boldsymbol{\Phi}}_1\right)-\nabla f\left({\boldsymbol{\Phi}}_2\right)\right|\right|}_F^2=\left|\left|\begin{array}{c}{\mathbf{X}}^T\mathbf{X}\boldsymbol{\Delta } \mathbf{W}-{\mathbf{X}}^T\mathbf{Y}\boldsymbol{\Delta } \mathbf{P}-{\mathbf{X}}^T\mathbf{Y}\boldsymbol{\Delta } \mathbf{N}+{\lambda}_5\boldsymbol{\Delta} \mathbf{W}\left({\mathbf{L}}_{\mathbf{P}}+{\mathbf{L}}_{\mathbf{P}}^T\right)+{\lambda}_6\boldsymbol{\Delta} \mathbf{W}\left({\mathbf{L}}_{\mathbf{N}}+{\mathbf{L}}_{\mathbf{N}}^T\right)\\ {}\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\boldsymbol{\Delta } \mathbf{P}+\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\boldsymbol{\Delta } \mathbf{N}-{\mathbf{Y}}^T\mathbf{X}\boldsymbol{\Delta } \mathbf{W}\\ {}\left(1+{\lambda}_4\right){\mathbf{Y}}^T Y\varDelta P+\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\boldsymbol{\Delta } \mathbf{N}-{\mathbf{Y}}^T\mathbf{X}\boldsymbol{\Delta } \mathbf{W}\end{array}\right|\right|\\ {}={\left|\left|{\mathbf{X}}^T\mathbf{X}\Delta \mathbf{W}-{\mathbf{X}}^T\mathbf{Y}\Delta \mathbf{P}-{\mathbf{X}}^T\mathbf{Y}\Delta \mathbf{N}+{\lambda}_5\Delta \mathbf{W}\left({\mathbf{L}}_{\mathrm{p}}+{\mathbf{L}}_{\mathrm{p}}^T\right)+{\lambda}_6\Delta \mathbf{W}\left({\mathbf{L}}_{\mathrm{n}}+{\mathbf{L}}_{\mathrm{n}}^T\right)\right|\right|}_F^2+\\ {}2{\left|\left|\begin{array}{c}\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\Delta \mathbf{P}+\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\Delta \mathbf{N}-{\mathbf{Y}}^TX\Delta \mathbf{W}\end{array}\right|\right|}_F^2\\ {}\le 5{\left|\left|\begin{array}{c}{\mathbf{X}}^T\mathbf{X}\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{W}\end{array}\right|\right|}_F^2+5{\left|\left|\begin{array}{c}{\mathbf{X}}^T\mathbf{Y}\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{P}\end{array}\right|\right|}_F^2+5{\left|\left|\begin{array}{c}{\mathbf{X}}^T\mathbf{Y}\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{N}\end{array}\right|\right|}_F^2+\\ {}5{\left|\left|\begin{array}{c}{\lambda}_5\left({\mathbf{L}}_{\mathrm{p}}+{\mathbf{L}}_{\mathrm{p}}^T\right)\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{W}\end{array}\right|\right|}_F^2+5{\left|\left|\begin{array}{c}{\lambda}_6\left({\mathbf{L}}_{\mathbf{N}}+{\mathbf{L}}_{\mathbf{N}}^T\right)\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{W}\end{array}\right|\right|}_F^2+\\ {}3{\left|\left|\begin{array}{c}\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{P}\end{array}\right|\right|}_F^2+3{\left|\left|\begin{array}{c}\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{N}\end{array}\right|\right|}_F^2+3{\left|\left|\begin{array}{c}{\mathbf{Y}}^T\mathbf{X}\end{array}\right|\right|}_2^2{\left|\left|\begin{array}{c}\Delta \mathbf{W}\end{array}\right|\right|}_F^2\\ {}\le \left(5{\left|\left|\begin{array}{c}{\mathbf{X}}^T\mathbf{X}\end{array}\right|\right|}_2^2+5{\left|\left|\begin{array}{c}{\mathbf{X}}^T\mathbf{Y}\end{array}\right|\right|}_2^2+5{\left|\left|\begin{array}{c}{\lambda}_5\left({\mathbf{L}}_{\mathrm{p}}+{\mathbf{L}}_{\mathrm{p}}^T\right)\end{array}\right|\right|}_2^2+5{\left|\left|\begin{array}{c}{\lambda}_6\left({\mathbf{L}}_{\mathrm{n}}+{\mathbf{L}}_{\mathrm{n}}^T\right)\end{array}\right|\right|}_2^2+3{\left|\left|\begin{array}{c}\left(1+{\lambda}_4\right){\mathbf{Y}}^T\mathbf{Y}\end{array}\right|\right|}_2^2\right){\left|\left|\begin{array}{c}\Delta \mathbf{W}\\ {}\Delta \mathbf{P}\\ {}\Delta \mathbf{N}\end{array}\right|\right|}_F^2\end{array}} $$
(25)

Thus, the Lipschitz constant of optimization problem (6) can be calculated by

$$ {\mathbf{L}}_f=\sqrt{5{\left|\left|{\mathbf{X}}^{\mathbf{T}}\mathbf{X}\right|\right|}_2^2+5{\left|\left|{\mathbf{X}}^{\mathbf{T}}\mathbf{Y}\right|\right|}_2^2+5{\left|\left|\begin{array}{c}{\lambda}_5\left({\mathbf{L}}_{{\mathbf{p}}^{\prime }}\right)\end{array}\right|\right|}_2^2+5{\left|\left|\begin{array}{c}{\lambda}_6\left({\mathbf{L}}_{{\mathrm{n}}^{\prime }}\right)\end{array}\right|\right|}_2^2+3{\left|\left|\begin{array}{c}\left(1+{\lambda}_4\right){\mathbf{Y}}^{\mathbf{T}}\mathbf{Y}\end{array}\right|\right|}_2^2} $$
(26)

where \( {\mathbf{L}}_{{\mathbf{p}}^{\prime }}={\mathbf{L}}_{\mathbf{p}}+{\mathbf{L}}_{\mathbf{p}}^{\mathbf{T}} \) and \( {\mathbf{L}}_{{\mathbf{n}}^{\prime }}={\mathbf{L}}_{\mathbf{n}}+{\mathbf{L}}_{\mathbf{n}}^{\mathbf{T}} \).

4.5 Time complex

In this part, let us briefly analyze the time complexity of the optimization algorithm. For matrix X ∈ Rn × m, Y ∈ {0, 1}n × l, W ∈ Rm × l, P ∈ Rl × l, N ∈ Rl × l, L ∈ Ll × l, where n denotes the number of instances, l denotes the number of labels for corresponding instances, and m denotes the dimensions of instances. In LSLC-1, the positive and negative label correlations Lp and Ln need O(l2) in step 3, and the other key steps of the algorithm lie in the calculation of the Lipschitz constant and the derivation of the matrix. In step 4, the calculation of the Lipschitz constant needs O(m3 + l3 + mnl). The derivation of the matrix focuses on 6, 9 and 12 steps, the derivation of the matrix W needs O(nm2 + ml2 + lm2 + mnl), note that the derivation of the matrix P and N are the similar, so the time complexity of them needs O(nl2 + l3 + ml2 + mnl). As mentioned above, the time complexity of the calculating the Lipschitz constant can be reduced by finding a appropriate L ≥ Lf [40] and it needs O(ml2 + m2 n + m2 l + nl2 + mnl + nl + l2 + l3). These are summarized in algorithm 2, called LSLCnew for simplicity.

4.6 Model prediction

In this section, we mainly use two optimization algorithms mentioned above to obtain matrix W. For unlabeled test data sets Dts, we can predict the label of Dts by calculating the score matrix Sc and then using Sign(Sc - τ) with a given threshold τ, where Sc = W Dts and τ is set to be 0.5 in this experiment. The process of prediction is summarized in algorithm 3.

5 Experiment

5.1 Data sets

The algorithms proposed in this paper are all simulated on benchmark multi-label data sets. The specific statistical information of multi-label data sets are shown in Table 1. During the simulation, for each data set, we randomly selected 80% of the data as training data and the rest as test data. The experiment is repeated ten times and a five-fold cross test is used.

Table 1 The statistical information of data sets

5.2 Evaluation measures

In order to evaluate the performance of the experimental results, we use seven evaluation metrics. For simplicity, let t denote the number of test data, Yi and \( {\overset{\sim }{\mathbf{Y}}}_i \) denote the sets of ground-truth labels and predicted labels of the i-th instance, respectively; \( {Z}_j^{\mathbf{P}} \) and \( {Z}_j^{\mathbf{N}} \) denote the sets of positive and negative instance of j-th class label. \( {C}_{{\mathrm{X}}_i}\left(\mathbf{Y}\right) \) is confidence score of xi associated with label y. Thus, evaluation metrics can be computed as follows,

Hamming Loss(HL) evaluates error matching between the ground-truth labels Yi and the predicted labels \( {\overset{\sim }{\mathbf{Y}}}_i \).

$$ HL=\frac{1}{t}\sum \limits_{i=1}^t\left(\frac{1}{t}\left|Y\triangle \overset{\sim }{Y}\right|\right) $$
(27)

where △ represents the difference between the two data sets.

Subset Accuracy(SA) evaluates the accuracy between the ground-truth labels Yi and the predicted labels \( {\overset{\sim }{\mathbf{Y}}}_i \) in a more rigorous way.

$$ SA=\frac{1}{t}\sum \limits_{i=1}^t\left\Vert \mathbf{Y}=\overset{\sim }{\mathbf{Y}}\right\Vert $$
(28)

where ‖·‖ is an indication function.

Micro F1(MF1) is an extended version of the F1 Measure, and it takes each entry of the label vector as a single instance regardless of the distance between labels.

$$ MF1=\frac{2\sum \limits_{j=1}^l\sum \limits_{i=1}^t{Y}_{ij}{\overset{\sim }{Y}}_{ij}}{\sum \limits_{j=1}^l\sum \limits_{i=1}^t{Y}_{ij}+\sum \limits_{j=1}^l\sum \limits_{i=1}^t{\overset{\sim }{Y}}_{ij}} $$
(29)

Average Precision(AP) evaluates the average score of related labels that are higher than the specific labels

(30)

Ranking Loss(RL) evaluates the fraction that the ranking of unrelated labels of the instance is lower than that of related labels.

$$ RL=\frac{1}{t}\sum \limits_{i=1}^t\left(\frac{1}{\left|{Y}_i\right|\left|{\overset{\sim }{Y}}_i\right|}\left|\left\{\left({y}^{\prime },{y}^{{\prime\prime}}\right)|{C}_{x_i}\left({y}^{\prime}\right)\le {C}_{x_i}\left({y}^{{\prime\prime}}\right)|\left({y}^{\prime },{y}^{{\prime\prime}}\right)\in {Y}_i\times {\overset{\sim }{Y}}_i\right\}\right|\right) $$
(31)

One Error(OE) evaluates the fraction that the highest ranking label of the instance is not correctly labeled.

$$ OE=\frac{1}{t}\sum \limits_{i=1}^t\left\Vert \arg \underset{y_i\in Y}{\max }{C}_{x_i}\left({y}_i\right)\right\Vert $$
(32)

where ‖·‖ is an indication function.

Average AUC(AUC) evaluates the average fraction that the ranking of positive instance of the all class labels is higher than that of negative instance.

$$ AUC=\frac{1}{\mathrm{l}}\sum \limits_{i=1}^l\frac{\left|\left\{\left({x}^{\prime },{x}^{{\prime\prime}}\right)|{C}_{x^{\prime }}\left({y}_j\right)\ge {C}_{x^{{\prime\prime} }}\left({y}_j\right),\left({x}^{\prime },{x}^{{\prime\prime}}\right)\in {Z}_j^p\times {Z}_j^n\right\}\right|}{\left|{Z}_j^p\right|\left|{Z}_j^n\right|} $$
(33)

For these evaluation metrics, for HL, RL and OE, the smaller the better; whereas for SA, MF1, AP and AUC, the bigger the better.

5.3 Compared algorithms

The two versions of algorithm are proposed in this paper, which are compared with the following state-of-the-art for multi-label learning algorithm. For simplicity and efficiency, all configuration for parameters in the compared algorithms are suggested in the original paper.

MLKNN [8] It introduces maximum posteriori probability (MAP) to deal with multi-label classification based on traditional KNN algorithm. The parameter k is found in {3, 5, …, 21}.

LPLC [15] It explores local positive and negative pairwise label correlations based on traditional KNN algorithm. The parameter k is found in {3, 5, …, 21} and the tradeoff parameter α is found in {0.1, 0.2, …, 1}.

LIFT [25] It learns label-specific features through the idea of clustering algorithm. Here, base learner is linear kernel SVM, r = 0.1.

LLSF [26] It learns label-specific features. The regulation parameters λ is found in {2−10, …, 210}.

Glocal [13] It explores both global and local label correlations when recover missing labels. The paramter λ = 1,from λ1 to λ5 are found in {2−5, …, 21}, k is found in {0.1l, …, 0.6l}, and g is found in {2−10, …, 210}.

LSML [39] It directly learns label-specific features with missing labels. All paramters are found in {2−5, …, 21}.

LSLC and LSLCnew It is two versions of algorithm are proposed in this paper. It directly learns the label-specific features that missing labels while considering the positive and negative correlation of labels. All paramters are found in {2−10, …, 21}.

5.4 Comparison experiments

In the comparison experiments, we control the missing rate of class labels to be in the range of 10% to 60% with a step size of 10% and all data is normalized for fairness. Thus, for every comparing algoritm, the average result(mean std) on eleven data sets in terms of evaluation metrics are consulted on Tables 2, 3, 4, 5, 6, 7, and 8. Considering the limitation of table size, the experimental results here only give the average value, and the complete experimental results(mean ± std) are recorded in the uploaded zip file containing 11 Execl files. In order to better evaluate the performance of the algorithm, Friedman test [41] and Holm test [42] are used in here. Table 10 summarizes the Friedman statistics FF and the corresponding critical value in terms of every metric. As shown in Table 10, at significance level α = 0.05, the full hypothesis that the compared algorithm has performance of equivalent is rejected in terms of all the evaluation metric. As a result, the Nemenyi test [41] is employed to test whether our method LSLC can perform competitiveness against comparing algorithm. The performance of two algorithms will be significantly different if their averge ranks differ by at least one critical difference \( CD={q}_{\alpha}\sqrt{\frac{k\left(k+1\right)}{6N}} \) . Here, qα = 3.031 at significance level α = 0.05, therefore CD = 1.2924(k = 8, N = 66). The detailed CD diagrams are shown in Fig. 1. Holm test is based on the ranking of multiple classifiers. Firstly, the statistic \( Z=\frac{Rank_i-{Rank}_j}{SE} \) obeying the standard normal distribution is calculated, where \( SE=\frac{k\left(k+1\right)}{6N} \), Ranki is the average ranking of the i-th classifier on all data sets. Then the corresponding probability p can be obtained according to Z and sorted in ascending order. Finally, the probability p and α/(k − i) are compared to judge whether the original assumption (the two classifiers have the same performance) is rejected. Table 9 summarizes the Holm test result.

Table 2 Experimental results of each comparing algorithm on eleven data sets in terms of HL. For HL, the smaller the better. Best results are highlighted in bold
Table 3 Experimental results of each comparing algorithm on eleven data sets in terms ofSA. For SA, the bigger the better. Best results are highlighted in bold
Table 4 Experimental results of each comparing algorithm on eleven data sets in terms of MF1. For MF1, the bigger the better. Best results are highlighted in bold
Table 5 Experimental results of each comparing algorithm on eleven data sets in terms of AP. For AP, the bigger the better. Best results are highlighted in bold
Table 6 Experimental results of each comparing algorithm on eleven data sets in terms of OE. For OE, the smaller the better. Best results are highlighted in bold
Table 7 Experimental results of each comparing algorithm on eleven data sets in terms of RL. For RL, the smaller the better. Best results are highlighted in bold
Table 8 Experimental results of each comparing algorithm on eleven data sets in terms of AUC. For AUC, the bigger the better. Best results are highlighted in bold
Fig. 1
figure 1

The detailed friedman test diagram of the experimental results

Table 9 The Holm test result

Based on the results of comparison experiments, the following observations can be gained:

  • LSLC can outperform other state-of-the-art for multi-label learning algorithms in some case of different label missing rates. According to the Tables 2, 3, 4, 5, 6, 7, and 8, we mainly analyze the experimental results from two directions. From the horizontal direction, we can see clearly that among all the evaluation matrices, the performance of LSLC and LSLCnew perform slightly better on most data sets than other comparison algorithms, and an analysis of Fig. 1 reveals that LPLC and LPLCnew are relatively left-sided in the evaluation matrices compared with other algorithms. From the vertical direction, the experimental results of MLKNN, LPLC, LIFT, LLSF and Glocal have obvious fluctuation with the increase of label missing rate. But the experimental results of algorithm LSML and LSLC show change relatively smooth and only rises slowly in a small range. The main cause of phenomenon is attributed to the algorithm itself. LPLC only considers positive label correlations. Both LIFT and LLSF consider only label-specific features. For MLKNN, LPLC and LLSF, they cannot handle missing labels. Although Glocal takes into account the fact that label correlations and missing labels, it cannot select label-specific features.

  • It can be seen from Table 9 that the performance of LSLC is different from that of MLKNN, LPLC, LIFT, LLSF and Glocal classifiers. Although the performance of LSLC and LSML is the same in holm test, it can be seen from the results in Fig. 1 that the performance of LSLC is better than that of LSML. LSML can handle missing labels, label correlations and label-specific features, but it only uses positive label correlations. In LSLC, first of all, the positive and negative label correlations can be learned from the incomplete training data sets, and then missing label will be recovered by learned label correlations, finally multi-label classification task will be modeled by selecting label-specific features iteratively.

  • LSLCnew still shows better performance compared with LSLC and it is faster than LSLC according to experimental results of all data sets.

In conclusion, our proposed algorithm shows a competitive performance compared to other exist state-of-the-art algorithm for multi-label classification task with missing label (Table 10).

Table 10 The Friedman statistic FF and critical value

5.5 Convergence

In our proposed algorithm, the accelerated proximal gradient(APG) is used to quickly solve the objective loss function. The loss function value w.r.t the number of iterations is shown in Fig. 2, and we conduct it over ‘medical’ data set. An analysis of Fig. 2 reveals that the ordinate value drops rapidly in the early stage over these two data sets and tends to be stable. Similarly, this phenomenon also occurs in other data sets.

Fig. 2
figure 2

Visualization of the loss function value w.r.t the number of iterations over ‘medical’ dataset

5.6 Analysis of model parameters

In this section, we mainly analyze model parameters, that is, label-specific features and label correlations. Firstly, we conduct LSLC over ‘medical’ dataset with 60% label missing rate and visualize matrix W, which is shown in Fig. 3. Here, the horizontal axis and the vertical axis represent feature index and label index respectively. As can be seen from Fig. 3, we can select label-specific features represented by white color, and each class label is only determined by a certain subset of specific features from original feature space. At the same time, label-specific feature selection is based on the assumption that the label matrix is complete. Thus, the label correlations play an important role in dealing with missing label task and it can be obtained by the learned model parameter P and N according to 3.4 and 3.5 respectively. Similarity, we conduct LSLC over ‘medical’ dataset and visualize P and N two matrices are shown in Figs. 4 and 5. Here, the horizontal axis and the vertical axis represent label index. Note, the points with white color indicate the nozero entries of P and N. After observing the Fig. 4, we can arrive at the following conclusion that each label is only positive label correlation with some other labels. Likewise, this phenomenon also occurs in the Fig. 5.

Fig. 3
figure 3

Visualization of the label-specific features over ‘medical’ dataset

Fig. 4
figure 4

Visualization of positive label correlations over ‘medical’ dataset

Fig. 5
figure 5

Visualization of negative label correlations over ‘medical’ dataset

To intuitively see the impact of positive and negative label correlations on classification results. We degenerate the algorithm of LSLC into two algorithms of LSLC-P and LSLC-N respectively. The former only positive label correlations are considered, while the latter only negative label correlations are considered. Simplicity, we conduct these algorithms over ‘medical’, ‘stack-chemistry’ and ‘rcv1subsets1’ datasets with 60% missing rate of class label. As shown in Table 11, it is effective to strengthen the classification ability if the positive and negative label correlations are considered.

Table 11 The Experimental results of these models. For HL,RL and OE, the smaller the better; whereas for AP and AUC, the bigger the better. Best results are highlighted in bold

5.7 Sensitivity to parameters

In this section, we mainly analyze the three parameters λ1, λ5, and λ6, where λ1 controls the sparsity of label-specific, λ5 controls the positive label relationship and λ6 controls the negative label relationship. We conduct the experiment on” medical” data set with 60% missing label, and dynamically changing two of parameters by fixing other four parameters with the optimal value respectively. We randomly selected 80% of the data as training data and the rest as test data. The experiment is repeated ten times and a five-fold cross test is used. The experimental results for evaluation metrics of MF1 and AUC are summarized in Fig. 6. An analysis of these Figures reveals that LSLC is sensitive obviously to these tradeoff parameters λ1, λ5, and λ6 in the set rang. When any one of the three parameters λ1, λ5, and λ6 exceeds a certain range, the performance of algorithm LSLC will be weakened.

Fig. 6
figure 6

The Sensitivity to parameters with 60% missing rate of class label

6 Conclusion

In this paper, we have proposed a novel algorithm named LSLC, which can deal with labelspecific for multi-label learning with missing labels by taking both positive label correlations and negative label correlations into consideration. For missing labels, LSLC can complete the missing label matrix through the positive and negative label correlations and then completion label matrix drives label-specific features selection. Besides, the APG is applied in our proposed algorithm to quickly solve the objective function iteratively. Futhermore, the experimental results show that both positive and negative correlation of labels plays a certain role in the final classification effect of the classifier. After a series of experiments in eleven benchmark data sets, our algorithm shows better performance and can be better select special label features from multi-label classification tasks with missing labels.