1 Introduction

With the rapid development of the Internet and digital acquisition equipment in recent years, the scale of data that needs to be analyzed and processed in classification problems has increased dramatically. These data may contain not only single-label data but also a large number of multi-label data. Each instance has only one label in the single-label data, and different labels are mutually independent. While in the multi-label data, a sample may belong to multiple labels simultaneously, and each label intersects with the other and is correlated. So far, the research of multi-label learning, which includes text classification, image annotation, video classification, biology, etc., has attracted the attention of many scholars. In many practical applications mentioned above, multi-label data usually has thousands or even more features, which brings many problems to data analysis, decision-making, screening, and prediction [1]. For example, redundant and irrelevant features may affect the function of classifiers [2]. In order to solve these problems, we will select a subset of related and optimal features. The procedure is called feature selection. Feature selection has many advantages in learning algorithms, including reducing measurement cost and storage requirements, shortening training time, avoiding dimension disaster, reducing overfitting, and so on [3, 4]. Therefore, multi-label feature selection has become a research hot spot.

Based on label information and search strategy, feature selection methods are usually divided into two categories [5]. Based on the search strategy, feature selection can be divided into three categories: filter [6,7,8], wrapper [9, 10] and embedded [11, 12]. Among them, the embedded method combines the advantages of the filter method and wrapper method. They embed the feature selection process in the learning process. Because they do not evaluate the feature subset iteratively, they are more effective than the wrapper method [1].

Through the study of advanced models, it is found that most of them are based on linear mapping [13, 14] and information theory. With the development of research and the introduction of manifold learning [15,16,17], the multi-label feature selection system is constantly improved.

A multi-label feature selection method based on mutual information and label correlation is proposed in [18]. A new multi-label feature selection based on label redundancy, called (LRFS), is proposed in [19]. It divides labels into independent labels and dependent labels and analyzes the differences between independent and dependent labels. Kernel alignment is introduced into multi-label learning to measure the consistency between feature space and label space. Moreover, a new multi-label feature selection method, which can automatically learn and deal with the importance of labels, is proposed in [20]. A new feature selection method of extended adaptive minimum absolute contraction selection operator (EALasso) is presented [21]. This method preserves the properties of determining the correct subset model and obtaining the optimal estimation accuracy, proposes an iterative optimization algorithm, and gives theoretical convergence proof. Some researchers put forward a multi-label feature selection method with multiple regularizations (MDFS) [22].

Moreover, they calculate the correlation between the feature and the local label and use the objective function that includes L21-norm regularization. Through linear mapping and combining manifold learning and L21-norm regularization, multi-label feature selection via feature manifold learning and sparsity regularization (MSSL) is proposed in [23]. A robust multi-label feature selection with dual-graph regularization (DRMFS) is proposed in [24]. In order to improve the robustness of the algorithm, the model uses L21-norm mapping and combines label manifold with characteristic manifold to consider not only the correlation between features but also the correlation between labels. Finally, the L21-norm is used to constrain the sparsity of the weight matrix.

To sum up, linear regression is often used in multi-label feature selection models. However, since labels are binary, it is not appropriate in most cases to assume a linear correlation between the sample space and the label space. Moreover, multi-label feature selection is used for multi-label classification. So from the perspective of classification, logistic regression is more suitable for the multi-label feature selection model. The reasons are as follows:

  1. 1)

    For multi-label data, label (dependent variable) is the discrete value (0 or 1), which is more suitable for logistic regression.

  2. 2)

    Logistic regression is a generalized linear model, which is equivalent to introducing nonlinearity into the model, which can improve the expression ability of the model and increase the fitting.

  3. 3)

    Linear regression directly analyzes the relationship between the dependent and independent variables, while logistic regression analyzes the relationship between the probability of taking a particular value of the dependent variable and the independent variable.

From reasons 1 and 2, it can be seen that logistic regression is more suitable for data classification than linear regression, and it can be applied to a variety of data distribution including the distribution of the positive and the negative; from reason 3, it can be seen that logistic regression is more robust than linear regression.

Based on this problem, some scholars choose logistic regression to replace the least square regression in the model and improve the algorithm’s function by improving the regular term. The author puts forward a correlation logistic regression model (CorrLog) for multi-label image classification, which extends the traditional logistic regression model to multi-label image classification [25]. A feature subset selection algorithm for mixed-integer optimal logistic regression is proposed in [26]. This paper presents a mixed-integer linear optimization problem, which can be solved using standard integer optimization software to approximate the logistic loss function piecewise. A robust logistic regression method based on the regularization of Lq-norm q ∈ [0, 1] is proposed in [27], which is a feasible and effective feature selection method.

However, the existing multi-label feature selection algorithm based on logistic regression ignores the feature manifold structure. The above multi-label feature selection algorithm [1, 3, 4, 19, 20] ignores the fitting of label information while paying attention to the feature manifold structure, and the above multi-label feature selection algorithm [21, 23] ignores the fitting of the feature manifold structure to label information while paying attention. Therefore, this study will combine the logistic regression model with the regularization of the feature map, label map, and L21-norm sparse regularization to solve the problem of multi-label feature selection. The main contributions of this paper are as follows:

  1. 1)

    The assumption of linear correlation between sample space and label space is not applicable in most cases to solve the problem. Therefore, this paper uses logistic regression to construct a multi-label feature selection model.

  2. 2)

    Considering that feature selection should be based on sample and label matrices, few existing models consider both of these matrices. Therefore, the feature manifold and label manifold are combined into the multi-label feature selection model constructed by logistic regression. In addition, the L2, 1-norm sparse constraint is combined to construct “Multi-label feature selection based on logistic regression and manifold learning.”

  3. 3)

    The model’s optimal solution is realized, the optimal algorithm is designed, and the algorithm’s convergence is proved.

  4. 4)

    A large number of experiments were designed and carried out on eight classical multi-label data sets, compared with five advanced multi-label feature selection algorithms (DRMFS, SCLS, etc.) and baseline, and the results prove the effectiveness of the LMFS algorithm.

The rest of this paper is organized as follows. Section 2 gives a multi-label feature selection model. In Section 3, the model is solved, an iterative algorithm for multi-label feature selection is proposed, and its time complexity is analyzed. In Section 4, the comprehensive experiment on six classical data sets shows that the algorithm proposed in this paper is superior to other algorithms. Finally, the conclusions and future work is presented in Section 5.

2 Problem description

2.1 Logistic regression model

Suppose a multi-label data set \(D={\{(d_{i},y_{i})\}}_{i=1}^{n}\) consists of n independent samples with the same distribution. Let X = [x1;x2;⋯ ;xn] be the augmented matrix of the data matrix, XRn×d, where xi = [1,di]. Y = [y1;y2;⋯ ;yn] be the label matrix, YRn×m, m is the number of classes. The value of yij is 0 or 1, indicating whether the i-th sample is associated with the j-th class. In the logistic regression, the posterior probability that sample xi belongs to the j-th class is:

$$ \begin{array}{@{}rcl@{}} \begin{array}{ll} & Pr(y_{ij}=1|x_{i})=g(x_{i} w_{j})=\frac{exp(x_{i} w_{j})}{{1+exp(x_{i} w_{j})}} \end{array} \end{array} $$
(1)

Thus the posterior probability that sample xi does not belong to the j-th class is:

$$ \begin{array}{@{}rcl@{}} \begin{array}{ll} & Pr(y_{ij}=0|x_{i})=1-g(x_{i} w_{j})=\frac{1}{{1+exp(x_{i} w_{j})}} \end{array} \end{array} $$
(2)

where W = [w1,w2,⋯ ,wm] and WRd×m; wj is the j-th column vector of the coefficient matrix W.

If the maximum likelihood estimation method is used to estimate the coefficient matrix, then the likelihood function (joint probability distribution) of the logistic regression on the multi-label data set is:

$$ \begin{array}{@{}rcl@{}} P(W)=\prod\limits_{j=1}^{m}\prod\limits_{i=1}^{n} g(x_{i} w_{j})^{y_{ij}} (1-g(x_{i} w_{j}))^{1-y_{ij}} \end{array} $$
(3)

Since it is inconvenient to solve optimization max P(W), the minimum value of L(W) of negative log likelihood function for solving logistic regression is used to solve W

$$ \begin{array}{@{}rcl@{}} \begin{array}{ll} L(W)&{=-}{\sum}_{j=1}^{m}{\sum}_{i=1}^{n} [y_{ij}\ln(g(x_{i} w_{j})){+}(1-y_{ij})\ln(1{-}g(x_{i} w_{j}))]\\ & {=-}{\sum}_{j=1}^{m}{\sum}_{i=1}^{n} [y_{ij} x_{i} w_{j}+\ln(1{-}g(x_{i} w_{j}))] \end{array} \end{array} $$
(4)

2.2 Sparse constraint

The logistic regression model may suffer from ill-posed problems, such as overfitting, multi-collinearity, and infinite solutions, which results in incorrect estimation of the coefficient matrix [28]. So in order to solve this problem, a widely used strategy is to introduce penalty terms into L(W), which aims to achieve a stable and accurate logistic regression model in high-dimensional data. The so-called penalty function is usually expressed as follows, where β is the regularization parameter.

$$ \begin{array}{@{}rcl@{}} {min_{W}} L(W)+\beta R(W) \end{array} $$
(5)

For the i-th row vector Wi of the coefficient matrix W, it can be regarded as a vector that measures the importance of the i-th feature. Let fiRn be the i-th feature vector of the data matrix, and then the data matrix X can be expressed in the form of X = [f1,f2,⋯ ,fd].

The common term R(W) has various forms for different purposes. Take L1-norm regular term and L2-norm regular term. For example, L1-norm is often used to guided sparsity; L2-norm is often used to guided stability. Because ∥Wi2 is generally used to measure the importance of feature fi, in order to distinguish the importance of features better, here we take L21-norm as the standard term R(W), which not only guides the row sparsity of the sparse matrix but also is sensitive to singular values [29]. So the objective optimization problem can be written as:

$$ \begin{array}{@{}rcl@{}} min_{W} L(W)+\frac{\beta }{ 2} \|W\|_{2, 1} \end{array} $$
(6)

where, \(\|W\|_{2, 1}={\sum }_{i=1}^{d} \|W_{i}\|_{2}\).

2.3 Feature manifold learning

Considering that the parameter of each coefficient vector wj in formula (6) is β, but according to the idea of binary conversion, the regularization parameter β may not apply to all coefficient vectors. In addition, the features are extracted from some manifolds called the feature manifold to [2, 15, 16]. This is an important technique that can obtain the structure of the feature weight feature manifold by exploring the geometry. According to the problem assumption, if the features fi and fj are closer, then their weight vectors Wi and Wj should also be closed. Therefore, a feature map regularization is constructed, which can adjust the regularization parameters of the coefficient vectors wj according to the similarity between the features fi and fj. Its expression is as follows:

$$ \begin{array}{@{}rcl@{}} \begin{array}{ll} &~~\frac{1}{ 2} {\sum}_{i=1}^{d}{\sum}_{j=1}^{d} \|W_{i}-W_{j}\|_{2}^{2} S_{ij}\\ &{} =\frac{1}{ 2} {\sum}_{i=1}^{d}{\sum}_{j=1}^{d} (W_{i}-W_{j})(W_{i}-W_{j})^{T} S_{ij}\\ &{} ={\sum}_{i=1}^{d} W_{i} {W_{i}^{T}} M_{ii}-{\sum}_{i=1}^{d}{\sum}_{j=1}^{d} W_{i} {W_{j}^{T}} S_{ij}\\ &{} =Tr(W^{T} (M-S)W)\\ &{} =Tr(W^{T} L_{S} W) \end{array} \end{array} $$
(7)

where, MRd×d is the diagonal matrix, and \(M_{ii}={\sum }_{j=1}^{d} S_{ij}\) is the i-th diagonal element of M. LS = MS is the Laplacian matrix of the feature similarity matrix S, and Sij is the i-th row and j-th element of the feature similarity matrix S, representing the similarity between features fi and fj. There are many ways to construct feature similarity matrix S,for example:

By using a kernel function, a feature association matrix S can be constructed, where tR:

$$ \begin{array}{@{}rcl@{}} S_{ij}{=}\begin{cases} exp(- \frac{\|f_{i}-f_{j}\|_{2}^{2} }{{t}}),\!\!\!&if~\ f_{i}{\in} N_{K} (f_{j})\ or\ f_{j}{\in} N_{K} (f_{i})\\ 0,&others \end{cases} \end{array} $$
(8)

where NK(∗) represents the k-nearest neighbor set of ∗. Through feature map regularization, the problem of feature selection is optimized:

$$ \begin{array}{@{}rcl@{}} min_{W} L(W) +\frac{\lambda }{ 2} Tr(W^{T} L_{S} W)+\frac{\beta }{ 2} \|W\|_{2, 1} \end{array} $$
(9)

2.4 Label manifold learning

In order to better fit the label information while fitting the manifold structure. According to the problem, suppose: Let f(xiW) = [g(xiw1),g(xiw2),⋯ ,g(xiwm)], f(xiW) ∈ Rm, if the labels yi and yj are closer, then the probability f(xiW) and f(xjW) in the logistic regression model should also be closer, and according to the positive correlation between g(xiwj) and xiwj, the positive correlation between f(xiW) and xiW is deduced, thus xiW should be closer to xjW. Therefore, a regularization of the label graph is constructed, which can adjust the coefficient matrix W according to the similarity between the labels yi and yj, so that W can better fit the label information. Its expression is as follows:

$$ \begin{array}{@{}rcl@{}} \begin{array}{ll} &~ \frac{1}{ 2} {\sum}_{i=1}^{n}{\sum}_{j=1}^{n} \|x_{i} W{-}x_{j} W\|_{2}^{2} A_{ij}\\ &{} =\!\frac{1}{ 2} {\sum}_{i=1}^{n}{\sum}_{j=1}^{n} (x_{i} W{-}x_{j} W)(x_{i} W-x_{j} W)^{T} A_{ij}\\ &{} {=}\!{\sum}_{i=1}^{n} x_{i} W (x_{i} W)^{T} P_{ii}{-}\!{\sum}_{i=1}^{n}{\sum}_{j=1}^{n} x_{i} W (x_{j} W)^{T} A_{ij}\\ & {}=\!Tr(W^{T} X^{T} (P{-}A)XW)\\ & {}=\!Tr(W^{T} X^{T} L_{A} XW)\\ & {}=Tr(W^{T} \overline{L_{A}} W) \end{array} \end{array} $$
(10)

where, \(\overline {L_{A}}=X^{T} L_{A} X\) and PRn×n are diagonal matrices, \(P_{ii}={\sum }_{j=1}^{n} A_{ij}\) is the i-th diagonal element of P. LA = PA is the Laplacian matrix of label similarity matrix A, and Aij is the element of the i-th row and the j-th column of label similarity matrix A, representing the similarity between the labels yi and yj. The label similarity matrix A can be given by many methods,as follows:

By using a kernel function, a label association matrix A can be constructed, where tR:

$$ \begin{array}{@{}rcl@{}} A_{ij}=\begin{cases} exp(- \frac{\|y_{i}-y_{j}\|_{2}^{2} }{{t}}),&if~\ y_{i}\in N_{K} (y_{j})\ or\ y_{j}\in N_{K} (y_{i})\\ 0,&others \end{cases} \end{array} $$
(11)

As for several different calculation methods of feature similarity matrix S and label similarity matrix A. The impact on multi-label feature selection, we have made a simple analysis on the Image and Emotion data sets, set the parameter range as [0.001, 0.01, 0.1, 1, 10, 100, 1000] to search and get the best result. As shown in Fig. 1 below, and found that several methods are similar. So in the experiment part, we use kernel function to learn the feature similarity matrix and label similarity matrix.

Fig. 1
figure 1

Average precision comparison of different similarity matrix methods when ML-KNN is used as the basic classifier. (the higher the result, the better)

Through label map regularization, the optimization feature selection problem is transformed into:

$$ \begin{array}{@{}rcl@{}} \begin{array}{ll} & min_{W} L(W) {+}\frac{\lambda }{ 2} Tr(W^{T} L_{S} W){+}\frac{\beta }{ 2} \|W\|_{2, 1}+\frac{\gamma }{ 2} Tr(W^{T} \overline{L_{A}} W) \end{array} \end{array} $$
(12)

3 Problem solving and proof of convergence

3.1 Problem solving

Due to the non-smoothness of L21-norm, it is difficult to find the closed solution of the optimization problem in (12) directly. According to [29], this problem can be solved by another method. When Wi≠ 0(i = 1, 2,⋯ ,d), the derivative of ∥W2, 1 to W is:

$$ \begin{array}{@{}rcl@{}} \frac{\partial (\|W\|_{2, 1})}{ \partial W}=2HW \end{array} $$
(13)

where HRd×d is the diagonal matrix and the i-th diagonal element of H is:

$$ \begin{array}{@{}rcl@{}} H_{ii}=\frac{1}{ 2 \|W_{i}\|_{2}} \end{array} $$
(14)

Therefore, the derivative in L21-norm can also be regarded as the derivative of Tr(WTHW). Since ∥W2, 1 is convex, the optimization problem of L21-norm can be used to find the approximate solution of (12). Thus the objective function is transformed into:

$$ \begin{array}{@{}rcl@{}} \begin{array}{ll} &obj(W)=L(W) +\frac{\lambda }{ 2} Tr(W^{T} L_{S} W)\\ &~~~~~~~~~~~~~~~~~+\frac{\beta }{ 2} Tr(W^{T} HW)+\frac{\gamma }{ 2} Tr(W^{T} \overline{L_{A}} W) \end{array} \end{array} $$
(15)

For this problem, we can give an H, calculate W with the current H, and then update H based on the currently calculated W.

Since (15) is differentiable, it can be solved by the Newton–Raphson algorithm. The first derivative of (15) to W is:

$$ \begin{array}{@{}rcl@{}} \begin{array}{ll} &\frac{\partial (obj(W))}{ \partial W}=-X^{T} [Y-G(XW)]\\ &~~~~~~~~~~~~~~~~~~~+\lambda L_{S} W+\beta HW+\gamma \overline{L_{A}} W \end{array} \end{array} $$
(16)

where G(XW) = [f(x1W);f(x2W);⋯ ;f(xnW)]. The second derivative of (15) to W is:

$$ \begin{array}{@{}rcl@{}} \frac{\partial^{2} (obj(W))}{ \partial W \partial W^{T}}=-X^{T} UX+\lambda L_{S}+\beta H+\gamma \overline{L_{A}} \end{array} $$
(17)

Among them:

$$ \begin{array}{@{}rcl@{}} U=diag \sum\limits_{j=1}^{m} [(1-g(x_{i} w_{j}))g(x_{i} w_{j})] \end{array} $$
(18)

where i = 1, 2,⋯ ,d.

The updated formula for W is:

$$ \begin{array}{@{}rcl@{}} W^{t+1} =W^{t} -(\frac{\partial^{2} (obj(W))}{ \partial W \partial W^{T}})^{-1} \frac{\partial (obj(W))}{ \partial W} \end{array} $$
(19)
figure d

In LMFS algorithm, the main purpose is to calculate the update U, \(\frac {\partial (obj(W))}{ \partial W}\), \(\frac {\partial ^{2} (obj(W))}{ \partial W \partial W^{T}}\), W and H. In each iteration, the complexity of update U is O(mn2), the complexity of update \(\frac {\partial (obj(W))}{ \partial W}\) is O(d2n), the complexity of update \(\frac {\partial ^{2} (obj(W))}{ \partial W \partial W^{T}}\) is O(d2), the complexity of update W is O(d2m), and the complexity of update H is O(dm). The LMFS algorithm has iterated t times in total. Therefore, the total complexity of the LMFS algorithm is O(t(d2n + d2m + d2 + nm2 + dm)), and the value of t is not large. Therefore, the running time of the LMFS algorithm processing data is greatly affected by the dimension d of data, the number of labels m, and the number of samples n in the data set.

3.2 Proof of convergence

In this section, we prove that the iterative procedure shown in Algorithm 1 is convergent. Therefore, in the t-th iteration, we know:

$$ \begin{array}{@{}rcl@{}} \begin{array}{ll} & W^{t+1}=arg min_{W} L(W) +\frac{\lambda }{ 2} Tr(W^{T} L_{S} W)\\ & ~~~~~~~~~~~~~+\frac{\beta }{ 2} Tr(W^{T} H^{t} W)+\frac{\gamma }{ 2} Tr(W^{T} \overline{L_{A}} W) \end{array} \end{array} $$
(20)

where \(H_{ii}^{t} =\frac {1}{ 2 \|{W_{i}^{t}}\|_{2}} (i=1, 2,\cdots ,d)\), so we have:

$$ \begin{array}{@{}rcl@{}} \begin{array}{ll} & L(W^{t+1}) +\frac{\lambda }{ 2} Tr((W^{t+1})^{T} L_{S} W^{t+1})\\ & ~~~~~~~~~~~~~~~+\frac{\beta }{ 2} Tr((W^{t+1})^{T} H^{t} W^{t+1})\\ &~~~~~~~~~~~~~~~+\frac{\gamma }{ 2} Tr((W^{t+1})^{T} \overline{L_{A}} W^{t+1})\leq L(W^{t})\\ &~~~~~~~~~~~~~~~+\frac{\lambda }{ 2} Tr((W^{t})^{T} L_{S} W^{t})\\ & ~~~~~~~~~~~~~~~+\frac{\beta }{ 2} Tr((W^{t})^{T} H^{t} W^{t})+\frac{\gamma }{ 2} Tr((W^{t})^{T} \overline{L_{A}} W^{t}) \end{array} \end{array} $$
(21)

That is:

$$ \begin{array}{@{}rcl@{}} \begin{array}{ll} & L(W^{t+1}) +\frac{\lambda }{ 2} Tr((W^{t+1})^{T} L_{S} W^{t+1})\\ & ~~~~~~~~~~~~~~~+\frac{\gamma }{ 2} Tr((W^{t+1})^{T} \overline{L_{A}} W^{t+1})\\ &~~~~~~~~~~~~~~~+\frac{\beta }{ 2} {\sum}_{i=1}^{d} \frac{\|W_{i}^{t+1}\|_{2}^{2}}{ 2 \|{W_{i}^{t}}\|_{2}} \leq L(W^{t})\\ &~~~~~~~~~~~~~~~+\frac{\lambda }{ 2} Tr((W^{t})^{T} L_{S} W^{t})\\ & ~~~~~~~~~~~~~~~+\frac{\gamma }{ 2} Tr((W^{t})^{T} \overline{L_{A}} W^{t})+\frac{\beta }{ 2} {\sum}_{i=1}^{d} \frac{\|{W_{i}^{t}}\|_{2}^{2}}{ 2 \|{W_{i}^{t}}\|_{2}} \end{array} \end{array} $$
(22)

It can be further transformed into:

$$ \begin{array}{@{}rcl@{}} \begin{array}{ll} & L(W^{t+1}) +\frac{\lambda }{ 2} Tr((W^{t+1})^{T} L_{S} W^{t+1})\\ & ~~~~~~~~~~~~~~~+\frac{\gamma }{ 2} Tr((W^{t+1})^{T} \overline{L_{A}} W^{t+1})\\ & ~~~~~~~~~~~~~~~+\frac{\beta }{ 2} \|W^{t+1}\|_{2, 1} -\frac{\beta }{ 2} (\|W^{t+1}\|_{2, 1}\\ &~~~~~~~~~~~~~~~-{\sum}_{i=1}^{d} \frac{\|W_{i}^{t+1}\|_{2}^{2}}{ 2 \|{W_{i}^{t}}\|_{2}})\leq L(W^{t})\\ &~~~~~~~~~~~~~~~+\frac{\lambda }{ 2} Tr((W^{t})^{T} L_{S} W^{t})+\frac{\gamma }{ 2} Tr((W^{t})^{T} \overline{L_{A}} W^{t})\\ & ~~~~~~~~~~~~~~~+\frac{\beta }{ 2} \|W^{t}\|_{2, 1} -\frac{\beta }{ 2} (\|W^{t}\|_{2, 1}-{\sum}_{i=1}^{d} \frac{\|{W_{i}^{t}}\|_{2}^{2}}{ 2 \|{W_{i}^{t}}\|_{2}}) \end{array} \end{array} $$
(23)

According to the inequality \(\sqrt {a} -\frac {a}{ 2\sqrt {b}} \leq \sqrt {b} -\frac {b}{ 2\sqrt {b}}\) for any positive numbers a and b, we have:

$$ \begin{array}{@{}rcl@{}} \|W_{i}^{t+1}\|_{2, 1}-\frac{\|W_{i}^{t+1}\|_{2}^{2}}{ 2 \|{W_{i}^{t}}\|_{2}} \leq \|{W_{i}^{t}}\|_{2, 1}-\frac{\|{W_{i}^{t}}\|_{2}^{2}}{ 2 \|{W_{i}^{t}}\|_{2}} \end{array} $$
(24)

Which sums, we get:

$$ \begin{array}{@{}rcl@{}} \sum\limits_{i=1}^{d} \left( \|W_{i}^{t+1}\|_{2, 1}{-}\frac{\|W_{i}^{t+1}\|_{2}^{2}}{ 2 \|{W_{i}^{t}}\|_{2}}\right) {\leq} \sum\limits_{i=1}^{d} \left( \|{W_{i}^{t}}\|_{2, 1}-\frac{\|{W_{i}^{t}}\|_{2}^{2}}{ 2 \|{W_{i}^{t}}\|_{2}}\right) \end{array} $$
(25)

Which implies:

$$ \begin{array}{@{}rcl@{}} \|W^{t+1}\|_{2, 1} - \sum\limits_{i=1}^{d} \frac{\|W_{i}^{t+1}\|_{2}^{2}}{ 2 \|{W_{i}^{t}}\|_{2}} \leq \|W^{t}\|_{2, 1}-\sum\limits_{i=1}^{d} \frac{\|{W_{i}^{t}}\|_{2}^{2}}{ 2 \|{W_{i}^{t}}\|_{2}} \end{array} $$
(26)

In summary, the convergence of Algorithm 1 is proved.

4 Experiments and results

In order to verify the effectiveness of the LMFS algorithm, the experiment uses eight public data sets and compares its performance with some of the most advanced methods and baselines. At the same time, the experiment selects ML-KNN [30] as the representative of the multi-label classification algorithm for evaluation.

4.1 Dataset and experimental setup

The experiment uses eight public data sets from four different areas. The specific parameters of each data set are shown in Table 1:

Table 1 Dataset information

In terms of experimental environment, all experimental related environments are: Microsoft Windows7 system, processor: Intel (R) Core (TM) i5-4210U CUP @ 1.70GHz 2.40GHz, memory: 4.00GB, programming software: Matlab R2016a.

To verify the effectiveness of the proposed feature selection method, the following most advanced state-of-the-art feature selection algorithms are compared:

  1. 1)

    Baseline: The results on various evaluation indicators after learning the data set directly with ML-KNN without any feature selection.

  2. 2)

    DRMFS [24]: A robust multi-label feature selection with dual-graph regularization was constructed by using feature graph and label graph to guide the sparsity between rows and within rows of the weight matrix, and L2, 1 norm to guide its global properties and robust.

  3. 3)

    SCLS [31]: Multi-label feature selection method based on scalable standards.

  4. 4)

    MDMR [32]: Multi-label feature selection through an evaluation metric that combines mutual information with maximum dependency and minimum redundancy.

  5. 5)

    PMU [33]: A multi-label feature selection algorithm based on mutual information. Multi-label feature selection is performed by selecting the dependency between the selected feature and the label.

  6. 6)

    FIMF [34]: A fast multi-label feature selection method based on information theory feature ranking. Based on information theory, a scoring function that evaluates the importance of features is derived, and its calculation cost is analyzed.

In order to ensure the fairness of the experiment, in terms of parameter setting:

The number of nearest neighbors K for the multi-label classification algorithm ML-KNN is set to 10, and the value of smooth S is set to 1. For MDMR, PMU, and FIMF, we discretize the data set using the equal-width intervals [35]. For FIMF, we set Q = 10 and b = 2. For DRMFS and LMFS, we use (8) to calculate the similarity matrix between features. The above settings are the default settings of the algorithms. In addition, For DRMFS and other comparative algorithms, the experiment adjusts the regularization parameters of all methods by the “grid search” strategy from [0.001, 0.01, 0.1, 1, 10, 100, 1000]. For the feature dimension, we set the number of selected features as [10, 15, 20, 25, 30, 35, 40, 45, 50]. The maximum number of iterations for alliterative algorithms is fixed as 50. At the same time, the size of neighborhood K is set as 5. For all multi-label feature selection algorithms, the experiments show the best results from the optimal parameters.

4.2 Evaluation metrics

The performance evaluation of the multi-label learning systems is different from the single-label learning systems. The evaluation criteria of the multi-label learning system are more complicated. The experiment uses five evaluation criteria: Hamming loss, Ranking loss, One-error, Coverage, and Average precision in ML-KNN. The specific contents of the five evaluation criteria are as follows:

Suppose there is a test data set \(D={\{(x_{i},y_{i})\}}_{i=1}^{n}\), where n is the number of test samples. Given test sample xi, the binary label vector that is predicted by the multi-label classifier is denoted as h(xi), and the rank of the l-th label prediction is dennoted as ranki(l).

1) Hamming loss: evaluates the percentage of mislabeled labels, i.e., a label belonging to the instance is not predicted or a label not belonging to the instance is predicted. The smaller the value, the better the performance.

$$ \begin{array}{@{}rcl@{}} HL(D)=\frac{1}{ n} \sum\limits_{i=1}^{n} \frac{1}{ m} \|h(x_{i})\bigtriangleup y_{i}\|_{1} \end{array} $$
(27)

where △ represents the symmetric difference between the two sets, and returns those values that appear only in one of the sets, HL(D) ∈ [0, 1].

2) Ranking loss: evaluates the proportion of reverse-order label pairs, that is, the case where unrelated labels are more relevant than related labels. The smaller the value, the better the performance.

$$ \begin{array}{@{}rcl@{}} &&RL(D)=\frac{1}{ n} \sum\limits_{i=1}^{n} \frac{1}{ {1_{m}^{T}} y_{i} {1_{m}^{T}} \overline{y_{i}}} \sum\limits_{l:{y_{i}^{l}} =1} \\&&\sum\limits_{l^{\prime} : y_{i}^{l^{\prime}}=0} (\delta(rank_{i} (l)\geq rank_{i} (l^{\prime}))) \end{array} $$
(28)

where \(\overline {y_{i}}\) is the complement of yi in Y. RL(D) ∈ [0, 1].

3) One-error: evaluates the proportion of samples that “the most relevant label is not” in “real labels”. The smaller the value, the better the performance.

$$ \begin{array}{@{}rcl@{}} OE(D)=\frac{1}{ n} \sum\limits_{i=1}^{n} \delta (y_{i}^{l_{i}}=0) \end{array} $$
(29)

where li = argminl∈[1,m]ranki(l) and B are indicator functions, OE(D) ∈ [0, 1].

4) Coverage: evaluates how many steps the ”sorted label list” needs to move, on the average, to cover the true related label set. The smaller the value, the better the performance.

$$ \begin{array}{@{}rcl@{}} CV(D)=\frac{1}{ n} \sum\limits_{i=1}^{n} arg max_{l: {y_{i}^{l}} =1} rank_{i} (l)-1 \end{array} $$
(30)

where CV (D) ∈ [1,m − 1].

5) Average precision: evaluates the proportion of those labels that are more relevant than particular labels. The larger the value, the better the performance.

$$ \begin{array}{@{}rcl@{}} AP(D)=\frac{1}{ n} {\sum}_{i=1}^{n} \frac{1}{ {1_{m}^{T}} y_{i}} {\sum}_{l: {y_{i}^{l}} =1} \frac{prec_{i} (l)}{ rank_{i} (l)} \end{array} $$
(31)

where \(prec_{i} (l)={\sum }_{l^{\prime } : y_{i}^{l^{\prime }}=1} \delta (rank_{i} (l)\geq rank_{i} (l^{\prime }))\) and AP(D) ∈ [0, 1].

4.3 Experimental results

The proposed multi-label feature selection algorithm has been tested in six public data sets with extensive experiments. Comparing with several state-of-the-art algorithms, we consider evaluation metrics of Hamming loss, Ranking loss, One-error, Coverage, and Average precision to evaluate the performance of the above multi-label feature selection methods. Tables 2, 3, 4, 5 and 6 show the best results of all the feature selection methods from the optimal parameters. The best performance is indicated in bold font in these tables, and the second-best performance is underlined. Tables 2, 3, 4, 5 and 6 report Hamming loss, Ranking loss, One-error, Coverage, and Average precision comparison of different algorithms in each data set.

Table 2 Hamming loss comparison of different algorithms under each data set
Table 3 Ranking loss comparison of different algorithms under each data set
Table 4 One-error comparison of different algorithms under each data set
Table 5 Coverage comparison of different algorithms under each data set
Table 6 Average precision comparison of different algorithms under each data set

First of all, feature selection is adequate. It reduces the number of features, shortens the classifier’s running time, and improves the performance of the classification algorithm. Secondly, as shown in Tables 2, 3, 4, 5 and 6, although the performance of the LMFS algorithm on data sets Business, Scene and Yeast is slightly inadequate, the performance of the LMFS algorithm on data sets Business and Scene is second only to the Baseline. In addition, the LMFS algorithm has the best performance on other data sets.

In order to visually show the relative performance of the LMFS algorithm and other comparing algorithms, Figs. 2, 3, 4, 5 and 6 shows the performance of all multi-label feature selection algorithms. As the number of selected features changes, the value of the metrics will also change. Therefore the x-axis represents the number of features selected by each feature selection algorithm, and the y-axis represents the performance of the evaluation metrics after classification feature selection. These results show that the proposed LMFS algorithm is superior to the previous ones among almost all data sets in most cases.

Fig. 2
figure 2

Hamming loss comparison of different feature selection methods when ML-KNN is used as the basic classifier. (the lower the result, the better)

Fig. 3
figure 3

Ranking loss comparison of different feature selection methods when ML-KNN is used as the basic classifier. (the lower the result, the better)

Fig. 4
figure 4

One-error comparison of different feature selection methods when ML-KNN is used as the basic classifier. (the lower the result, the better)

Fig. 5
figure 5

Coverage comparison of different feature selection methods when ML-KNN is used as the basic classifier. (the lower the result, the better)

Fig. 6
figure 6

Average precision comparison of different feature selection methods when ML-KNN is used as the basic classifier. (the higher the result, the better)

Specifically, Figs. 2 to 6 show the Hamming loss, Ranking loss, One-error, Coverage, and Average precision comparison of different feature selection methods when using ML-KNN as the primary classifier. From a, b, c, e, f and g in Fig. 2a, b, e, f and g in Fig. 3a, b, e, f and g in Fig. 4; and a, b, c, d, e, f and g in Fig. 5, we can see that the curve of LMFS algorithm is lower than that of all comparison algorithms, even for the other images in Figs. 2 to 5. The curves of the LMFS algorithm are significantly lower than those of the SCLS algorithm, MDMR algorithm, PMU algorithm, and FIMF algorithm. Even in the subfigure of Figs. 2 and 4, only the LMFS algorithm’s curves are below the baseline. From a, b, c, e, f, and g in Fig. 6, we can see that the curve of LMFS is significantly higher than that of all comparison algorithms, even in a and f, only the LMFS algorithm’s curves were above the baseline. Thus, it can be seen that the proposed LMFS algorithm can reduce irrelevant or redundant features.

In addition, to explore the influence of parameters on the performance of the LMFS algorithm, we choose two different kinds of data sets: music data set Scene and biological data set Yeast. For parameters λ,β and γ, we fix two of them as 1. The influence of another parameter on the performance of the LMFS algorithm is discussed under the selection of a different number of features. The parameter range is set as [0.001, 0.01, 0.1, 1, 10, 100, 1000], the number of feature selection is set to [10, 15, 20, 25, 30, 35, 40, 45, 50]. The experimental results are shown in Fig. 7.

Fig. 7
figure 7

The change of Average precision with parameters in Enron and Scene data set

Specifically, the performance of the algorithm will change with the change of parameters. As shown in Fig. 7, for different data sets, the optimal range of parameters is different. For example, on the Scene data set the optimal range of parameter β is [10, 100], the optimal range of parameter λ is [10, 1000], and the optimal range of parameter γ is [0.01, 0.1]. Due to the different basic structures of different data sets, the parameters in the Yeast data set are more sensitive, but the parameters in the Scene data set are not sensitive. At the same time, in order to explore the influence of the nearest neighbor parameter K on the performance of the algorithm, let K = [5, 6, 7, 8, 9, 10], Experiments were carried out on three data sets, Business, Emotion and Computers, the experimental results are shown in Fig. 8, we can see that the performance of the algorithm is sensitive to K.

Fig. 8
figure 8

The influence of the nearest neighbor parameter k on the performance of the algorithm

As shown in Fig. 9, the horizontal axis represents the sorting of multi-label feature selection algorithms under each index; from left to right, the algorithm’s performance is getting better and better, the best performing algorithm is on the far right side. At the same time, we report the results of the Bonferroni-Dunn test in the form of an average rank graph, the algorithm groups with no significant difference (P < 0.1) were connected, if the difference of average ranking reaches the critical value of the difference (CD), then there is significant difference [36]. Although the LMFS algorithm has no significant difference with the DRMFS algorithm, SCLS algorithm, and MDMR algorithm in all indicators, the LMFS algorithm always has a significant difference with PMU and FIFM algorithm, and the LMFS algorithm consistently ranks on the right side. Therefore, compared with other methods, the LMFS algorithm shows better performance.

Fig. 9
figure 9

Bonferroni-Dunn test results in the form of average rank diagrams. Groups of feature selection algorithms that are not significantly different (at p = 0.1) are connected

5 Conclusions and outlook

In this paper, logistic regression is combined with feature manifold learning, sparse regularization, and label map regularization to study the multi-label feature selection problem. Sparseness has been widely used in regression-based feature selection methods. In order to overcome the shortcomings that when dealing with the regression coefficient of features, the existing feature selection method based on logistic regression fails to consider the geometric structure of the feature manifold; and that the existing linear regression feature selection method fails to consider the relevant label information between the geometric structure of the feature manifold and the manifold feature coefficient, we embed feature map regularization method and label map regularization method into the multi-label feature selection problem based on logistic regression to obtain the regression coefficients, so that the regression coefficients are smooth relative to the feature manifold without losing the relevant label information. We also design an iterative update algorithm to prove the convergence of the LMFS algorithm. Another direction in the future is to extend this method to study the semi-supervised feature selection.