1 Introduction

Multi-label learning (MLL) is a form of supervised learning where the learning algorithm is required to learn from a set of instances, and each instance can belong to multiple classes and so after be able to predict a set of class labels for a new instance [1]. In the past decade, such a learning issue has received a lot of attention because of many real-world applications, e.g., protein function classification [2, 3], speech recognition [4, 5], music categorization [6, 7] and image annotation [8, 9]. Due to the complex nature of this type of data, their learning is more difficult than two- or multi-class data. MLL methods can generally be divided into the following [10].

  • Problem transformation methods: These methods transform the MLL problem into one or more single-label classification problems, which can be solved using existing single-label learning methods. An essential property of problem transformation methods is that they are algorithm independent. Binary relevance [10], label powerset [11] and calibrated label ranking [12] are some of the algorithms which use problem transformation methods for MLL.

  • Algorithm adaptation methods: These methods handle directly the MLL problems by adapting popular learning techniques to deal with multi-label data. Support vector machine [13,14,15], nearest neighbor [16], neural network [17, 18] and decision tree [19, 20] are some of the machine learning techniques which have been extended for MLL. The proposed method is one of the algorithm adaptation methods that extend SVM to learn multi-label data.

Despite the works done in MLL, there is still a need to develop more efficient methods in terms of precision and other metrics in Sect. 5.2. Because the structural information of data may contain useful prior domain knowledge for training a classifier, in this paper, we first present an improvement on MLTSVM [15], called structural twin support vector machine for multi-label learning (ML-STSVM). In this method, we, respectively, embed the data structures of classes into the optimization problems of MLTSVM based on the same clustering technology of S-TWSVM [21]. Least square SVM [22] solves linear system instead of QP problem. This property improves both the generalization capability and running speed. For this reason, we present least square version of ML-STSVM (using the proposed idea in [22, 23]) called ML-SLSTSVM. In the algorithm, we modified the primal QPPs of TSVM in least square sense and solved them with equality constraints instead of inequalities of TSVM.

To thoroughly evaluate the performance of the proposed approach, comparative studies over both synthetic and real-world multi-label datasets have been conducted. Experimental results show that ML-SLSTSVM achieves superior performance against several state-of-the-art multi-label learning algorithms. Finally, the main contribution of ML-SLSTSVM is defining a ranking-based SVM that extends the MLTSVM method [15] with considering structural information of training samples to improve the generalization capability and using the least square idea [22] to improve both generalization capability and running speed.

This paper is organized as follows. In Sect. 2, the setting of a MLL and related works are introduced. Some preliminaries are given in Sect. 3. The proposed algorithm is described in detail in Sect. 4. In Sect. 5, some simulation results are discussed. Finally, the conclusions are drawn in Sect. 6.

2 Related works

Let \(X \in {\mathbb{R}}^{d}\) be an instance space of d-dimensional features and \({\mathcal{Y}} = \left\{ {\lambda_{1} ,\lambda_{2} , \ldots ,\lambda_{L} } \right\}\) be a finite set of labels or classes. Each instance \(x \in X\) has multiple class labels Y in \({ \mathcal{Y}}\). Given a set of training examples consisting of n instances,\(D = \left\{ {\left( {X_{1} ,Y_{1} } \right), \ldots ,\left( {X_{n} ,Y_{n} } \right)} \right\}\), where each instance is independent and identically distributed (i.i.d.) drawn from an unknown distribution \({\mathcal{D}}\). xi ∈ X and labels Yi ⊆ \({\mathcal{Y}}\) are known [11]. The goal of multi-label classification is to learn categories’ properties from labeled examples and find a multi-label classifier \(g:X \to 2^{{\mathcal{Y}}}\) that maps an instance x to its label such that specific performance criteria are optimized. Here, \(2^{{{\mathcal{Y} }}}\) is the powerset of \({ \mathcal{Y}}\), which is the set of all subsets of \({ \mathcal{Y}}\). To facilitate our discussion, we summarize major symbols and notations used in this paper in Table 1. Others are defined following their first appearance as required.

Table 1 Symbols and notations used in this paper

There is another method to learn multi-label data called label ranking. In the label ranking, the goal is to learn to predict a total order, a ranking, of all possible labels for a new training example. In other words, the task is to learn the function \(f:X \times {\mathcal{Y}} \to {\mathbb{R}}\) that, for a new instance x, assign a real number to each label \(y \in {\mathcal{Y}}\) where more related label with x has greater value of f. In the label ranking, it is necessary to be considered a threshold that based on related labels separated from unrelated ones. It should be noted that in this paper, the proposed method builds a label ranking model to learn multi-label data.

As mentioned in the Introduction, current MLL algorithms have been developed with two strategies: the problem transformation and algorithm adaptation. In the following, we introduce some well-known works on both these strategies and especially focus on SVM-based methods that used for MLL problem.

The binary relevance approach (BR) [10] decomposes multi-label learning task into several independent binary classification problems, one for each label in the set of labels. Finally, this method determines the labels for a new instance by aggregating the predictions from all the classification problems. Since the BR method does not model dependencies between labels, the predictive performance of it is low.

To address this deficiency (to model labels dependency), some new methods have improved. One of these methods is the label powerset (LP) method [11]. In LP, each of the different combinations of labels in a dataset is considered to be a single label. Thus, LP transforms a multi-label problem to a multi-class problem with 2|L| different classes, where L is the set of labels in original problem and 2L is the powerset of all labels in L. The predictive performance of LP is greater than BR, but the main drawback of it is high computational complexity (the number of label combinations grows exponentially with the number of labels).

Ranking by pairwise comparison (RPC) [24] is another type of binary classification approach which uses problem transformation strategy. This method learns L(L − 1)/2 binary models, one for each pair of labels. To predict a new instance by RPC, all the models are invoked and a ranking is obtained by counting the votes received by each label. This method has a good predictive performance, but the time complexity of it is quadratic.

Another strategy to classify multi-label data is the algorithm adaptation. In the following, we review some works that use it. In [18], an extension of backpropagation algorithm called backpropagation for multi-label learning (BP-MLL) is proposed. It uses multiple binary outputs as the label variables and defines a novel error function that captures the characteristics of multi-label learning. In [20], an extension of tree algorithm C4.5 is presented to handle multi-label problems. It modifies the entropy to consider instances associated with multiple labels. In [19], an algorithm based on Bayesian approach is proposed for multi-label textual data classification. It constructs a probabilistic generative model to model multiple labels associated with each document.

Multi-label k nearest neighbor (MLkNN) [16] is a k nearest-neighbor method adapted for MLL. This method classifies a new instance by voting from the labels found in its neighbors. In [25], the authors have tried to extend the ELM technique for multi-label learning by proposing a thresholding method based on ELM. In [26], an extension of AdaBoost algorithm called BoosTexter is presented for MLL problem. It maintains a set of weights over both training examples and labels to handle multiple labels.

Some works on MLL are general-purpose and we can use theme for learning different types of data [15, 16, 27, 28]. Some other works are specific-purpose and learn the specific type of data (e.g., image, sound and gene expression data) [29,30,31,32,33,34]. There are many works in the literature that are proposed for images classification. Since deep neural networks and more specifically convolutional neural networks have shown promising results in image classification, most of the MLL works have used them. Multi-label classification for image annotation [8], pedestrian attribute classification [30], facial attribute classification [31] and social image understanding [29, 32] are examples of this type of works.

Extreme multi-label learning (XML) is another problem in the area of MLL, and it is important problem since the boom of big data [35]. The objective in XML is to learn a classifier that can automatically tag a data point with the most relevant subset of labels from a large label set. Many existing MLL methods are not suitable for XML, due to the large label space (as large as in millions). Tree-based methods [36, 37] and embedding-based methods [38, 39] are two general types of methods in this area. It should be noted that our proposed method is general-purpose and does not claim to solve XML problems. In this paper, we extend SVM to handle multi-label data. Therefore, in the following we introduce some SVM-based methods for MLL problems.

Many well-known approaches extend the traditional SVM for MLL problem, e.g., RankSVM [13], RankCVM [14] and MLTSVM [15]. RankSVM is proposed via extending multi-class SVM and uses a novel ranking loss function as its empirical loss. This method solves a quadratic programming problem to rank the label set for new instances. A drawback of RankSVM is high computational complexity due to a large number of variables in its quadratic programming. To address this drawback, the RankCVM was proposed. This method combines RankSVM with CVM to construct a novel SVM-type multi-label classifier which is described as the same optimization form as binary CVM.

We cite [15] as the key reference for the proposed method, since we freely use some of the terminologies and the ideas presented in it. It proposes a novel twin support vector machine to multi-label learning called MLTSVM. This method which is an extension of twin SVM (TWSVM) [40] determines multiple nonparallel hyperplanes to capture the multi-label information embedded in data. MLTSVM solves L quadratic programming problems (L is the number of labels) to obtain L nonparallel proximal hyperplanes, and each problem is similar to binary TWSVM. Note that the optimization is done in a way that kth hyperplane is closer to the instances with the label k, and is as far as possible from the others.

3 Preliminaries

Since the proposed method is based on SVM, in this section we briefly introduce the SVM and some extension of it.

3.1 Support vector machine

Support vector machine (SVM) was originally proposed by Vapnik [41] for binary classification. SVM implements the structural risk minimization (SRM), in contrast to other machine learning approaches like conventional artificial neural network which aims at minimizing empirical risk (ERM). (SRM minimizes the upper bound of generation, whereas ERM minimizes the error on the training data.) Recently SVMs have been extended to solve the multi-class and MLL problems.

The main idea of the SVM classifier is to obtain a linear decision boundary that maximizes a separation margin. In other words, SVM tries to find a discriminant hyperplane by solving an optimization problem to separate two classes of patterns.

We consider the problem of classifying n points in the d-dimensional real space Rd, represented by the n × d matrix X, according to membership of each point Xi in the classes + 1 or −1 as specified by a given m × m diagonal matrix D with ones or minus ones along its diagonal. For this problem, the SVM discriminant function is \(d\left( x \right) = w^{'} x + b\) and it obtained by minimizing the following objective function.

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{w,b,\xi } \frac{1}{2}\left\| w \right\|^{2} + ce^{'} \xi \\ & \quad {\text{s}}.{\text{t}}:D\left( {Xw + eb} \right) \ge e - \xi \quad \xi \ge 0 \\ \end{aligned}$$
(1)

where \(\xi\) is the n-dimensional vector of slack variables, c is a penalty parameter and e is a vector of ones of appropriate dimensions. Figure 1 shows the linear separation of data points using SVM classifier. As you see in this figure, the discriminant hyperplane lies in a space that is halfway between the two sets of points and has maximum margin.

Fig. 1
figure 1

Linear separation of data points using SVM classifier

Training algorithm of SVM performs a linear classification and builds a model that assigns new examples into one class or the other. When data points are not linearly separable, we can use nonlinear SVM, in which data points are mapped to higher-dimensional feature space where the examples become linearly separable. This mapping is complex, and training time increases exponentially with the input data dimension. To tackle with this drawback, we can use the kernel trick [42]. Kernel trick provides a way to work in a feature-rich space without explicitly mapping the points to higher dimension. This trick is done by using an appropriate kernel function, which describes the hyperplane in higher dimensions. Therefore, the art of using nonlinear SVMs is to choose the correct kernel function, which is well suited for the underlying problem.

SVM has such advantages as good generalization capacity, less limit in training set size and more robust to noise [43], but there is a main challenge in the traditional SVM and it is the high computational complexity of quadratic programming problem (QPP) [43]. The computational complexity of SVM is \(O\left( {n^{3} } \right)\), where n denotes the count of training data. This drawback restricts the application of SVM to large-scale problem domains.

3.2 Twin support vector machine

The study in [40] proposed a novel binary classifier, twin support vector machine (TWSVM), the speed of which is approximately four times faster than SVM.

This method seeks two nonparallel hyperplanes such that each hyperplane is close to the patterns of one class and far away from the patterns of the other classes simultaneously. The TWSVM hyperplanes have the forms (2) and are obtained by minimizing the objective functions (3) and (4).

$$x^{'} w_{1} + b_{1} = 0,x^{'} w_{2} + b_{2} = 0$$
(2)
$$\begin{aligned} & \mathop {\hbox{min} }\limits_{w1,b1,\xi } \frac{1}{2} \left( {Aw_{1} + eb_{1} } \right)^{'} \left( {Aw_{1} + eb_{1} } \right) + ce^{'} \xi \\ & \quad {\text{s}}.{\text{t}}: - ( \, Bw_{1} + eb_{1} ) \, + \xi \ge e , \xi \ge 0 \\ \end{aligned}$$
(3)
$$\begin{aligned} & \mathop {\hbox{min} }\limits_{w2,b2,\xi } \frac{1}{2} \left( {Bw_{2} + eb_{2} } \right)^{'} \left( {Bw_{2} + eb_{2} } \right) + ce^{'} \xi \\ & \quad {\text{s}}.{\text{t}}: (Aw_{2} + eb_{2} ) \, + \xi \ge e ,\xi \ge 0 \\ \end{aligned}$$
(4)

where A and B are the data points belonging to classes + 1 and − 1, respectively, and \(\xi\) is vector of slack variables. Figure 2 shows how the linear TWSVM separates training data points. When this method found the hyperplanes, a new data instance is classified based on the distance between it and the hyperplanes.

Fig. 2
figure 2

Separation of training data using linear TWSVM

3.3 Structural twin SVM

Using prior information of training data can help to design more powerful classifiers, and the structural information of data points is one of them [44]. For this reason, some SVM-based methods are presented that use covariance matrix of data. STSVM or structural TWSVM is one of these methods, which was introduced in 2013 by Zhiquan [21].

In this method which is an extension of TWSVM, the derived decision hyperplanes tend to lie in the same direction with the data dispersion. The hyperplanes have the form (5) and are obtained from solving the objective functions (6) and (7).

$$f_{ + } \left( x \right) = W_{ + } x + b_{ + } = 0 ,f_{ - } \left( x \right) = W_{ - } x + b_{ - } = 0$$
(5)
$$\begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{k} ,b_{k} ,\xi_{j} }} \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {AW_{ + } + e_{ + } b_{ + } } \right)} \right\|^{2} + c_{1} e_{ - }^{'} \xi + \frac{1}{2}c_{2} \left( {\left\| {w_{ + } } \right\|^{2} + b_{ + }^{2} } \right) + \frac{1}{2}c_{3} \left( {w^{'}_{ + } \varSigma_{ + } w_{ + } } \right) \\ & \quad {\text{s}}.{\text{t}}: - (BW_{ + } + e_{ - } b_{ + } ) + \xi \ge e_{ - } ,\quad \xi \ge 0 \\ \end{aligned}$$
(6)
$$\begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{k} ,b_{k} ,\xi_{j} }} \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {BW_{ - } + e_{ - } b_{ - } } \right)} \right\|^{2} + c_{4} e_{ + }^{'} \xi + \frac{1}{2}c_{5} \left( {\left\| {w_{ - } } \right\|^{2} + b_{ - }^{2} } \right) + \frac{1}{2}c_{6} \left( {w^{'}_{ - } \varSigma_{ - } w_{ - } } \right) \\ & \quad {\text{s}}.{\text{t}}: - (AW_{ - } + e_{ + } b_{ - } ) + \xi \ge e_{ + } , \xi \ge 0 \\ \end{aligned}$$
(7)

where c1, …c6 are the pre-specified penalty factors, e+, e are vectors of ones of appropriate dimensions, \(\xi\) is the vector of slack variables,\(\varSigma_{ + } = \varSigma_{p1} + \cdots + \varSigma_{pm} ,\varSigma_{ - } = \varSigma_{n1} + \cdots + \varSigma_{nk}\), ΣPi and ΣNj are, respectively, the covariance matrices corresponding to the ith and jth clusters in the two classes, i = 1, …, m, j = 1, …, k.

3.4 Least square twin SVM

Least square twin support vector machines (LSTSVM) is another extension of SVM that attempts to solve two modified primal problems of TSVM, instead of two dual problems usually solved [23]. This method requires just the solution of two systems of linear equations for both linear and nonlinear cases, while TWSVM requires solving two quadratic programming problems (QPPs). The training time of LSTSVM is much faster than TWSVM, and this advantage allows to easily classify large datasets. The LSTSVM model can be formulated as:

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{w_{1},b_{1}} \frac{1}{2} \left\| {\left( {Aw_{1} + eb_{1} } \right)} \right\|^{2} + \frac{c}{2}\xi^{'} \xi \\ & \quad {\text{s}}.{\text{t}}: - (Bw_{1} + eb_{1} ) \, + \xi = e \end{aligned}$$
(8)
$$\begin{aligned} & \mathop {\hbox{min} }\limits_{w_{2},b_{2}} \frac{1}{2} \left\| {\left( {Bw_{2} + eb_{2} } \right)} \right\|^{2} + \frac{c}{2}\xi^{'} \xi \\ & \quad {\text{s}}.{\text{t}}: (Aw_{2} + eb_{2} ) + \xi = e \end{aligned}$$
(9)

3.5 Multi-label twin support vector machine

As mentioned in the previous section, in [15] it is presented a SVM-based method called MLTSVM. This method obtains as many proximal hyperplanes as there are labels on the dataset. Each hyperplane has the form (10), where wk and bk are the normal vector and the bias term of the kth proximal hyperplane, respectively. These hyperplanes are obtained by minimizing the objective function (11) where Ck and \(\lambda_{k}\) are the pre-specified penalty factors, and \(\xi\) is the vector of slack variables.

$$f_{k} \left( x \right): w^{'}_{k} x_{i} + b_{k} = 0 , k \in \left\{ {1,2, \ldots ,L} \right\}$$
(10)
$$\begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{k} ,b_{k} ,\xi_{j} }} F = \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {w^{\prime}_{k} x_{i} + b_{k} } \right)} \right\|^{2} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \xi_{j} + \frac{1}{2} \lambda_{k} \left( {\left\| {w_{k} } \right\|^{2} + b_{k}^{2} } \right) \\ & \quad {\text{s}}.{\text{t}}: - (w^{ '}_{k} x_{j} + b_{k} ) \ge 1 - \xi_{j} ,\quad \xi_{j} \ge 0 \,\forall j \in I_{{\bar{k}}} \\ \end{aligned}$$
(11)

Using Lagrangian function and Karush–Kuhn–Tucker (KKT) conditions and by introducing the Lagrangian multipliers α, the dual QPP of (11) can be represented as

$$\begin{aligned} & \mathop {\hbox{max} }\limits_{{\alpha_{j} }} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \alpha_{j} - \frac{1}{2} \mathop \sum \limits_{{j_{1} \in I_{{\bar{k}}} }} \mathop \sum \limits_{{j_{2} \in I_{{\bar{k}}} }} \alpha_{j1} \alpha_{j2} z^{'}_{j1} \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + \lambda_{k} I} \right)^{ - 1} z_{j2} \\ & \quad {\text{s}}.{\text{t}}: 0 \le \alpha_{j} \le c_{k} ,\quad j \in I_{{\bar{k}}} \\ \end{aligned}$$
(12)

After obtaining the solutions αj from the dual problem (12), the kth proximal hyperplane (k = 1, …, K) can be constructed according to (13).

$$\left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] = - \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + \lambda_{k} I} \right)^{ - 1} \left( {\mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z_{j} \alpha_{j} } \right)$$
(13)

4 Proposed method

In this section, we first introduce a structural version of MLTSVM [15], called structural twin SVM for multi-label learning (ML-STSVM) and then propose the least square version of it called structural least squared twin support vector machine for multi-label learning (ML-SLSTSVM). Similar to MLTSVM, let us use \(I_{k}\) and \(I_{{\bar{k}}}\) to denote two complementary index sets, and if the instance \(x_{i}\) is associated with the kth label, then \(i \in I_{k}\); otherwise, \(i \in I_{{\bar{k}}}\).

4.1 ML-STSVM

4.1.1 Linear ML-STSVM

For the linear case, the ML-STSVM determines a nonparallel hyperplane for each label

$$f_{k} \left( x \right) = W_{k}^{'} x + b_{k} = 0 ;\,k = 1, \ldots ,L$$
(14)

where L is the number of labels, the weight vector \(W_{k} \in {\mathbb{R}}^{d}\) determines the hyperplane’s orientation and the bias \(b_{k} \in {\mathbb{R}}\) determines the hyperplane’s offset relative to the system of coordinates. Here, each hyperplane \(f_{i}\) is closer to the instances with the label k and is as far as possible from the others. These nonparallel hyperplanes can be obtained by solving the following optimization problems for each k = 1, 2, …, L.

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{k} ,b_{k} ,\xi_{j} }} F = \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {w^{'}_{k} x_{i} + b_{k} } \right)} \right\|^{2} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \xi_{j} + \frac{1}{2}\lambda_{k} \left( {\left\| {w_{k} } \right\|^{2} + b_{k}^{2} } \right) + \frac{1}{2}\eta_{k} \left( {w^{'}_{k} \varSigma_{k} w_{k} } \right) \\ & \quad {\text{s}}.{\text{t}}: - (w^{'}_{k} x_{j} + b_{k} ) \ge 1 - \xi_{j} , \xi_{j} \ge 0 \forall j \in I_{{\bar{k}}} \\ \end{aligned}$$
(15)

where \(c_{k}\), \(\lambda_{k}\) and \(\eta_{k}\) are the pre-specified penalty factors, \(\xi_{j}\) is the slack variable for each training data and \(\varSigma_{k} = \varSigma_{{k_{1} }} + \cdots + \varSigma_{{k_{nk} }} ,\varSigma_{{k_{i} }}\) is the covariance matrix corresponding to the ith cluster in training data associated with the kth label, i = 1, 2, …, nk.

The first term in the objective function is the sum of squared distances from the hyperplane to points of kth label. The second term is a 2 regularization term that favors sparse models (weighted by λ ∈ ℝ) [45], and the third term embodies the structural information of each label.

The Lagrangian corresponding to the problem (15) is given by

$$L = \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {w^{'}_{k} x_{i} + b_{k} } \right)} \right\|^{2} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \xi_{j} + \frac{1}{2}\lambda_{k} \left( {\left\| {w_{k} } \right\|^{2} + b_{k}^{2} } \right) + \frac{1}{2}\eta_{k} \left( {w^{'}_{k} \varSigma_{k} w_{k} } \right) + \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \alpha_{j} \left( {w^{'}_{k} x_{j} + b_{k} + 1 - \xi_{j} } \right) - \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \beta_{j} \xi_{j}$$
(16)

where \(\alpha = \alpha_{1} , \ldots ,\alpha_{{\left| {I_{{\bar{k}}} } \right|}}\) and \(\beta = \beta_{1} , \ldots ,\beta_{{\left| {I_{{\bar{k}}} } \right|}}\) are the Lagrange multipliers. According to the KKT conditions (Karush–Kuhn–Tucker), we know that there exist Lagrange multipliers satisfying:

$$\frac{\partial L}{{\partial w_{k} }} = 0 \Rightarrow \mathop \sum \limits_{{i \in I_{k} }} \left( {w^{'}_{k} x_{i} + b_{k} } \right) x_{i} + \lambda_{k} w_{k} + \eta_{k} \varSigma_{k} w_{k} + \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} x_{j} \alpha_{j} = 0$$
(17)
$$\frac{\partial L}{{\partial b_{k} }} = 0 \Rightarrow \mathop \sum \limits_{{i \in I_{k} }} \left( {w^{'}_{k} x_{i} + b_{k} } \right) + \lambda_{k} b_{k} + \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \alpha_{j} = 0$$
(18)
$$\frac{\partial L}{{\partial \xi_{j} }} = 0 \Rightarrow c_{k} - \alpha_{j} - \beta_{j} = 0 ; j \in I_{{\bar{k}}}$$
(19)
$$- \left( {w^{'}_{k} x_{j} + b_{k} } \right) \ge 1 - \xi_{j} ; \xi_{j} \ge 0 \quad \forall j \in I_{{\bar{k}}}$$
(20)
$$\mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \alpha_{j} \left( {w^{'}_{k} x_{j} + b_{k} + 1 - \xi_{j} } \right) = 0$$
(21)
$$\mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \beta_{j} \xi_{j} = 0$$
(22)
$$\alpha_{j} \ge 0 , \beta_{j} \ge 0 ; \quad\forall j \in I_{{\bar{k}}}$$
(23)

According to (19) and (23):

$$0 \le \alpha_{j} \le c_{k}$$

Obviously, combining (17) and (18) leads to:

$$\mathop \sum \limits_{{i \in I_{k} }} \left[ {\begin{array}{*{20}c} {x_{i} } \\ 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {x_{i} } 1 \\ \end{array} } \right] \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] + \lambda_{k} \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] + \eta_{k} \left[ {\begin{array}{*{20}c} {\varSigma_{k} } & 0 \\ 0 & 0 \\ \end{array} } \right] \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] + \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left[ {\begin{array}{*{20}c} {x_{j} } \\ 1 \\ \end{array} } \right]\alpha_{j} = 0$$
(24)

If we define \(z_{i} = \left[ {\begin{array}{*{20}c} {x_{i} } \\ 1 \\ \end{array} } \right]\), \(s_{k} = \left[ {\begin{array}{*{20}c} {\varSigma_{k} } & 0 \\ 0 & 0 \\ \end{array} } \right]\) and \(\theta_{k} = \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right]\), then Eq. (24) can be rewritten as:

$$\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} \theta_{k} + \lambda_{k} \theta_{k} + \eta_{k} s_{k} \theta_{k} + \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z_{j} \alpha_{j} = 0$$
(25)
$$\Rightarrow \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + \lambda_{k} I + \eta_{k} s_{k} } \right)\theta_{k} = - \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z_{j} \alpha_{j}$$
(26)
$$\Rightarrow \theta_{k} = \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] = - \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} \left( {\mathop \sum \limits_{{\varvec{j} \in \varvec{I}_{{\bar{\varvec{k}}}} }} z_{j} \alpha_{j} } \right)$$
(27)

Substituting (27) and (19) into (16), we obtain the dual formulation of (15):

$$\begin{aligned} & \mathop {\hbox{max} }\limits_{{\alpha_{j} }} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \alpha_{j} - \frac{1}{2}\mathop \sum \limits_{{j_{1} \in I_{{\bar{k}}} }} \mathop \sum \limits_{{j_{2} \in I_{{\bar{k}}} }} \alpha_{j1} \alpha_{j2} z^{'}_{j1} \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} z_{j2} \\ & \quad {\text{s}}.{\text{t}}: 0 \le \alpha_{j} \le c_{k} ,\quad j \in I_{{\bar{k}}} \\ \end{aligned}$$
(28)

We can rewrite this problem in the following unified matrix form:

$$\begin{aligned} & \mathop {\hbox{max} }\limits_{\alpha } e^{\prime}\alpha - \frac{1}{2}\alpha^{\prime}Q\alpha \\ & \quad {\text{s}}.{\text{t}}: 0 \le \alpha \le c_{k} e \\ \end{aligned}$$
(29)

where \(\alpha = \alpha_{1} , \ldots ,\alpha_{{\left| {I_{{\bar{k}}} } \right|}}\) and Q is defined by

$$Q = G_{k} \left( {H^{'}_{k} H_{k} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} G^{'}_{k}$$
(30)
$$G_{k} = \left[ {\begin{array}{*{20}c} {X_{{I_{{\bar{k}}} }} } & e \\ \end{array} } \right] , H_{k} = \left[ {\begin{array}{*{20}c} {X_{{I_{k} }} } & e \\ \end{array} } \right]$$
(31)

In Eq. (31), \(X_{{I_{k} }} = \{ X_{i} |i \in I_{k} \}\) (instances associated with the kth label) and \(X_{{I_{{\bar{k}}} }} = \{ X_{i} |i \in I_{{\bar{k}}} \}\) (the other instances).

After optimizing the dual problem (29) for each label k = 1, …, L and obtaining \(\alpha\), we can substitute it in Eq. (27) and construct the decision strategies (14).

Now, we can use these decision strategies to determine the labels of test data. For this purpose, we must determine a threshold Tk (k = 1, …, L) for each label and then assign the label k to a new instance x if the distance between x and fk(x) is less than or equal to Tk.

$$d_{k} \left( x \right) = {\text{distance}}\left( {x,f_{k} \left( x \right)} \right) = \left( {W_{k}^{ '} x + b_{k} } \right)/\left\| {W_{k} } \right\|$$
(32)

To determine thresholds Tk, we used the proposed idea in [15]. For this, we set Tk = T for all k = 1 to L, where \(T = { \hbox{min} }\left( {1/\left\| {W_{k} } \right\|} \right)\), k = 1, …, L.

4.1.2 Nonlinear ML-STSVM

In this subsection, we extend the linear ML-STSVM to the nonlinear case by considering the following kernel generated surfaces:

$$f_{k} \left( x \right) = \varvec{U}_{\varvec{k}}^{ '} {\text{K}}\left( {\varvec{X},x} \right) + b_{k} = 0;\quad k = 1, \ldots ,L$$
(33)

where X denotes all the training instances, \(\varvec{U}_{\varvec{k}}\) is the weight vector in the kernel space and \(K\left( {.,.} \right)\) is a proper chosen kernel. The primal QPPs of the nonlinear ML-STSVM corresponding to the surface (33) is given in (35).

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{{U_{k} ,b_{k} ,\xi_{j} }} \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {\varvec{U}^{\varvec{'}}_{k} K\left( {\varvec{X},x_{i} } \right) + b_{k} } \right)} \right\|^{2} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \xi_{j} + + \frac{1}{2}\lambda_{k} \left( {\left\| {\varvec{U}_{k} } \right\|^{2} + b_{k}^{2} } \right) + \frac{1}{2}\eta_{k} \left( {\varvec{U}^{ '}_{k} \varSigma_{k}^{\varPhi } \varvec{U}_{k} } \right) \\ & \quad {\text{s}}.{\text{t}}: - \left( {{\mathbf{U}}^{ '}_{\text{k}} {\text{K}}({\mathbf{X}},x_{j} ) + b_{k} } \right) \ge 1 - \xi_{j} , \quad \xi_{j} \ge 0 \,\forall j \in {\text{I}}_{{\bar{k}}} \\ \end{aligned}$$
(34)

where \(\varSigma_{k}^{\varPhi } = \varSigma_{{k_{1} }}^{\varPhi } + \cdots + \varSigma_{{k_{nk} }}^{\varPhi } ,\varSigma_{{k_{i} }}^{\varPhi }\) is the covariance matrix corresponding to the ith cluster in training data associated with the kth label, i = 1, 2, …, nk by the kernel Ward’s linkage clustering [21]. Similar to previous subsection, we can use Lagrangian multipliers and derive the dual formulation of the QPP (21) and the solution as follows.

$$\begin{aligned} & \mathop {\hbox{max} }\limits_{{\alpha_{j} }} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \alpha_{j} - \frac{1}{2}\mathop \sum \limits_{{j_{1} \in I_{{\bar{k}}} }} \mathop \sum \limits_{{j_{2} \in I_{{\bar{k}}} }} \alpha_{j1} \alpha_{j2} z^{'}_{j1} \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} z_{j2} \\ & \quad {\text{s}}.{\text{t}}: 0 \le \alpha_{j} \le c_{k} , \quad j \in I_{{\bar{k}}} \\ \end{aligned}$$
(35)

where \(z_{i} = \left[ {\begin{array}{*{20}c} {{\rm K}(\varvec{X},x_{i} )} \\ 1 \\ \end{array} } \right]\) and \(s_{k} = \left[ {\begin{array}{*{20}c} {\varSigma_{k}^{\varPhi } } & 0 \\ 0 & 0 \\ \end{array} } \right]\).

$$\theta_{k} = \left[ {\begin{array}{*{20}c} {\varvec{U}_{k} } \\ {b_{k} } \\ \end{array} } \right] = - \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} \left( {\mathop \sum \limits_{{\varvec{j} \in \varvec{I}_{{\bar{k}}} }} \varvec{z}_{\varvec{j}}\varvec{\alpha}_{\varvec{j}} } \right)$$
(36)

The unified matrix form of (35) can be written as (37).

$$\begin{aligned} & \mathop {\hbox{max} }\limits_{\alpha } e^{'} \alpha - \frac{1}{2}\alpha^{\prime}Q\alpha \\ & \quad {\text{s}}.{\text{t}}: 0 \le \alpha \le c_{k} e \\ \end{aligned}$$
(37)

where \(\alpha = \alpha_{1} , \ldots ,\alpha_{{\left| {I_{{\bar{k}}} } \right|}}\) and Q is defined by

$$Q = G_{k} \left( {H^{'}_{k} H_{k} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} G^{'}_{k}$$
(38)
$$G_{k} = \left[ {{\rm K}\left( {\varvec{X},X_{{I_{{\bar{k}}} }} } \right)e} \right] , H_{k} = \left[ {{\rm K}\left( {\varvec{X},X_{{I_{k} }} } \right)e} \right]$$
(39)

After optimizing the dual problem (36) for each label k = 1, …, L and obtaining \(\alpha\), we can substitute it in (37) and construct the decision strategies in (32). Similar to linear case, we can use the hyperplanes to determine the labels of test data based on the distance between them and the hyperplanes.

4.2 ML-SLSTSVM

In this subsection, we introduce a least square version of ML-STSVM called structural least squared twin support vector machine for multi-label learning (ML-SLSTSVM). Following the idea of PSVM [46], the decision functions of ML-SLSTSVM are obtained extremely fast and simple by solving the primal problem directly.

4.2.1 Linear ML-SLSTSVM

Here we modify the primal problem (15) of linear ML-STSVM in least square sense as (40), with the inequality constraints replaced with equality constraints.

$$\begin{aligned} \mathop {\hbox{min} }\limits_{{w_{k} ,b_{k} ,\xi_{j} }} F & = \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {w^{'}_{k} x_{i} + b_{k} } \right)} \right\|^{2} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left\| {\xi_{j} } \right\|^{2} + \frac{1}{2}\lambda_{k} \left( {\left\| {w_{k} } \right\|^{2} + b_{k}^{2} } \right) \\ & \quad + \frac{1}{2}\eta_{k} \left( {w^{'}_{k} \varSigma_{k} w_{k} } \right) \\ & \quad \quad {\text{s}}.{\text{t}}: - \left( {w^{ '}_{k} x_{j} + b_{k} } \right) = 1 - \xi_{j} \quad \forall j \in I_{{\bar{k}}} \\ \end{aligned}$$
(40)

where the variables are similar to those defined for (15). On substituting the equality constraint of (41) into the objective function of it, we can get the following unconstrained optimization problem.

$$L = \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {w^{'}_{k} x_{i} + b_{k} } \right)} \right\|^{2} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left\| {w^{'}_{k} x_{j} + b_{k} + 1} \right\|^{2} + \frac{1}{2}\lambda_{k} \left( {\left\| {w_{k} } \right\|^{2} + b_{k}^{2} } \right) + \frac{1}{2}\eta_{k} \left( {w^{ '}_{k} \varSigma_{k} w_{k} } \right)$$
(41)

Setting the gradient of (41) with respect to \(w_{k}\) and \(b_{k}\) to zero gives:

$$\frac{\partial L}{{\partial w_{k} }} = 0 \Rightarrow \mathop \sum \limits_{{i \in I_{k} }} \left( {w^{'}_{k} x_{i} + b_{k} } \right) x_{i} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left( {w^{ '}_{k} x_{i} + b_{k} + 1} \right) x_{j} + \lambda_{k} w_{k} + \eta_{k} \varSigma_{k} w_{k} = 0$$
(42)
$$\frac{\partial L}{{\partial b_{k} }} = 0 \to \mathop \sum \limits_{{i \in I_{k} }} \left( {w^{'}_{k} x_{i} + b_{k} } \right) + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left( {w^{ '}_{k} x_{i} + b_{k} + 1} \right) + \lambda_{k} b_{k} = 0$$
(43)

Combining (42) and (43) leads to:

$$\mathop \sum \limits_{{i \in I_{k} }} \left[ {\begin{array}{*{20}c} {x_{i} } \\ 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {x_{i} } & 1 \\ \end{array} } \right] \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left[ {\begin{array}{*{20}c} {x_{j} } \\ 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {x_{j} } & 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left[ {\begin{array}{*{20}c} {x_{j} } \\ 1 \\ \end{array} } \right] + \lambda_{k} \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] + \eta_{k} \left[ {\begin{array}{*{20}c} {\varSigma_{k} } & 0 \\ 0 & 0 \\ \end{array} } \right] \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] = 0$$
(44)

Defining \(z_{i} = \left[ {\begin{array}{*{20}c} {x_{i} } \\ 1 \\ \end{array} } \right]\), \(s_{k} = \left[ {\begin{array}{*{20}c} {\varSigma_{k} } & 0 \\ 0 & 0 \\ \end{array} } \right]\) and \(\theta_{k} = \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right]\), then Eq. (44) can be rewritten as:

$$\begin{aligned} \mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} \theta_{k} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z_{j} z^{'}_{j} \theta_{k} + \lambda_{k} \theta_{k} + \eta_{k} s_{k} \theta_{k} = - \,c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z^{'}_{j} \hfill \\ \hfill \\ \end{aligned}$$
(45)
$$\Rightarrow \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z_{j} z^{'}_{j} + \lambda_{k} I + \eta_{k} s_{k} } \right)\theta_{k} = - \,c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z^{'}_{j}$$
(46)

Finally, the solution of (40) can be obtained by (47)

$$\Rightarrow \theta_{k} = \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] = - \,c_{k} \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z_{j} z^{'}_{j} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} \left( {\mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z^{'}_{j} } \right)$$
(47)

Equation (47) can be rewritten in the following unified matrix form:

$$\left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] = - \,c_{k} \left( {H^{'}_{k} H_{k} + c_{k} G^{\prime}_{k} G_{k} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} G^{'}_{k} e$$
(48)

where \(G_{k}\) and \(H_{k}\) are:

$$G_{k} = \left[ {\begin{array}{*{20}c} {X_{{I_{{\bar{k}}} }} } & e \\ \end{array} } \right] , H_{k} = \left[ {\begin{array}{*{20}c} {X_{{I_{k} }} } & e \\ \end{array} } \right]$$
(49)

We can obtain the decision strategies as (14) by solving Eq. (48) and use them to determine the labels of test data.

4.2.2 Nonlinear ML-SLSTSVM

In this subsection, we propose nonlinear ML-SLSTSVM by introducing kernel function \(K\left( {x,x^{'} } \right) = \left( {\varPhi \left( x \right).\varPhi \left( {x^{'} } \right)} \right)\) and the corresponding transformation: \(\varvec{X} = \varPhi \left( x \right)\) where \(\varvec{X} \in \varvec{H}\) and H is the Hilbert space. In order to extend our algorithm to nonlinear cases, we consider the kernel generated surfaces (33), instead of planes.

Then the optimization problem of nonlinear ML-SLSTSVM is constructed as follows:

$$\begin{aligned} \mathop {\hbox{min} }\limits_{{w_{k} ,b_{k} ,\xi_{j} }} F & = \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {\varvec{U}^{\varvec{'}}_{k} K\left( {\varvec{X}, x_{i} } \right) + b_{k} } \right)} \right\|^{2} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left\| {\xi_{j} } \right\|^{2} + \frac{1}{2}\lambda_{k} \left( {\left\| {\varvec{U}_{k} } \right\|^{2} + b_{k}^{2} } \right) \\ & \quad + \frac{1}{2}\eta_{k} \left( {\varvec{U}^{ '}_{k} \varSigma_{k}^{\varPhi } \varvec{U}_{k} } \right) \\ & \quad \quad {\text{s}}.{\text{t}}: - \left( {\varvec{U}^{ '}_{k} K\left( {\varvec{X},x_{j} } \right) + b_{k} } \right) = 1 - \xi_{j} \quad \forall j \in I_{{\bar{k}}} \\ \end{aligned}$$
(50)

Similar to linear case, we can obtain the unknown variables on the decision boundaries (33) (Uk and bk) by substituting the equality constraint of (50) into the objective function of it and then differentiating obtained function with respect to \(\varvec{U}_{k}\) and \(b_{k} .\)

$$\left[ {\begin{array}{*{20}c} {\varvec{U}_{k} } \\ {b_{k} } \\ \end{array} } \right] = - c_{k} \left( {H^{'}_{k} H_{k} + c_{k} G^{'}_{k} G_{k} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} G^{'}_{k} e$$
(51)

where \(G_{k} = \left[ {\begin{array}{*{20}c} {K\left( {\varvec{X},X_{{I_{{\bar{k}}} }} } \right)} & e \\ \end{array} } \right]\), \(H_{k} = \left[ {\begin{array}{*{20}c} {K(\varvec{X},X_{{I_{k} }} ) } & e \\ \end{array} } \right]\) and \(s_{k} = \left[ {\begin{array}{*{20}c} {\varSigma_{k}^{\varPhi } } & 0 \\ 0 & 0 \\ \end{array} } \right]\).

After solving Eq. (51), we can obtain the normal vector and the bias term of the kth proximal surface (33). Therefore, we can construct a similar prediction strategy as the linear case to predict the labels of unseen instances.

5 Experiments and results

In this section at first, the used datasets are introduced, and then, evaluation criteria for measuring the efficiency of MLL algorithms are expressed. Next, the parameter setting is described, and finally, the experimental results and parameter sensitivity analysis are given. All experiments have been implemented in MATLAB R2010b on a personal computer (PC) with an Intel (R) Core-i5 processor (2.3 GHz) and 4 GB random-access memory (RAM).

5.1 Benchmark datasets

To evaluate the proposed ML-SLSTSVM algorithm, in this section we have used several synthetic and real-world datasets. All synthetic data are generated by Mldatagen according to some predefined parameters such as the used strategies (hyperspheres or hypercubes), number of relevant, irrelevant and redundant features, number of instances and number of labels. We add 5% noises to the labels of each instance to make the learning tasks more challengeable. Also, the real datasets Emotion, Birds, Yeast, Flags and Medical are widely used for evaluating multi-label learning methods. These datasets represent a wide range of domains (include music, text, image and biology), sizes (from 194 to 2417), features (from 19 to 1449) and labels (from 19 to 45).

All real-world datasets (summarized in Table 3) were downloaded from the MULAN multi-label dataset repositories [47].Footnote 1 Note that all samples are normalized such that the continuous features are located in the range [0,1] before learning. Tables 2 and 3 show the used synthetic and real datasets in more details.

Table 2 Synthetic dataset statistics
Table 3 Real dataset statistics

5.2 Evaluation criteria

There are different criteria to evaluate the performance of multi-label data classifiers. For this reason, in this paper, we used five standard criteria, which are explained in more detail below. In all criteria, n, L, \(y_{i}\) and \(\bar{y}_{i}\) denote, respectively, the number of training data, the number of labels, the set of labels relevant to the ith instance and the set of labels that are irrelevant to it. In addition, the function \(f_{y} \left( x \right)\) is a real-valued function (\(f:X \times {\mathcal{Y}} \to {\mathbb{R}}\)) that returns the confidence of being proper label of x and \({\text{rank}}_{f} \left( {x,y} \right)\) returns the rank of y in \({\mathcal{Y}}\) based on the descending order induced from \(f_{.} \left( x \right)\).

5.2.1 Hamming loss

This criterion indicates the fraction of labels that are incorrectly predicted to the total number of labels.

$${\text{Hloss}} = \frac{1}{n \times L}\mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{l} \left( {h_{j} \left( {x_{i} } \right) \ne y_{j} } \right)$$
(52)

5.2.2 Ranking loss

This metric is used for ranking-based algorithms and measures the average fraction of label pairs that are reversely ordered. For example, an irrelevant label is ranked higher than a relevant label.

$${\text{Rloss}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {\frac{1}{{\left| {{\mathbf{y}}_{{\mathbf{i}}} } \right|\left| {{\bar{\mathbf{y}}}_{{\mathbf{i}}} } \right|}}\left| {\left\{ {\left( {y^{'} y^{''} } \right)|f_{{y^{ '} \in {\mathbf{y}}_{{\mathbf{i}}} }} \left( {x_{i} } \right) \le f_{{y^{''} \in {\bar{\mathbf{y}}}_{{\mathbf{i}}} }} \left( {x_{i} } \right)} \right\}} \right|} \right)$$
(53)

5.2.3 Coverage

This measure evaluates how many steps are needed, on average, to move down the ranked label list so as to cover all the true labels of an instance.

$${\text{Coverage}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \mathop {\hbox{max} }\limits_{{y \in {\mathbf{y}}_{{\mathbf{i}}} }} {\text{rank}}_{f} \left( {x_{i} ,y} \right) - 1$$
(54)

5.2.4 One error

The one error evaluates the fraction of examples whose the label with the best rank computed by the classification algorithm is not in the relevant label set.

$$\begin{aligned} {\text{Oerror}} & = \left( {1/n} \right)\mathop \sum \limits_{i = 1}^{n} H\left( {x_{i} } \right) \\ H\left( {x_{i} } \right) & = \left\{ {\begin{array}{*{20}l} 0 \hfill & {{\text{if}}\, {\text{argmax}}\, f_{y} \left( {x_{i} } \right) \in {\mathbf{y}}_{{\mathbf{i}}} } \hfill \\ 1 \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. \\ \end{aligned}$$
(55)

5.2.5 Average precision

Average precision evaluates the average fraction of relevant labels ranked higher than a particular label \(y \in {\mathbf{y}}_{{\mathbf{i}}} .\)

$${\text{A}}.{\text{percision}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {\frac{1}{{\left| {{\mathbf{y}}_{{\mathbf{i}}} } \right|}}\mathop \sum \limits_{{y \in {\mathbf{y}}_{{\mathbf{i}}} }} \frac{{\left| {\left\{ {\left( {y^{'} \in {\mathbf{y}}_{{\mathbf{i}}} } \right)|{\text{rank}}_{f} \left( {x_{i} ,y^{'} } \right) \le {\text{rank}}_{f} \left( {x_{i} ,y} \right)} \right\}} \right|}}{{{\text{rank}}_{f} \left( {x_{i} ,y} \right)}}} \right)$$
(56)

The smaller the values for all previous measures except the average precision, the better the performance is.

5.3 Parameters setting

It is clear that the performance of SVM-based classification methods depends on the choices of parameters [48]. In these simulations, cross-validation was used to find the best parameters for each dataset. Due to the better performance of the kernel version, in our experiments we only consider the Gaussian kernel function \(K\left( {x_{i} ,x_{j} } \right) = { \exp }\left( { - \left\| {x_{i} - x_{j} } \right\|^{2} /\gamma^{2} } \right)\) as it yields great generalization performance. To reduce the parameter selection complexity, we set \(C_{i} = C ,\eta_{i} = \eta \, {\text{and}}\, \lambda_{i} = \lambda\) for all i = 1, 2, …, L. The value of the parameters (C, \(\eta\) and λ) is selected from the set (2−10, 2−9, …, 29, 210) by cross-validation and \(\gamma\) is fixed to be 1.

5.4 Experimental results

In order to evaluate the proposed ML-SLSTSVM, we investigate its efficiency using the above criteria on the presented datasets. In our experiments, we focus on comparison between our ML-SLSTSVM and several state-of-the-art multi-label learning methods including calibrated label ranking (CLR) [11], MLkNN [16], RankSVM [13] and MLTSVM [15]. Since we have used the same datasets and algorithms as in [15] to evaluate the performance of our proposed approach, we used the results reported on it.

Table 4 shows the mean and the deviation of tenfold cross-validation learning results of each classifier on the synthetic and real datasets, where the best result on each dataset is shown in boldface. To compare these methods statistically, we choose nonlinear ML-SLSTSVM as a baseline, and compare whether our method is statistically better than one of the remaining four methods.

Table 4 Predictive performance of each comparing algorithm (mean + SD) on the real-world and synthetic datasets

We performed statistical significance tests using a Wilcoxon rank-sum test [49] with α = 0.05 to ensure the statistical significance of the results comparison in terms of evaluation metrics. The result of the test is presented at the end of each comparison. If the differences in the results are statistically significant, a “≪” symbol is shown in the tables. In these tables, the symbols ↑ and ↓ after the title of each evaluation metric define the expected behavior of it, with ↓ meaning that the lowest values are the best ones and ↑ meaning that the highest values are the best ones.

The results show that ML-SLSTSVM outperforms the other approaches in all five metrics. We can also see that the performance of proposed algorithm is statistically significantly better than other compared algorithms in terms of all evaluation metrics. For example, on the Flags dataset, our method achieves 2.8%, 4.3%, 7.7%, 2.4% and 6.3% relative improvement in terms of the five evaluation criteria over MLTSVM that obtains the second-best results on this dataset.

Table 5 presents the average CPU time of our method and other compared methods. We used the Wilcoxon signed-rank test to ensure the statistical significance of the results comparison in terms of CPU time. In this test, α was set to 0.05. The results of the test are presented at the bottom of Table 5. From the table, we can see that in terms of computational cost, the nonlinear ML-SLSTSVM method takes slightly more time than the MLkNN and MLTSVM methods; however, the difference is not significant.

Table 5 Average CPU time (mean + SD)

5.5 Sensitivity analysis

In the ML-SLSTSVM model, there are a few parameters. The first parameter is the C, which is used as a penalty for the classification error and it controls the trade-off between allowing training errors and forcing rigid margins. The second parameter is the η, which is the parameter that regulates the relative importance of the structural information within the clusters. Finally, the third is the λ, which is the regularization parameter.

In order to examine the sensitivity of these parameters in classification accuracy of ML-SLSTSVM, we conducted a set of experiments by varying one parameter and fixing the other two to the values elected by cross-validation. The results on ranking loss and average precision are illustrated in Fig. 3. (The x-axis of all charts represents the log2 of parameters value.) ML-SLSTSVM had similar situations on Hamming loss, one error and coverage. Due to the limitation of space, the results have not been presented here.

Fig. 3
figure 3

Effect of parameters including C, λ and η on the performance of ML-SLSTSVM on the real datasets

According to the first row in Fig. 3, parameter C is the most sensitive parameter among all three parameters in our study; hence, selecting a suitable value of C can significantly improve the performance of the algorithm. As shown in Fig. 3, with increasing the value of C, performance (both ranking loss and average precision) first improved and then worsened. The best value for this parameter is 26 for Flags dataset and close to 2−1 for the other datasets.

As depicted in the second row in Fig. 3, large values for the parameter λ reduces the performance, i.e., ranking loss and average precision. More precisely, increasing the value of λ leads to a slight increase in the performance at first which then diminishes as the λ becomes larger. The decline is considerably higher in Birds dataset in comparison with others. The best value for λ is 24 for Flags and it is close to 2−2 for the other datasets. This means that the regularization term (minimization of hyperplane parameters w and b) is important to obtain the best classifier for Flags dataset and is relatively ineffective for the others.

Finally, as shown in the third row of Fig. 3, the different values of parameter η don't have significant effect on ML-SLSTSVM, except in Birds and Flags datasets in which the average precision can increase up to 7% by a proper selection of the parameter. This means that the structural information of the both datasets contains useful prior domain knowledge for training the classifier. In addition, since the ranking loss values are small, changing the value of η does not have much effect on it and the ranking loss chart is fairly smooth (except Flags dataset). The best value for the parameter η is close to 2−6 for Emotion, Birds and Yeast datasets and 22 and 25 for Medical and Flags datasets, respectively.

6 Conclusion

For the MLL problem, a new algorithm, termed as ML-SLSTSVM, is proposed in this paper. This algorithm is a ranking-based SVM that extends the MLTSVM method [15] with considering structural information of training samples and using least square idea. ML-SLSTSVM seeks a proximal hyperplane for each label where the kth hyperplane is closer to the instances with the label k, and is as far as possible from the others. We only need to solve systems of linear equations for both linear and nonlinear cases rather than to solve systems of QPPs in the MLTSVM. Experiments on nine synthetic and real-world multi-label datasets show that in term of evaluation metrics mentioned in subsection 5.2, ML-SLSTSVM outperforms some well-established multi-label learning algorithms.