ML-SLSTSVM: a new structural least square twin support vector machine for multi-label learning

Azad-Manjiri, Meisam; Amiri, Ali; Saleh Sedghpour, Alireza

doi:10.1007/s10044-019-00779-2

ML-SLSTSVM: a new structural least square twin support vector machine for multi-label learning

Theoretical advances
Published: 12 February 2019

Volume 23, pages 295–308, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Pattern Analysis and Applications Aims and scope Submit manuscript

ML-SLSTSVM: a new structural least square twin support vector machine for multi-label learning

Download PDF

Meisam Azad-Manjiri¹,
Ali Amiri¹ &
Alireza Saleh Sedghpour²

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Multi-label learning (MLL) is a special supervised learning task, where any single instance possibly belongs to several classes simultaneously. Nowadays, MLL methods are increasingly required by modern applications, such as protein function classification, speech recognition and textual data classification. In this paper, a structural least square twin support vector machine (SLSTSVM) classifier for multi-label learning is presented. This proposed ML-SLSTSVM focuses on the cluster-based structural information of the corresponding class in each optimization problem, which is vital for designing a good classifier in different real-world problems. This method is extended to a nonlinear version by the kernel trick. Experimental results demonstrate that proposed method is superior in generalization performance to other classifiers.

Dimensionality Reduction Using Stacked Kernel Discriminant Analysis for Multi-label Classification

An improved multi-task least squares twin support vector machine

Article Open access 27 July 2023

PPML: Penalized Partial Least Squares Discriminant Analysis for Multi-Label Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Multi-label learning (MLL) is a form of supervised learning where the learning algorithm is required to learn from a set of instances, and each instance can belong to multiple classes and so after be able to predict a set of class labels for a new instance [1]. In the past decade, such a learning issue has received a lot of attention because of many real-world applications, e.g., protein function classification [2, 3], speech recognition [4, 5], music categorization [6, 7] and image annotation [8, 9]. Due to the complex nature of this type of data, their learning is more difficult than two- or multi-class data. MLL methods can generally be divided into the following [10].

Problem transformation methods: These methods transform the MLL problem into one or more single-label classification problems, which can be solved using existing single-label learning methods. An essential property of problem transformation methods is that they are algorithm independent. Binary relevance [10], label powerset [11] and calibrated label ranking [12] are some of the algorithms which use problem transformation methods for MLL.
Algorithm adaptation methods: These methods handle directly the MLL problems by adapting popular learning techniques to deal with multi-label data. Support vector machine [13,14,15], nearest neighbor [16], neural network [17, 18] and decision tree [19, 20] are some of the machine learning techniques which have been extended for MLL. The proposed method is one of the algorithm adaptation methods that extend SVM to learn multi-label data.

Despite the works done in MLL, there is still a need to develop more efficient methods in terms of precision and other metrics in Sect. 5.2. Because the structural information of data may contain useful prior domain knowledge for training a classifier, in this paper, we first present an improvement on MLTSVM [15], called structural twin support vector machine for multi-label learning (ML-STSVM). In this method, we, respectively, embed the data structures of classes into the optimization problems of MLTSVM based on the same clustering technology of S-TWSVM [21]. Least square SVM [22] solves linear system instead of QP problem. This property improves both the generalization capability and running speed. For this reason, we present least square version of ML-STSVM (using the proposed idea in [22, 23]) called ML-SLSTSVM. In the algorithm, we modified the primal QPPs of TSVM in least square sense and solved them with equality constraints instead of inequalities of TSVM.

To thoroughly evaluate the performance of the proposed approach, comparative studies over both synthetic and real-world multi-label datasets have been conducted. Experimental results show that ML-SLSTSVM achieves superior performance against several state-of-the-art multi-label learning algorithms. Finally, the main contribution of ML-SLSTSVM is defining a ranking-based SVM that extends the MLTSVM method [15] with considering structural information of training samples to improve the generalization capability and using the least square idea [22] to improve both generalization capability and running speed.

This paper is organized as follows. In Sect. 2, the setting of a MLL and related works are introduced. Some preliminaries are given in Sect. 3. The proposed algorithm is described in detail in Sect. 4. In Sect. 5, some simulation results are discussed. Finally, the conclusions are drawn in Sect. 6.

2 Related works

Let $X \in {\mathbb{R}}^{d}$ be an instance space of d-dimensional features and ${\mathcal{Y}} = \left\{ {\lambda_{1} ,\lambda_{2} , \ldots ,\lambda_{L} } \right\}$ be a finite set of labels or classes. Each instance $x \in X$ has multiple class labels Y in ${ \mathcal{Y}}$. Given a set of training examples consisting of n instances,$D = \left\{ {\left( {X_{1} ,Y_{1} } \right), \ldots ,\left( {X_{n} ,Y_{n} } \right)} \right\}$, where each instance is independent and identically distributed (i.i.d.) drawn from an unknown distribution ${\mathcal{D}}$. x_i ∈ X and labels Y_i ⊆ ${\mathcal{Y}}$ are known [11]. The goal of multi-label classification is to learn categories’ properties from labeled examples and find a multi-label classifier $g:X \to 2^{{\mathcal{Y}}}$ that maps an instance x to its label such that specific performance criteria are optimized. Here, $2^{{{\mathcal{Y} }}}$ is the powerset of ${ \mathcal{Y}}$, which is the set of all subsets of ${ \mathcal{Y}}$. To facilitate our discussion, we summarize major symbols and notations used in this paper in Table 1. Others are defined following their first appearance as required.

Table 1 Symbols and notations used in this paper

Full size table

There is another method to learn multi-label data called label ranking. In the label ranking, the goal is to learn to predict a total order, a ranking, of all possible labels for a new training example. In other words, the task is to learn the function $f:X \times {\mathcal{Y}} \to {\mathbb{R}}$ that, for a new instance x, assign a real number to each label $y \in {\mathcal{Y}}$ where more related label with x has greater value of f. In the label ranking, it is necessary to be considered a threshold that based on related labels separated from unrelated ones. It should be noted that in this paper, the proposed method builds a label ranking model to learn multi-label data.

As mentioned in the Introduction, current MLL algorithms have been developed with two strategies: the problem transformation and algorithm adaptation. In the following, we introduce some well-known works on both these strategies and especially focus on SVM-based methods that used for MLL problem.

The binary relevance approach (BR) [10] decomposes multi-label learning task into several independent binary classification problems, one for each label in the set of labels. Finally, this method determines the labels for a new instance by aggregating the predictions from all the classification problems. Since the BR method does not model dependencies between labels, the predictive performance of it is low.

To address this deficiency (to model labels dependency), some new methods have improved. One of these methods is the label powerset (LP) method [11]. In LP, each of the different combinations of labels in a dataset is considered to be a single label. Thus, LP transforms a multi-label problem to a multi-class problem with 2^|L| different classes, where L is the set of labels in original problem and 2^L is the powerset of all labels in L. The predictive performance of LP is greater than BR, but the main drawback of it is high computational complexity (the number of label combinations grows exponentially with the number of labels).

Ranking by pairwise comparison (RPC) [24] is another type of binary classification approach which uses problem transformation strategy. This method learns L(L − 1)/2 binary models, one for each pair of labels. To predict a new instance by RPC, all the models are invoked and a ranking is obtained by counting the votes received by each label. This method has a good predictive performance, but the time complexity of it is quadratic.

Another strategy to classify multi-label data is the algorithm adaptation. In the following, we review some works that use it. In [18], an extension of backpropagation algorithm called backpropagation for multi-label learning (BP-MLL) is proposed. It uses multiple binary outputs as the label variables and defines a novel error function that captures the characteristics of multi-label learning. In [20], an extension of tree algorithm C4.5 is presented to handle multi-label problems. It modifies the entropy to consider instances associated with multiple labels. In [19], an algorithm based on Bayesian approach is proposed for multi-label textual data classification. It constructs a probabilistic generative model to model multiple labels associated with each document.

Multi-label k nearest neighbor (MLkNN) [16] is a k nearest-neighbor method adapted for MLL. This method classifies a new instance by voting from the labels found in its neighbors. In [25], the authors have tried to extend the ELM technique for multi-label learning by proposing a thresholding method based on ELM. In [26], an extension of AdaBoost algorithm called BoosTexter is presented for MLL problem. It maintains a set of weights over both training examples and labels to handle multiple labels.

Some works on MLL are general-purpose and we can use theme for learning different types of data [15, 16, 27, 28]. Some other works are specific-purpose and learn the specific type of data (e.g., image, sound and gene expression data) [29,30,31,32,33,34]. There are many works in the literature that are proposed for images classification. Since deep neural networks and more specifically convolutional neural networks have shown promising results in image classification, most of the MLL works have used them. Multi-label classification for image annotation [8], pedestrian attribute classification [30], facial attribute classification [31] and social image understanding [29, 32] are examples of this type of works.

Extreme multi-label learning (XML) is another problem in the area of MLL, and it is important problem since the boom of big data [35]. The objective in XML is to learn a classifier that can automatically tag a data point with the most relevant subset of labels from a large label set. Many existing MLL methods are not suitable for XML, due to the large label space (as large as in millions). Tree-based methods [36, 37] and embedding-based methods [38, 39] are two general types of methods in this area. It should be noted that our proposed method is general-purpose and does not claim to solve XML problems. In this paper, we extend SVM to handle multi-label data. Therefore, in the following we introduce some SVM-based methods for MLL problems.

Many well-known approaches extend the traditional SVM for MLL problem, e.g., RankSVM [13], RankCVM [14] and MLTSVM [15]. RankSVM is proposed via extending multi-class SVM and uses a novel ranking loss function as its empirical loss. This method solves a quadratic programming problem to rank the label set for new instances. A drawback of RankSVM is high computational complexity due to a large number of variables in its quadratic programming. To address this drawback, the RankCVM was proposed. This method combines RankSVM with CVM to construct a novel SVM-type multi-label classifier which is described as the same optimization form as binary CVM.

We cite [15] as the key reference for the proposed method, since we freely use some of the terminologies and the ideas presented in it. It proposes a novel twin support vector machine to multi-label learning called MLTSVM. This method which is an extension of twin SVM (TWSVM) [40] determines multiple nonparallel hyperplanes to capture the multi-label information embedded in data. MLTSVM solves L quadratic programming problems (L is the number of labels) to obtain L nonparallel proximal hyperplanes, and each problem is similar to binary TWSVM. Note that the optimization is done in a way that kth hyperplane is closer to the instances with the label k, and is as far as possible from the others.

3 Preliminaries

Since the proposed method is based on SVM, in this section we briefly introduce the SVM and some extension of it.

3.1 Support vector machine

Support vector machine (SVM) was originally proposed by Vapnik [41] for binary classification. SVM implements the structural risk minimization (SRM), in contrast to other machine learning approaches like conventional artificial neural network which aims at minimizing empirical risk (ERM). (SRM minimizes the upper bound of generation, whereas ERM minimizes the error on the training data.) Recently SVMs have been extended to solve the multi-class and MLL problems.

The main idea of the SVM classifier is to obtain a linear decision boundary that maximizes a separation margin. In other words, SVM tries to find a discriminant hyperplane by solving an optimization problem to separate two classes of patterns.

We consider the problem of classifying n points in the d-dimensional real space R^d, represented by the n × d matrix X, according to membership of each point X_i in the classes + 1 or −1 as specified by a given m × m diagonal matrix D with ones or minus ones along its diagonal. For this problem, the SVM discriminant function is $d\left( x \right) = w^{'} x + b$ and it obtained by minimizing the following objective function.

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{w,b,\xi } \frac{1}{2}\left\| w \right\|^{2} + ce^{'} \xi \\ & \quad {\text{s}}.{\text{t}}:D\left( {Xw + eb} \right) \ge e - \xi \quad \xi \ge 0 \\ \end{aligned}$$

(1)

where $\xi$ is the n-dimensional vector of slack variables, c is a penalty parameter and e is a vector of ones of appropriate dimensions. Figure 1 shows the linear separation of data points using SVM classifier. As you see in this figure, the discriminant hyperplane lies in a space that is halfway between the two sets of points and has maximum margin.

Training algorithm of SVM performs a linear classification and builds a model that assigns new examples into one class or the other. When data points are not linearly separable, we can use nonlinear SVM, in which data points are mapped to higher-dimensional feature space where the examples become linearly separable. This mapping is complex, and training time increases exponentially with the input data dimension. To tackle with this drawback, we can use the kernel trick [42]. Kernel trick provides a way to work in a feature-rich space without explicitly mapping the points to higher dimension. This trick is done by using an appropriate kernel function, which describes the hyperplane in higher dimensions. Therefore, the art of using nonlinear SVMs is to choose the correct kernel function, which is well suited for the underlying problem.

SVM has such advantages as good generalization capacity, less limit in training set size and more robust to noise [43], but there is a main challenge in the traditional SVM and it is the high computational complexity of quadratic programming problem (QPP) [43]. The computational complexity of SVM is $O\left( {n^{3} } \right)$, where n denotes the count of training data. This drawback restricts the application of SVM to large-scale problem domains.

3.2 Twin support vector machine

The study in [40] proposed a novel binary classifier, twin support vector machine (TWSVM), the speed of which is approximately four times faster than SVM.

This method seeks two nonparallel hyperplanes such that each hyperplane is close to the patterns of one class and far away from the patterns of the other classes simultaneously. The TWSVM hyperplanes have the forms (2) and are obtained by minimizing the objective functions (3) and (4).

$$x^{'} w_{1} + b_{1} = 0,x^{'} w_{2} + b_{2} = 0$$

(2)

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{w1,b1,\xi } \frac{1}{2} \left( {Aw_{1} + eb_{1} } \right)^{'} \left( {Aw_{1} + eb_{1} } \right) + ce^{'} \xi \\ & \quad {\text{s}}.{\text{t}}: - ( \, Bw_{1} + eb_{1} ) \, + \xi \ge e , \xi \ge 0 \\ \end{aligned}$$

(3)

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{w2,b2,\xi } \frac{1}{2} \left( {Bw_{2} + eb_{2} } \right)^{'} \left( {Bw_{2} + eb_{2} } \right) + ce^{'} \xi \\ & \quad {\text{s}}.{\text{t}}: (Aw_{2} + eb_{2} ) \, + \xi \ge e ,\xi \ge 0 \\ \end{aligned}$$

(4)

where A and B are the data points belonging to classes + 1 and − 1, respectively, and $\xi$ is vector of slack variables. Figure 2 shows how the linear TWSVM separates training data points. When this method found the hyperplanes, a new data instance is classified based on the distance between it and the hyperplanes.

3.3 Structural twin SVM

Using prior information of training data can help to design more powerful classifiers, and the structural information of data points is one of them [44]. For this reason, some SVM-based methods are presented that use covariance matrix of data. STSVM or structural TWSVM is one of these methods, which was introduced in 2013 by Zhiquan [21].

In this method which is an extension of TWSVM, the derived decision hyperplanes tend to lie in the same direction with the data dispersion. The hyperplanes have the form (5) and are obtained from solving the objective functions (6) and (7).

$$f_{ + } \left( x \right) = W_{ + } x + b_{ + } = 0 ,f_{ - } \left( x \right) = W_{ - } x + b_{ - } = 0$$

(5)

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{k} ,b_{k} ,\xi_{j} }} \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {AW_{ + } + e_{ + } b_{ + } } \right)} \right\|^{2} + c_{1} e_{ - }^{'} \xi + \frac{1}{2}c_{2} \left( {\left\| {w_{ + } } \right\|^{2} + b_{ + }^{2} } \right) + \frac{1}{2}c_{3} \left( {w^{'}_{ + } \varSigma_{ + } w_{ + } } \right) \\ & \quad {\text{s}}.{\text{t}}: - (BW_{ + } + e_{ - } b_{ + } ) + \xi \ge e_{ - } ,\quad \xi \ge 0 \\ \end{aligned}$$

(6)

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{k} ,b_{k} ,\xi_{j} }} \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {BW_{ - } + e_{ - } b_{ - } } \right)} \right\|^{2} + c_{4} e_{ + }^{'} \xi + \frac{1}{2}c_{5} \left( {\left\| {w_{ - } } \right\|^{2} + b_{ - }^{2} } \right) + \frac{1}{2}c_{6} \left( {w^{'}_{ - } \varSigma_{ - } w_{ - } } \right) \\ & \quad {\text{s}}.{\text{t}}: - (AW_{ - } + e_{ + } b_{ - } ) + \xi \ge e_{ + } , \xi \ge 0 \\ \end{aligned}$$

(7)

where c₁, …c₆ are the pre-specified penalty factors, e₊, e₋ are vectors of ones of appropriate dimensions, $\xi$ is the vector of slack variables,$\varSigma_{ + } = \varSigma_{p1} + \cdots + \varSigma_{pm} ,\varSigma_{ - } = \varSigma_{n1} + \cdots + \varSigma_{nk}$, Σ_Pi and Σ_Nj are, respectively, the covariance matrices corresponding to the ith and jth clusters in the two classes, i = 1, …, m, j = 1, …, k.

3.4 Least square twin SVM

Least square twin support vector machines (LSTSVM) is another extension of SVM that attempts to solve two modified primal problems of TSVM, instead of two dual problems usually solved [23]. This method requires just the solution of two systems of linear equations for both linear and nonlinear cases, while TWSVM requires solving two quadratic programming problems (QPPs). The training time of LSTSVM is much faster than TWSVM, and this advantage allows to easily classify large datasets. The LSTSVM model can be formulated as:

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{w_{1},b_{1}} \frac{1}{2} \left\| {\left( {Aw_{1} + eb_{1} } \right)} \right\|^{2} + \frac{c}{2}\xi^{'} \xi \\ & \quad {\text{s}}.{\text{t}}: - (Bw_{1} + eb_{1} ) \, + \xi = e \end{aligned}$$

(8)

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{w_{2},b_{2}} \frac{1}{2} \left\| {\left( {Bw_{2} + eb_{2} } \right)} \right\|^{2} + \frac{c}{2}\xi^{'} \xi \\ & \quad {\text{s}}.{\text{t}}: (Aw_{2} + eb_{2} ) + \xi = e \end{aligned}$$

(9)

3.5 Multi-label twin support vector machine

As mentioned in the previous section, in [15] it is presented a SVM-based method called MLTSVM. This method obtains as many proximal hyperplanes as there are labels on the dataset. Each hyperplane has the form (10), where w_k and b_k are the normal vector and the bias term of the kth proximal hyperplane, respectively. These hyperplanes are obtained by minimizing the objective function (11) where C_k and $\lambda_{k}$ are the pre-specified penalty factors, and $\xi$ is the vector of slack variables.

$$f_{k} \left( x \right): w^{'}_{k} x_{i} + b_{k} = 0 , k \in \left\{ {1,2, \ldots ,L} \right\}$$

(10)

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{k} ,b_{k} ,\xi_{j} }} F = \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {w^{\prime}_{k} x_{i} + b_{k} } \right)} \right\|^{2} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \xi_{j} + \frac{1}{2} \lambda_{k} \left( {\left\| {w_{k} } \right\|^{2} + b_{k}^{2} } \right) \\ & \quad {\text{s}}.{\text{t}}: - (w^{ '}_{k} x_{j} + b_{k} ) \ge 1 - \xi_{j} ,\quad \xi_{j} \ge 0 \,\forall j \in I_{{\bar{k}}} \\ \end{aligned}$$

(11)

Using Lagrangian function and Karush–Kuhn–Tucker (KKT) conditions and by introducing the Lagrangian multipliers α, the dual QPP of (11) can be represented as

$$\begin{aligned} & \mathop {\hbox{max} }\limits_{{\alpha_{j} }} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \alpha_{j} - \frac{1}{2} \mathop \sum \limits_{{j_{1} \in I_{{\bar{k}}} }} \mathop \sum \limits_{{j_{2} \in I_{{\bar{k}}} }} \alpha_{j1} \alpha_{j2} z^{'}_{j1} \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + \lambda_{k} I} \right)^{ - 1} z_{j2} \\ & \quad {\text{s}}.{\text{t}}: 0 \le \alpha_{j} \le c_{k} ,\quad j \in I_{{\bar{k}}} \\ \end{aligned}$$

(12)

After obtaining the solutions αj from the dual problem (12), the kth proximal hyperplane (k = 1, …, K) can be constructed according to (13).

$$\left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] = - \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + \lambda_{k} I} \right)^{ - 1} \left( {\mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z_{j} \alpha_{j} } \right)$$

(13)

4 Proposed method

In this section, we first introduce a structural version of MLTSVM [15], called structural twin SVM for multi-label learning (ML-STSVM) and then propose the least square version of it called structural least squared twin support vector machine for multi-label learning (ML-SLSTSVM). Similar to MLTSVM, let us use $I_{k}$ and $I_{{\bar{k}}}$ to denote two complementary index sets, and if the instance $x_{i}$ is associated with the kth label, then $i \in I_{k}$; otherwise, $i \in I_{{\bar{k}}}$.

4.1 ML-STSVM

4.1.1 Linear ML-STSVM

For the linear case, the ML-STSVM determines a nonparallel hyperplane for each label

$$f_{k} \left( x \right) = W_{k}^{'} x + b_{k} = 0 ;\,k = 1, \ldots ,L$$

(14)

where L is the number of labels, the weight vector $W_{k} \in {\mathbb{R}}^{d}$ determines the hyperplane’s orientation and the bias $b_{k} \in {\mathbb{R}}$ determines the hyperplane’s offset relative to the system of coordinates. Here, each hyperplane $f_{i}$ is closer to the instances with the label k and is as far as possible from the others. These nonparallel hyperplanes can be obtained by solving the following optimization problems for each k = 1, 2, …, L.

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{k} ,b_{k} ,\xi_{j} }} F = \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {w^{'}_{k} x_{i} + b_{k} } \right)} \right\|^{2} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \xi_{j} + \frac{1}{2}\lambda_{k} \left( {\left\| {w_{k} } \right\|^{2} + b_{k}^{2} } \right) + \frac{1}{2}\eta_{k} \left( {w^{'}_{k} \varSigma_{k} w_{k} } \right) \\ & \quad {\text{s}}.{\text{t}}: - (w^{'}_{k} x_{j} + b_{k} ) \ge 1 - \xi_{j} , \xi_{j} \ge 0 \forall j \in I_{{\bar{k}}} \\ \end{aligned}$$

(15)

where $c_{k}$, $\lambda_{k}$ and $\eta_{k}$ are the pre-specified penalty factors, $\xi_{j}$ is the slack variable for each training data and $\varSigma_{k} = \varSigma_{{k_{1} }} + \cdots + \varSigma_{{k_{nk} }} ,\varSigma_{{k_{i} }}$ is the covariance matrix corresponding to the ith cluster in training data associated with the kth label, i = 1, 2, …, n_k.

The first term in the objective function is the sum of squared distances from the hyperplane to points of kth label. The second term is a ℓ₂ regularization term that favors sparse models (weighted by λ ∈ ℝ) [45], and the third term embodies the structural information of each label.

The Lagrangian corresponding to the problem (15) is given by

$$L = \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {w^{'}_{k} x_{i} + b_{k} } \right)} \right\|^{2} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \xi_{j} + \frac{1}{2}\lambda_{k} \left( {\left\| {w_{k} } \right\|^{2} + b_{k}^{2} } \right) + \frac{1}{2}\eta_{k} \left( {w^{'}_{k} \varSigma_{k} w_{k} } \right) + \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \alpha_{j} \left( {w^{'}_{k} x_{j} + b_{k} + 1 - \xi_{j} } \right) - \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \beta_{j} \xi_{j}$$

(16)

where $\alpha = \alpha_{1} , \ldots ,\alpha_{{\left| {I_{{\bar{k}}} } \right|}}$ and $\beta = \beta_{1} , \ldots ,\beta_{{\left| {I_{{\bar{k}}} } \right|}}$ are the Lagrange multipliers. According to the KKT conditions (Karush–Kuhn–Tucker), we know that there exist Lagrange multipliers satisfying:

$$\frac{\partial L}{{\partial w_{k} }} = 0 \Rightarrow \mathop \sum \limits_{{i \in I_{k} }} \left( {w^{'}_{k} x_{i} + b_{k} } \right) x_{i} + \lambda_{k} w_{k} + \eta_{k} \varSigma_{k} w_{k} + \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} x_{j} \alpha_{j} = 0$$

(17)

$$\frac{\partial L}{{\partial b_{k} }} = 0 \Rightarrow \mathop \sum \limits_{{i \in I_{k} }} \left( {w^{'}_{k} x_{i} + b_{k} } \right) + \lambda_{k} b_{k} + \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \alpha_{j} = 0$$

(18)

$$\frac{\partial L}{{\partial \xi_{j} }} = 0 \Rightarrow c_{k} - \alpha_{j} - \beta_{j} = 0 ; j \in I_{{\bar{k}}}$$

(19)

$$- \left( {w^{'}_{k} x_{j} + b_{k} } \right) \ge 1 - \xi_{j} ; \xi_{j} \ge 0 \quad \forall j \in I_{{\bar{k}}}$$

(20)

$$\mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \alpha_{j} \left( {w^{'}_{k} x_{j} + b_{k} + 1 - \xi_{j} } \right) = 0$$

(21)

$$\mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \beta_{j} \xi_{j} = 0$$

(22)

$$\alpha_{j} \ge 0 , \beta_{j} \ge 0 ; \quad\forall j \in I_{{\bar{k}}}$$

(23)

According to (19) and (23):

$$0 \le \alpha_{j} \le c_{k}$$

Obviously, combining (17) and (18) leads to:

$$\mathop \sum \limits_{{i \in I_{k} }} \left[ {\begin{array}{*{20}c} {x_{i} } \\ 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {x_{i} } 1 \\ \end{array} } \right] \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] + \lambda_{k} \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] + \eta_{k} \left[ {\begin{array}{*{20}c} {\varSigma_{k} } & 0 \\ 0 & 0 \\ \end{array} } \right] \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] + \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left[ {\begin{array}{*{20}c} {x_{j} } \\ 1 \\ \end{array} } \right]\alpha_{j} = 0$$

(24)

If we define $z_{i} = \left[ {\begin{array}{*{20}c} {x_{i} } \\ 1 \\ \end{array} } \right]$, $s_{k} = \left[ {\begin{array}{*{20}c} {\varSigma_{k} } & 0 \\ 0 & 0 \\ \end{array} } \right]$ and $\theta_{k} = \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right]$, then Eq. (24) can be rewritten as:

$$\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} \theta_{k} + \lambda_{k} \theta_{k} + \eta_{k} s_{k} \theta_{k} + \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z_{j} \alpha_{j} = 0$$

(25)

$$\Rightarrow \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + \lambda_{k} I + \eta_{k} s_{k} } \right)\theta_{k} = - \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z_{j} \alpha_{j}$$

(26)

$$\Rightarrow \theta_{k} = \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] = - \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} \left( {\mathop \sum \limits_{{\varvec{j} \in \varvec{I}_{{\bar{\varvec{k}}}} }} z_{j} \alpha_{j} } \right)$$

(27)

Substituting (27) and (19) into (16), we obtain the dual formulation of (15):

$$\begin{aligned} & \mathop {\hbox{max} }\limits_{{\alpha_{j} }} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \alpha_{j} - \frac{1}{2}\mathop \sum \limits_{{j_{1} \in I_{{\bar{k}}} }} \mathop \sum \limits_{{j_{2} \in I_{{\bar{k}}} }} \alpha_{j1} \alpha_{j2} z^{'}_{j1} \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} z_{j2} \\ & \quad {\text{s}}.{\text{t}}: 0 \le \alpha_{j} \le c_{k} ,\quad j \in I_{{\bar{k}}} \\ \end{aligned}$$

(28)

We can rewrite this problem in the following unified matrix form:

$$\begin{aligned} & \mathop {\hbox{max} }\limits_{\alpha } e^{\prime}\alpha - \frac{1}{2}\alpha^{\prime}Q\alpha \\ & \quad {\text{s}}.{\text{t}}: 0 \le \alpha \le c_{k} e \\ \end{aligned}$$

(29)

where $\alpha = \alpha_{1} , \ldots ,\alpha_{{\left| {I_{{\bar{k}}} } \right|}}$ and Q is defined by

$$Q = G_{k} \left( {H^{'}_{k} H_{k} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} G^{'}_{k}$$

(30)

$$G_{k} = \left[ {\begin{array}{*{20}c} {X_{{I_{{\bar{k}}} }} } & e \\ \end{array} } \right] , H_{k} = \left[ {\begin{array}{*{20}c} {X_{{I_{k} }} } & e \\ \end{array} } \right]$$

(31)

In Eq. (31), $X_{{I_{k} }} = \{ X_{i} |i \in I_{k} \}$ (instances associated with the kth label) and $X_{{I_{{\bar{k}}} }} = \{ X_{i} |i \in I_{{\bar{k}}} \}$ (the other instances).

After optimizing the dual problem (29) for each label k = 1, …, L and obtaining $\alpha$, we can substitute it in Eq. (27) and construct the decision strategies (14).

Now, we can use these decision strategies to determine the labels of test data. For this purpose, we must determine a threshold T_k (k = 1, …, L) for each label and then assign the label k to a new instance x if the distance between x and f_k(x) is less than or equal to T_k.

$$d_{k} \left( x \right) = {\text{distance}}\left( {x,f_{k} \left( x \right)} \right) = \left( {W_{k}^{ '} x + b_{k} } \right)/\left\| {W_{k} } \right\|$$

(32)

To determine thresholds T_k, we used the proposed idea in [15]. For this, we set T_k = T for all k = 1 to L, where $T = { \hbox{min} }\left( {1/\left\| {W_{k} } \right\|} \right)$, k = 1, …, L.

4.1.2 Nonlinear ML-STSVM

In this subsection, we extend the linear ML-STSVM to the nonlinear case by considering the following kernel generated surfaces:

$$f_{k} \left( x \right) = \varvec{U}_{\varvec{k}}^{ '} {\text{K}}\left( {\varvec{X},x} \right) + b_{k} = 0;\quad k = 1, \ldots ,L$$

(33)

where X denotes all the training instances, $\varvec{U}_{\varvec{k}}$ is the weight vector in the kernel space and $K\left( {.,.} \right)$ is a proper chosen kernel. The primal QPPs of the nonlinear ML-STSVM corresponding to the surface (33) is given in (35).

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{{U_{k} ,b_{k} ,\xi_{j} }} \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {\varvec{U}^{\varvec{'}}_{k} K\left( {\varvec{X},x_{i} } \right) + b_{k} } \right)} \right\|^{2} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \xi_{j} + + \frac{1}{2}\lambda_{k} \left( {\left\| {\varvec{U}_{k} } \right\|^{2} + b_{k}^{2} } \right) + \frac{1}{2}\eta_{k} \left( {\varvec{U}^{ '}_{k} \varSigma_{k}^{\varPhi } \varvec{U}_{k} } \right) \\ & \quad {\text{s}}.{\text{t}}: - \left( {{\mathbf{U}}^{ '}_{\text{k}} {\text{K}}({\mathbf{X}},x_{j} ) + b_{k} } \right) \ge 1 - \xi_{j} , \quad \xi_{j} \ge 0 \,\forall j \in {\text{I}}_{{\bar{k}}} \\ \end{aligned}$$

(34)

where $\varSigma_{k}^{\varPhi } = \varSigma_{{k_{1} }}^{\varPhi } + \cdots + \varSigma_{{k_{nk} }}^{\varPhi } ,\varSigma_{{k_{i} }}^{\varPhi }$ is the covariance matrix corresponding to the ith cluster in training data associated with the kth label, i = 1, 2, …, n_k by the kernel Ward’s linkage clustering [21]. Similar to previous subsection, we can use Lagrangian multipliers and derive the dual formulation of the QPP (21) and the solution as follows.

$$\begin{aligned} & \mathop {\hbox{max} }\limits_{{\alpha_{j} }} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \alpha_{j} - \frac{1}{2}\mathop \sum \limits_{{j_{1} \in I_{{\bar{k}}} }} \mathop \sum \limits_{{j_{2} \in I_{{\bar{k}}} }} \alpha_{j1} \alpha_{j2} z^{'}_{j1} \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} z_{j2} \\ & \quad {\text{s}}.{\text{t}}: 0 \le \alpha_{j} \le c_{k} , \quad j \in I_{{\bar{k}}} \\ \end{aligned}$$

(35)

where $z_{i} = \left[ {\begin{array}{*{20}c} {{\rm K}(\varvec{X},x_{i} )} \\ 1 \\ \end{array} } \right]$ and $s_{k} = \left[ {\begin{array}{*{20}c} {\varSigma_{k}^{\varPhi } } & 0 \\ 0 & 0 \\ \end{array} } \right]$.

$$\theta_{k} = \left[ {\begin{array}{*{20}c} {\varvec{U}_{k} } \\ {b_{k} } \\ \end{array} } \right] = - \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} \left( {\mathop \sum \limits_{{\varvec{j} \in \varvec{I}_{{\bar{k}}} }} \varvec{z}_{\varvec{j}}\varvec{\alpha}_{\varvec{j}} } \right)$$

(36)

The unified matrix form of (35) can be written as (37).

$$\begin{aligned} & \mathop {\hbox{max} }\limits_{\alpha } e^{'} \alpha - \frac{1}{2}\alpha^{\prime}Q\alpha \\ & \quad {\text{s}}.{\text{t}}: 0 \le \alpha \le c_{k} e \\ \end{aligned}$$

(37)

where $\alpha = \alpha_{1} , \ldots ,\alpha_{{\left| {I_{{\bar{k}}} } \right|}}$ and Q is defined by

$$Q = G_{k} \left( {H^{'}_{k} H_{k} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} G^{'}_{k}$$

(38)

$$G_{k} = \left[ {{\rm K}\left( {\varvec{X},X_{{I_{{\bar{k}}} }} } \right)e} \right] , H_{k} = \left[ {{\rm K}\left( {\varvec{X},X_{{I_{k} }} } \right)e} \right]$$

(39)

After optimizing the dual problem (36) for each label k = 1, …, L and obtaining $\alpha$, we can substitute it in (37) and construct the decision strategies in (32). Similar to linear case, we can use the hyperplanes to determine the labels of test data based on the distance between them and the hyperplanes.

4.2 ML-SLSTSVM

In this subsection, we introduce a least square version of ML-STSVM called structural least squared twin support vector machine for multi-label learning (ML-SLSTSVM). Following the idea of PSVM [46], the decision functions of ML-SLSTSVM are obtained extremely fast and simple by solving the primal problem directly.

4.2.1 Linear ML-SLSTSVM

Here we modify the primal problem (15) of linear ML-STSVM in least square sense as (40), with the inequality constraints replaced with equality constraints.

$$\begin{aligned} \mathop {\hbox{min} }\limits_{{w_{k} ,b_{k} ,\xi_{j} }} F & = \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {w^{'}_{k} x_{i} + b_{k} } \right)} \right\|^{2} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left\| {\xi_{j} } \right\|^{2} + \frac{1}{2}\lambda_{k} \left( {\left\| {w_{k} } \right\|^{2} + b_{k}^{2} } \right) \\ & \quad + \frac{1}{2}\eta_{k} \left( {w^{'}_{k} \varSigma_{k} w_{k} } \right) \\ & \quad \quad {\text{s}}.{\text{t}}: - \left( {w^{ '}_{k} x_{j} + b_{k} } \right) = 1 - \xi_{j} \quad \forall j \in I_{{\bar{k}}} \\ \end{aligned}$$

(40)

where the variables are similar to those defined for (15). On substituting the equality constraint of (41) into the objective function of it, we can get the following unconstrained optimization problem.

$$L = \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {w^{'}_{k} x_{i} + b_{k} } \right)} \right\|^{2} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left\| {w^{'}_{k} x_{j} + b_{k} + 1} \right\|^{2} + \frac{1}{2}\lambda_{k} \left( {\left\| {w_{k} } \right\|^{2} + b_{k}^{2} } \right) + \frac{1}{2}\eta_{k} \left( {w^{ '}_{k} \varSigma_{k} w_{k} } \right)$$

(41)

Setting the gradient of (41) with respect to $w_{k}$ and $b_{k}$ to zero gives:

$$\frac{\partial L}{{\partial w_{k} }} = 0 \Rightarrow \mathop \sum \limits_{{i \in I_{k} }} \left( {w^{'}_{k} x_{i} + b_{k} } \right) x_{i} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left( {w^{ '}_{k} x_{i} + b_{k} + 1} \right) x_{j} + \lambda_{k} w_{k} + \eta_{k} \varSigma_{k} w_{k} = 0$$

(42)

$$\frac{\partial L}{{\partial b_{k} }} = 0 \to \mathop \sum \limits_{{i \in I_{k} }} \left( {w^{'}_{k} x_{i} + b_{k} } \right) + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left( {w^{ '}_{k} x_{i} + b_{k} + 1} \right) + \lambda_{k} b_{k} = 0$$

(43)

Combining (42) and (43) leads to:

$$\mathop \sum \limits_{{i \in I_{k} }} \left[ {\begin{array}{*{20}c} {x_{i} } \\ 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {x_{i} } & 1 \\ \end{array} } \right] \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left[ {\begin{array}{*{20}c} {x_{j} } \\ 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {x_{j} } & 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left[ {\begin{array}{*{20}c} {x_{j} } \\ 1 \\ \end{array} } \right] + \lambda_{k} \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] + \eta_{k} \left[ {\begin{array}{*{20}c} {\varSigma_{k} } & 0 \\ 0 & 0 \\ \end{array} } \right] \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] = 0$$

(44)

Defining $z_{i} = \left[ {\begin{array}{*{20}c} {x_{i} } \\ 1 \\ \end{array} } \right]$, $s_{k} = \left[ {\begin{array}{*{20}c} {\varSigma_{k} } & 0 \\ 0 & 0 \\ \end{array} } \right]$ and $\theta_{k} = \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right]$, then Eq. (44) can be rewritten as:

$$\begin{aligned} \mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} \theta_{k} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z_{j} z^{'}_{j} \theta_{k} + \lambda_{k} \theta_{k} + \eta_{k} s_{k} \theta_{k} = - \,c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z^{'}_{j} \hfill \\ \hfill \\ \end{aligned}$$

(45)

$$\Rightarrow \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z_{j} z^{'}_{j} + \lambda_{k} I + \eta_{k} s_{k} } \right)\theta_{k} = - \,c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z^{'}_{j}$$

(46)

Finally, the solution of (40) can be obtained by (47)

$$\Rightarrow \theta_{k} = \left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] = - \,c_{k} \left( {\mathop \sum \limits_{{i \in I_{k} }} z_{i} z^{'}_{i} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z_{j} z^{'}_{j} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} \left( {\mathop \sum \limits_{{j \in I_{{\bar{k}}} }} z^{'}_{j} } \right)$$

(47)

Equation (47) can be rewritten in the following unified matrix form:

$$\left[ {\begin{array}{*{20}c} {w_{k} } \\ {b_{k} } \\ \end{array} } \right] = - \,c_{k} \left( {H^{'}_{k} H_{k} + c_{k} G^{\prime}_{k} G_{k} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} G^{'}_{k} e$$

(48)

where $G_{k}$ and $H_{k}$ are:

$$G_{k} = \left[ {\begin{array}{*{20}c} {X_{{I_{{\bar{k}}} }} } & e \\ \end{array} } \right] , H_{k} = \left[ {\begin{array}{*{20}c} {X_{{I_{k} }} } & e \\ \end{array} } \right]$$

(49)

We can obtain the decision strategies as (14) by solving Eq. (48) and use them to determine the labels of test data.

4.2.2 Nonlinear ML-SLSTSVM

In this subsection, we propose nonlinear ML-SLSTSVM by introducing kernel function $K\left( {x,x^{'} } \right) = \left( {\varPhi \left( x \right).\varPhi \left( {x^{'} } \right)} \right)$ and the corresponding transformation: $\varvec{X} = \varPhi \left( x \right)$ where $\varvec{X} \in \varvec{H}$ and H is the Hilbert space. In order to extend our algorithm to nonlinear cases, we consider the kernel generated surfaces (33), instead of planes.

Then the optimization problem of nonlinear ML-SLSTSVM is constructed as follows:

$$\begin{aligned} \mathop {\hbox{min} }\limits_{{w_{k} ,b_{k} ,\xi_{j} }} F & = \frac{1}{2}\mathop \sum \limits_{{i \in I_{k} }} \left\| {\left( {\varvec{U}^{\varvec{'}}_{k} K\left( {\varvec{X}, x_{i} } \right) + b_{k} } \right)} \right\|^{2} + c_{k} \mathop \sum \limits_{{j \in I_{{\bar{k}}} }} \left\| {\xi_{j} } \right\|^{2} + \frac{1}{2}\lambda_{k} \left( {\left\| {\varvec{U}_{k} } \right\|^{2} + b_{k}^{2} } \right) \\ & \quad + \frac{1}{2}\eta_{k} \left( {\varvec{U}^{ '}_{k} \varSigma_{k}^{\varPhi } \varvec{U}_{k} } \right) \\ & \quad \quad {\text{s}}.{\text{t}}: - \left( {\varvec{U}^{ '}_{k} K\left( {\varvec{X},x_{j} } \right) + b_{k} } \right) = 1 - \xi_{j} \quad \forall j \in I_{{\bar{k}}} \\ \end{aligned}$$

(50)

Similar to linear case, we can obtain the unknown variables on the decision boundaries (33) (U_k and b_k) by substituting the equality constraint of (50) into the objective function of it and then differentiating obtained function with respect to $\varvec{U}_{k}$ and $b_{k} .$

$$\left[ {\begin{array}{*{20}c} {\varvec{U}_{k} } \\ {b_{k} } \\ \end{array} } \right] = - c_{k} \left( {H^{'}_{k} H_{k} + c_{k} G^{'}_{k} G_{k} + \lambda_{k} I + \eta_{k} s_{k} } \right)^{ - 1} G^{'}_{k} e$$

(51)

where $G_{k} = \left[ {\begin{array}{*{20}c} {K\left( {\varvec{X},X_{{I_{{\bar{k}}} }} } \right)} & e \\ \end{array} } \right]$, $H_{k} = \left[ {\begin{array}{*{20}c} {K(\varvec{X},X_{{I_{k} }} ) } & e \\ \end{array} } \right]$ and $s_{k} = \left[ {\begin{array}{*{20}c} {\varSigma_{k}^{\varPhi } } & 0 \\ 0 & 0 \\ \end{array} } \right]$.

After solving Eq. (51), we can obtain the normal vector and the bias term of the kth proximal surface (33). Therefore, we can construct a similar prediction strategy as the linear case to predict the labels of unseen instances.

5 Experiments and results

In this section at first, the used datasets are introduced, and then, evaluation criteria for measuring the efficiency of MLL algorithms are expressed. Next, the parameter setting is described, and finally, the experimental results and parameter sensitivity analysis are given. All experiments have been implemented in MATLAB R2010b on a personal computer (PC) with an Intel (R) Core-i5 processor (2.3 GHz) and 4 GB random-access memory (RAM).

5.1 Benchmark datasets

To evaluate the proposed ML-SLSTSVM algorithm, in this section we have used several synthetic and real-world datasets. All synthetic data are generated by Mldatagen according to some predefined parameters such as the used strategies (hyperspheres or hypercubes), number of relevant, irrelevant and redundant features, number of instances and number of labels. We add 5% noises to the labels of each instance to make the learning tasks more challengeable. Also, the real datasets Emotion, Birds, Yeast, Flags and Medical are widely used for evaluating multi-label learning methods. These datasets represent a wide range of domains (include music, text, image and biology), sizes (from 194 to 2417), features (from 19 to 1449) and labels (from 19 to 45).

All real-world datasets (summarized in Table 3) were downloaded from the MULAN multi-label dataset repositories [47].^{Footnote 1} Note that all samples are normalized such that the continuous features are located in the range [0,1] before learning. Tables 2 and 3 show the used synthetic and real datasets in more details.

Table 2 Synthetic dataset statistics

Full size table

Table 3 Real dataset statistics

Full size table

5.2 Evaluation criteria

There are different criteria to evaluate the performance of multi-label data classifiers. For this reason, in this paper, we used five standard criteria, which are explained in more detail below. In all criteria, n, L, $y_{i}$ and $\bar{y}_{i}$ denote, respectively, the number of training data, the number of labels, the set of labels relevant to the ith instance and the set of labels that are irrelevant to it. In addition, the function $f_{y} \left( x \right)$ is a real-valued function ($f:X \times {\mathcal{Y}} \to {\mathbb{R}}$) that returns the confidence of being proper label of x and ${\text{rank}}_{f} \left( {x,y} \right)$ returns the rank of y in ${\mathcal{Y}}$ based on the descending order induced from $f_{.} \left( x \right)$.

5.2.1 Hamming loss

This criterion indicates the fraction of labels that are incorrectly predicted to the total number of labels.

$${\text{Hloss}} = \frac{1}{n \times L}\mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{l} \left( {h_{j} \left( {x_{i} } \right) \ne y_{j} } \right)$$

(52)

5.2.2 Ranking loss

This metric is used for ranking-based algorithms and measures the average fraction of label pairs that are reversely ordered. For example, an irrelevant label is ranked higher than a relevant label.

$${\text{Rloss}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {\frac{1}{{\left| {{\mathbf{y}}_{{\mathbf{i}}} } \right|\left| {{\bar{\mathbf{y}}}_{{\mathbf{i}}} } \right|}}\left| {\left\{ {\left( {y^{'} y^{''} } \right)|f_{{y^{ '} \in {\mathbf{y}}_{{\mathbf{i}}} }} \left( {x_{i} } \right) \le f_{{y^{''} \in {\bar{\mathbf{y}}}_{{\mathbf{i}}} }} \left( {x_{i} } \right)} \right\}} \right|} \right)$$

(53)

5.2.3 Coverage

This measure evaluates how many steps are needed, on average, to move down the ranked label list so as to cover all the true labels of an instance.

$${\text{Coverage}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \mathop {\hbox{max} }\limits_{{y \in {\mathbf{y}}_{{\mathbf{i}}} }} {\text{rank}}_{f} \left( {x_{i} ,y} \right) - 1$$

(54)

5.2.4 One error

The one error evaluates the fraction of examples whose the label with the best rank computed by the classification algorithm is not in the relevant label set.

$$\begin{aligned} {\text{Oerror}} & = \left( {1/n} \right)\mathop \sum \limits_{i = 1}^{n} H\left( {x_{i} } \right) \\ H\left( {x_{i} } \right) & = \left\{ {\begin{array}{*{20}l} 0 \hfill & {{\text{if}}\, {\text{argmax}}\, f_{y} \left( {x_{i} } \right) \in {\mathbf{y}}_{{\mathbf{i}}} } \hfill \\ 1 \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. \\ \end{aligned}$$

(55)

5.2.5 Average precision

Average precision evaluates the average fraction of relevant labels ranked higher than a particular label $y \in {\mathbf{y}}_{{\mathbf{i}}} .$

$${\text{A}}.{\text{percision}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {\frac{1}{{\left| {{\mathbf{y}}_{{\mathbf{i}}} } \right|}}\mathop \sum \limits_{{y \in {\mathbf{y}}_{{\mathbf{i}}} }} \frac{{\left| {\left\{ {\left( {y^{'} \in {\mathbf{y}}_{{\mathbf{i}}} } \right)|{\text{rank}}_{f} \left( {x_{i} ,y^{'} } \right) \le {\text{rank}}_{f} \left( {x_{i} ,y} \right)} \right\}} \right|}}{{{\text{rank}}_{f} \left( {x_{i} ,y} \right)}}} \right)$$

(56)

The smaller the values for all previous measures except the average precision, the better the performance is.

5.3 Parameters setting

It is clear that the performance of SVM-based classification methods depends on the choices of parameters [48]. In these simulations, cross-validation was used to find the best parameters for each dataset. Due to the better performance of the kernel version, in our experiments we only consider the Gaussian kernel function $K\left( {x_{i} ,x_{j} } \right) = { \exp }\left( { - \left\| {x_{i} - x_{j} } \right\|^{2} /\gamma^{2} } \right)$ as it yields great generalization performance. To reduce the parameter selection complexity, we set $C_{i} = C ,\eta_{i} = \eta \, {\text{and}}\, \lambda_{i} = \lambda$ for all i = 1, 2, …, L. The value of the parameters (C, $\eta$ and λ) is selected from the set (2⁻¹⁰, 2⁻⁹, …, 2⁹, 2¹⁰) by cross-validation and $\gamma$ is fixed to be 1.

5.4 Experimental results

In order to evaluate the proposed ML-SLSTSVM, we investigate its efficiency using the above criteria on the presented datasets. In our experiments, we focus on comparison between our ML-SLSTSVM and several state-of-the-art multi-label learning methods including calibrated label ranking (CLR) [11], MLkNN [16], RankSVM [13] and MLTSVM [15]. Since we have used the same datasets and algorithms as in [15] to evaluate the performance of our proposed approach, we used the results reported on it.

Table 4 shows the mean and the deviation of tenfold cross-validation learning results of each classifier on the synthetic and real datasets, where the best result on each dataset is shown in boldface. To compare these methods statistically, we choose nonlinear ML-SLSTSVM as a baseline, and compare whether our method is statistically better than one of the remaining four methods.

Table 4 Predictive performance of each comparing algorithm (mean + SD) on the real-world and synthetic datasets

Full size table

We performed statistical significance tests using a Wilcoxon rank-sum test [49] with α = 0.05 to ensure the statistical significance of the results comparison in terms of evaluation metrics. The result of the test is presented at the end of each comparison. If the differences in the results are statistically significant, a “≪” symbol is shown in the tables. In these tables, the symbols ↑ and ↓ after the title of each evaluation metric define the expected behavior of it, with ↓ meaning that the lowest values are the best ones and ↑ meaning that the highest values are the best ones.

The results show that ML-SLSTSVM outperforms the other approaches in all five metrics. We can also see that the performance of proposed algorithm is statistically significantly better than other compared algorithms in terms of all evaluation metrics. For example, on the Flags dataset, our method achieves 2.8%, 4.3%, 7.7%, 2.4% and 6.3% relative improvement in terms of the five evaluation criteria over MLTSVM that obtains the second-best results on this dataset.

Table 5 presents the average CPU time of our method and other compared methods. We used the Wilcoxon signed-rank test to ensure the statistical significance of the results comparison in terms of CPU time. In this test, α was set to 0.05. The results of the test are presented at the bottom of Table 5. From the table, we can see that in terms of computational cost, the nonlinear ML-SLSTSVM method takes slightly more time than the MLkNN and MLTSVM methods; however, the difference is not significant.

Table 5 Average CPU time (mean + SD)

Full size table

5.5 Sensitivity analysis

In the ML-SLSTSVM model, there are a few parameters. The first parameter is the C, which is used as a penalty for the classification error and it controls the trade-off between allowing training errors and forcing rigid margins. The second parameter is the η, which is the parameter that regulates the relative importance of the structural information within the clusters. Finally, the third is the λ, which is the regularization parameter.

In order to examine the sensitivity of these parameters in classification accuracy of ML-SLSTSVM, we conducted a set of experiments by varying one parameter and fixing the other two to the values elected by cross-validation. The results on ranking loss and average precision are illustrated in Fig. 3. (The x-axis of all charts represents the log2 of parameters value.) ML-SLSTSVM had similar situations on Hamming loss, one error and coverage. Due to the limitation of space, the results have not been presented here.

According to the first row in Fig. 3, parameter C is the most sensitive parameter among all three parameters in our study; hence, selecting a suitable value of C can significantly improve the performance of the algorithm. As shown in Fig. 3, with increasing the value of C, performance (both ranking loss and average precision) first improved and then worsened. The best value for this parameter is 2⁶ for Flags dataset and close to 2⁻¹ for the other datasets.

As depicted in the second row in Fig. 3, large values for the parameter λ reduces the performance, i.e., ranking loss and average precision. More precisely, increasing the value of λ leads to a slight increase in the performance at first which then diminishes as the λ becomes larger. The decline is considerably higher in Birds dataset in comparison with others. The best value for λ is 2⁴ for Flags and it is close to 2⁻² for the other datasets. This means that the regularization term (minimization of hyperplane parameters w and b) is important to obtain the best classifier for Flags dataset and is relatively ineffective for the others.

Finally, as shown in the third row of Fig. 3, the different values of parameter η don't have significant effect on ML-SLSTSVM, except in Birds and Flags datasets in which the average precision can increase up to 7% by a proper selection of the parameter. This means that the structural information of the both datasets contains useful prior domain knowledge for training the classifier. In addition, since the ranking loss values are small, changing the value of η does not have much effect on it and the ranking loss chart is fairly smooth (except Flags dataset). The best value for the parameter η is close to 2⁻⁶ for Emotion, Birds and Yeast datasets and 2² and 2⁵ for Medical and Flags datasets, respectively.

6 Conclusion

For the MLL problem, a new algorithm, termed as ML-SLSTSVM, is proposed in this paper. This algorithm is a ranking-based SVM that extends the MLTSVM method [15] with considering structural information of training samples and using least square idea. ML-SLSTSVM seeks a proximal hyperplane for each label where the kth hyperplane is closer to the instances with the label k, and is as far as possible from the others. We only need to solve systems of linear equations for both linear and nonlinear cases rather than to solve systems of QPPs in the MLTSVM. Experiments on nine synthetic and real-world multi-label datasets show that in term of evaluation metrics mentioned in subsection 5.2, ML-SLSTSVM outperforms some well-established multi-label learning algorithms.

Notes

http://mulan.sourceforge.net/datasets-mlc.html.

References

Sorower MS (2010) A literature survey on algorithms for multi-label learning, vol 18. Oregon State University, Corvallis
Google Scholar
Wu J-S, Huang S-J, Zhou Z-H (2014) Genome-wide protein function prediction through multi-instance multi-label learning. IEEE/ACM Trans Comput Biol Bioinform 11(5):891–902
Article Google Scholar
Wang X, Zhang W, Zhang Q, Li G-Z (2015) MultiP-SChlo: multi-label protein subchloroplast localization prediction with Chou’s pseudo amino acid composition and a novel multi-label classifier. Bioinformatics 31(16):2639–2645
Article Google Scholar
Singh-Miller N, Collins M (2009) Learning label embeddings for nearest-neighbor multi-class classification with an application to speech recognition. In: Advances in neural information processing systems 22 (NIPS 2009), pp 1678–1686
Xu G, Lee H, Koo M-W, Seo J (2017) Convolutional neural network using a threshold predictor for multi-label speech act classification. In: 2017 IEEE international conference on big data and smart computing (BigComp), 2017, pp 126–130
Wu B, Zhong E, Horner A, Yang Q (2014) Music emotion recognition by multi-label multi-layer multi-instance multi-view learning. In: Proceedings of the 22nd ACM international conference on multimedia, 2014, pp 117–126
Trohidis K, Tsoumakas G, Kalliris G, Vlahavas IP (2008) Multi-label classification of music into emotions. ISMIR 8:325–330
Google Scholar
Liu Y, Wen K, Gao Q, Gao X, Nie F (2018) SVM based multi-label learning with missing labels for image annotation, vol 78. Elsevier Ltd., Amsterdam
Google Scholar
Wu J, Yu Y, Huang C, Yu K (2015) Deep multiple instance learning for image classification and auto-annotation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp 3460–3469
Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehous Min 3(3):1–13
Article Google Scholar
Zhang M-L, Zhou Z-H (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(8):1819–1837
Article Google Scholar
Fürnkranz J, Hüllermeier E, Loza Mencía E, Brinker K (2008) Multilabel classification via calibrated label ranking. Mach Learn 73(2):133–153
Article Google Scholar
Elisseeff A, Weston J (2001) A kernel method for multi-labelled classification. In: Advances in neural information processing systems 14 (NIPS 2001), pp 681–687
Xu J (2013) Fast multi-label core vector machine. Pattern Recognit 46(3):885–898
Article MATH Google Scholar
Chen WJ, Shao YH, Li CN, Deng NY (2016) MLTSVM: a novel twin support vector machine to multi-label learning. Pattern Recognit 52:61–74
Article MATH Google Scholar
Zhang ML, Zhou ZH (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit 40(7):2038–2048
Article MATH Google Scholar
Zhang M-L (2009) Ml-rbf: RBF neural networks for multi-label learning. Neural Process Lett 29(2):61–74
Article Google Scholar
Zhang M-L, Zhou Z-H (2006) Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans Knowl Data Eng 18(10):1338–1351
Article Google Scholar
Clare A, King RD (2001) Knowledge discovery in multi-label phenotype data. In: European conference on principles of data mining and knowledge discovery, 2001, pp 42–53
Chapter MATH Google Scholar
Vens C, Struyf J, Schietgat L, Džeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 73(2):185–214
Article Google Scholar
Qi Z, Tian Y, Shi Y (2013) Structural twin support vector machine for classification. Knowl Based Syst 43:74–81
Article Google Scholar
Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300
Article Google Scholar
Arun Kumar M, Gopal M (2009) Least squares twin support vector machines for pattern classification. Expert Syst Appl 36(4):7535–7543
Article Google Scholar
Hüllermeier E, Fürnkranz J, Cheng W, Brinker K (2008) Label ranking by learning pairwise preferences. Artif Intell 172(16–17):1897–1916
Article MathSciNet MATH Google Scholar
Sun X et al (2016) ELM-ML: study on multi-label classification using extreme learning machine. In: Proceedings of ELM-2015, vol 2. Springer, 2016, pp 107–116
Schapire RE, Singer Y (2000) BoosTexter: a boosting-based system for text categorization. Mach Learn 39(2–3):135–168
Article MATH Google Scholar
Wang Y et al (2017) A multi-label learning method for efficient affective detection. In: 2017 IEEE EMBS international conference on biomedical and health informatics, pp 61–64
Reyes O, Morell C, Ventura S (2018) Effective active learning strategy for multi-label learning. Neurocomputing 273:494–508
Article Google Scholar
Li Z, Tang J, Mei T (2018) Deep collaborative embedding for social image understanding. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2018.2852750
Article Google Scholar
Zhu J, Liao S, Lei Z, Li SZ (2017) Multi-label convolutional neural network based pedestrian attribute classification. Image Vis Comput 58:224–229
Article Google Scholar
Zhuang N, Yan Y, Chen S, Wang H, Shen C (2018) Multi-label learning based deep transfer neural network for facial attribute classification. Pattern Recognit 80:225–240
Article Google Scholar
Li Z, Tang J (2017) Weakly supervised deep matrix factorization for social image understanding. IEEE Trans Image Process 26(1):276–288
Article MathSciNet MATH Google Scholar
Cakir E, Heittola T, Huttunen H, Virtanen T (2015) Multi-label vs. combined single-label sound event detection with deep neural networks. In: 2015 23rd European signal processing conference (EUSIPCO), 2015, pp 2551–2555
Barutcuoglu Z, Schapire RE, Troyanskaya OG (2006) Hierarchical multi-label prediction of gene function. Bioinformatics 22(7):830–836
Article Google Scholar
Zhang W, Yan J, Wang X, Zha H (2018) Deep extreme multi-label learning. In: Proceedings of the 2018 ACM on international conference on multimedia retrieval. ACM, pp 100–107
Prabhu Y, Varma M (2014) Fastxml: a fast, accurate and stable tree-classifier for extreme multi-label learning. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp 263–272
Weston J, Makadia A, Yee H (2013) Label partitioning for sublinear ranking. In: International conference on machine learning, 2013, pp 181–189
Xu C, Tao D, Xu C (2016) Robust extreme multi-label learning. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp 1275–1284
Bhatia K, Jain H, Kar P, Varma M, Jain P (2015) Sparse local embeddings for extreme multi-label classification. In: Advances in neural information processing systems 28 (NIPS 2015), pp 730–738
Jayadeva, Khemchandani R, Chandra S (2007) Twin support vector machines for pattern classification. IEEE Trans Pattern Anal Mach Intell 29(5):905–910
Article MATH Google Scholar
Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5):988–999
Article Google Scholar
Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36(3):1171–1220
Article MathSciNet MATH Google Scholar
Auria L, Moro R (2008) Support vector machines (SVM) as a technique for solvency analysis. DIW Berlin discussion paper no. 811. Available at SSRN: https://ssrn.com/abstract=1424949 or https://doi.org/10.2139/ssrn.1424949
Xue H, Chen S, Yang Q (2011) Structural regularized support vector machine: a framework for structural large margin classifier. IEEE Trans Neural Netw 22(4):573–587
Article Google Scholar
Shao Y-H, Zhang C-H, Wang X-B, Deng N-Y (2011) Improvements on twin support vector machines. IEEE Trans Neural Netw 22(6):962–968
Article Google Scholar
Fung G, Mangasarian OL (2001) Proximal support vector machine classifiers. In: Proceeding of ACM SIGKDD international conference on knowledge discovery and data mining—KDD’01, pp 77–86
Tsoumakas G, Spyromitros-Xioufis E, Vilcek J, Vlahavas I (2011) Mulan: a java library for multi-label learning. J Mach Learn Res 12(Jul):2411–2414
MathSciNet MATH Google Scholar
Nasiri JA, Charkari NM, Jalili S (2015) Least squares twin multi-class classification support vector machine. Pattern Recognit 48(3):984–992
Article MATH Google Scholar
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bull 1(6):80–83
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, University of Zanjan, Zanjan, Iran
Meisam Azad-Manjiri & Ali Amiri
Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
Alireza Saleh Sedghpour

Authors

Meisam Azad-Manjiri
View author publications
You can also search for this author in PubMed Google Scholar
Ali Amiri
View author publications
You can also search for this author in PubMed Google Scholar
Alireza Saleh Sedghpour
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Meisam Azad-Manjiri.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Azad-Manjiri, M., Amiri, A. & Saleh Sedghpour, A. ML-SLSTSVM: a new structural least square twin support vector machine for multi-label learning. Pattern Anal Applic 23, 295–308 (2020). https://doi.org/10.1007/s10044-019-00779-2

Download citation

Received: 05 August 2017
Accepted: 07 January 2019
Published: 12 February 2019
Issue Date: February 2020
DOI: https://doi.org/10.1007/s10044-019-00779-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

ML-SLSTSVM: a new structural least square twin support vector machine for multi-label learning

Abstract

Similar content being viewed by others

Dimensionality Reduction Using Stacked Kernel Discriminant Analysis for Multi-label Classification

An improved multi-task least squares twin support vector machine

PPML: Penalized Partial Least Squares Discriminant Analysis for Multi-Label Learning

Explore related subjects

1 Introduction

2 Related works

3 Preliminaries

3.1 Support vector machine

3.2 Twin support vector machine

3.3 Structural twin SVM

3.4 Least square twin SVM

3.5 Multi-label twin support vector machine

4 Proposed method

4.1 ML-STSVM

4.1.1 Linear ML-STSVM

4.1.2 Nonlinear ML-STSVM

4.2 ML-SLSTSVM

4.2.1 Linear ML-SLSTSVM

4.2.2 Nonlinear ML-SLSTSVM

5 Experiments and results

5.1 Benchmark datasets

5.2 Evaluation criteria

5.2.1 Hamming loss

5.2.2 Ranking loss

5.2.3 Coverage

5.2.4 One error

5.2.5 Average precision

5.3 Parameters setting

5.4 Experimental results

5.5 Sensitivity analysis

6 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation