Abstract
The multi-task learning support vector machines (SVMs) have recently attracted considerable attention since the conventional single task learning ones usually ignore the relatedness among multiple related tasks and train them separately. Different from the single task learning, the multi-task learning methods can capture the correlation among tasks and achieve an improved performance by training all tasks simultaneously. In this paper, we make two assumptions on the relatedness among tasks. One is that the normal vectors of the related tasks share a certain common parameter value; the other is that the models of the related tasks are close enough and share a common model. Under these assumptions, we propose two multi-task learning methods, named as MTL-aLS-SVM I and MTL-aLS-SVM II respectively, for binary classification by taking full advantages of multi-task learning and the asymmetric least squared loss. MTL-aLS-SVM I seeks for a trade-off between the maximal expectile distance for each task model and the closeness of each task model to the averaged model. MTL-aLS-SVM II can use different kernel functions for different tasks, and it is an extension of the MTL-aLS-SVM I. Both of them can be easily implemented by solving quadratic programming. In addition, we develop their special cases which include L2-SVM based multi-task learning methods (MTL-L2-SVM I and MTL-L2-SVM II) and the least squares SVM (LS-SVM) based multi-task learning methods (MTL-LS-SVM I and MTL-LS-SVM II). Although the MTL-L2-SVM II and MTL-LS-SVM II appear in the form of special cases, they are firstly proposed in this paper. The experimental results show that the proposed methods are very encouraging.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Multi-task learning which is an important and ongoing issue in machine learning has attracted growing attention in many regions, such as multi-level analysis [1], semi-supervised learning [2], medical diagnosis [3], speech recognition [4], web search ranking [5], and cell biology [6]. The basic idea of multi-task learning is to obtain the satisfactory performance for each task by simultaneously learning multiple tasks with underlying cross relatedness [7, 8]. Different from single task learning, multi-task learning shares the useful knowledge among multiple tasks, which is helpful to improve the generalization performance. And determining the relatedness among the multiple tasks is important for establishing the formulations of the multi-task learning approaches [9,10,11]. Although the single task learning methods have achieved successful applications in many areas, they train each task independently and ignore the potential relatedness among tasks, which may reduce the accuracy of prediction. When there are correlations between tasks, it is more reasonable to learn all tasks simultaneously rather than separately [7].
The regularized multi-task learning methods proposed by Evgenious and Pontil [12, 13] generalize the kernel-based methods from single task learning to multi-task learning. Recently, the multi-task learning strategy has been applied in evolutionary algorithm [6], deep neural network [14], pattern recognition [15], support vector machine and so on. Thereinto, multi-task SVM is a power tool of machine learning, and a lot of literature reveal that the SVM-based multi-task learning methods are effective when the related tasks are trained simultaneously [16,17,18,19,20,21,22,23,24]. Yang et al. [17] presented a one-class SVM-based multi-task learning method by constraining the solutions of multiple tasks close to each other, and the resulting formulation is a conic programming [16]. He et al. [18] proposed an improved SVM-based multi-task learning method for the one-class classification under the assumption that the parameter value of each task model is close to a mean value [12]. A general formulation which has the ability to employ the different kernels for different tasks was then proposed under the assumption that the models of different tasks are close enough [19]. Sun et al. established a multi-task multi-class SVM approach with a constrained optimization instead of the decomposition methods, which can learn both label-compatible and label-incompatible scenarios [20, 21]. Based on the LS-SVM [25], Xu et al. generalized a multi-task LS-SVM that makes use of the advantages of LS-SVM and multi-task learning [22]. Li et al. proposed a multi-task proximal SVM with looser constraints to improve the training speed [23]. Song et al. proposed a novel formulation for multi-task learning by extending the relative margin machine (RMM) to the multi-task learning paradigm [24].
As an important part of machine learning, SVM has been widely studied in many fields, such as multi-class classification [26], feature selection [27], multi-instance multi-label learning [28], and nonparallel least square support vector machine (NLSSVM) [29]. A wide spectrum of successful applications show that SVM is an advanced classifier. As is well known, the loss function plays a key role in SVM, and the different support vector approaches can be established by using the corresponding loss functions [25, 30,31,32,33,34,35]. The typical loss functions include hinge loss function, least squared loss function, and insensitive loss function. All of these functions are convex and convenient to make calculations and theoretical analysis. Recently, a novel asymmetric squared loss function and the corresponding asymmetric least squares SVM (aLS-SVM) were proposed by Huang et al. [31]. Compared with LS-SVM, the aLS-SVM is more flexible since it introduces the expectile value in the asymmetric squared loss function. The aLS-SVM has the advantage of considerable robustness to the noise around the decision boundary and stability to re-sampling.
In this paper, we propose two aLS-SVM based multi-task learning methods and their special cases by integrating the merits of multi-task learning and the asymmetric squared loss function. We first make the assumption as in [12, 18, 20, 22, 23] that the normal vector of the hyperplane corresponding to each task is expressed as the sum of a certain common vector and a private vector, and establish the new method MTL-aLS-SVM I. We prove that the new method strikes a balance between the maximal expectile distance for each task model and the closeness of each task model to the averaged model. Then, we relax the assumption and suppose that each task model is expressed as the sum of a common model and a private model, and establish the second multi-task learning method MTL-aLS-SVM II. Compared with MTL-aLS-SVM I, MTL-aLS-SVM II is more flexible as it can use different kernel functions for different tasks. These two new methods can be easily implemented by solving quadratic programming and simultaneously obtain the decision functions for all tasks. In addition, we also present their special cases: LS-SVM based multi-task learning methods (denoted correspondingly by MTL-LS-SVM I [22] and MTL-LS-SVM II) and L2-SVM based multi-task learning methods (denoted correspondingly by MTL-L2-SVM I and MTL-L2-SVM II). The special cases MTL-LS-SVM II and MTL-L2-SVM II are also our newly proposed methods. We compare these multi-task learning methods with several related effective single-task learning methods including aLS-SVM [31], LS-SVM, L2-SVM, and NLSSVM [29]. The experimental results verify the effectiveness of our proposed multi-task learning methods.
In summary, by incorporating the properties of the multi-task learning and the asymmetric squared loss function, the advantages of our proposed methods are:
-
To have a good ability to process multi-task learning problems directly;
-
To have the potential to capture the relatedness among multiple related tasks;
-
To effectively exploit different kernel functions for different tasks;
-
To be more flexible by using the asymmetric squared loss function;
-
To be easily implemented by solving quadratic programming.
We organize the rest of this paper as follows. A brief introduction of the aLS-SVM is given in Section 2. Then we detail the MTL-aLS-SVM I and MTL-aLS-SVM II formulations in Section 3. Meanwhile, we give their corresponding special cases in this section. In Section 4, we evaluate the proposed methods by the numerical experiments. Finally, we conclude the paper in Section 5.
2 The aLS-SVM
The asymmetric least squares support vector machine (aLS-SVM) [31] is proposed based on the following asymmetric squared loss function:
where ρ(0 ≤ ρ ≤ 1) is the expectile value. Unlike the general SVMs, the aLS-SVM maximizes the expectile distance instead of the minimal distance between two classes and solves the following optimization problem:
where ζ is the error variable vector; ϕ(⋅) is a nonlinear mapping from the input space \(\mathbb {R}^{d}\) into the feature space \(\mathbb {R}^{h}\); C is the regularization parameter. According to the asymmetric squared loss function (1), the optimization problem (2) can be equivalently written as
Compared with the usual SVMs, the aLS-SVM is robust to noise around the decision boundary and stable to re-sampling because of the maximization of the expectile distance. It is also an extension of L2-SVM and LS-SVM [25]. More details about the aLS-SVM can be seen in [31].
3 The aLS-SVM based multi-task learning formulations
In this section, we propose two aLS-SVM based multi-task learning methods—MTL-aLS-SVM I and MTL-aLS-SVM II according to the different task relatedness assumptions. Meanwhile, we develop two types of special cases of these two multi-task learning methods. In the multi-task learning scenario, we are given N different but related tasks. For each task k, we have m k training data \(\{(\boldsymbol {x}_{ki}, y_{ki})\}_{i = 1}^{m_{k}}\), where \({{\boldsymbol {x}}_{ki}} \in {{\mathbb {R}}^{d}}\) and y k i ∈{1,− 1}. Thus, we totally have \(m={\sum }_{k = 1}^{N}m_{k}\) training data. Our aim is to learn N different decision functions (hyperplanes) for each task simultaneously.
3.1 MTL-aLS-SVM I
In the light of the method presented in [12], when the related tasks share a common function ω 0, the normal vector \({\boldsymbol {\omega }_{k}}\in \mathbb {R}^{h}\) for the specific task k can be expressed as ω k = ω 0 + υ k , where υ k represents the private information of task k. Under this assumption, we elaborate the primal optimization problem of MTL-aLS-SVM I as follows.
where \({\boldsymbol {Z}_{k}} = \left (y_{k1}\phi (\boldsymbol {x}_{k1}), y_{k2}\phi (\boldsymbol {x}_{k2}), \cdots , y_{k{m_{k}}}\phi (\boldsymbol {x}_{k{m_{k}}})\right ) \in \mathbb {R}^{h \times {m_{k}}}\) with ϕ(⋅) having the same meaning as in (3); \({\boldsymbol {y}_{k}} = {(y_{k1}, y_{k2}, \cdots , y_{k{m_{k}}})^{T}}\); \(\boldsymbol {\zeta }_{k} = (\zeta _{k1}, \zeta _{k2}, \cdots , \zeta _{k{m_{k}}})^{T} \in \mathbb {R}^{{m_{k}}}\) is the slack variable vector for task k; \(\boldsymbol {e}_{m_{k}}=(1, 1, \cdots , 1)^{T} \in \mathbb {R}^{m_{k}}\); C 1 and C 2 are the positive regularization parameters. We introduce C 1to control the trade-off between the public classification information ω 0 and the dissimilarity among all tasks. Specifically, bigger C 1enforces MTL-aLS-SVM I to train a common model, while smaller C 1 will make MTL-aLS-SVM I learn each task model independently. It is shown from (4) that N different tasks are trained simultaneously because of the connection of the public classification information.
The Lagrangian of the primal problem (4) is
where \(\boldsymbol {\alpha }_{k}\,=\,(\alpha _{k1}, \alpha _{k2}, \cdots , \alpha _{k{m_{k}}})^{T} \) and β k = (β k1,β k2,⋯, \(\beta _{k{m_{k}}})^{T}\) are the nonnegative Lagrange multiplier vectors. By differentiating the Lagrangian with respect to ω 0,υ k ,ζ k ,b k based on the Karush-Kuhn-Tucker (KKT) condition, we get the following equations:
which shows that ω 0 is a linear combination of υ k . Since ω k = ω 0 + υ k , we further have
Substituting ω 0,υ k by ω k , we get the following equivalent form of the objective function of the primal problem (4) (for the proof of (12), see the Appendix).
where \(\bar {\boldsymbol {\omega }}=\frac {1}{N}{\sum }_{k = 1}^{N} {\boldsymbol {\omega }_{k}}\) is the mean vector of ω 1,⋯ ,ω N , \(\tau _{1}=\frac {C_{1}}{1+C_{1}N},\,\tau _{2}=\frac {{C^{2}_{1}} N}{1+C_{1} N}\). It is shown by (12) and the constraints of (4) that the newly proposed MTL-aLS-SVM I seeks for a trade-off between the maximum expectile distance for each task model and the closeness of each task model to the averaged model.
Substituting (6)–(9) into the Lagrangian (5), we get the following dual form of the primal problem (4):
where \(\boldsymbol {\alpha }=({\boldsymbol {\alpha }^{T}_{1}}, {\boldsymbol {\alpha }^{T}_{2}}, \cdots , {\boldsymbol {\alpha }^{T}_{N}})^{T}\) and \(\boldsymbol {\beta }=({\boldsymbol {\beta }^{T}_{1}}, {\boldsymbol {\beta }^{T}_{2}}, \cdots \), \({\boldsymbol {\beta }^{T}_{N}})^{T}\). By setting λ k = α k −β k , we rewrite (13) as
where \({\boldsymbol {\lambda }_{k}}=(\lambda _{k1}, \lambda _{k2},\cdots , \lambda _{km_{k}})^{T} \). Furthermore, the objective function of (14) can be rewritten as
where
Denote \(\boldsymbol {\lambda }_{k}^{*}, k = 1,\cdots , N\) as the optimal solutions of the above optimization problem. Then the decision function for task k can be obtained as
where K(⋅,⋅) is a kernel function, and the optimal value \({b^{*}_{k}}\) can be obtained by the following equations:
where \({\boldsymbol {\lambda }} = \left ({{\boldsymbol {\lambda }^{T}_{1}}}, {{\boldsymbol {\lambda }^{T}_{2}}}, \cdots , {{\boldsymbol {\lambda }^{T}_{N}}}\right )^{T}\); \(\boldsymbol {Z} = ({\boldsymbol {Z}_{1}}, {\boldsymbol {Z}_{2}}, \cdots , {\boldsymbol {Z}_{N}}) \in \mathbb {R}^{h \times m}\), and Z k i is the i th column of Z k .
3.2 MTL-aLS-SVM II
Next, we present an other elegant formulation under the assumption that all tasks share a common model, and every task function f k can be expressed as the sum of the common function h 0and the private function h k :
where ω 0and ϕ 0are the normal vector and nonlinear feature mapping for the common model, respectively, and υ k and ϕ k are those for the private model. We denote the offset b 0 + b k by b k for simplicity. Obviously, ϕ 0and ϕ k for the different task k can be the different nonlinear mappings, and compared with MTL-aLS-SVM I in which only one nonlinear transformation is employed, MTL-aLS-SVM II is its extension. If ϕ 0 = ϕ k , then MTL-aLS-SVM II reduces to MTL-aLS-SVM I.
We establish MTL-aLS-SVM II by solving the following optimization problem:
where \(\tilde {{\boldsymbol {Z}}_{k}}=(y_{k1}\phi _{0}(\boldsymbol {x}_{k1}), y_{k2}\phi _{0} (\boldsymbol {x}_{k2}), \cdots , y_{k{m_{k}}}\phi _{0} \cdots \) \((\boldsymbol {x}_{k{m_{k}}}{}){}) \in \mathbb {R}^{h \times {m_{k}}}\) ; A k = (y k1 ϕ k (x k1),y k2 ϕ k (x k2)⋯, \(y_{k{m_{k}}}\phi _{k} (\boldsymbol {x}_{k{m_{k}}}))\in \mathbb {R}^{h \times {m_{k}}}\), ζ k , y k , \(\boldsymbol {e}_{m_{k}}\) C 1, and C 2 have the same meanings as in formula (4).
The Lagrangian function of the above optimization problem is
where \(\boldsymbol {\alpha }_{k}\,=\,(\alpha _{k1}, \alpha _{k2}, \cdots , \alpha _{k{m_{k}}})^{T} \) and β k = (β k1,β k2,⋯, \(\beta _{k{m_{k}}})^{T}\) are the nonnegative Lagrange multiplier vectors. According to the KKT condition, we get the following equations:
By substituting (22)–(25) into (21), we obtain the following dual program of (20):
where \(\boldsymbol {\alpha }=({\boldsymbol {\alpha }^{T}_{1}}, {\boldsymbol {\alpha }^{T}_{2}}, \cdots , {\boldsymbol {\alpha }^{T}_{N}})^{T}\) and \(\boldsymbol {\beta }=({\boldsymbol {\beta }^{T}_{1}}, {\boldsymbol {\beta }^{T}_{2}}\), \(\cdots , {\boldsymbol {\beta }^{T}_{N}})^{T}\).
Setting λ k = α k −β k , we get the equivalent form of (26):
where K 0(⋅,⋅) and K k (⋅,⋅)(k = 1,2,⋯ ,N) are the kernel functions. It can be seen by comparing the program (27) with (14) (notice (15)) that MTL-aLS-SVM II and MTL-aLS-SVM I are equivalent if K 0 = K k . Therefore, MTL-aLS-SVM II is an extension of MTL-aLS-SVM I.
Denote \(\boldsymbol {\lambda }_{k}^{*}, k = 1,\cdots ,N\) as the optimal solutions of the above optimization problem. Then the decision function for task k can be obtained as
where the optimal value \({b^{*}_{k}}\) can be obtained by the following equations:
3.3 The special cases
In this subsection, we develop two kinds of special cases of MTL-aLS-SVM I and MTL-aLS-SVM II for the multi-task learning. Recall that the sharp of the asymmetric squared loss function (1) is closely related to the value of ρ. When ρ = 1, the asymmetric squared loss (1) reduces to the squared hinge loss:
And accordingly, the MTL-aLS-SVM I and MTL-aLS-SVM II reduce to the L2-SVM based multi-task learning methods (denoted by MTL-L2-SVM I and MTL-L2-SVM II, respectively).
MTL-L2-SVM I:
where Z k , ζ k , C 1, C 2 and \(\boldsymbol {e}_{m_{k}}\) have the same meanings as in formula (4). By the KKT condition, the dual problem of the above optimization problem can be obtained
MTL-L2-SVM II:
where \(\tilde {\boldsymbol {Z}}_{k}\), A k have the same meanings as in (20). The dual form of the above optimization problem is
On the other hand, when ρ = 1/2, the asymmetric squared loss (1) reduces to the least squared loss \( L_{\rho }(r) = \frac {1}{2}r^{2} \). Then the MTL-aLS-SVM I and MTL-aLS-SVM II accordingly turn to be the least squares SVM based multi-task learning methods (denoted by MTL-LS-SVM I and MTL-LS-SVM II, respectively).
MTL-LS-SVM I [22]:
The optimization problem (34) can be solved by the following linear system:
where D = b l o c k d i a g(y 1,y 2,⋯ ,y N ), the positive definite matrix \(\boldsymbol {H}=\boldsymbol {\Omega }+\frac {1}{C_{2}}\boldsymbol I_{m}+\frac {1}{C_{1}}\boldsymbol B \in \mathbb {R}^{m \times m}\), \(\boldsymbol {\Omega }=\boldsymbol {Z}^{T}\boldsymbol {Z}\in \mathbb {R}^{m \times m}\) with Z = (Z 1,Z 2,⋯ ,Z N ), and B = b l o c k d i a g(Ω 1,Ω 2,⋯ ,\(\boldsymbol {\Omega }_{N})\in \mathbb {R}^{m \times m}\) with \(\boldsymbol {\Omega }_{k}={\boldsymbol {Z}_{k}}^{T}\boldsymbol {Z}_{k} \in \mathbb {R}^{m_{k} \times m_{k}}\).
The efficiency of MTL-LS-SVM I has been verified by comparing it with other several multi-task learning methods [22]. More details can be seen in [22].
MTL-LS-SVM II:
The optimization problem (36) can be solved by the following linear system:
where D = b l o c k d i a g(y 1,y 2,⋯ ,y N ), the positive definite matrix \(\boldsymbol {H}=\tilde {\boldsymbol {\Omega }}+\frac {1}{C_{2}}\boldsymbol I_{m}+\frac {1}{C_{1}}\tilde {\boldsymbol B} \in \mathbb {R}^{m \times m}\), \(\tilde {\boldsymbol {\Omega }}=\tilde {\boldsymbol {Z}}^{T}\tilde {\boldsymbol {Z}}\in \mathbb {R}^{m \times m}\) with \(\tilde {\boldsymbol {Z} }= ({\tilde {\boldsymbol {Z}}_{1}}, {\tilde {\boldsymbol {Z}}_{2}}, \cdots , {\tilde {\boldsymbol {Z}}_{N}})\), and \(\tilde {\boldsymbol B}=blockdiag(\boldsymbol {\Theta }_{1},\boldsymbol {\Theta }_{2},\cdots ,\) Θ N )\(\in \mathbb {R}^{m \times m}\) with \(\boldsymbol {\Theta }_{k}={\boldsymbol A_{k}}^{T}{\boldsymbol A_{k}} \in \mathbb {R}^{m_{k} \times m_{k}}\).
4 Experiments
To verify the effectiveness of the newly proposed multi-task learning methods, we conduct experiments to compare the newly proposed multi-task learning methods with the strategy that all of the N tasks are learned independently by employing aLS-SVM [31], L2-SVM [32], LS-SVM [25], and nonparallel least square SVM (NLSSVM) [29]. And the corresponding single task learning methods are denoted as N-aLS-SVM, N-L2-SVM, N-LS-SVM, and N-NLSSVM respectively. All the experiments are carried out in MATLAB R2014a on a personal computer (PC) with an Intel(R) Core(TM) i7 processor (3.40 GHz) and 4GB random access memory (RAM).
We test these methods on a collection of three benchmark datasets including Isolet, Monk, and Dermatology coming from the UCI Machine Learning RepositoryFootnote 1. The Isolet dataset that is gathered from 150 subjects speaking 26 English letters twice consists of 7797 instances with 617 attributes (three instances had been historically lost). All of the speakers are divided into five equal number subsets known as Isolet1 to Isolet5, and each subset is treated as one classification task. On one hand, the five tasks have close relationship because they are gathered from the same utterances [11, 20]. On the other hand, the five tasks differ from each other because the speakers within diverse groups vary in the way of pronouncing the English letters. We classified three pairs of similar sounding letters including (B, D), (G, J) and (M, N) in our experiments. For (B, D) and (G, J) pairs, there are totally 600 instances in the five tasks of each pair; and for (M, N) pair, there are totally 599 instances in the five tasks. We employ principal component analysis (PCA) on the chosen datasets for removing the low variance noise. We reduce the attributes from 617 to 200, and 97.5% of the data variance is captured.
The Monk dataset with 432 instances is the basis of a first international comparison of learning algorithms. It is divided into 3 subsets based on the characteristic of 6 attributes. The subsets are referred to as Monk1, Monk2, and Monk3 which are corresponding to the three tasks.
The Dermatology dataset is a collection of 366 differential diagnosis including six kinds of dermatological diseases grounded on 33 clinicopathological characteristics. As in [8, 22], the problem can be converted into six binary one-versus-rest classification problems, and each one is regarded as a task. Therefore, we totally have six tasks.
In our experiments, the Gaussian kernel K(x i ,x j ) = exp(−σ∥x i −x j ∥2)is employed in the first multi-task learning method MTL-aLS-SVM I. For the second multi-task learning method MTL-aLS-SVM II, there exist two basic kernel functions: K 0 in the common model and K k in the private model (27). We test our method with two different combinations: One is that K 0(x k i ,x j r ) = 〈x k i , x j r 〉 is a linear kernel and K k (x k i ,x j r ) = exp(−σ∥x k i −x j r ∥2) is a Gaussian kernel; The other combination is that K 0(x k i ,x j r ) = 〈x k i ,x j r 〉is a linear kernel and K k (x k i , x j r ) = (〈x k i ,x j r 〉 + 1)d is a polynomial kernel with d > 1.
Generally speaking, the performance of the algorithms relies on the selections of parameters. There exist four (five) tuning parameters in MTL-aLS-SVM I (MTL-aLS-SVM II): C 1,C 2,σ, d (in MTL-aLS-SVM II), ρ. The first three (four) parameters are the same as those of MTL-L2-SVM I and MTL-LS-SVM I (MTL-L2-SVM II and MTL-LS-SVM II). The parameter ρ in MTL-aLS-SVM I and MTL-aLS-SVM II controls the sharp of the loss function. In our experiments, as in [31], we set ρ = 0.99,0.95,0.83. The parameter scopes are C 1 ∈{2− 7,2− 5,⋯ ,25}, C 2 ∈{2− 6,2− 4,⋯ ,28}, σ ∈{2− 7,2− 5,⋯ ,25}, and d ∈{2,3,⋯ ,9}. For the single task learning methods N-aLS-SVM, N-L2-SVM, and N-LS-SVM, except the kernel parameters σ,d, the optimal tuning parameter C is chosen from {2− 6,2− 4,⋯ ,28}. For N-NLSSVM, except the kernel parameters σ,d, there are two tuning parameters c 1and c 2 with the same ranges {2− 7,2− 6,⋯ ,28}. For each dataset, the attributes are scaled in [-1,1]. About 55% of the instances are randomly chosen from the whole dataset to constitute the training set, and the rest is the testing set. The five-fold cross validation is used on the training dataset to find the optimal parameters, and then a classification accuracy on the testing set is obtained. Repeat the process ten times, and the “Accuracy” in the following tables is the mean value of ten times testing results.
In Tables 1, 2, 3, 4 and 5, “Accuracy±S” denotes the averaged classification accuracy plus or minus the standard deviation. “L”, “G”, and “P” represent Linear kernel, Gaussian kernel, and Polynomial kernel, respectively. In the second kind of multi-task learning methods, “L + G” represents that K 0 is the linear kernel function and K k is the Gaussian kernel function; and “L + P” means that K 0 is the linear kernel function and K k is the Polynomial kernel function. The best result among all the methods for each dataset is highlighted.
As can be seen from Tables 1, 2 and 3, for Isolet dataset, the accuracies of the multi-task learning methods are much higher than the single task learning methods in general. Specifically, for (B, D) pair, the highest accuracy was created by MTL-aLS-SVM II using the combination of the linear kernel and Polynomial kernel. For (G, J) pair, the multi-task learning method MTL-L2-SVM II with the linear kernel and Gaussian kernel combination achieved the best accuracy. For (M, N) pair, the best accuracy was obtained by the multi-task learning method MTL-L2-SVM II with the linear kernel and Polynomial kernel combination.
For the Monk dataset, it is shown by Table 4 that the MTL-aLS-SVM I and MTL-aLS-SVM II achieve better performance than the other multi-task learning methods and the single task learning methods. And MTL-aLS-SVM II obtains the best accuracy among all of the multi-task and single task learning methods. In addition, it can be found that the performance of the single task learning methods has much to do with the different choices of the kernel functions. However, the multi-task learning methods are less sensitive to kernel functions.
For the Dermatology dataset, it can be seen from Table 5 that the accuracies obtained by MTL-aLS-SVM I and MTL-aLS-SVM II are sightly lower than the highest accuracy obtained by the single task learning method L2-SVM. The same phenomenon occurs in [8] for MTL-FEAT (RBF) and independent (RBF). Argyrious et al. reinforce their conjecture by the numerical experiments that the relation among these tasks is weak or not [8]. As in [8, 22], the results in Table 5 indicate that the newly proposed multi-task learning methods can also achieve good performance even in such case.
In addition, it has been shown by Tables 1, 2, 3, 4 and 5 that the results obtained by our proposed multi-task learning methods are better than those reported by the multi-task RMM algorithm in [24].
Further, we employ the non-parametric Friedman test with its corresponding Nemenyi post-hoc test [36] to perform a more fair comparison of all the involved algorithms on the employed UCI datasets. For simplicity, only the best accuracy of each involved algorithm is under consideration. Table 6 reports the ranks of “Accuracy” of all the involved algorithms on the employed UCI datasets, where each algorithm is represented by its abbreviation, for example, “MTL-aL I” denotes “MTL-aLS-SVM I”.
Let R i denotes the average rank of the i th algorithm in Table 6, the Friedman statistic which is distributed according to \(\mathcal {X}^{2}_{F}\) with (K − 1) degrees of freedom and the \(\mathcal {F}_{F} \) which is distributed according to the \(\mathcal {F}\)-distribution with (K − 1) and (K − 1)(N − 1) degrees of freedom can be calculated as \(\mathcal {X}^{2}_{F}=\frac {12N}{K(K + 1)}\left [\sum \limits ^{K}_{i = 1}{R^{2}_{i}}-\frac {K(K + 1)^{2}}{4}\right ]= 21.1418\) and \(\mathcal {F}_{F} =\frac {(N-1)\mathcal {X}^{2}_{F}}{N(K-1)-\mathcal {X}^{2}_{F}}= 3.5446\), where N = 5, K = 10. According to the table of critical values, it is easy to know that \(\mathcal {F}_{\alpha = 0.1}(10,5)= 1.811<3.5446\), so we reject the null hypothesis. For further pairwise comparison, we resort to the Nemenyi post-hoc test. For α = 0.1, the critical difference \(CD=\mathcal {F}_{\alpha = 0.1}(10,5)*\sqrt {\frac {K(K + 1)}{6N}}= 3.4678\). It is well known that the performance of two algorithms is significantly different if their average ranks differ by at least the critical difference. Based on Table 6, the differences between MTL-aL II and other algorithms can be calculated as follows:
where d(a − b) denotes the differences between aand b. Then we obtain the following conclusion: on the employed UCI datasets, MTL-aLS-SVM II performs significantly better than all the single task learning methods including N-aLS-SVM, N-L2-SVM, N-LS-SVM, N-NLSSVM and the multi-task learning method MTL-LS-SVM I, and there is no significant differences between MTL-aLS-SVM II and MTL-aLS-SVM I, MTL-L2-SVM I, MTL-L2-SVM II, MTL-LS-SVM II.
In the next part of the experiments, we demonstrate the influence of the parameter C 1 in multi-task learning methods MTL-aLS-SVM I and MTL-aLS-SVM II (formulations (4) and (20)) which trades off the public classification information and the dissimilarity between tasks. For this purpose, we contradistinguish MTL-aLS-SVM I, MTL-aLS-SVM II including MTL-aLS-SVM II (L+G) and MTL-aLS-SVM II (L+P), N-aLS-SVM, and 1-aLS-SVM (the method that employs one aLS-SVM for all tasks when all tasks are regarded as one big task). We take ρ = 0.95as an example, and conduct the experiments on the (B, D) pair of Isolet dataset. The variations of the averaged accuracy of the multi-task learning methods on each task along with the values of C 1 are illustrated in the Figs. 1, 2 and 3. As comparisons, the averaged accuracy obtained by the single learning methods N-aLS-SVM and 1-aLS-SVM with linear kernel, polynomial kernel, and Gaussian kernel are also illustrated in the Figs. 1, 2 and 3, respectively. Note that the N-aLS-SVM and 1-aLS-SVM models do not contain parameter C 1, the averaged accuracy of these two models are not affected by the variation of parameter C 1. The “Accuracy” in the three figures denotes the averaged accuracy.
It is shown by the three figures that when the values of C 1are small, the accuracies of MTL-aLS-SVM I and MTL-aLS-SVM II (L+P) are close to those of the conventional independent learning strategy N-aLS-SVM. When the values of C 1 are large, the performance of MTL-aLS-SVM I and MTL-aLS-SVM II (L+P) is in line with that of 1-aLS-SVM. However, the variation of the accuracies of MTL-aLS-SVM II (L+G) is not distinct, and MTL-aLS-SVM II (L+G) keeps the good performance along with the values of C 1.
In addition, it is interesting to see that the averaged accuracies of N-aLS-SVM are always lower than those of 1-aLS-SVM. The reason for this is that a small number of training data provides less information for N-aLS-SVM. And 1-aLS-SVM cannot deal with the label-incompatible dataset (for example, Monk dataset and Dermatology dataset). However, MTL-aLS-SVM I and MTL-aLS-SVM II can obtain the good performance with the proper values of C 1 since they can potentially learn correlation between tasks leading to more information.
5 Conclusion
In this paper, we have proposed the multi-task learning methods MTL-aLS-SVM I, MTL-aLS-SVM II, and their special cases for binary classification. MTL-aLS-SVM I combines the advantages of multi-task learning and the asymmetric least squares support vector machines. MTL-aLS-SVM II is an extension of the MTL-aLS-SVM I which adopt the assumption that the models of related tasks share a common model. A regularization parameter C 1 is introduced in MTL-aLS-SVM I and MTL-aLS-SVM II to seek for a trade-off between the public information and the private information dedicated to some specific task. In addition, the special cases MTL-L2-SVM II and MTL-LS-SVM II are also the newly proposed multi-task learning methods, which exhibit good performance. We have conducted comprehensive experiments to test the performance of the newly proposed methods and the influence of the regularization parameter C 1. Experimental results have shown that our methods are more effective than the corresponding single task learning methods. Additionally, our methods are flexible due to the introduction of parameter C 1. When there exists relatedness among the tasks, a proper value of C 1can be selected to make the methods achieve good performance. On the other hand, if the tasks are independent, a small value of C 1will make the methods learn tasks independently.
The multi-task learning is mainly designed to explore the latent information by learning all tasks jointly. As for exploiting the underlying information to improve the traditional inductive learning, an other renewed interest approach is Learning Using Privileged Information (LUPI) [37, 38]. Our future work is to extend our proposed multi-task learning methods to the LUPI learning paradigm.
References
Bakker B, Heskes T (2003) Task clustering and gating for Bayesian multitask learning. J Mach Learn Res 4:83–99
Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1953
Bi J, Xiong T, Yu S, Dundar M, Rao RB (2008) An improved multi-task learning approach with applications in medical diagnosis. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Datasets–Part I, Antwerp, Belgium, pp 117–132
Birlutiu A, Groot P, Heskes T (2010) Multi-task preference learning with an application to hearing aid personalization. Neurocomputing 73:1177–1185
Chapelle O, Shivaswamy P, Vadrevu S, Weinberger K, Zhang Y, Tseng B (2010) Multi-task learning for boosting with application to web search ranking. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 1189–1198
Ren Y, Xu B, Zhu P (2016) A multiCell visual tracking algorithm using multi-task paticle swarm optimization for low-constrast image seqences. Appl Intell 45(4):1129–1147
Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75
Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272
Ben-David S, Schuller R (2003) Exploiting task relatedness for multiple task learning. In: Proceedings of the 16th Annual Conference on Computational Learning Theory and the 7th Kernel Workshop, Washington DC, pp 567–580
Ben-David S, Borbely RS (2008) A notion of task relatedness yielding provable multiple-task learning guarantees. Mach Learn 73(3):273–287
Parameswaran S, Weinberger KQ (2000) Large margin multi-task metric learning. Adv Neural Inf Process Syst 23:1867–1875
Evgeniou T, Pontil M (2004) Regularized multi-task learning. In: Tenth ACM SIGKDD International Conference on Knowledge discovery and data mining, Seattle, pp 109–117
Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637
Li X, Zhao L, Wei L, Yang MH, Wu F, Zhuang Y, Ling H, Wang J (2016) DeepSaliency: Multi-task deep neural network model for salient object detection. IEEE Trans Image Process Publ IEEE Signal Process Soc 25(8):3919–3930
Yan Y, Ricci E, Subramanian R, Liu G, Lanz O, Sebe N (2016) A multi-task learning framework for head pose estimation under target motion. IEEE Trans Pattern Anal Mach Intell 38(6):1070–1083
Kato T, Kashima H, Sugiyama M, Asai K (2008) Multi-task learning via conic programming. In: Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, pp 737–744
Yang H, King I, Lyu MR (2010) Multi-task learning for one-class classification. In: Proceedings of the International Joint Conference on Neural Networks, Barcelona, pp 1–8
He X, Mourot G, Maquin D, Ragot J, Beauseroy P, Smolarz A, Grall-Maes E (2011) One-class SVM in multi-task learning. In: Advances in Safety, Reliability and Risk Management. ESREL 2011, Troyes, pp 486–494
He X, Mourot G, Maquin D, Ragot J, Beauseroy P, Smolarz A, Grall-Maes E (2014) Multi-task learning with one-class SVM. Neurocomputing 133:416–426
Ji Y, Sun S (2013) Multitask multiclass support vector machines: model and experiments. Pattern Recogn 46(3):914–924
Ji Y, Sun S, Lu Y (2012) Multitask multiclass privileged information support vector machines. In: Proceedings of the twenty-first international conference on pattern recognition, pp 2323–2326
Xu S, An X, Qiao X, Zhu L (2014) Multi-task least-squares support vector machines. Multimed Tools Appl 71(2):699–715
Li Y, Tian X, Song M, Song MG, Tao DC (2015) Multi-task proximal support vector machine. Pattern Recogn 48(10):3249–3257
Song YY, Zhu WX (2016) Multi-task support vector machine for data classification. Image Process Pattern Recogn 9(7):341–350
Suykens JAK, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J (2002) Least squares support vector machines. World Scientific, Singapore
Maldonado S, López J (2017) Robust kernel-based multiclass support vector machines via second-order cone programming. Appl Intell 46:983–992
Le Thi HA, Pham Dinh T, Thiao M (2016) Efficient approaches for l 2-l 0 regularization and applications to feature selection in SVM. Appl Intell 45:549–565
Li C, Zhang Y, Lu L (2015) An MIMLSVM algorithm based on ECC. Appl Intell 42:537–543
Zhao J, Yang Z, Xu Y (2016) Nonparallel least square support vector machine for classification. Appl Intell 45:1119–1128
Huang X, Shi L, Suykens JAK (2014) Support vector machine classifier with pinball loss. IEEE Trans Pattern Anal Mach Intell 36(5):984–997
Huang X, Shi L, Suykens JAK (2014) Asymmetric least squares support vector machine classifiers. Comput Stat Data Anal 70: 395–405
Vapnik V (1995) The nature of statistical learning theory. Springer-Verlag, New York
Wang KN, Zhu WX, Zhong P (2015) Robust support vector regression with generalized Loss Function and Applications. Neural Process Lett 41:89–106
Wang KN, Zhong P (2014) Robust non-convex least squares loss function for regression with outliers. Knowl-Based Syst 71:290–302
Zhong P (2012) Training robust support vector regression with smooth non-convex loss function. Optim Methods Softw 27(6):1039–1058
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Vapnik V, Vashist A (2009) A new learning paradigm: learning using privileged information. Neural Netw 22(5):544–557
Zhu WX, Zhong P (2014) A new one-class SVM based on hidden information. Knowl-Based Syst 60:35–43
Acknowledgments
The work is supported by the National Natural Science Foundation of China (No.11171346) and Chinese Universities Scientific Fund No. 2017LX003.
Author information
Authors and Affiliations
Corresponding author
Appendix: The proof of (12)
Appendix: The proof of (12)
Substituting (11) into the objective function of (4), we have
where \(\bar {\boldsymbol \omega }=\frac {1}{N}{\sum }_{t = 1}^{N} {\boldsymbol \omega _{t}}\), \(\tau _{1}=\frac {C_{1}}{1+C_{1}N},\,\tau _{2}=\frac {{C^{2}_{1}} N}{1+C_{1} N}\). Noticing that τ 1 + τ 2 = C 1, τ 2 = τ 1 C 1 N, and \(\tau _{2}=(1+C_{1}N){\tau _{1}^{2}}N\), the above equation can be calculated as follows.
Therefore, the proof of (12) is completed.
Rights and permissions
About this article
Cite this article
Lu, L., Lin, Q., Pei, H. et al. The aLS-SVM based multi-task learning classifiers. Appl Intell 48, 2393–2407 (2018). https://doi.org/10.1007/s10489-017-1087-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-017-1087-9