1 Introduction

Multi-task learning which is an important and ongoing issue in machine learning has attracted growing attention in many regions, such as multi-level analysis [1], semi-supervised learning [2], medical diagnosis [3], speech recognition [4], web search ranking [5], and cell biology [6]. The basic idea of multi-task learning is to obtain the satisfactory performance for each task by simultaneously learning multiple tasks with underlying cross relatedness [7, 8]. Different from single task learning, multi-task learning shares the useful knowledge among multiple tasks, which is helpful to improve the generalization performance. And determining the relatedness among the multiple tasks is important for establishing the formulations of the multi-task learning approaches [9,10,11]. Although the single task learning methods have achieved successful applications in many areas, they train each task independently and ignore the potential relatedness among tasks, which may reduce the accuracy of prediction. When there are correlations between tasks, it is more reasonable to learn all tasks simultaneously rather than separately [7].

The regularized multi-task learning methods proposed by Evgenious and Pontil [12, 13] generalize the kernel-based methods from single task learning to multi-task learning. Recently, the multi-task learning strategy has been applied in evolutionary algorithm [6], deep neural network [14], pattern recognition [15], support vector machine and so on. Thereinto, multi-task SVM is a power tool of machine learning, and a lot of literature reveal that the SVM-based multi-task learning methods are effective when the related tasks are trained simultaneously [16,17,18,19,20,21,22,23,24]. Yang et al. [17] presented a one-class SVM-based multi-task learning method by constraining the solutions of multiple tasks close to each other, and the resulting formulation is a conic programming [16]. He et al. [18] proposed an improved SVM-based multi-task learning method for the one-class classification under the assumption that the parameter value of each task model is close to a mean value [12]. A general formulation which has the ability to employ the different kernels for different tasks was then proposed under the assumption that the models of different tasks are close enough [19]. Sun et al. established a multi-task multi-class SVM approach with a constrained optimization instead of the decomposition methods, which can learn both label-compatible and label-incompatible scenarios [20, 21]. Based on the LS-SVM [25], Xu et al. generalized a multi-task LS-SVM that makes use of the advantages of LS-SVM and multi-task learning [22]. Li et al. proposed a multi-task proximal SVM with looser constraints to improve the training speed [23]. Song et al. proposed a novel formulation for multi-task learning by extending the relative margin machine (RMM) to the multi-task learning paradigm [24].

As an important part of machine learning, SVM has been widely studied in many fields, such as multi-class classification [26], feature selection [27], multi-instance multi-label learning [28], and nonparallel least square support vector machine (NLSSVM) [29]. A wide spectrum of successful applications show that SVM is an advanced classifier. As is well known, the loss function plays a key role in SVM, and the different support vector approaches can be established by using the corresponding loss functions [25, 30,31,32,33,34,35]. The typical loss functions include hinge loss function, least squared loss function, and insensitive loss function. All of these functions are convex and convenient to make calculations and theoretical analysis. Recently, a novel asymmetric squared loss function and the corresponding asymmetric least squares SVM (aLS-SVM) were proposed by Huang et al. [31]. Compared with LS-SVM, the aLS-SVM is more flexible since it introduces the expectile value in the asymmetric squared loss function. The aLS-SVM has the advantage of considerable robustness to the noise around the decision boundary and stability to re-sampling.

In this paper, we propose two aLS-SVM based multi-task learning methods and their special cases by integrating the merits of multi-task learning and the asymmetric squared loss function. We first make the assumption as in [12, 18, 20, 22, 23] that the normal vector of the hyperplane corresponding to each task is expressed as the sum of a certain common vector and a private vector, and establish the new method MTL-aLS-SVM I. We prove that the new method strikes a balance between the maximal expectile distance for each task model and the closeness of each task model to the averaged model. Then, we relax the assumption and suppose that each task model is expressed as the sum of a common model and a private model, and establish the second multi-task learning method MTL-aLS-SVM II. Compared with MTL-aLS-SVM I, MTL-aLS-SVM II is more flexible as it can use different kernel functions for different tasks. These two new methods can be easily implemented by solving quadratic programming and simultaneously obtain the decision functions for all tasks. In addition, we also present their special cases: LS-SVM based multi-task learning methods (denoted correspondingly by MTL-LS-SVM I [22] and MTL-LS-SVM II) and L2-SVM based multi-task learning methods (denoted correspondingly by MTL-L2-SVM I and MTL-L2-SVM II). The special cases MTL-LS-SVM II and MTL-L2-SVM II are also our newly proposed methods. We compare these multi-task learning methods with several related effective single-task learning methods including aLS-SVM [31], LS-SVM, L2-SVM, and NLSSVM [29]. The experimental results verify the effectiveness of our proposed multi-task learning methods.

In summary, by incorporating the properties of the multi-task learning and the asymmetric squared loss function, the advantages of our proposed methods are:

  • To have a good ability to process multi-task learning problems directly;

  • To have the potential to capture the relatedness among multiple related tasks;

  • To effectively exploit different kernel functions for different tasks;

  • To be more flexible by using the asymmetric squared loss function;

  • To be easily implemented by solving quadratic programming.

We organize the rest of this paper as follows. A brief introduction of the aLS-SVM is given in Section 2. Then we detail the MTL-aLS-SVM I and MTL-aLS-SVM II formulations in Section 3. Meanwhile, we give their corresponding special cases in this section. In Section 4, we evaluate the proposed methods by the numerical experiments. Finally, we conclude the paper in Section 5.

2 The aLS-SVM

The asymmetric least squares support vector machine (aLS-SVM) [31] is proposed based on the following asymmetric squared loss function:

$$\begin{array}{@{}rcl@{}} L_{\rho}(r) = \left\{ {\begin{array}{*{20}{l}} \rho{r^{2}}, \qquad r \geq 0 \\ (1 - \rho){r^{2}}, \, r < 0 \end{array}} \right. \end{array} $$
(1)

where ρ(0 ≤ ρ ≤ 1) is the expectile value. Unlike the general SVMs, the aLS-SVM maximizes the expectile distance instead of the minimal distance between two classes and solves the following optimization problem:

$$\begin{array}{@{}rcl@{}} \displaystyle{\min_{\boldsymbol{\omega},b,\boldsymbol{\zeta}}} && \frac{1}{2}{\| \boldsymbol{\omega} \|^{2}} + \frac{C}{2}\sum\limits_{k = 1}^{m} L_{\rho} (\zeta_{i}) \\ \text{s.t.} && {\zeta_{i}} = 1 - {y_{i}}({\boldsymbol{\omega}^{T}}{\phi(\boldsymbol{x}_{i})} + b),\, i = 1,2, \cdots, m \end{array} $$
(2)

where ζ is the error variable vector; ϕ(⋅) is a nonlinear mapping from the input space \(\mathbb {R}^{d}\) into the feature space \(\mathbb {R}^{h}\); C is the regularization parameter. According to the asymmetric squared loss function (1), the optimization problem (2) can be equivalently written as

$$\begin{array}{@{}rcl@{}} \displaystyle\min_{\boldsymbol{\omega},b,\boldsymbol{\zeta}} && \frac{1}{2}{\| \boldsymbol{\omega} \|^{2}} + \frac{C}{2}\sum\limits_{k = 1}^{m} {{\zeta_{i}}^{2}} \\ \text{s.t.} &&{y_{i}}({\boldsymbol{\omega}^{T}}\phi({\boldsymbol{x}_{i}}) + b) \geq 1-\frac{1}{\rho}{\zeta_{i}},\, i = 1,2, \cdots, m\\ && {y_{i}}({\boldsymbol{\omega}^{T}}\phi({\boldsymbol{x}_{i}}) + b) \leq 1+\frac{1}{1-\rho}{\zeta_{i}},\, i = 1,2, \cdots, m\\ \end{array} $$
(3)

Compared with the usual SVMs, the aLS-SVM is robust to noise around the decision boundary and stable to re-sampling because of the maximization of the expectile distance. It is also an extension of L2-SVM and LS-SVM [25]. More details about the aLS-SVM can be seen in [31].

3 The aLS-SVM based multi-task learning formulations

In this section, we propose two aLS-SVM based multi-task learning methods—MTL-aLS-SVM I and MTL-aLS-SVM II according to the different task relatedness assumptions. Meanwhile, we develop two types of special cases of these two multi-task learning methods. In the multi-task learning scenario, we are given N different but related tasks. For each task k, we have m k training data \(\{(\boldsymbol {x}_{ki}, y_{ki})\}_{i = 1}^{m_{k}}\), where \({{\boldsymbol {x}}_{ki}} \in {{\mathbb {R}}^{d}}\) and y k i ∈{1,− 1}. Thus, we totally have \(m={\sum }_{k = 1}^{N}m_{k}\) training data. Our aim is to learn N different decision functions (hyperplanes) for each task simultaneously.

3.1 MTL-aLS-SVM I

In the light of the method presented in [12], when the related tasks share a common function ω 0, the normal vector \({\boldsymbol {\omega }_{k}}\in \mathbb {R}^{h}\) for the specific task k can be expressed as ω k = ω 0 + υ k , where υ k represents the private information of task k. Under this assumption, we elaborate the primal optimization problem of MTL-aLS-SVM I as follows.

$$\begin{array}{@{}rcl@{}} \min_{\boldsymbol{\omega}_{0}, \boldsymbol{\upsilon}_{k}, b_{k}, \boldsymbol{\zeta}_{k}} && \frac{1}{2}\|\boldsymbol{\omega}_{0}\|^{2} + \frac{C_{1}}{2}\sum\limits_{k = 1}^{N} \|\boldsymbol{\upsilon}_{k}\|^{2} + \frac{C_{2}}{2}\sum\limits_{k = 1}^{N} \|\boldsymbol{\zeta}_{k}\|^{2} \\ \text{s.t.} && \boldsymbol{Z}_{k}^{T}(\boldsymbol{\omega}_{0} + \boldsymbol{\upsilon}_{k}) + b_{k} \boldsymbol{y}_{k} \geq \boldsymbol{e}_{m_{k}} - \frac{1}{\rho}\boldsymbol{\zeta}_{k},\\ && k = 1, 2, \cdots, N \\ && \boldsymbol{Z}_{k}^{T}(\boldsymbol{\omega}_{0} + \boldsymbol{\upsilon}_{k}) + b_{k} \boldsymbol{y}_{k} \leq \boldsymbol{e}_{m_{k}} + \frac{1}{1 - \rho}{\boldsymbol{\zeta}_{k}},\\ && k = 1, 2, \cdots, N \end{array} $$
(4)

where \({\boldsymbol {Z}_{k}} = \left (y_{k1}\phi (\boldsymbol {x}_{k1}), y_{k2}\phi (\boldsymbol {x}_{k2}), \cdots , y_{k{m_{k}}}\phi (\boldsymbol {x}_{k{m_{k}}})\right ) \in \mathbb {R}^{h \times {m_{k}}}\) with ϕ(⋅) having the same meaning as in (3); \({\boldsymbol {y}_{k}} = {(y_{k1}, y_{k2}, \cdots , y_{k{m_{k}}})^{T}}\); \(\boldsymbol {\zeta }_{k} = (\zeta _{k1}, \zeta _{k2}, \cdots , \zeta _{k{m_{k}}})^{T} \in \mathbb {R}^{{m_{k}}}\) is the slack variable vector for task k; \(\boldsymbol {e}_{m_{k}}=(1, 1, \cdots , 1)^{T} \in \mathbb {R}^{m_{k}}\); C 1 and C 2 are the positive regularization parameters. We introduce C 1to control the trade-off between the public classification information ω 0 and the dissimilarity among all tasks. Specifically, bigger C 1enforces MTL-aLS-SVM I to train a common model, while smaller C 1 will make MTL-aLS-SVM I learn each task model independently. It is shown from (4) that N different tasks are trained simultaneously because of the connection of the public classification information.

The Lagrangian of the primal problem (4) is

$$\begin{array}{@{}rcl@{}} &&{}\mathcal{L}(\boldsymbol{\omega}_{0}, \boldsymbol{\upsilon}_{k}, b_{k}, \boldsymbol{\zeta}_{k}, \boldsymbol{\alpha}_{k}, \boldsymbol{\beta}_{k})\\ && = \frac{1}{2}\|\boldsymbol{\omega}_{0}\|^{2} + \frac{C_{1}}{2}\sum\limits_{k = 1}^{N} \| \boldsymbol{\upsilon}_{k}\|^{2} + \frac{C_{2}}{2}\sum\limits_{k = 1}^{N} \|\boldsymbol{\zeta}_{k}\|^{2}\\ &&\quad -\sum\limits_{k = 1}^{N} {{\boldsymbol{\alpha}_{k}^{T}}} \!\left( \!\boldsymbol{Z}_{k}^{T}(\boldsymbol{\omega}_{0} \,+\, \boldsymbol{\upsilon}_{k}) \,+\, {b_{k}}{\boldsymbol{y}_{k}} \,-\, {\boldsymbol{e}_{m_{k}}} \,+\, \frac{1}{\rho}{\boldsymbol{\zeta}_{k}}\!\right) \\ &&\quad + \sum\limits_{k = 1}^{N}{\kern-1.5pt} {{\boldsymbol{\beta}_{k}^{T}}}{} \!\left( \!\boldsymbol{Z}_{k}^{T}{\kern-1.5pt}({\kern-.5pt}\boldsymbol{\omega}_{0} \,+\, \boldsymbol{\upsilon}_{k}{\kern-.5pt}) \,+\, {b_{k}}{\kern-.5pt}{\boldsymbol{y}_{k}} \,-\, {\boldsymbol{e}_{m_{k}}}{\kern-.5pt} \,-\, \frac{1}{1\!{\kern-.5pt}-{\kern-.5pt}\!\rho}{\kern-.5pt}{\boldsymbol{\zeta}_{k}}\!{\kern-.5pt}\right)\\ \end{array} $$
(5)

where \(\boldsymbol {\alpha }_{k}\,=\,(\alpha _{k1}, \alpha _{k2}, \cdots , \alpha _{k{m_{k}}})^{T} \) and β k = (β k1,β k2,⋯, \(\beta _{k{m_{k}}})^{T}\) are the nonnegative Lagrange multiplier vectors. By differentiating the Lagrangian with respect to ω 0,υ k ,ζ k ,b k based on the Karush-Kuhn-Tucker (KKT) condition, we get the following equations:

$$\begin{array}{@{}rcl@{}} &&\boldsymbol{\omega}_{0}=\sum\limits_{k = 1}^{N} \boldsymbol{Z}_{k}({\boldsymbol{\alpha}_{k}}-{\boldsymbol{\beta}_{k}}) \end{array} $$
(6)
$$\begin{array}{@{}rcl@{}} && \boldsymbol{\upsilon}_{k} = \frac{1}{C_{1}}{\boldsymbol{Z}_{k}}({\boldsymbol{\alpha}_{k}} - {\boldsymbol{\beta}_{k}}) \end{array} $$
(7)
$$\begin{array}{@{}rcl@{}} && {\boldsymbol{\zeta}_{k}} = \frac{1}{C_{2}}\left( \frac{1}{\rho}{\boldsymbol{\alpha}_{k}} + \frac{1}{1-\rho}\boldsymbol{\beta}_{k}\right) \end{array} $$
(8)
$$\begin{array}{@{}rcl@{}} &&(\boldsymbol{\alpha}_{k}-\boldsymbol{\beta}_{k})^{T}{\boldsymbol{y}}_{k} = 0 \end{array} $$
(9)

By (6) and (7), we have

$$ \boldsymbol{\omega}_{0} = C_{1}\sum\limits_{k = 1}^{N} {\boldsymbol{\upsilon}_{k}} $$
(10)

which shows that ω 0 is a linear combination of υ k . Since ω k = ω 0 + υ k , we further have

$$ {\boldsymbol{\omega}_{0}} = \frac{{C_{1}}}{{1 + C_{1} N}}\sum\limits_{k = 1}^{N} {{ \boldsymbol{\omega}_{k}}} $$
(11)

Substituting ω 0,υ k by ω k , we get the following equivalent form of the objective function of the primal problem (4) (for the proof of (12), see the Appendix).

$$ \frac{\tau_{1}}{2}\sum\limits_{k = 1}^{N} \| {\boldsymbol{\omega}_{k}} \|^{2} + \frac{\tau_{2}}{2}\sum\limits_{k = 1}^{N} \left\| {\boldsymbol{\omega}_{k}} -\bar{\boldsymbol{\omega}} \right\|^{2} + \frac{C_{2}}{2}\sum\limits_{k = 1}^{N} \| {\boldsymbol{\zeta}_{k}} \|^{2} $$
(12)

where \(\bar {\boldsymbol {\omega }}=\frac {1}{N}{\sum }_{k = 1}^{N} {\boldsymbol {\omega }_{k}}\) is the mean vector of ω 1,⋯ ,ω N , \(\tau _{1}=\frac {C_{1}}{1+C_{1}N},\,\tau _{2}=\frac {{C^{2}_{1}} N}{1+C_{1} N}\). It is shown by (12) and the constraints of (4) that the newly proposed MTL-aLS-SVM I seeks for a trade-off between the maximum expectile distance for each task model and the closeness of each task model to the averaged model.

Substituting (6)–(9) into the Lagrangian (5), we get the following dual form of the primal problem (4):

$$\begin{array}{@{}rcl@{}} \max_{\boldsymbol{\alpha},\boldsymbol{\beta}} && - \frac{1}{2}\sum\limits_{k,j = 1}^{N} ({\boldsymbol{\alpha}_{k}} - {\boldsymbol{\beta}_{k}})^{T} {\boldsymbol{Z}^{T}_{k}} {\boldsymbol{Z}_{j}}({\boldsymbol{\alpha}_{j}} - {\boldsymbol{\beta}_{j}})\\ &&- \frac{1}{2C_{1}}\sum\limits_{k = 1}^{N} ({\boldsymbol{\alpha}_{k}} - {\boldsymbol{\beta}_{k}})^{T} {\boldsymbol{Z}^{T}_{k}} {\boldsymbol{Z}_{k}}({\boldsymbol{\alpha}_{k}} - {\boldsymbol{\beta}_{k}}) \\ && - \frac{1}{2C_{2}}\sum\limits_{k = 1}^{N}\left( \frac{1}{\rho}\boldsymbol{\alpha}_{k} \,+\, \frac{1}{1 - \rho}\boldsymbol{\beta}_{k}\right)^{T}\left( \frac{1}{\rho}{\boldsymbol{\alpha}_{k}} \,+\, \frac{1}{1 - \rho}{\boldsymbol{\beta}_{k}}\right)\\ &&+ \sum\limits_{k = 1}^{N} ({\boldsymbol{\alpha}_{k}} \,-\, {\boldsymbol{\beta}_{k}})^{T} \boldsymbol{e}_{{m_{k}}} \\ \text{s.t.} && ({\boldsymbol\alpha_{k}} - {\boldsymbol\beta_{k}})^{T}{\boldsymbol{y}_{k}} = 0,\, k = 1,2, \cdots, N \\ &&\boldsymbol{\alpha}_{k} \geq \textbf{0}, \,k = 1, 2, \cdots, N \\ &&\boldsymbol{\beta}_{k} \geq \textbf{0},\, k = 1,2, \cdots, N \end{array} $$
(13)

where \(\boldsymbol {\alpha }=({\boldsymbol {\alpha }^{T}_{1}}, {\boldsymbol {\alpha }^{T}_{2}}, \cdots , {\boldsymbol {\alpha }^{T}_{N}})^{T}\) and \(\boldsymbol {\beta }=({\boldsymbol {\beta }^{T}_{1}}, {\boldsymbol {\beta }^{T}_{2}}, \cdots \), \({\boldsymbol {\beta }^{T}_{N}})^{T}\). By setting λ k = α k β k , we rewrite (13) as

$$\begin{array}{@{}rcl@{}} \min_{{\boldsymbol{\lambda}_{k}},{\boldsymbol{\beta}_{k}}}&&\frac{1}{2}\sum\limits_{k, j = 1}^{N} {\boldsymbol{\lambda}_{k}}^{T}{\boldsymbol{Z}^{T}_{k}} {\boldsymbol{Z}_{j}}{\boldsymbol{\lambda}_{j}} +\frac{1}{2C_{1}}\sum\limits_{k = 1}^{N} {\boldsymbol{\lambda}_{k}}^{T}{\boldsymbol{Z}_{k}}^{T} {\boldsymbol{Z}_{k}}{\boldsymbol{\lambda}_{k}}\\ && + \frac{1}{2 \rho^{2} C_{2}}\sum\limits_{k = 1}^{N}\left( {\boldsymbol{\lambda}_{k}} + \frac{1}{1 - \rho}{\boldsymbol{\beta}_{k}}\right)^{T}\left( {\boldsymbol{\lambda}_{k}} + \frac{1}{1 - \rho}{\boldsymbol{\beta}_{k}}\right)\\ && - \sum\limits_{k = 1}^{N} {\boldsymbol{\lambda}_{k}}^{T} \boldsymbol{e}_{{m_{k}}} \end{array} $$
$$\begin{array}{@{}rcl@{}} \text{ s.t.}&&{\boldsymbol{\lambda}_{k}}^{T}{\boldsymbol{y}_{k}} = 0,\, k = 1,2, \cdots, N \\ &&{\boldsymbol{\lambda}_{k}} + {\boldsymbol{\beta}_{k}} \geq 0,\, k = 1,2, \cdots, N\\ &&{\boldsymbol{\beta}_{k}} \geq 0,\, k = 1,2, \cdots, N \end{array} $$
(14)

where \({\boldsymbol {\lambda }_{k}}=(\lambda _{k1}, \lambda _{k2},\cdots , \lambda _{km_{k}})^{T} \). Furthermore, the objective function of (14) can be rewritten as

$$\begin{array}{@{}rcl@{}} && \frac{1}{2}\sum\limits_{k, j = 1}^{N} \sum\limits_{i = 1}^{m_{k}} \sum\limits_{r = 1}^{m_{j}}{\lambda_{ki}}{\lambda_{jr}}{y_{ki}}{y_{jr}}\left( 1+\frac{\delta_{kj}}{C_{1}}\right){K(\boldsymbol{x}_{ki},\boldsymbol{x}_{jr})}\\ &&\qquad + \frac{1}{2 \rho^{2} C_{2}}\sum\limits_{k = 1}^{N}\left( {\boldsymbol{\lambda}_{k}} + \frac{1}{1 - \rho}{\boldsymbol{\beta}_{k}}\right)^{T}\left( {\boldsymbol{\lambda}_{k}} + \frac{1}{1 - \rho}{\boldsymbol{\beta}_{k}}\right)\\ &&\qquad - \sum\limits_{k = 1}^{N} {\boldsymbol{\lambda}_{k}}^{T} \boldsymbol{e}_{{m_{k}}} \end{array} $$
(15)

where

$$\begin{array}{@{}rcl@{}} \delta_{kj} = \left\{ {\begin{array}{*{20}{l}} 1, k=j \\ 0, k\neq j \end{array}} \right. \end{array} $$
(16)

Denote \(\boldsymbol {\lambda }_{k}^{*}, k = 1,\cdots , N\) as the optimal solutions of the above optimization problem. Then the decision function for task k can be obtained as

$$\begin{array}{@{}rcl@{}} {f_{k}}(\boldsymbol{x})& =& \text{sign}\left( \phi {(\boldsymbol{x})^{T}}\left( \sum\limits_{k = 1}^{N}\boldsymbol{Z}_{k}\boldsymbol{\lambda}^{*}_{k} + \frac{1}{C_{1}}{\boldsymbol{Z}_{k}}\boldsymbol{\lambda}_{k}^{*}\right) \!+ b_{k}^{*}\right) \\ &= &\text{sign}\left( \sum\limits_{k = 1}^{N} \sum\limits_{i = 1}^{m_{k}} \lambda^{*}_{ki} y_{ki} K(\boldsymbol{x}_{ki}, \boldsymbol{x}) \right.\\&&\left.+ \frac{1}{C_{1}}\sum\limits_{i = 1}^{{m_{k}}} {\lambda^{*}_{ki}} {y_{ki}} K({\boldsymbol{x}_{ki}},\boldsymbol{x}) + b^{*}_{k}\right) \end{array} $$
(17)

where K(⋅,⋅) is a kernel function, and the optimal value \({b^{*}_{k}}\) can be obtained by the following equations:

$$ \boldsymbol{Z}_{ki}^{T} \boldsymbol{Z}\boldsymbol{\lambda} + \frac{1}{{C_{1}}}\boldsymbol{Z}_{ki}^{T}{\boldsymbol{Z}_{k}}{\boldsymbol{\lambda}_{k}} + {y_{ki}}{b_{k}} \,=\, 1 {\kern-.5pt}-{\kern-.5pt} \frac{1}{\rho}{\zeta_{ki}},\forall ki\!:\! {\alpha_{ki}} \!>\! 0{\kern6pt} $$
(18)
$$ \boldsymbol{Z}_{ki}^{T}\boldsymbol{Z}\boldsymbol{\lambda} + \frac{1}{{C_{1}}}\boldsymbol{Z}_{ki}^{T}{\boldsymbol{Z}_{k}}{\boldsymbol{\lambda}_{k}} + {y_{ki}}{b_{k}} \,=\, 1 + \frac{1}{{1 - \rho}}{\zeta_{ki}},\forall ki: {\beta_{ki}} \!>\! 0 $$
(19)

where \({\boldsymbol {\lambda }} = \left ({{\boldsymbol {\lambda }^{T}_{1}}}, {{\boldsymbol {\lambda }^{T}_{2}}}, \cdots , {{\boldsymbol {\lambda }^{T}_{N}}}\right )^{T}\); \(\boldsymbol {Z} = ({\boldsymbol {Z}_{1}}, {\boldsymbol {Z}_{2}}, \cdots , {\boldsymbol {Z}_{N}}) \in \mathbb {R}^{h \times m}\), and Z k i is the i th column of Z k .

3.2 MTL-aLS-SVM II

Next, we present an other elegant formulation under the assumption that all tasks share a common model, and every task function f k can be expressed as the sum of the common function h 0and the private function h k :

$$\begin{array}{@{}rcl@{}} f_{k}&=&h_{0}+h_{k}\\ &=&\langle\boldsymbol\omega_{0},\phi_{0}({\boldsymbol{x}})\rangle+ \langle\boldsymbol{\upsilon}_{k},\phi_{k}({\boldsymbol{x}})\rangle+b_{k} \end{array} $$

where ω 0and ϕ 0are the normal vector and nonlinear feature mapping for the common model, respectively, and υ k and ϕ k are those for the private model. We denote the offset b 0 + b k by b k for simplicity. Obviously, ϕ 0and ϕ k for the different task k can be the different nonlinear mappings, and compared with MTL-aLS-SVM I in which only one nonlinear transformation is employed, MTL-aLS-SVM II is its extension. If ϕ 0 = ϕ k , then MTL-aLS-SVM II reduces to MTL-aLS-SVM I.

We establish MTL-aLS-SVM II by solving the following optimization problem:

$$\begin{array}{@{}rcl@{}} \min_{\boldsymbol{\omega}_{0}, \boldsymbol{\upsilon}_{k}, b_{k}, \boldsymbol{\zeta}_{k}}\!\! && \frac{1}{2}\|\boldsymbol{\omega}_{0}\|^{2} \,+\, \frac{C_{1}}{2}\sum\limits_{k = 1}^{N} \|\boldsymbol{\upsilon}_{k}\|^{2} \,+\, \frac{C_{2}}{2}\sum\limits_{k = 1}^{N} \|\boldsymbol{\zeta}_{k}\|^{2} \\ \text{s.t.} && \tilde{\boldsymbol{Z}}_{k}^{T}\boldsymbol{\omega}_{0} \,+\, \boldsymbol{A}_{k}^{T} {\boldsymbol\upsilon}_{k} \,+\, b_{k} \boldsymbol{y}_{k} \!\geq\! \boldsymbol{e}_{m_{k}} \,-\, \frac{1}{\rho}\boldsymbol{\zeta}_{k},\\ && k\,=\,1, 2, \cdots, N \\ && \tilde{\boldsymbol{Z}}_{k}^{T}\boldsymbol{\omega}_{0} \,+\, \boldsymbol{A}_{k}^{T}\boldsymbol{\upsilon}_{k} \,+\, b_{k} \boldsymbol{y}_{k} \!\leq\! \boldsymbol{e}_{m_{k}} \,+\, \frac{1}{1 \,-\, \rho}{\boldsymbol{\zeta}_{k}},\\ && k\,=\,1, 2, \cdots, N \end{array} $$
(20)

where \(\tilde {{\boldsymbol {Z}}_{k}}=(y_{k1}\phi _{0}(\boldsymbol {x}_{k1}), y_{k2}\phi _{0} (\boldsymbol {x}_{k2}), \cdots , y_{k{m_{k}}}\phi _{0} \cdots \) \((\boldsymbol {x}_{k{m_{k}}}{}){}) \in \mathbb {R}^{h \times {m_{k}}}\) ; A k = (y k1 ϕ k (x k1),y k2 ϕ k (x k2)⋯, \(y_{k{m_{k}}}\phi _{k} (\boldsymbol {x}_{k{m_{k}}}))\in \mathbb {R}^{h \times {m_{k}}}\), ζ k , y k , \(\boldsymbol {e}_{m_{k}}\) C 1, and C 2 have the same meanings as in formula (4).

The Lagrangian function of the above optimization problem is

$$\begin{array}{@{}rcl@{}} &&{} \mathcal{L}(\boldsymbol{\omega}_{0}, \boldsymbol{\upsilon}_{k}, b_{k}, \boldsymbol{\zeta}_{k}, \boldsymbol{\alpha}_{k}, \boldsymbol{\beta}_{k})\\ && = \frac{1}{2}\|\boldsymbol{\omega}_{0}\|^{2} + \frac{C_{1}}{2}\sum\limits_{k = 1}^{N} \| \boldsymbol{\upsilon}_{k}\|^{2} + \frac{C_{2}}{2}\sum\limits_{k = 1}^{N} \|\boldsymbol{\zeta}_{k}\|^{2}\\ &&\quad-\sum\limits_{k = 1}^{N} {{\boldsymbol{\alpha}_{k}^{T}}} \left( \tilde{\boldsymbol{Z}}_{k}^{T}\boldsymbol{\omega}_{0} \,+\, \boldsymbol{A}_{k}^{T}\boldsymbol\upsilon_{k} \,+\, {b_{k}}{\boldsymbol{y}_{k}} \,-\, {\boldsymbol{e}_{m_{k}}} \,+\, \frac{1}{\rho}{\boldsymbol\zeta_{k}}\right)\\ &&\quad + \sum\limits_{k = 1}^{N} {{\boldsymbol{\beta}_{k}^{T}}}\!\!\left( \!\tilde{\boldsymbol{Z}}_{k}^{T}\boldsymbol\omega_{0} \,+\, \boldsymbol{A}_{k}^{T}\boldsymbol\upsilon_{k} \,+\, {b_{k}}{\boldsymbol{y}_{k}} \,-\, {\boldsymbol{e}_{m_{k}}} \,-\, \frac{1}{1\,-\,\rho}{\boldsymbol\zeta_{k}}\!\!\right)\\ \end{array} $$
(21)

where \(\boldsymbol {\alpha }_{k}\,=\,(\alpha _{k1}, \alpha _{k2}, \cdots , \alpha _{k{m_{k}}})^{T} \) and β k = (β k1,β k2,⋯, \(\beta _{k{m_{k}}})^{T}\) are the nonnegative Lagrange multiplier vectors. According to the KKT condition, we get the following equations:

$$\begin{array}{@{}rcl@{}} &&\boldsymbol\omega_{0}=\sum\limits_{k = 1}^{N} \tilde{\boldsymbol{Z}}_{k}({\boldsymbol{\alpha}_{k}}-{\boldsymbol{\beta}_{k}}) \end{array} $$
(22)
$$\begin{array}{@{}rcl@{}} && \boldsymbol\upsilon_{k} = \frac{1}{C_{1}}{\boldsymbol{A}_{k}}({\boldsymbol{\alpha}_{k}} - {\boldsymbol{\beta}_{k}}) \end{array} $$
(23)
$$\begin{array}{@{}rcl@{}} && {\boldsymbol\zeta_{k}} = \frac{1}{C_{2}}\left( \frac{1}{\rho}{\boldsymbol{\alpha}_{k}} + \frac{1}{1-\rho}\boldsymbol{\beta}_{k}\right) \end{array} $$
(24)
$$\begin{array}{@{}rcl@{}} &&(\boldsymbol{\alpha}_{k}-\boldsymbol{\beta}_{k})^{T}{\boldsymbol{y}}_{k} = 0 \end{array} $$
(25)

By substituting (22)–(25) into (21), we obtain the following dual program of (20):

$$\begin{array}{@{}rcl@{}} \max_{\boldsymbol{\alpha},\boldsymbol{\beta}} && - \frac{1}{2}\sum\limits_{k,j = 1}^{N} ({\boldsymbol{\alpha}_{k}} - {\boldsymbol{\beta}_{k}})^{T} {\tilde{\boldsymbol{Z}}^{T}_{k}} {\tilde{\boldsymbol{Z}}_{j}}({\boldsymbol{\alpha}_{j}} - {\boldsymbol{\beta}_{j}})\\ &&- \frac{1}{2C_{1}}\sum\limits_{k = 1}^{N} ({\boldsymbol{\alpha}_{k}} - {\boldsymbol{\beta}_{k}})^{T} {\boldsymbol{A}^{T}_{k}} {\boldsymbol{A}_{k}}({\boldsymbol{\alpha}_{k}} - {\boldsymbol{\beta}_{k}}) \\ && - \frac{1}{2C_{2}}\sum\limits_{k = 1}^{N}\!\left( \frac{1}{\rho}\boldsymbol{\alpha}_{k} \,+\, \frac{1}{1 \,-\, \rho}\boldsymbol{\beta}_{k}\right)^{T}\left( \frac{1}{\rho}{\boldsymbol{\alpha}_{k}} \,+\, \frac{1}{1 \,-\, \rho}{\boldsymbol{\beta}_{k}}\right) \\&&+ \sum\limits_{k = 1}^{N} ({\boldsymbol{\alpha}_{k}} - {\boldsymbol{\beta}_{k}})^{T} \boldsymbol{e}_{{m_{k}}} \\ \text{s.t.} && ({\boldsymbol{\alpha}_{k}} - {\boldsymbol{\beta}_{k}})^{T}{\boldsymbol{y}_{k}} = 0,\, k = 1,2, \cdots, N \\ &&\boldsymbol{\alpha}_{k} \geq \textbf{0}, \,k = 1, 2, \cdots, N \\ &&\boldsymbol{\beta}_{k} \geq \textbf{0},\, k = 1,2, \cdots, N \end{array} $$
(26)

where \(\boldsymbol {\alpha }=({\boldsymbol {\alpha }^{T}_{1}}, {\boldsymbol {\alpha }^{T}_{2}}, \cdots , {\boldsymbol {\alpha }^{T}_{N}})^{T}\) and \(\boldsymbol {\beta }=({\boldsymbol {\beta }^{T}_{1}}, {\boldsymbol {\beta }^{T}_{2}}\), \(\cdots , {\boldsymbol {\beta }^{T}_{N}})^{T}\).

Setting λ k = α k β k , we get the equivalent form of (26):

$$\begin{array}{@{}rcl@{}} \min_{{\boldsymbol{\lambda}_{k}},{\boldsymbol{\beta}_{k}}}&&\frac{1}{2}\sum\limits_{k, j = 1}^{N} \sum\limits_{i = 1}^{m_{k}} \sum\limits_{r = 1}^{m_{j}}{\lambda_{ki}}{\lambda_{jr}}{y_{ki}}{y_{jr}}\\ && \times \left( {K_{0}(\boldsymbol{x}_{ki},\boldsymbol{x}_{jr})} +\frac{\delta_{kj}}{C_{1}}{K_{k}(\boldsymbol{x}_{ki},\boldsymbol{x}_{jr})}\right)\\ &&+ \frac{1}{2 \rho^{2} C_{2}}\sum\limits_{k = 1}^{N}\left( {\boldsymbol{\lambda}_{k}} + \frac{1}{1 - \rho}{\boldsymbol{\beta}_{k}}\right)^{T}\left( {\boldsymbol{\lambda}_{k}} + \frac{1}{1 - \rho}{\boldsymbol{\beta}_{k}}\right)\\ && - \sum\limits_{k = 1}^{N} {\boldsymbol{\lambda}_{k}}^{T} \boldsymbol{e}_{{m_{k}}} \\ \text{ s.t.}&&{\boldsymbol{\lambda}_{k}}^{T}{\boldsymbol{y}_{k}} = 0,\, k = 1,2, \cdots, N \\ &&{\boldsymbol{\lambda}_{k}} + {\boldsymbol{\beta}_{k}} \geq 0,\, k = 1,2, \cdots, N\\ &&{\boldsymbol{\beta}_{k}} \geq 0,\, k = 1,2, \cdots, N \end{array} $$
(27)

where K 0(⋅,⋅) and K k (⋅,⋅)(k = 1,2,⋯ ,N) are the kernel functions. It can be seen by comparing the program (27) with (14) (notice (15)) that MTL-aLS-SVM II and MTL-aLS-SVM I are equivalent if K 0 = K k . Therefore, MTL-aLS-SVM II is an extension of MTL-aLS-SVM I.

Denote \(\boldsymbol {\lambda }_{k}^{*}, k = 1,\cdots ,N\) as the optimal solutions of the above optimization problem. Then the decision function for task k can be obtained as

$$\begin{array}{@{}rcl@{}} {f_{k}}(\boldsymbol{x})\!\!&=&\!\!\text{sign}\!{}\left( \!\phi_{0} {{\kern-.5pt}({\kern-.5pt}\boldsymbol{x}{\kern-.5pt})^{T}}\!\sum\limits_{k = 1}^{N}{}\tilde{\boldsymbol{Z}}_{k}\boldsymbol{\lambda}^{*}_{k} \,+\, \phi_{k} {(\boldsymbol{x})^{T}}\! \frac{1}{C_{1}}{\boldsymbol{A}_{k}}\boldsymbol{\lambda}_{k}^{*} \,+\, b_{k}^{*}\!\right) \\ &= &\!\text{sign} \left( \sum\limits_{k = 1}^{N} \sum\limits_{i = 1}^{m_{k}} \lambda^{*}_{ki} y_{ki} K_{0}(\boldsymbol{x}_{ki}, \boldsymbol{x})\right.\\&&\left.+ \frac{1}{C_{1}}\sum\limits_{i = 1}^{{m_{k}}} {\lambda^{*}_{ki}} {y_{ki}} K_{k}({\boldsymbol{x}_{ki}},\boldsymbol{x}) \,+\, b^{*}_{k} \right) \end{array} $$
(28)

where the optimal value \({b^{*}_{k}}\) can be obtained by the following equations:

$$\begin{array}{@{}rcl@{}} &&\sum\limits_{j = 1}^{N} \sum\limits_{r = 1}^{m_{j}}{\lambda_{jr}}{y_{ki}}{y_{jr}}\!\left( \!K_{0}(\boldsymbol{x}_{ki},\boldsymbol{x}_{jr}) \,+\,\frac{\delta_{kj}}{C_{1}}K_{k}(\boldsymbol{x}_{ki},\boldsymbol{x}_{jr})\!\right)\\ &&\quad+ {y_{ki}}{b_{k}} = 1 - \frac{1}{\rho}{\zeta_{ki}},\,\forall ki: {\alpha_{ki}} > 0 \end{array} $$
$$\begin{array}{@{}rcl@{}} &&\sum\limits_{j = 1}^{N} \sum\limits_{r = 1}^{m_{j}}{\lambda_{jr}}{y_{ki}}{y_{jr}}\!\left( \!K_{0}(\boldsymbol{x}_{ki},\boldsymbol{x}_{jr}) \,+\,\frac{\delta_{kj}}{C_{1}}K_{k}(\boldsymbol{x}_{ki},\boldsymbol{x}_{jr})\!\right)\\ &&\quad + {y_{ki}}{b_{k}} = 1 + \frac{1}{{1 - \rho}}{\zeta_{ki}},\,\forall ki: {\beta_{ki}} > 0 \end{array} $$

3.3 The special cases

In this subsection, we develop two kinds of special cases of MTL-aLS-SVM I and MTL-aLS-SVM II for the multi-task learning. Recall that the sharp of the asymmetric squared loss function (1) is closely related to the value of ρ. When ρ = 1, the asymmetric squared loss (1) reduces to the squared hinge loss:

$$\begin{array}{@{}rcl@{}} L_{\rho}(r) = \left\{ {\begin{array}{*{20}{l}} {r^{2}}, \,r \geq 0 \\ 0, \, r < 0 \end{array}} \right. \end{array} $$
(29)

And accordingly, the MTL-aLS-SVM I and MTL-aLS-SVM II reduce to the L2-SVM based multi-task learning methods (denoted by MTL-L2-SVM I and MTL-L2-SVM II, respectively).

MTL-L2-SVM I:

$$\begin{array}{@{}rcl@{}} \min_{\boldsymbol{\omega}_{0}, \boldsymbol{\upsilon}_{k}, b_{k}, \boldsymbol{\zeta}_{k}} &&\frac{1}{2}\|\boldsymbol{\omega}_{0}\|^{2} + \frac{C_{1}}{2}\sum\limits_{k = 1}^{N} \|\boldsymbol{\upsilon}_{k}\|^{2} + \frac{C_{2}}{2}\sum\limits_{k = 1}^{N} \|\boldsymbol{\zeta}_{k}\|^{2} \\ \text{s.t.} && \boldsymbol{Z}_{k}^{T}(\boldsymbol{\omega}_{0} + \boldsymbol{\upsilon}_{k}) + b_{k} \boldsymbol{y}_{k} \geq \boldsymbol{e}_{m_{k}} -\boldsymbol{\zeta}_{k},\\ && k = 1, 2, \cdots, N \end{array} $$
(30)

where Z k , ζ k , C 1, C 2 and \(\boldsymbol {e}_{m_{k}}\) have the same meanings as in formula (4). By the KKT condition, the dual problem of the above optimization problem can be obtained

$$\begin{array}{@{}rcl@{}} \max_{\boldsymbol{\alpha}}&& -\frac{1}{2}\sum\limits_{k,j = 1}^{N} {{\boldsymbol{\alpha}_{k}}^{T}{\boldsymbol{Z}_{k}}^{T}} \boldsymbol{{Z}_{j}}{\boldsymbol{\alpha}_{j}}- \frac{1}{{2C_{1}}}\sum\limits_{k = 1}^{N} {\boldsymbol{\alpha}_{k}}^{T}{\boldsymbol{{Z}_{k}}^{T}} \boldsymbol{{Z}_{k}}{\boldsymbol{\alpha}_{k}}\\ &&-\frac{1}{{2C_{2}}}\sum\limits_{k = 1}^{N}{\boldsymbol{\alpha}_{k}}^{T}{\boldsymbol{\alpha}_{k}} + \sum\limits_{k = 1}^{N}{\boldsymbol{\alpha}_{k}}^{T}\boldsymbol{e}_{{m_{k}}}\\ \text{ s.t.} && {\boldsymbol{\alpha}_{k}}^{T}{\boldsymbol{{y}_{k}} = 0,\, k = 1,2, \cdots, N} \\ &&\boldsymbol{\alpha}_{k} \geq 0,\, k = 1,2, \cdots, N \end{array} $$
(31)

MTL-L2-SVM II:

$$\begin{array}{@{}rcl@{}} \min_{\boldsymbol{\omega}_{0}, \boldsymbol{\upsilon}_{k}, b_{k}, \boldsymbol{\zeta}_{k}} && \frac{1}{2}\|\boldsymbol{\omega}_{0}\|^{2} \,+\, \frac{C_{1}}{2}\sum\limits_{k = 1}^{N} \|\boldsymbol{\upsilon}_{k}\|^{2} \,+\, \frac{C_{2}}{2}\sum\limits_{k = 1}^{N} \|\boldsymbol{\zeta}_{k}\|^{2} \\ \text{s.t.} && \tilde{\boldsymbol{Z}}_{k}^{T}\boldsymbol{\omega}_{0} + \boldsymbol{A}_{k}^{T}\boldsymbol{\upsilon}_{k} + b_{k} \boldsymbol{y}_{k} \geq \boldsymbol{e}_{m_{k}} -\boldsymbol{\zeta}_{k},\\ && k = 1, 2, \cdots, N \end{array} $$
(32)

where \(\tilde {\boldsymbol {Z}}_{k}\), A k have the same meanings as in (20). The dual form of the above optimization problem is

$$\begin{array}{@{}rcl@{}} \min_{{\boldsymbol{\alpha}_{k}}}&&\frac{1}{2}\sum\limits_{k, j = 1}^{N} \sum\limits_{i = 1}^{m_{k}} \sum\limits_{r = 1}^{m_{j}}{\alpha_{ki}}{\alpha_{jr}}{y_{ki}}{y_{jr}}\!\left( \!\vphantom{\frac{1}{1}}K_{0}(\boldsymbol{x}_{ki},\boldsymbol{x}_{jr})\right.\\&&\left. +\frac{\delta_{kj}}{C_{1}}K_{k}(\boldsymbol{x}_{ki}{\kern-.5pt},{\kern-.5pt}\boldsymbol{x}_{jr}{\kern-.5pt}){\kern-1.25pt}\!\right)\,+\, \frac{1}{2 C_{2}}{\kern-.5pt}\sum\limits_{k = 1}^{N}{\kern-.5pt}{\boldsymbol{\alpha}_{k}}^{T}{\kern-.5pt}{\boldsymbol{\alpha}_{k}}\,-\, \sum\limits_{k = 1}^{N}{\kern-.5pt} {\boldsymbol{\alpha}_{k}}^{T} {\kern-.5pt}\boldsymbol{e}_{{m_{k}}} \\ \text{ s.t.}&&{\boldsymbol{\alpha}_{k}}^{T}{\boldsymbol{y}_{k}} = 0,\, k = 1,2, \cdots, N \\ &&{\boldsymbol{\alpha}_{k}}\geq 0,\, k = 1,2, \cdots, N \end{array} $$
(33)

On the other hand, when ρ = 1/2, the asymmetric squared loss (1) reduces to the least squared loss \( L_{\rho }(r) = \frac {1}{2}r^{2} \). Then the MTL-aLS-SVM I and MTL-aLS-SVM II accordingly turn to be the least squares SVM based multi-task learning methods (denoted by MTL-LS-SVM I and MTL-LS-SVM II, respectively).

MTL-LS-SVM I [22]:

$$\begin{array}{@{}rcl@{}} \min_{\boldsymbol{\omega}_{0}, \boldsymbol{\upsilon}_{k}, b_{k}, \boldsymbol{\zeta}_{k}} && \frac{1}{2}\|\boldsymbol{\omega}_{0}\|^{2} \,+\, \frac{C_{1}}{2}{\sum}_{k = 1}^{N} \|\boldsymbol{\upsilon}_{k}\|^{2} \,+\, \frac{C_{2}}{2}{\sum}_{k = 1}^{N} \|\boldsymbol{\zeta}_{k}\|^{2}\\ \text{s.t.} && \boldsymbol{Z}_{k}^{T}(\boldsymbol{\omega}_{0} + \boldsymbol{\upsilon}_{k}) + b_{k} \boldsymbol{y}_{k} =\boldsymbol{e}_{m_{k}} -\boldsymbol{\zeta}_{k},\\ && k = 1, 2, \cdots, N \end{array} $$
(34)

The optimization problem (34) can be solved by the following linear system:

$$\begin{array}{@{}rcl@{}} \left[ {\begin{array}{*{20}{c}} {{\boldsymbol{0}_{N \times N}}}&{{\boldsymbol{D}^{T}}} \\ {\boldsymbol D}&{\boldsymbol H} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} \boldsymbol{b} \\ \boldsymbol{\alpha} \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} {{\boldsymbol{0}_{N}}} \\ {{\boldsymbol{e}_{m}}} \end{array}} \right] \end{array} $$
(35)

where D = b l o c k d i a g(y 1,y 2,⋯ ,y N ), the positive definite matrix \(\boldsymbol {H}=\boldsymbol {\Omega }+\frac {1}{C_{2}}\boldsymbol I_{m}+\frac {1}{C_{1}}\boldsymbol B \in \mathbb {R}^{m \times m}\), \(\boldsymbol {\Omega }=\boldsymbol {Z}^{T}\boldsymbol {Z}\in \mathbb {R}^{m \times m}\) with Z = (Z 1,Z 2,⋯ ,Z N ), and B = b l o c k d i a g(Ω 1,Ω 2,⋯ ,\(\boldsymbol {\Omega }_{N})\in \mathbb {R}^{m \times m}\) with \(\boldsymbol {\Omega }_{k}={\boldsymbol {Z}_{k}}^{T}\boldsymbol {Z}_{k} \in \mathbb {R}^{m_{k} \times m_{k}}\).

The efficiency of MTL-LS-SVM I has been verified by comparing it with other several multi-task learning methods [22]. More details can be seen in [22].

MTL-LS-SVM II:

$$\begin{array}{@{}rcl@{}} \min_{\boldsymbol{\omega}_{0}, \boldsymbol{\upsilon}_{k}, b_{k}, \boldsymbol{\zeta}_{k}} &&\! \frac{1}{2}\|\boldsymbol{\omega}_{0}\|^{2} \,+\, \frac{C_{1}}{2}{\sum}_{k = 1}^{N} \|\boldsymbol{\upsilon}_{k}\|^{2} \,+\, \frac{C_{2}}{2}{\sum}_{k = 1}^{N} \|\boldsymbol{\zeta}_{k}\|^{2}\\ \text{s.t.} &&\! \tilde{\boldsymbol{Z}}_{k}^{T}\boldsymbol{\omega}_{0} + \boldsymbol{A}_{k}^{T}\boldsymbol{\upsilon}_{k} + b_{k} \boldsymbol{y}_{k} =\boldsymbol{e}_{m_{k}} -\boldsymbol{\zeta}_{k},\\ &&\! k = 1, 2, \cdots, N \end{array} $$
(36)

The optimization problem (36) can be solved by the following linear system:

$$\begin{array}{@{}rcl@{}} \left[ {\begin{array}{*{20}{c}} {{\boldsymbol{0}_{N \times N}}}&{{\boldsymbol{D}^{T}}} \\ {\boldsymbol D}&{\tilde{\boldsymbol H}} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} \boldsymbol{b} \\ \boldsymbol{\alpha} \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} {{\boldsymbol{0}_{N}}} \\ {{\boldsymbol{e}_{m}}} \end{array}} \right] \end{array} $$
(37)

where D = b l o c k d i a g(y 1,y 2,⋯ ,y N ), the positive definite matrix \(\boldsymbol {H}=\tilde {\boldsymbol {\Omega }}+\frac {1}{C_{2}}\boldsymbol I_{m}+\frac {1}{C_{1}}\tilde {\boldsymbol B} \in \mathbb {R}^{m \times m}\), \(\tilde {\boldsymbol {\Omega }}=\tilde {\boldsymbol {Z}}^{T}\tilde {\boldsymbol {Z}}\in \mathbb {R}^{m \times m}\) with \(\tilde {\boldsymbol {Z} }= ({\tilde {\boldsymbol {Z}}_{1}}, {\tilde {\boldsymbol {Z}}_{2}}, \cdots , {\tilde {\boldsymbol {Z}}_{N}})\), and \(\tilde {\boldsymbol B}=blockdiag(\boldsymbol {\Theta }_{1},\boldsymbol {\Theta }_{2},\cdots ,\) Θ N )\(\in \mathbb {R}^{m \times m}\) with \(\boldsymbol {\Theta }_{k}={\boldsymbol A_{k}}^{T}{\boldsymbol A_{k}} \in \mathbb {R}^{m_{k} \times m_{k}}\).

4 Experiments

To verify the effectiveness of the newly proposed multi-task learning methods, we conduct experiments to compare the newly proposed multi-task learning methods with the strategy that all of the N tasks are learned independently by employing aLS-SVM [31], L2-SVM [32], LS-SVM [25], and nonparallel least square SVM (NLSSVM) [29]. And the corresponding single task learning methods are denoted as N-aLS-SVM, N-L2-SVM, N-LS-SVM, and N-NLSSVM respectively. All the experiments are carried out in MATLAB R2014a on a personal computer (PC) with an Intel(R) Core(TM) i7 processor (3.40 GHz) and 4GB random access memory (RAM).

We test these methods on a collection of three benchmark datasets including Isolet, Monk, and Dermatology coming from the UCI Machine Learning RepositoryFootnote 1. The Isolet dataset that is gathered from 150 subjects speaking 26 English letters twice consists of 7797 instances with 617 attributes (three instances had been historically lost). All of the speakers are divided into five equal number subsets known as Isolet1 to Isolet5, and each subset is treated as one classification task. On one hand, the five tasks have close relationship because they are gathered from the same utterances [11, 20]. On the other hand, the five tasks differ from each other because the speakers within diverse groups vary in the way of pronouncing the English letters. We classified three pairs of similar sounding letters including (B, D), (G, J) and (M, N) in our experiments. For (B, D) and (G, J) pairs, there are totally 600 instances in the five tasks of each pair; and for (M, N) pair, there are totally 599 instances in the five tasks. We employ principal component analysis (PCA) on the chosen datasets for removing the low variance noise. We reduce the attributes from 617 to 200, and 97.5% of the data variance is captured.

The Monk dataset with 432 instances is the basis of a first international comparison of learning algorithms. It is divided into 3 subsets based on the characteristic of 6 attributes. The subsets are referred to as Monk1, Monk2, and Monk3 which are corresponding to the three tasks.

The Dermatology dataset is a collection of 366 differential diagnosis including six kinds of dermatological diseases grounded on 33 clinicopathological characteristics. As in [8, 22], the problem can be converted into six binary one-versus-rest classification problems, and each one is regarded as a task. Therefore, we totally have six tasks.

In our experiments, the Gaussian kernel K(x i ,x j ) = exp(−σx i x j 2)is employed in the first multi-task learning method MTL-aLS-SVM I. For the second multi-task learning method MTL-aLS-SVM II, there exist two basic kernel functions: K 0 in the common model and K k in the private model (27). We test our method with two different combinations: One is that K 0(x k i ,x j r ) = 〈x k i , x j r 〉 is a linear kernel and K k (x k i ,x j r ) = exp(−σx k i x j r 2) is a Gaussian kernel; The other combination is that K 0(x k i ,x j r ) = 〈x k i ,x j r 〉is a linear kernel and K k (x k i , x j r ) = (〈x k i ,x j r 〉 + 1)d is a polynomial kernel with d > 1.

Generally speaking, the performance of the algorithms relies on the selections of parameters. There exist four (five) tuning parameters in MTL-aLS-SVM I (MTL-aLS-SVM II): C 1,C 2,σ, d (in MTL-aLS-SVM II), ρ. The first three (four) parameters are the same as those of MTL-L2-SVM I and MTL-LS-SVM I (MTL-L2-SVM II and MTL-LS-SVM II). The parameter ρ in MTL-aLS-SVM I and MTL-aLS-SVM II controls the sharp of the loss function. In our experiments, as in [31], we set ρ = 0.99,0.95,0.83. The parameter scopes are C 1 ∈{2− 7,2− 5,⋯ ,25}, C 2 ∈{2− 6,2− 4,⋯ ,28}, σ ∈{2− 7,2− 5,⋯ ,25}, and d ∈{2,3,⋯ ,9}. For the single task learning methods N-aLS-SVM, N-L2-SVM, and N-LS-SVM, except the kernel parameters σ,d, the optimal tuning parameter C is chosen from {2− 6,2− 4,⋯ ,28}. For N-NLSSVM, except the kernel parameters σ,d, there are two tuning parameters c 1and c 2 with the same ranges {2− 7,2− 6,⋯ ,28}. For each dataset, the attributes are scaled in [-1,1]. About 55% of the instances are randomly chosen from the whole dataset to constitute the training set, and the rest is the testing set. The five-fold cross validation is used on the training dataset to find the optimal parameters, and then a classification accuracy on the testing set is obtained. Repeat the process ten times, and the “Accuracy” in the following tables is the mean value of ten times testing results.

In Tables 1234 and 5, “Accuracy±S” denotes the averaged classification accuracy plus or minus the standard deviation. “L”, “G”, and “P” represent Linear kernel, Gaussian kernel, and Polynomial kernel, respectively. In the second kind of multi-task learning methods, “L + G” represents that K 0 is the linear kernel function and K k is the Gaussian kernel function; and “L + P” means that K 0 is the linear kernel function and K k is the Polynomial kernel function. The best result among all the methods for each dataset is highlighted.

Table 1 Experimental results on (B, D) pair of Isolet dataset
Table 2 Experimental results on (G, J) pair of Isolet dataset
Table 3 Experimental results on (M,N) pair of Isolet dataset

As can be seen from Tables 12 and 3, for Isolet dataset, the accuracies of the multi-task learning methods are much higher than the single task learning methods in general. Specifically, for (B, D) pair, the highest accuracy was created by MTL-aLS-SVM II using the combination of the linear kernel and Polynomial kernel. For (G, J) pair, the multi-task learning method MTL-L2-SVM II with the linear kernel and Gaussian kernel combination achieved the best accuracy. For (M, N) pair, the best accuracy was obtained by the multi-task learning method MTL-L2-SVM II with the linear kernel and Polynomial kernel combination.

For the Monk dataset, it is shown by Table 4 that the MTL-aLS-SVM I and MTL-aLS-SVM II achieve better performance than the other multi-task learning methods and the single task learning methods. And MTL-aLS-SVM II obtains the best accuracy among all of the multi-task and single task learning methods. In addition, it can be found that the performance of the single task learning methods has much to do with the different choices of the kernel functions. However, the multi-task learning methods are less sensitive to kernel functions.

Table 4 Experimental results on Monk dataset

For the Dermatology dataset, it can be seen from Table 5 that the accuracies obtained by MTL-aLS-SVM I and MTL-aLS-SVM II are sightly lower than the highest accuracy obtained by the single task learning method L2-SVM. The same phenomenon occurs in [8] for MTL-FEAT (RBF) and independent (RBF). Argyrious et al. reinforce their conjecture by the numerical experiments that the relation among these tasks is weak or not [8]. As in [8, 22], the results in Table 5 indicate that the newly proposed multi-task learning methods can also achieve good performance even in such case.

In addition, it has been shown by Tables 1234 and 5 that the results obtained by our proposed multi-task learning methods are better than those reported by the multi-task RMM algorithm in [24].

Table 5 Experimental results on Dermatology dataset

Further, we employ the non-parametric Friedman test with its corresponding Nemenyi post-hoc test [36] to perform a more fair comparison of all the involved algorithms on the employed UCI datasets. For simplicity, only the best accuracy of each involved algorithm is under consideration. Table 6 reports the ranks of “Accuracy” of all the involved algorithms on the employed UCI datasets, where each algorithm is represented by its abbreviation, for example, “MTL-aL I” denotes “MTL-aLS-SVM I”.

Table 6 The ranks of the involved algorithms in the Friedman test on the employed UCI datasets

Let R i denotes the average rank of the i th algorithm in Table 6, the Friedman statistic which is distributed according to \(\mathcal {X}^{2}_{F}\) with (K − 1) degrees of freedom and the \(\mathcal {F}_{F} \) which is distributed according to the \(\mathcal {F}\)-distribution with (K − 1) and (K − 1)(N − 1) degrees of freedom can be calculated as \(\mathcal {X}^{2}_{F}=\frac {12N}{K(K + 1)}\left [\sum \limits ^{K}_{i = 1}{R^{2}_{i}}-\frac {K(K + 1)^{2}}{4}\right ]= 21.1418\) and \(\mathcal {F}_{F} =\frac {(N-1)\mathcal {X}^{2}_{F}}{N(K-1)-\mathcal {X}^{2}_{F}}= 3.5446\), where N = 5, K = 10. According to the table of critical values, it is easy to know that \(\mathcal {F}_{\alpha = 0.1}(10,5)= 1.811<3.5446\), so we reject the null hypothesis. For further pairwise comparison, we resort to the Nemenyi post-hoc test. For α = 0.1, the critical difference \(CD=\mathcal {F}_{\alpha = 0.1}(10,5)*\sqrt {\frac {K(K + 1)}{6N}}= 3.4678\). It is well known that the performance of two algorithms is significantly different if their average ranks differ by at least the critical difference. Based on Table 6, the differences between MTL-aL II and other algorithms can be calculated as follows:

$$\begin{array}{llll} d(\text{N-aL} )\,-\, d(\text{MTL-aL II} )\,=\,5.8\,-\,2.3\,=\,3.5\!>\!3.4678\\ d(\text{N-L2} )\,-\, d(\text{MTL-aL II} )\,=\,6.4\,-\,2.3\,=\,4.1\!>\!3.4678\\ d(\text{N-LS} )\!- \!d(\text{MTL-aL II} )\,=\,6.4\,-\,2.3\,=\,4.1\!>\!3.4678\\ d(\text{N-NL} )\,-\, d(\text{MTL-aL II} )\,=\,9.4\,-\,2.3\,=\,7.1\!>\!3.4678\\ d(\text{MTL-L2 I})\,-\, d(\text{MTL-aL II})\,=\,4.8\,-\,2.3\,=\,2.5\!<\!3.4678\\ d(\text{MTL-LS I})\,-\, d(\text{MTL-aL II} )\,=\,7.2\,-\,2.3\,=\,4.9\!>\!3.4678\\ d(\text{MTL-aL I} )\,-\,d(\text{MTL-aL II} )\,=\,3.2\,-\,2.3\,=\,0.9\!<\!3.4678\\ d(\text{MTL-L2 II} )\,-\,d{\text{(MTL-aL II} )}\,=\,3.8\,-\,2.3\,=\,1.5\!<\!3.4678\\ d(\text{MTL-LS II} )\,-\,d(\text{MTL-aL II} )\,=\,5.7\,-\,2.3\,=\,3.4\!<\!3.4678 \end{array} $$

where d(a − b) denotes the differences between aand b. Then we obtain the following conclusion: on the employed UCI datasets, MTL-aLS-SVM II performs significantly better than all the single task learning methods including N-aLS-SVM, N-L2-SVM, N-LS-SVM, N-NLSSVM and the multi-task learning method MTL-LS-SVM I, and there is no significant differences between MTL-aLS-SVM II and MTL-aLS-SVM I, MTL-L2-SVM I, MTL-L2-SVM II, MTL-LS-SVM II.

In the next part of the experiments, we demonstrate the influence of the parameter C 1 in multi-task learning methods MTL-aLS-SVM I and MTL-aLS-SVM II (formulations (4) and (20)) which trades off the public classification information and the dissimilarity between tasks. For this purpose, we contradistinguish MTL-aLS-SVM I, MTL-aLS-SVM II including MTL-aLS-SVM II (L+G) and MTL-aLS-SVM II (L+P), N-aLS-SVM, and 1-aLS-SVM (the method that employs one aLS-SVM for all tasks when all tasks are regarded as one big task). We take ρ = 0.95as an example, and conduct the experiments on the (B, D) pair of Isolet dataset. The variations of the averaged accuracy of the multi-task learning methods on each task along with the values of C 1 are illustrated in the Figs. 12 and 3. As comparisons, the averaged accuracy obtained by the single learning methods N-aLS-SVM and 1-aLS-SVM with linear kernel, polynomial kernel, and Gaussian kernel are also illustrated in the Figs. 12 and 3, respectively. Note that the N-aLS-SVM and 1-aLS-SVM models do not contain parameter C 1, the averaged accuracy of these two models are not affected by the variation of parameter C 1. The “Accuracy” in the three figures denotes the averaged accuracy.

Fig. 1
figure 1

Accuracy variations of MTL-aLS-SVM I and MTL-aLS-SVM II along with C 1; and the comparison algorithms N-aLS-SVM and 1-aLS-SVM use the Linear kernel

Fig. 2
figure 2

Accuracy variations of MTL-aLS-SVM I and MTL-aLS-SVM II along with C 1; and the comparison algorithms N-aLS-SVM and 1-aLS-SVM use the Polynomial kernel

Fig. 3
figure 3

Accuracy variations of MTL-aLS-SVM I and MTL-aLS-SVM II along with C 1; and the comparison algorithms N-aLS-SVM and 1-aLS-SVM use the Gaussian kernel

It is shown by the three figures that when the values of C 1are small, the accuracies of MTL-aLS-SVM I and MTL-aLS-SVM II (L+P) are close to those of the conventional independent learning strategy N-aLS-SVM. When the values of C 1 are large, the performance of MTL-aLS-SVM I and MTL-aLS-SVM II (L+P) is in line with that of 1-aLS-SVM. However, the variation of the accuracies of MTL-aLS-SVM II (L+G) is not distinct, and MTL-aLS-SVM II (L+G) keeps the good performance along with the values of C 1.

In addition, it is interesting to see that the averaged accuracies of N-aLS-SVM are always lower than those of 1-aLS-SVM. The reason for this is that a small number of training data provides less information for N-aLS-SVM. And 1-aLS-SVM cannot deal with the label-incompatible dataset (for example, Monk dataset and Dermatology dataset). However, MTL-aLS-SVM I and MTL-aLS-SVM II can obtain the good performance with the proper values of C 1 since they can potentially learn correlation between tasks leading to more information.

5 Conclusion

In this paper, we have proposed the multi-task learning methods MTL-aLS-SVM I, MTL-aLS-SVM II, and their special cases for binary classification. MTL-aLS-SVM I combines the advantages of multi-task learning and the asymmetric least squares support vector machines. MTL-aLS-SVM II is an extension of the MTL-aLS-SVM I which adopt the assumption that the models of related tasks share a common model. A regularization parameter C 1 is introduced in MTL-aLS-SVM I and MTL-aLS-SVM II to seek for a trade-off between the public information and the private information dedicated to some specific task. In addition, the special cases MTL-L2-SVM II and MTL-LS-SVM II are also the newly proposed multi-task learning methods, which exhibit good performance. We have conducted comprehensive experiments to test the performance of the newly proposed methods and the influence of the regularization parameter C 1. Experimental results have shown that our methods are more effective than the corresponding single task learning methods. Additionally, our methods are flexible due to the introduction of parameter C 1. When there exists relatedness among the tasks, a proper value of C 1can be selected to make the methods achieve good performance. On the other hand, if the tasks are independent, a small value of C 1will make the methods learn tasks independently.

The multi-task learning is mainly designed to explore the latent information by learning all tasks jointly. As for exploiting the underlying information to improve the traditional inductive learning, an other renewed interest approach is Learning Using Privileged Information (LUPI) [37, 38]. Our future work is to extend our proposed multi-task learning methods to the LUPI learning paradigm.