1 Introduction

As a milestone in the development of support vector machine (SVM) [1], twin support vector machines (TWSVMs) attract much attention during recent years. It is first introduced in [2]. After a decade of research, there are many variants appeared, such as least squares twin support vector machine (LS-TWSVM) [3], twin bounded support vector machine (TBSVM) [4], robust twin support vector machine (robust-TWSVM) [5] and improved twin support vector machine (ITWSVM) [6]. A classical variant of TWSVM is \(\nu\)-TWSVM [7]. It is motivated by the classical \(\nu\)-SVM [8] and is proved to be more effective and efficient than \(\nu\)-SVM. Experiments on both synthetic and real datasets also demonstrate the effectiveness and efficiency of TWSVMs when compared to SVMs [9, 10]. TWSVM has also been applied into many machine learning areas, such as multi-view learning [11], domain adaptation [12] and clustering (TWSVC) [13]. Based on the PAC-Bayes theory, the generalization ability of TWSVM is analyzed [14]. Some novel safe screening rules are also proposed to speed up TWSVM without performance degradation [15, 16]. More advances of TWSVM can be found in recent survey [17, 18].

We should note that most machine learning algorithms belong to single-task learning, such as support vector machine, linear discriminant analysis, decision tree and so on. Many variants of TWSVM also belong to single-task learning. Actually, we usually train multiple tasks independently. In other words, one task is trained at one time. However, researchers point out that we may neglect the shared information among these tasks, which may be useful to improve the overall performance of these learning algorithms. The multi-task learning theory is thus proposed and has been studied extensively during the past two decades [19, 20]. It aims at improving the overall performance of several related tasks. Compared to single-task learning, it suggests that related tasks may share underlying knowledge, which should be learned jointly so as to take full advantages of the underlying information behind all tasks. Empirical works have demonstrated the effectiveness of multi-task learning and have also pointed out the mechanism of this machine learning paradigm [21].

A multi-task learning problem may be composed of several single-label learning problems regardless of how these tasks are related. One prerequisite is that all the samples in these tasks share the same feature space, which is also termed as homogeneous multi-task learning [22]. A special case of multi-task learning is multi-label learning, which studies the problem where each sample is associated with a set of labels simultaneously. The relation between these two machine learning paradigms has been clarified in [23]. Suppose the prediction of each label is a task, the multi-label learning problem can be transformed into a multi-task learning problem. By modeling the correlation of all tasks, the relations among multiple labels can be captured as well.

The research on multi-task learning in the early days can be found in [21, 24]. It mainly focused on neural network-based multi-task learning methods and also discussed k-nearest neighbors (k-NN) and decision tree-based multi-task learning algorithms. The generalization bound of multi-task learning was also discussed in early research [25]. At present, many multi-task learning methods appeared, such as Bayesian multi-task learning [26] and multi-task Gaussian process [27]. Recent survey on multi-task learning categorizes these methods into several types, including multi-task feature learning, multi-task relation learning, low-rank approach, dirty approach-based methods, task-clustering methods and other methods [22]. More surveys on multi-task learning can be found in [23, 28].

Recent success in multi-task support vector machines is interesting as well. The first practice is regularized multi-task learning (RMTL) [29,30,31], which suggests all tasks share a common separating hyperplane and belongs to mean-regularized multi-task learning. It has been used in human action recognition [32]. A generalized sequential minimal optimization (GSMO) is proposed for SVM \(+\) MTL [33]. Some other multi-task SVMs are also proposed recently, including multi-task least square support vector machine (MTLS-SVM) [34], multi-task proximal support vector machine (MTPSVM) [35] and multi-task asymmetric least squares support vector machine (MT-aLS-SVM) [36], all of which are based on a certain single-task learning method. Some other variants, such as multi-task infinite latent support vector machines (MT-iLSVM) [37], multi-task multi-class support vector machine (MTMCSVM) [38] and multi-view multi-task support vector machine (MVMTSVM) [39], are also inspiring. Based on SVM, an online multi-task learning algorithm is proposed for semantic concept detection in video [40]. A multi-task ranking SVM is proposed for image co-segmentation [41]. A least squares support vector machine for semi-supervised multi-task learning is also proposed recently [42].

In contrast to the extensive research on multi-task SVMs, few attention is focused on multi-task TWSVMs. Recent works on multi-task TWSVMs have directed multi-task twin support vector machine (DMTSVM) [43], multi-task centroid twin support vector machine (MTCTSVM) [44] and multi-task least squares twin support vector machine (MTLS-TWSVM) [45]. Compared to their single-task learning counterparts, these models show better generalization performance. They suppose all tasks share two mean hyperplanes, one for the positive and the other for the negative. Inspired by the multi-task learning, we propose two novel multi-task \(\nu\)-twin support vector machines (MT-\(\nu\)-TWSVMs) to take full advantage of the regularized multi-task learning and \(\nu\)-TWSVM. Both models inherit the merits of \(\nu\)-TWSVM and multi-task learning and overcome the shortage of TWSVM, DMTSVM and MTCTSVM. Thus, our model can perform better than DMTSVM and MTCTSVM. The main contributions of our paper are as follows:

  1. 1.

    We propose two novel multi-task \(\nu\)-twin support vector machines based on different assumptions. They are natural extensions of \(\nu\)-TWSVM in multi-task setting.

  2. 2.

    Both models inherit the merits of \(\nu\)-TWSVM. The fraction of support vectors is thus easier to control in our models than other multi-task SVMs and TWSVMs.

  3. 3.

    The task relation is easier to control in our models and more flexible than other methods.

  4. 4.

    Our models achieve better performance than other multi-task SVMs and TWSVMs.

The remainder of this paper is organized as follows. After a brief review of \(\nu\)-TWSVM and DMTSVM in Sect. 2, we give a detailed derivation of the proposed MT-\(\nu\)-TWSVMs in Sects. 3 and 4. Analysis of algorithms is shown in Sect. 5. The numerical experimental results are shown in Sect. 6. Finally, we show the conclusions and future work in Sect. 7.

2 Related work

Here we introduce \(\nu\)-twin support vector machine (\(\nu\)-TWSVM) and the primal multi-task twin support vector machine (DMTSVM). It would be better to clarify the original inspiration of these methods, since they lay a solid foundation for our proposed methods.

2.1 \(\nu\)-Twin support vector machine

For a standard TWSVM model, it aims at finding two nonparallel hyperplanes rather than one hyperplane in SVM. It is also proved to be faster than SVM for four times on training large datasets. The \(\nu\)-TWSVM is similar to TWSVM. Suppose \(X_p\) represents all the positive samples, and \(X_n\) stands for the negative. For simplicity, denote \(A=[X_p\,e]\), \(B=[X_n\,e]\), \(u=[w_1, b_1]^\top\) and \(v=[w_2, b_2]^\top\), and this model generates two nonparallel hyperplanes by solving the following problems,

$$\begin{aligned} \displaystyle {\min _{u,p,\rho _+}}\,\,&\quad \frac{1}{2}\Vert Au \Vert ^2-\nu _1\rho _++\frac{1}{l_-}e_2^\top p\nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad -\,Bu+p \ge \rho _+,\nonumber \\&\quad \rho _+,\,p\ge 0, \end{aligned}$$
(1)

and

$$\begin{aligned} \displaystyle {\min _{v,q,\rho _-}}\,\,&\quad \frac{1}{2}\Vert Bv \Vert ^2-\nu _2\rho _-+\frac{1}{l_+}e_1^\top q\nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad Av+q \ge \rho _-,\nonumber \\&\quad \rho _-,\,q\ge 0. \end{aligned}$$
(2)

where \(\nu _1\) and \(\nu _2\) are positive parameters. \(l_{+}\) and \(l_{-}\) denote the numbers of positive samples and negative samples, respectively. Both \(e_{1}\) and \(e_{2}\) are vectors of ones of appropriate dimensions. Then, a new point \(x \in R^n\) is assigned to class \(i (i=+1,-1)\) by

$$\begin{aligned} f(x)=\arg \min _{r=\pm 1}|x^\top w_{r}+b_{r}|. \end{aligned}$$
(3)

This model is modeled after the \(\nu\)-SVM. It can adjust the fraction of support vectors and is proved to be more efficient and effective than traditional SVMs and TWSVMs. However, just like many other single-task learning models, it is not designed to deal with the commonality and individuality of multiple tasks.

2.2 Multi-task twin support vector machine

This model introduces TWSVM into multi-task learning setting, is modeled after the RMTL and also is called direct multi-task twin support vector machine (DMTSVM) [43], unlike multi-task support vector machines, which supposes all tasks share two mean hyperplanes. Suppose the positive (negative) samples in the tth task are represented by \(X_{pt}\)(\(X_{nt}\)). Meanwhile, \(X_p\) represents the positive samples, while \(X_n\) stands for the negative. Now, we let

$$\begin{aligned} A_t=[X_{pt}\,e_t],\,B_t=[X_{nt}\,e_t],\,A=[X_p\,e],\,B=[X_n\,e], \end{aligned}$$

where \(e_t\) and e are one vectors of appropriate dimensions.

Suppose there are two mean hyperplanes \(u=[w_1, b_1]^\top\) and \(v=[w_2, b_2]^\top\) shared by all tasks, the two hyperplanes in the tth task are \((u+u_t)=[w_{1t},b_{1t}]^\top\) and \((v+v_t)=[w_{2t},b_{2t}]^\top\), respectively. The bias between the hyperplanes in the tth task and the common hyperplanes u and v is captured by \(u_t\) and \(v_t\). Then, the primal problem of DMTSVM is illustrated as follows:

$$\begin{aligned} \displaystyle {\min _{u,u_t,p_t}}\,\,&\quad \frac{1}{2} \Vert Au \Vert _2^2+\frac{1}{2} \sum _{t=1}^{T} \rho _t \Vert A_tu_t \Vert _2^2 + c_1 \sum _{t=1}^{T} e_{2t}^\top p_t\nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad -\,B_t(u+u_t)+p_t \ge e_t, \quad p_t \ge 0, \end{aligned}$$
(4)

and

$$\begin{aligned} \displaystyle {\min _{v,v_t,q_t}}\,\,&\quad \frac{1}{2} \Vert Bv \Vert _2^2+\frac{1}{2}\sum _{t=1}^{T} \lambda _t \Vert B_tv_t \Vert _2^2 + c_2 \sum _{t=1}^{T} e_{1t}^\top q_t\nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad A_t(v+v_t)+q_t \ge e_t, \quad q_t \ge 0, \end{aligned}$$
(5)

where \(t \in \{1,2,\ldots ,T\}\), \(e_{1t}\) and \(e_{2t}\) are one vectors. \(c_1\) and \(c_2\) are nonnegative trade-off parameters. The relationships of all tasks can be adjusted by parameters \(\rho _t\) and \(\lambda _t\). Both \(p_t\) and \(q_t\) are slack variables. Then all tasks can be modeled unrelated when \(\rho _t\rightarrow 0\) and \(\lambda _t\rightarrow 0\) simultaneously. On the contrary, these models will be learned the same when \(\rho _t\rightarrow \infty\) and \(\lambda _t\rightarrow \infty\). Finally, the label of a new point x in the tth task can be determined by

$$\begin{aligned} f(x)=\arg \min _{r=\pm 1}|x^\top w_{rt}+b_{rt}|. \end{aligned}$$
(6)

3 Multi-task \(\nu\)-twin support vector machine I

3.1 Linear case

In this section, based on the regularized multi-task learning and the \(\nu\)-TWSVM, we propose a primal multi-task learning problems as follows:

$$\begin{aligned} \displaystyle {\min _{u_0,u_t,\rho _+,p_t}}\,\,&\quad \frac{1}{2} \Vert Au_0 \Vert ^2+\frac{\mu _1}{2T} \sum _{t=1}^{T} \Vert A_tu_t \Vert ^2 -\nu _1\rho _+\nonumber \\&\quad +\,\frac{1}{l_-}\sum _{t=1}^{T} e_{2t}^\top p_t\nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad -\,B_t(u_0+u_t)+p_t \ge \rho _+,\nonumber \\&\quad \rho _+,\,p_t \ge 0, \end{aligned}$$
(7)

and

$$\begin{aligned} \displaystyle {\min _{v_0,v_t,\rho _-,q_t}}\,\,&\quad \frac{1}{2} \Vert Bv_0 \Vert ^2+\frac{\mu _2}{2T} \sum _{t=1}^{T} \Vert B_tv_t \Vert ^2 -\nu _2\rho _-\nonumber \\&\quad +\,\frac{1}{l_+}\sum _{t=1}^{T} e_{1t}^\top q_t \nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad A_t(v_0+v_t)+q_t \ge \rho _-,\nonumber \\&\quad \rho _-,\,q_t \ge 0, \end{aligned}$$
(8)

where \(t \in \{1,2,\ldots ,T\}\), \(l_+\) and \(l_-\) are the numbers of positive and negative samples, separately. Note that, \(w_{rt}(r\in\{+1,-1\})\) are the weight vectors of the hyperplanes for each task. Here, \(u_0(v_0)\) and \(u_t(v_t)\) indicate the commonality and personality of each task, separately. The difference between all tasks is controlled by \(\mu\). However, we take ideas from \(\nu\)-TWSVM. Its merits may be different from the DMTSVM and MTCTSVM. Two additional variables \(\rho _\pm\) in (7) and (8) need to be optimized. Before analyzing the effect of \(\nu\), we take the dual problem of (7) and (8). The Lagrangian function of problem (7) is given by

$$\begin{aligned} L_1& = \frac{1}{2} \Vert Au_0 \Vert ^2 +\frac{\mu _1}{2T} \sum _{t=1}^{T} \Vert A_tu_t \Vert ^2-\nu _1\rho _+ +\frac{1}{l_-} \sum _{t=1}^{T} e_{2t}^\top p_t\nonumber \\&\quad -\,\sum _{t=1}^{T}\alpha _t^\top (-B_t(u_0+u_t)+p_t-\rho _+)\nonumber \\&\quad -\,\sum _{t=1}^{T}\beta _t^\top p_t-\eta \rho _+, \end{aligned}$$
(9)

where \(\alpha _t\), \(\beta _t\) and \(\eta\) are the Lagrangian multipliers. The Karush–Kuhn–Tucker (KKT) conditions are given below

$$\begin{aligned}&\frac{\partial L}{\partial u_0}=A^\top Au_0+B^\top \alpha =0,\nonumber \\&\frac{\partial L}{\partial u_t}=\frac{\mu _1}{T}A_t^\top A_tu_t+B_t^\top \alpha _t=0,\nonumber \\&\frac{\partial L}{\partial \rho _+}=-\nu _1+e_2^\top \alpha -\eta =0 \Rightarrow e_2^\top \alpha \ge \nu _1,\nonumber \\&\frac{\partial L}{\partial p}=\frac{e_2}{l_-}-\alpha -\beta =0 \Rightarrow 0 \le \alpha \le \frac{1}{l_-}, \end{aligned}$$
(10)

where \(\alpha =[\alpha _1^\top ,\alpha _2^\top ,\ldots ,\alpha _T^\top ]^\top\) and \(p=[p_1^\top ,p_2^\top ,\ldots ,p_T^\top ]^\top\). Then, we have

$$\begin{aligned}&u_0=-(A^\top A)^{-1}B^\top \alpha ,\nonumber \\&u_t=-\frac{T}{\mu _1}(A_t^\top A_t)^{-1}B_t^\top \alpha _t. \end{aligned}$$
(11)

Then, substituting \(u_0\), \(u_t\) into function (9)

$$\begin{aligned} L_1& = \frac{1}{2}u_0^\top A^\top Au_0+\frac{\mu _1}{2T}\sum _{t=1}^{T}u_t^\top A_t^\top A_tu_t\nonumber \\&\quad +\,\sum _{t=1}^{T}\alpha _t^\top B_t(u_0+u_t), \end{aligned}$$
(12)

and using below equations

$$\begin{aligned}&Q=B(A^\top A)^{-1}B^\top ,\nonumber \\&P_t=B_t(A_t^\top A_t)^{-1}B_t^\top ,\nonumber \\&P={\mathrm{blkdiag}}(P_1,P_2,\ldots ,P_T), \end{aligned}$$
(13)

the dual problem of (7) can be simplified as

$$\begin{aligned} \displaystyle {\max _{\alpha }}\,\,&\quad -\frac{1}{2}\alpha ^\top \left( Q+\frac{T}{\mu _1}P\right) \alpha \nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad e_2^\top \alpha \ge \nu _1,\nonumber \\&\quad 0 \le \alpha \le \frac{e_2}{l_-}. \end{aligned}$$
(14)

Similarly, we introduce below equations

$$\begin{aligned}&R=A(B^\top B)^{-1}A^\top ,\nonumber \\&S_t=A_t(B_t^\top B_t)^{-1}A_t^\top ,\nonumber \\&S={\mathrm{blkdiag}}(S_1,S_2,\ldots ,S_T). \end{aligned}$$
(15)

The dual problem of (8) can be written as

$$\begin{aligned} \displaystyle {\max _{\gamma }}\,\,&\quad -\frac{1}{2}\gamma ^\top \left( R+\frac{T}{\mu _2}S\right) \gamma \nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad e_1^\top \gamma \ge \nu _2,\nonumber \\&\quad 0 \le \gamma \le \frac{e_1}{l_+}. \end{aligned}$$
(16)

Finally, the label of a new sample x in the tth task can be determined by

$$\begin{aligned} f(x)=\arg \min _{r=\pm 1}|x^\top w_{rt}+b_{rt}|. \end{aligned}$$
(17)

3.2 Nonlinear case

Since linear classifier may not be appropriate for linear nonseparable cases, the kernel trick can be used in such case. Now, we introduce the kernel function \(K(\cdot )\) and define

$$\begin{aligned} E=[K(A,X^\top ),\,e],\,E_t=[K(A_t,X^\top ),\,e_t],\\ F=[K(B,X^\top ),\,e],\,F_t=[K(B_t,X^\top ),\,e_t]. \end{aligned}$$

where X represents training samples from all tasks, i.e., \(X=[A_1^\top ,B_1^\top ,A_2^\top ,B_2^\top ,\ldots ,A_T^\top ,B_T^\top ]^\top\). By substituting A and B in (7) and (8) with E and F, respectively, we can obtain the kernel version of this model. The primal problems of the nonlinear model are

$$\begin{aligned} \displaystyle {\min _{u_0,u_t,\rho _+,p_t}}\,\,&\quad \frac{1}{2} \Vert Eu_0 \Vert ^2+\frac{\mu _1}{2T} \sum _{t=1}^{T} \Vert E_tu_t \Vert ^2 -\nu _1\rho _+\nonumber \\&\quad +\,\frac{1}{l_-} \sum _{t=1}^{T}e_{2t}^\top p_t \nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad -\,F_t(u_0+u_t)+p_t \ge \rho _+,\nonumber \\&\quad \rho _+,\,p_t \ge 0, \end{aligned}$$
(18)

and

$$\begin{aligned} \displaystyle {\min _{v_0,v_t,\rho _-,q_t}}\,\,&\quad \frac{1}{2} \Vert Fv_0 \Vert ^2+\frac{\mu _2}{2T} \sum _{t=1}^{T} \Vert F_tv_t \Vert ^2 -\nu _2\rho _-\nonumber \\&\quad +\,\frac{1}{l_+} \sum _{t=1}^{T}e_{1t}^\top q_t \nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad E_t(v_0+v_t)+q_t \ge \rho _-,\nonumber \\&\quad \rho _-,\,q_t \ge 0. \end{aligned}$$
(19)

Then the corresponding decision function of the tth task is

$$\begin{aligned} f(x)=\arg \min _{r=\pm 1}|K(x,X^\top )w_{rt}+b_{rt}|. \end{aligned}$$
(20)

4 Multi-task \(\nu\)-twin support vector machine II

Although it is easy to understand MT-\(\nu\)-TWSVM I, this model may have some disadvantages. Because the range of parameters \(\mu _1\) (\(\mu _2\)) is \((0,+\,\infty )\), it is hard for us to adjust the relationship among multiple tasks. Thus we propose another multi-task \(\nu\)-TWSVM to address this problem in this section.

4.1 Linear case

Suppose the hyperplane of the tth task can be expressed as a linear convex combination of the common vectors \(u_0(v_0)\) and the task specific vectors \(u_t(v_t)\), we propose another problem as follows:

$$\begin{aligned} \displaystyle {\min _{u_0,u_t,\rho _+,p_t}}\,\,&\quad \frac{\mu _1}{2} \Vert Au_0 \Vert ^2+\frac{1-\mu _1}{2T} \sum _{t=1}^{T} \Vert A_tu_t \Vert ^2 -\nu _1\rho _+ \nonumber \\&\quad +\,\frac{1}{l_-} \sum _{t=1}^{T} e_{2t}^\top p_t \nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad -\,B_t(\mu _1 u_0+(1-\mu _1)u_t)+p_t \ge \rho _+,\nonumber \\&\quad \rho _+,\,p_t \ge 0, \end{aligned}$$
(21)

and

$$\begin{aligned} \displaystyle {\min _{v_0,v_t,\rho _-,q_t}}\,\,&\quad \frac{\mu _2}{2} \Vert Bv_0 \Vert ^2+\frac{1-\mu _2}{2T} \sum _{t=1}^{T} \Vert B_tv_t \Vert ^2 -\nu _2\rho _- \nonumber \\&\quad +\,\frac{1}{l_+} \sum _{t=1}^{T}e_{1t}^\top q_t\nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad A_t(\mu _2 v_0+(1-\mu _2)v_t)+q_t \ge \rho _-,\nonumber \\&\quad \rho _-,\,q_t \ge 0, \end{aligned}$$
(22)

where \(t \in \{1,2,\ldots ,T\}\).

Similar to DMTSVM, the differences between all tasks are controlled by parameters \(\mu _1\) and \(\mu _2\). But it is different from DMTSVM, that is, the task relation is captured by a linear convex combination of the common hyperplane \(u_0\)(\(v_0\)) and a specific vector \(u_t\)(\(v_t\)) for the positive (negative). If we set \(\mu _1=0\) and \(\mu _2=0\), it means that \(u_0\) and \(v_0\) have no effect on the tth task. Then T completely different tasks will be learned, and the hyperplanes of the tth task will be far away from the common hyperplanes. When \(\mu _1=1\) and \(\mu _2=1\), our model reduces to an enlarged ν-TWSVM, and it means all tasks have the same two hyperplanes. Therefore, the difference among all tasks can be easily captured by two parameters \(\mu _1\) and \(\mu _2\). It is more flexible than DMTSVM and MTCTSVM.

However, both models are based on \(\nu\)-TWSVM. Their merits may be different from the DMTSVM and MTCTSVM. Two additional variables \(\rho _\pm\) in (21) and (22) need to be optimized. Before analyzing the effect of \(\nu\), we take the dual problem of (21). The Lagrangian function of problem (21) is given by

$$\begin{aligned} L_1& = \frac{\mu _1}{2} \Vert Au_0 \Vert ^2 +\frac{1-\mu _1}{2T} \sum _{t=1}^{T} \Vert A_tu_t \Vert ^2 +\frac{1}{l_-}\sum _{t=1}^{T} e_{2t}^\top p_t\nonumber \\&\quad -\,\sum _{t=1}^{T}\alpha _t^\top (-B_t(\mu _1 u_0+(1-\mu _1)u_t)+p_t-\rho _+)\nonumber \\&\quad -\,\sum _{t=1}^{T}\beta _t^\top p_t-\eta \rho _+ -\nu _1\rho _+. \end{aligned}$$
(23)

Taking the partial derivatives of Lagrangian function (23) with respect to (\(w_0\), \(w_t\), \(\rho _+\), p), we obtain the following KKT conditions:

$$\begin{aligned}&\frac{\partial L}{\partial u_0}=\mu _1(A^\top Au_0+B^\top \alpha )=0,\nonumber \\&\frac{\partial L}{\partial u_t}=(1-\mu _1)\left( \frac{1}{T}A_t^\top A_tu_t+B_t^\top \alpha _t\right) =0,\nonumber \\&\frac{\partial L}{\partial \rho _+}=-\nu _1+e_2^\top \alpha -\eta =0 \Rightarrow e_2^\top \alpha \ge \nu _1,\nonumber \\&\frac{\partial L}{\partial p}=\frac{e_2}{l_-}-\alpha -\beta =0 \Rightarrow 0 \le \alpha \le \frac{1}{l_-}, \end{aligned}$$
(24)

where \(\alpha =[\alpha _1^\top ,\alpha _2^\top ,\ldots ,\alpha _T^\top ]^\top\).

Then, we have the following equalities with respect to primal problem variables (\(u_0\), \(u_t\))

$$\begin{aligned}&u_0=-(A^\top A)^{-1}B^\top \alpha ,\nonumber \\&u_t=-T\cdot (A_t^\top A_t)^{-1}B_t^\top \alpha _t. \end{aligned}$$
(25)

Then, we substitute \(u_0\) and \(u_t\) into function (23)

$$\begin{aligned} L_1& = \frac{\mu _1}{2}u_0^\top A^\top Au_0+\frac{1-\mu _1}{2T}\sum _{t=1}^{T}u_t^\top A_t^\top A_tu_t\nonumber \\&\quad +\,\sum _{t=1}^{T}\alpha _t^\top B_t(\mu _1 u_0+(1-\mu _1)u_t). \end{aligned}$$
(26)

The dual problem of (21) can be simplified as

$$\begin{aligned} \displaystyle {\max _{\alpha }}\,\,&\quad -\,\frac{1}{2}\alpha ^\top (\mu _1 Q+(1-\mu _1)T\cdot P)\alpha \nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad e_2^\top \alpha \ge \nu _1,\nonumber \\&\quad 0 \le \alpha \le \frac{e_2}{l_-}. \end{aligned}$$
(27)

Similarly, the dual problem of (22) can be written as

$$\begin{aligned} \displaystyle {\max _{\gamma }}\,\,&\quad -\,\frac{1}{2}\gamma ^\top (\mu _2R+(1-\mu _2)T\cdot S)\gamma \nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad e_1^\top \gamma \ge \nu _2,\nonumber \\&\quad 0 \le \gamma \le \frac{e_1}{l_+}. \end{aligned}$$
(28)

Finally, the label of a new sample x in the tth task can be determined by

$$\begin{aligned} f(x)=\arg \min _{r=\pm 1}|x^\top w_{rt}+b_{rt}|. \end{aligned}$$
(29)

4.2 Nonlinear case

A linear classifier may not be suitable for training samples that are linear inseparable. The kernel trick can be used to deal with such problems. Similarly, we introduce the kernel function \(K(\cdot )\) and define

$$\begin{aligned} E=[K(A,X^\top ),\,e],\,E_t=[K(A_t,X^\top ),\,e_t],\\ F=[K(B,X^\top ),\,e],\,F_t=[K(B_t,X^\top ),\,e_t], \end{aligned}$$

where X represents training samples from all tasks, i.e., \(X=[A_1^\top ,B_1^\top ,A_2^\top ,B_2^\top ,\ldots ,A_T^\top ,B_T^\top ]^\top\). The \(K(\cdot )\) is a kernel function. The primal problems of the nonlinear case are given as

$$\begin{aligned} \displaystyle {\min _{u_0,u_t,\rho _+,p_t}}\,\,&\frac{\mu _1}{2} \Vert Eu_0 \Vert ^2+\frac{1-\mu _1}{2T} \sum _{t=1}^{T} \Vert E_tu_t \Vert ^2 -\nu _1\rho _+\nonumber \\&\quad +\,\frac{1}{l_-} \sum _{t=1}^{T} e_{2t}^\top p_t \nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad -\,F_t(\mu _1u_0+(1-\mu _1)u_t)+p_t \ge \rho _+,\nonumber \\&\quad \rho _+,\,p_t \ge 0, \end{aligned}$$
(30)

and

$$\begin{aligned} \displaystyle {\min _{v_0,v_t,\rho _-,q_t}}\,\,&\frac{\mu _2}{2} \Vert Fv_0 \Vert ^2+\frac{1-\mu _2}{2T} \sum _{t=1}^{T} \Vert F_tv_t \Vert ^2 -\nu _2\rho _-\nonumber \\&\quad +\,\frac{1}{l_+} \sum _{t=1}^{T} e_{1t}^\top q_t\nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad E_t(\mu _2v_0+(1-\mu _2)v_t)+q_t \ge \rho _-,\nonumber \\&\quad \rho _-,\,q_t \ge 0. \end{aligned}$$
(31)

Then the decision function of the tth task is

$$\begin{aligned} f(x)=\arg \min _{r=\pm 1}|K(x,X^\top )w_{rt}+b_{rt}|. \end{aligned}$$
(32)

5 Analysis of algorithms

5.1 Equivalent form of model

The dual problems of MT-\(\nu\)-TWSVM are similar to that of \(\nu\)-TWSVM. The difference lies in the Hessian matrix. Besides, these models share similar features with \(\nu\)-TWSVM. Similar to \(\nu\)-SVM and \(\nu\)-TWSVM, to compute \(\rho _\pm\), we select samples \(x_i\) (or \(x_j\)) with \(0< \alpha _i < \frac{1}{l_-}\) (or \(0< \gamma_j < \frac{1}{l_+}\)) from all tasks, which means that \(p_t=0\) (or \(q_t=0\)) and \(w_1^\top x_j+b_+=-\rho _+\)(or \(w_2^\top x_i+b_2=\rho _-\)). According to the KKT conditions, the \(\rho _\pm\) can be calculated by

$$\begin{aligned}&\rho _+=-\frac{1}{l_-}\sum _{t=1}^{T}\sum _{j=1}^{N_{tn}}(w_{1t}^\top x_j+b_{1t}),\nonumber \\&\rho _-=\frac{1}{l_+}\sum _{t=1}^{T}\sum _{i=1}^{N_{tp}}(w_{2t}^\top x_i+b_{2t}), \end{aligned}$$
(33)

where \(N_{tn}\) and \(N_{tp}\) represent the number of negative and positive samples satisfying above constraints in the tth task.

Here we show an equivalent form of QPP (14). However, the optimal value of parameter \(\rho _+\) (\(\rho _-\)) is actually larger than zero. According to previous conclusions, we have the following Proposition 1.

Proposition 1

QPP (14) can be transformed into the following QPP.

$$\begin{aligned} \displaystyle {\max _{\alpha }}\,\,&\quad -\frac{1}{2}\alpha ^\top \left( Q+\frac{T}{\mu _1}P\right) \alpha \nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad 0 \le \alpha \le \frac{e_2}{l_-},\nonumber \\&\quad e_2^\top \alpha = \nu _1. \end{aligned}$$
(34)

The difference between (14) and (34) lies in the second constraint. The second inequality constraint\(e_2^\top \alpha \ge \nu _1\)can be transformed into an equality constraint\(e_2^\top \alpha = \nu _1\).

Proof

According to the KKT conditions \(\eta \rho _+=0\) and the assumption \(\rho _+ > 0\), we have that \(\eta =0\). Then we obtain the equality constraint \(e_2^\top \alpha = \nu _1\) by substituting \(\eta\) into (10). Thus we prove Proposition 1.

Similar to \(\nu\)-TWSVM, dual problems (14) and (15) of MT-\(\nu\) TWSVM I can be seen as minimizing the generalized Mahalanobis norm. This norm is defined as \(\Vert u\Vert _{GM}=\sqrt{u^\top Su}\). Here, we set \(S=Q+\frac{T}{\mu _1}P\), and problem (14) can be written as a standard generalized Mahalanobis norm minimizing problem as follows,

$$\begin{aligned} \displaystyle {\min _{\alpha }}\,\,&\quad \frac{1}{2}\alpha ^\top S\alpha \nonumber \\ \,\,\,\,\text{ s.t. }\,\,&\quad 0 \le \alpha \le \alpha _m,\nonumber \\&\quad e_2^\top \alpha = 1, \end{aligned}$$
(35)

where \(\alpha _m=\frac{e_2}{\nu_1l_-}\). Further analysis of this property can be found in [7], and the only difference lies in the Hessian matrix. Similar conclusions can be obtained for QPP (15) as well. The MT-\(\nu\)-TWSVM II also has these features. \(\square\)

5.2 Property of parameter \(\nu\)

As in \(\nu\)-TWSVM, parameter \(\nu\) in our multi-task \(\nu\)-TWSVMs also has these properties. They are discussed in the following propositions.

Proposition 2

Suppose we run both MT-\(\nu\)-TWSVM I and II withnsamples on dataset\(\mathcal {D}\), obtaining the result that\(\rho _\pm \ge 0\). Then

  1. 1.

    \(\nu _2\) (or\(\nu _1\)) is an upper bound on the fraction of positive (or negative) margin errors of the common task.

  2. 2.

    \(\nu _2\) (or\(\nu _1\)) is a lower bound on the fraction of positive (or negative) support vectors of the common task.

Proof

The proof of Proposition 2 is similar to that of Proposition 5 in [8]. These results can be extended to the nonlinear case by introducing the kernel function. \(\square\)

5.3 Complexity analysis

We analyze the training time complexity of our proposed algorithms. Clearly, both models need solving two smaller quadratic programming problems. It is the same as training original single-task learning \(\nu\)-TWSVM on all the samples in these tasks. Although one may notice that there is \(2T+2\) times matrix inversion in a training process, we note that it can be better optimized by carefully organizing the training procedure of the grid process. It will not affect the overall time complexity theoretically. Therefore, the training time complexity of our proposed algorithms is the same as that of \(\nu\)-TWSVM. Suppose the number of training samples in all the tasks is l, the time complexity of our algorithms is also \(O(\frac{l^3}{4})\).

According to the analysis, we know that training such a multi-task learning model needs additional computation when compared to training a unify model on all the samples. But in one aspect, the personality and commonality can be modeled to improve the overall performance. In other aspect, the training tasks could help each other in a multi-task learning scenario. This is what a single-task learning method cannot achieve practically.

6 Numerical experiments

In this section, we present experimental results on both single-task learning methods and multi-task learning algorithms. The single-task learning algorithms are consisted of SVM, PSVM, LSSVM, TWSVM, LSTWSVM and \(\nu\)-TWSVM, while the multi-task learning methods are MTPSVM, MTLS-SVM, MTL-aLS-SVM, DMTSVM, MCTSVM and our proposed MT-\(\nu\)-TWSVM I and II. The numerical experiments are first conducted on three benchmark datasets. To further evaluate these methods, we have also made comparisons on popular Caltech 101 and 256 datasets.

For each algorithm, all parameters, such as \(\lambda\), \(\gamma\) and \(\rho\), are turned by grid-search strategy. Without specification, all parameters are selected from set \(\{2^i|i=-3,-2,\ldots ,8\}\). The parameter p in MTL-aLS-SVM is selected from set \(\{0.82, 0.86,0.90,0.95\}\). The parameter \(\nu\) in \(\nu\)-TWSVM and MT-\(\nu\)-TWSVMs is selected from set \(\{0.1,0.2,\ldots ,1.0\}\). The parameter \(\mu\) in MT-\(\nu\)-TWSVM II is selected from set \(\{0,0.1,\ldots ,0.9,1\}\). Then, we use fivefold cross-validation to obtain average performance. Finally, all experiments are conducted in MATLAB R2018b on Windows 8.1 running on a PC with system configuration of Intel(R) Core(TM) i3-6100 CPU (3.90 GHz) with 12.00 GB of RAM.

We note one special operation we had done to handle multi-task learning problems when conducting simulations. Since training a group of unrelated tasks may have negative impact on the performance of our proposed multi-task learning models, all the training tasks should be conceptually positive related. In our work, the training tasks satisfy such requirement to a certain extent. Thus it can better utilize the generalization ability of our proposed multi-task learning methods.

6.1 Benchmark datasets

In this subsection, we conduct experiments on three datasets. The general information is show in (Table 1). The details of these datasets are as follows.

Table 1 The statistics of these three datasets

Monk This dataset comes from a first international comparison of learning algorithms and contains three Monk’s problems corresponding to three tasks [46]. The domains of all tasks are the same. Thus, these tasks can be seen as related. We select different number of samples to test these methods.

Emotions This is a multi-label dataset in Mulan library [47] and is used to recognize different emotions. There are six kinds of labels for all samples. Each sample may have more than one label (or emotion). Suppose the recognition tasks of different emotions share similar features and can be seen as related tasks. We cast it into a multi-task classification problem, and each task is to recognize one type of emotion. We select 100 to 200 samples from this dataset to evaluate these multi-task learning algorithms.

Flags This is a multi-label dataset in Mulan library [47] as well. Each sample may have seven labels. Since the recognition task of each label can be seen as related. Thus, we also consider it as a multi-task learning problem. Then we select different number of samples from this dataset to compare the performance of these multi-task learning methods.

Finally, we use Gaussian kernel function on Monk dataset only. But considering the feature mapping of the Gaussian kernel, the data could be more linear separable in high-dimensional space, causing the classification performance of each model cannot be easily distinguished on limited testing samples. Therefore, a polynomial kernel function is applied in our experiments, i.e.,

$$\begin{aligned} K(x_i,x_j)=(\langle x_i,x_j\rangle +c)^{d}. \end{aligned}$$
(36)

In our experiments, we set the kernel parameter \(c=1\) and \(d=2\). By the kernel trick, the input data are mapped into a high-dimensional feature space. In the feature space, a linear classifier is implemented which corresponds to a nonlinear separating surface in the input space.

Figures 1 and 2 show the performance comparison on Monk dataset with RBF kernel function. We can learn from them that our algorithms clearly outperform these single-task learning algorithms. The performance of each algorithm increases when increasing the size of each task. With the task size increases, the performance gap between single-task learning methods and our methods decreases. It can be explained as follows. Since our models train all tasks simultaneously, it can take advantage of the underlying information among all tasks when there are few samples in each task. The performance of single-task learning algorithms also becomes better when increasing the number of samples. Thus our multi-task learning methods are suitable for training small tasks. Meanwhile, the performance of our models in Fig. 2 is clearly better than other multi-task learning algorithms when there are few samples in each task. With the increase in task size, the performance of each algorithm also increases. In addition, we note that the average training time of these four multi-task TWSVMs is almost the same, while the training time of MTPSVM and MTLS-SVM is lower than the other algorithms.

Fig. 1
figure 1

Performance comparison between our methods and six STL methods on Monk dataset (RBF kernel)

Fig. 2
figure 2

Performance comparison between MTL methods on Monk dataset (RBF kernel)

In the following, the performance comparison on Monk dataset with polynomial kernel function is shown in Figs. 3 and 4. We point out that our algorithms also outperform other single tasks at varying task size. In addition, since our methods train all tasks simultaneously, the training time is surely larger than those single-task learning algorithms. But we also note that the training time of SVM, TWSVM and \(\nu\)-TWSVM is close to our methods when there are few training samples in each task, since these algorithms need to solve one or two quadratic programming problems. In addition, our algorithms also perform better than other multi-task learning methods at varying task size in terms of the mean accuracy. Meanwhile, the average training time of the last four algorithms is almost the same. The training time of PSVM, LSSVM and their multi-task leaning extensions MTPSVM and MTLS-SVM is the lowest among all algorithms. Finally, our algorithms perform better than these single-task learning and multi-task learning methods on Monk dataset in our experimental results in terms of the mean accuracy.

Fig. 3
figure 3

Performance comparison between our methods and six STL methods on Monk dataset (polynomial kernel)

Fig. 4
figure 4

Performance comparison between MTL methods on Monk dataset (polynomial kernel)

The experimental results on Flags dataset between multi-task learning algorithms with polynomial kernel function are illustrated in Fig. 5. Our algorithms perform better than other multi-task learning algorithms when the number of samples in each task is larger than 100. Since there are 7 tasks in this dataset, we cannot suppose these tasks are really correlated. The performance of these multi-task learning algorithms may not be so well. But our algorithms still perform better than these three multi-task SVMs in terms of mean accuracy. In addition, the training time of MPTSVM and MTLS-SVM is clearly lower than other algorithms. However these two algorithms just solve a larger linear equation problem, while our algorithms need to solve two smaller quadratic programming problems and several small matrix inversions. The computational costs of our methods are naturally high.

Fig. 5
figure 5

Performance comparison between MTL methods on Flags dataset (polynomial kernel)

The comparison between multi-task learning algorithms on Emotions dataset with polynomial kernel function is shown in Fig. 6. In this group of experiments, our algorithms perform better than other similar methods in terms of the mean accuracy. The MTPSVM and MTLS-SVM are faster than other methods. The average training time of the last five algorithms is almost the same.

Fig. 6
figure 6

Performance comparison between MTL methods on Emotions dataset (polynomial kernel)

6.2 Image datasets

To further evaluate the effectiveness of MT-\(\nu\)-TWSVMs, we conduct experiments on two image datasets. The images are selected from the Caltech 101 [48, 49] and the Caltech 256 datasets [50], which have been widely used in computer vision researches. There are 102 categories in Caltech 101 dataset, and each category has more than 50 samples. Each image has about \(300\times 200\) pixels [48]. We select about 50 samples from each category in our experiments. Caltech 256 dataset has 256 categories of images in total, such as mammals, birds, insects and flowers. There is a clutter category in this dataset, which can be seen as negative samples. The number of images in each category ranges from 80 to 827. We select no more than 80 samples from each category. Then, we manually cluster these images into 15 main categories according to the hierarchy of category, each category contains three to ten classes of images. Some images are shown in Fig. 7. We note that the images in one column have similar features. But each row belongs to different subclass. Therefore, the recognition tasks of different subclasses belonging to the same category can be regarded as a group of related tasks. Then, we train these tasks simultaneously to evaluate these multi-task learning methods.

Fig. 7
figure 7

Samples selected from ten categories in Caltech datasets. Each column of samples belongs to the same main category, but the features of image differ in rows

As a classical image feature extractor, scale invariant feature transform (SIFT) algorithm [51] is widely used in many computer vision researches before [52,53,54]. Until few years ago, hand crafted features such as SIFT represented the state of the art for visual content analysis. In particular, SIFT is widely regarded as the gold standard in the context of local feature extraction [55]. In this paper, a fast and dense version of SIFT, called dense-SIFTFootnote 1, is used in accompany with Bag of Visual Words (BoVW) method to obtain the vector representation of the images. It is a fast algorithm for the calculation of a large number of SIFT descriptors of densely sampled features of the same scale and orientation.Footnote 2 It not only runs faster than original SIFT feature extractor, but also can generate more feature descriptors. Thus it can provide more information of an image. It is especially important in building the feature vector of an image with BoVW method. The feature vector of a preprocessed image is 1000 dimensions in our experiments. Afterward, the dimensions of those feature vectors are reduced with PCA to capture \(97\%\) of the variance. Thus, to reduce the training complexity, the task is to recognize those samples in each subclass. Finally, considering the high dimensional of samples, all experiments on these two datasets are conducted with a polynomial kernel function as described in previous experiments.

Figures 8 and 9 illustrate our experimental results on the Caltech 101 and Caltech 256 datasets. We find that our methods perform better than MTPSVM, MTLS-SVM and MTL-aLS-SVM on four categories in Caltech 101 dataset. However, from the training time, multi-task TWSVMs are almost the same. But our methods outperform the DMTSVM and MTCTSVM. In comparison, the MTL-aLS-SVM performs badly on these two metrics in most cases. In addition, our algorithms perform better than other two multi-task TWSVMs on seven categories in Caltech 256 dataset. Our algorithms also perform better than MTLS-SVM on six categories. We notice that the training time of the last five algorithms is almost the same in this group of experiments. However, they all need to solve one or two quadratic programming problems instead of one larger linear equation. The MTPSVM and MTLS-SVM perform well in terms of the average training time. The reason has been clarified in the previous section. Although the feature vector has been reduced to a low dimension, the dimensions are still high when compared to the number of all sample in most cases. We point out that the number of features is about 300–600 dimensions in these two groups of experiments. In contrast to previous results on benchmark datasets, the ability of our algorithms in dealing with such case may not be so well.

Fig. 8
figure 8

Performance comparison between multi-task methods on Caltech 101 image dataset (polynomial kernel)

Fig. 9
figure 9

Performance comparison between multi-task methods on Caltech 256 image dataset (polynomial kernel)

After showing our experimental results, we then have an overview of the accuracy levels other researchers reported when they used Caltech 101 and 256 datasets in Table 2. As we can see, the SVM is applied with a specific feature extractor to evaluate the performance in these researches. We note that recently proposed ResFeats-152 \(+\) PCA-SVM achieves the best accuracy on both datasets. It uses deep neural network as an image feature extractor and then feeds the preprocessed feature vectors into the SVM. In contrast, the other methods are manually designed feature extractors. The Pyramid SIFT is a feature extractor based on SIFT. In a word, the main difference of above researches on Caltech datasets lies in the feature extraction method. Compared to previous results, the accuracy level of our methods is comparable to the other method on Caltech 101 dataset. Meanwhile, the accuracy level of our methods is better than the above methods in most cases. The comparison shows the effectiveness of our methods on Caltech datasets.

Table 2 The accuracy levels recently proposed image recognition methods obtained on Caltech 101 and 256 datasets
Fig. 10
figure 10

The effect of parameters \(\mu\) and \(\nu\) on the overall performance of MT-\(\nu\)-TWSVM II

Finally, to verify our hypothesis that recognizing images belongs to different subclasses but in a common category can be trained simultaneously, Fig. 10 shows the trend of mean accuracy with respect to the parameters \(\mu\) and \(\nu\) around the best parameters when the kernel parameters are fixed. The raw data come from the result of MT-\(\nu\)-TWSVM II on Caltech 256 image dataset with a RBF kernel function. According to this figure, we can directly know whether the performance is largely affected by the choice of \(\mu\) or \(\nu\). This figure indicates that the performance has strong correlation with the value of \(\mu\) rather than \(\nu\). Our model achieves the highest accuracy at a larger value of \(\mu\). According to the previous analysis of our model, it means all tasks share two mean hyperplanes and have high correlation. Thus, we should choose a larger parameter \(\mu\) in the range of [0, 1]. This result is in consistent with our hypothesis. It means the tasks selected from Caltech 256 dataset are related and should be learned jointly rather than separately. However, this is not the case on all the datasets. But we can find the relationships of all tasks according to this figure. Therefore, it provides a better way to choose the best parameters.

7 Conclusion and future work

In this paper, we propose two novel multi-task classifiers, termed as MT-\(\nu\)-TWSVM I and II, which are natural extension of \(\nu\)-TWSVM in multi-task learning. Both models inherit the merits of \(\nu\)-TWSVM and multi-task learning. Our analysis shows that both models share similar properties with \(\nu\)-TWSVM. The main difference lies in the two Hessian matrices, which model the personality and commonality of all tasks. Unlike original \(\nu\)-TWSVM, it is the fraction of support vectors of the common task that can be bounded by parameter \(\nu\). It overcomes the shortage of DMTSVM and MTCTSVM. The multi-task relationship can be modeled from completely irrelevant to fully relevant in the second model. Therefore, it is more flexible. Experimental results on three benchmark datasets and two image datasets demonstrate the effectiveness and efficiency of our algorithms. Meanwhile, the accuracy levels other researchers obtained on these two image datasets are also discussed. This comparison also clearly confirms that our proposed methods are powerful and consistently outperform the other image classification algorithms.

Finally, our future work will focus on speeding up the training process of multi-task SVM and TWSVMs on large datasets.