1 Introduction

Inspired by human learning activities where people often apply the knowledge learned from other tasks to help learn a related task, knowledge (memes) transfer has been investigated to enhance the optimization performance in many related optimization tasks, which can be real-world problems that have similarities in nature or different methods solving one complicated problem [1, 14, 17, 19, 49]. Based on this idea, evolutionary multitasking algorithms have been developed and verified on a range of optimization tasks [3, 8, 18, 48]. Similarly, in the community of machine learning, Multi-task Learning (MTL) solves multiple related tasks simultaneously to leverage knowledge contained in one task to help learn other tasks. MTL has shown to outperform single-task learning in many cases, from computer vision [28, 54], drug discovery [21, 22], to natural language processing [9], which validate the idea of utilizing knowledge from related tasks.

MTL handles multiple related tasks by jointly training them to improve generalization ability, either by shallow or deep models [37, 51]. There are different mechanisms to utilize information from similar tasks in MTL. In a majority of shallow model based approaches, the similarities are promoted through regularization in the global cost function, which is composed of all the tasks, such as feature based approaches [2, 6, 20, 31, 33] and task-relation based approaches [11, 16, 53]. In the deep model based approaches, one way is to let different tasks share the first several hidden layers, where common feature representations for multiple tasks are learned, and then have task-specific parameters in the subsequent layers [23, 32, 37, 54]. Inspired by regularization techniques for shallow MTL, the other way of parameter sharing in deep models is to regularize the distance between the parameters of different models to encourage the parameters to be similar [10, 44]. As a subproblem of MTL, multi-label learning deals with the problem that one instance is associated with multiple labels. Due to the special form of multi-label learning problems, a lot of methods have been proposed to improve the performance of multi-label learning by exploring the label correlations [46]. Classical methods such as classifier chains [36], calibrated label ranking [15], and random k-labelsets [43] transform the problem into a combination of classification problems. [24, 25] considered taking the symmetric label correlations as prior knowledge and incorporating it into the model training. Local label correlations are also exploited in [26, 55] by partitioning the dataset into groups and then learn local label correlations within the groups.

In most existing MTL models, the information transfer is symmetric between any two participating tasks based on the coupled cost functions, or a common feature representation is learned from all the tasks. The symmetric information sharing will deteriorate the performance of some of the participating learners since not always all tasks benefit from the joint learning [30], and the indistinguishable common feature sharing may result in performance degeneration if there are noisy and outlier tasks. Although some methods proposed for multi-label learning, such as classifier chains, can achieve asymmetric information transfer, they rely largely on the ordering or relation combination of the labels, which usually have high complexity with a large number of tasks.

Inspired by the merits of first-order gradient descent and taking into account the importance of relations among the tasks, a novel Multi-task Gradient Descent (MGD) algorithm is proposed in this paper to solve the MTL problem. In MGD, instead of modeling all the tasks into one coupled cost function, each task minimizes its individual cost function using the gradient descent algorithm. The similarities among the tasks are then facilitated through transferring model parameter values during the model learning process of each task. By implicitly changing the cost function during the learning process, MGD achieves MTL with proper transfer mechanisms. The convergence of MGD when the transfer mechanism and the step size of gradient descent satisfy certain easily achievable conditions is theoretically proven, which allows a variety of similarity measurements to be leveraged to promote information sharing. Compared with the existing approaches, the advantages of MGD are threefold:

  1. 1.

    MGD provides a novel easy-to-implement framework for MTL and more flexible way to conduct information transfer between related tasks.

  2. 2.

    It can achieve asymmetric transfer easily such that negative transfer is mitigated.

  3. 3.

    It can benefit from parallel computing with a small amount of information processed centrally when the number of tasks is large.

The rest of the paper is organized as follows. Section 2 briefly introduces related works, including MTL, multi-label learning, and transfer learning. Then, the proposed MGD is presented in Sect. 3, where the convergence of the proposed algorithm is theoretically proven, and the relation of MGD and the regularization based MTL is analyzed. Experiments on a synthetic problem, a set of real-world multi-label learning datasets, and MultiMNIST dataset using LeNet are conducted in Sect. 4 to evaluate the performance of the algorithm. Finally, Sect. 5 gives the conclusion and future work.

2 Related work

In the existing MTL works, information sharing is achieved by designing a common model concerning the related tasks, either is shallow or deep. In most of the deep model based MTL, the first several hidden layers are trained on all the tasks to learn common feature representations for multiple tasks, and the subsequent layers are left as task specific parameters which are only trained on a specific task [32, 47, 54]. The deep MTL models can learn powerful feature representations. However, the sharing hidden layers approach usually lacks of interpretability, and it is vulnerable to noisy and outlier tasks. The shallow model usually comes up with a global cost function, the relations among the tasks are promoted through a regularization term in the cost function that composed of all the tasks’ parameters, such as feature based approaches [2, 6, 20, 31, 33] and task relation based approaches [11, 16, 53]. Taking the classical feature selection method in [33] as an example, the cost function under the regularization framework is

$$\begin{aligned} \min \sum _{i=1}^T f_i(\varvec{w}_i)+\sigma \Vert W\Vert _{2,1}, \end{aligned}$$

where T is the number of tasks, \(\varvec{w}_i\) is the model parameters for the ith task, \(f_i\) denotes a function of \(\varvec{w}_i\) including the loss function and other regularization functions of \(\varvec{w}_i\), W is a matrix with the ith column being \(\varvec{w}_i\), and \(\sigma \) is a positive regularization parameter. The \(\ell _{2,1}\)-norm regularization on W equals the sum of \(\ell _2\) norm of rows in W, which enforces W to be row sparse and results in selecting important features shared across multiple tasks. In this kind of regularization based approaches, same shared features are used for all the participating tasks.

Treating a single-label learning problem as one task, the multi-label learning problem can be seen as a special case of MTL problem, where the feature vectors \({\varvec{x}}_j\) for \(j=1,\ldots ,n\) are the same for different tasks. The relations between the tasks can be learned from the relations between the labels. For first-order methods, the label correlations are ignored and the multi-label learning problem is handled in a label by label manner, such as BR [5] and LIFT [45]. Second-order methods consider pairwise relations between labels, such as LLSF [24] and JFSC [25]. High-order methods, where high-order relations among label subsets or all the labels are considered, such as RAkEL [43], ECC [36], LLSF-DL [24], and CAMEL [13]. Generally, the higher the order of correlations being considered, the stronger is the correlation-modeling capabilities, while on the other hand, the more computationally demanding and less scalable the approach becomes.

In contrast to the existing MTL approaches which incorporate correlation information into the modeling in the form of regularization or shared hidden layers, MGD serves as the first attempt to incorporate the correlations by transferring model parameter values during the model learning process of each task. Unlike the multi-label learning methods which rely on the ordering or relation combination of the labels to achieve asymmetric information transfer, MGD can achieve asymmetric transfer easily even with a large number of tasks. When considering only one task receive information from other tasks in an asymmetric transfer manner, it seems similar to transfer learning [7, 34], however, they have significant difference. The objective in transfer learning is to improve the performance of a target task with the help of source tasks which are already learned, while in MGD with only one task receives information, all the tasks are simultaneously solved. The asymmetric information transfer is utilized to mitigate negative transfer.

3 The MGD approach

In this section, we elaborate the proposed MGD algorithm for MTL. The mathematical notations used in the manuscript are first introduced. We then generically formulate the MTL problem and introduce the proposed MGD. At last, we perform the theoretical analysis of MGD, including convergence proof, analysis on relation with regularization based MTL, and computational complexity.

Throughout this paper, normal font small letters denote scalars, boldface small letters denote column vectors, and capital letters denote matrices. \(\varvec{0}\) denotes zero column vector with proper dimension, \(I_n\) denotes identity matrix of size \(n\times n\). \(A'\) denotes the transpose of matrix A and \(\otimes \) denotes the Kronecker product. \([\varvec{z}_i]_{\text {vec}}\) denotes a concatenated column vector formed by stacking \(\varvec{z}_i\) on top of each other, and \(\text {diag}\{z_i\}\) denotes a diagonal matrix with the ith diagonal element being \(z_i\). The norm \(\Vert \cdot \Vert \) without specifying the subscript represents the Euclidean norm by default.

3.1 Problem formulation

Suppose we have T tasks to be solved simultaneously. For simplicity, we assume that all the tasks share the common data space and the tasks are positively correlated. Each task \(i\in \{1,\ldots ,T\}\) aims to solve the following minimization problem,

$$\begin{aligned} \min _{\varvec{w}_i} \ f_i(\varvec{w}_i), \end{aligned}$$
(1)

where \(\varvec{w}_i\in {\mathbb {R}}^{d}\) is the model parameter and \(f_i: \mathbb R^{d}\rightarrow \mathbb R\) is the cost function of the ith task. In this paper, we do not restrict the specific form of the cost functions. In particular, the cost functions \(f_i(\varvec{w}_i)\) is assumed to be strongly convex, twice differentiable, and the gradient of \(f_i\) is Lipschitz continuous with constant \(L_{f_i}\), i.e.,

$$\begin{aligned} \Vert \nabla f_i(\varvec{u})-\nabla f_i(\varvec{v})\Vert \le L_{f_i} \Vert \varvec{u}-\varvec{v}\Vert ,\quad \forall \varvec{u}, \varvec{v}\in \mathbb R^{d}. \end{aligned}$$

Cost functions in machine learning problems such as mean squared error with \(\ell _2\) norm regularization and cross-entropy with \(\ell _2\) norm regularization apply. Non-differentiable cost functions where \(\ell _1\) norm regularization is used can also be approximately considered [40]. Since \(f_i(\varvec{w}_i)\) is strongly convex and twice differentiable, there exists positive constant \(\xi _i\) such that \(\nabla ^2 f_i(\varvec{u})\ge \xi _i I_d\). As a result, we have

$$\begin{aligned} \xi _iI_d\le \nabla ^2f_i(\varvec{u})\le L_{f_i}I_d, \ \forall \varvec{u}\in \mathbb R^d. \end{aligned}$$

3.2 The proposed framework

Using the gradient descent, problem (1) can be solved using the following iteration,

$$\begin{aligned} \varvec{w}_i^{t+1} = \varvec{w}_i^{t}-\alpha \nabla f_i(\varvec{w}_i^t), \end{aligned}$$
(2)

where t is the iteration index, \(\alpha \) is the step size, and \(\nabla f_i(\varvec{w}_i^t)\in \mathbb R^d\) is the gradient of \(f_i\) at \(\varvec{w}_i^t\). As there are relations among tasks, we are able to improve the learning performance by considering the correlation of parameters belonging to different tasks. Based on this idea, we propose a reformative gradient descent iteration, which allows the values of the model parameters to be transferred across tasks during each iteration. The MGD is designed as follows,

$$\begin{aligned} \varvec{w}_i^{t+1} = \sum _{j=1}^T m_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t),\ i=1,\ldots ,T, \end{aligned}$$
(3)

where \(m_{ij}^t\) is the transfer coefficient describes the information flow from task j to task i, which satisfies the following conditions,

$$\begin{aligned} m_{ij}^t\ge 0, \end{aligned}$$
(4a)
$$\begin{aligned} \sum _{j=1}^T m_{ij}^t = 1. \end{aligned}$$
(4b)

From (3), we can see that under MGD, the learning of different tasks can be decoupled with only a small amount of information need to be transferred among the tasks, thus can benefit from parallel computing.

From (4b), we have \(m_{ii}^t = 1-\sum _{j\ne i}m_{ij}^t\). Rewriting iteration (3) as follows

$$\begin{aligned} \varvec{w}_i^{t+1}&= m_{ii}^t\varvec{w}_i^t+\sum _{j\ne i}m_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t)\\&=\left( 1-\sum _{j\ne i}m_{ij}^t\right) \varvec{w}_i^t+\sum _{j\ne i}m_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t). \end{aligned}$$

Rescale \(m_{ij}^t\) as

$$\begin{aligned} {\bar{m}}_{ij}^t= {\left\{ \begin{array}{ll} \frac{1}{\alpha \sigma }m_{ij}^t, &{} j\ne i,\\ 1-\frac{1}{\alpha \sigma }\sum _{j\ne i}m_{ij}^t, &{} j=i, \end{array}\right. } \end{aligned}$$
(5)

where \(\sigma \) is a positive constant and satisfies the condition

$$\begin{aligned} 1-\frac{1}{\alpha \sigma }\sum _{j\ne i}m_{ij}^t>0. \end{aligned}$$
(6)

Given (5), \(m_{ij}^t\) is parameterized by \(\sigma \). With the rescaling, the iteration in (3) can be alternatively expressed as

$$\begin{aligned} \varvec{w}_i^{t+1}\!&= \! \left( 1-\!\alpha \sigma \!\sum _{j\ne i}{\bar{m}}_{ij}^t\right) \varvec{w}_i^t+\alpha \sigma \sum _{j\ne i}{\bar{m}}_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t)\nonumber \\&= \varvec{w}_i^t-\alpha \sigma (1-{\bar{m}}_{ii}^t)\varvec{w}_i^t+\alpha \sigma \sum _{j\ne i}{\bar{m}}_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t)\nonumber \\&= (1-\alpha \sigma )\varvec{w}_i^t+\alpha \sigma \sum _{j=1}^T {\bar{m}}_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t). \end{aligned}$$
(7)

3.3 Convergence analysis

In this section, we give the convergence property of the proposed MGD iteration based on the expression in (7).

Denote \(\varvec{w}_i^*\) as the underlying target model parameter for task i, \(\tilde{\varvec{w}_i}^t = \varvec{w}_i^t-\varvec{w}_i^*\), and \({\bar{L}}_{f_i} = \max _i\{L_{f_i}\}\). The following theorem gives the convergence property of the iteration (7) under certain conditions on the transfer parameter \({\bar{m}}_{ij}^t\) and step-size \(\alpha \).

Theorem 1

Under the iteration in (7) with the transfer coefficient \({\bar{m}}_{ij}^t\) satisfying

$$\begin{aligned}&\sum _{j=1}^T{\bar{m}}_{ij}^t=1,\ \forall i,\\&{\bar{m}}_{ij}^t\ge 0,\ \forall i,j, \end{aligned}$$

\(\varvec{w}_i^t\) is convergent if the step size \(\alpha \) is chosen to satisfy

$$\begin{aligned} 0<\alpha <\frac{2}{2\sigma + {\bar{L}}_{f_i}}. \end{aligned}$$
(8)

Specifically,

$$\begin{aligned} \lim _{t\rightarrow \infty } \max _i\Vert \tilde{\varvec{w}_i}^{t}\Vert \le \frac{\alpha \sigma \max _{i,j}\Vert \varvec{w}_i^*-\varvec{w}_j^*\Vert +\alpha \max _i\Vert \nabla f_i(\varvec{w}_i^*)\Vert }{1-({\bar{\gamma }}+\alpha \sigma )}, \end{aligned}$$
(9)

where \({\bar{\gamma }} = \max _i\{|1-\alpha \sigma -\alpha \xi _i|,|1-\alpha \sigma -\alpha L_{f_i}|\}\).

Proof

Let the ij-th element of \( {\bar{M}}^t\in \mathbb R^{T\times T}\) at iteration time t being \({\bar{m}}_{ij}^t\), denote \(\bar{\mathcal M}^t = {\bar{M}}^t\otimes I_d\in \mathbb R^{dT\times dT}\), \(\varvec{\mathrm {w}}= [\varvec{w}_1^{'}, \ldots , \varvec{w}_T^{'}]'\in \mathbb R^{dT}\), and \(\nabla f(\varvec{\mathrm {w}}^t) = [\nabla f_1(\varvec{w}_1^t)',\ldots ,\nabla f_T(\varvec{w}_T^t)']'\in \mathbb R^{dT}\). Note that we are using the typeface \(\varvec{\mathrm {w}}\) to distinguish this from the single vector-valued variable \(\varvec{w}_i\). Write (7) into a concatenated form gives

$$\begin{aligned} \varvec{\mathrm {w}}^{t+1} = (1-\alpha \sigma )\varvec{\mathrm {w}}^t+\alpha \sigma \bar{\mathcal M}^t\varvec{\mathrm {w}}^t-\alpha \nabla f(\varvec{\mathrm {w}}^t). \end{aligned}$$
(10)

Denote \(\varvec{\mathrm {w}}^* = [\varvec{w}_1^{*'},\ldots ,\varvec{w}_T^{*'}]'\) and \({\tilde{\varvec{\mathrm {w}}}}^t=\varvec{\mathrm {w}}^t-\varvec{\mathrm {w}}^*\). Subtracting \(\varvec{\mathrm {w}}^*\) from both sides of (10) gives

$$\begin{aligned} {\tilde{\varvec{\mathrm {w}}}}^{t+1}&= ((1-\alpha \sigma )I_{dT}+\alpha \sigma \bar{\mathcal M}^t)\varvec{\mathrm {w}}^t-\varvec{\mathrm {w}}^*-\alpha \nabla f(\varvec{\mathrm {w}}^t)\nonumber \\&= ((1-\alpha \sigma ) I_{dT}{+}\alpha \sigma \bar{\mathcal M}^t){\tilde{\varvec{\mathrm {w}}}}^t-\alpha (\nabla f(\varvec{\mathrm {w}}^t){-}\nabla f(\varvec{\mathrm {w}}^*))\nonumber \\&\quad +\,\alpha (\sigma (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*-\nabla f(\varvec{\mathrm {w}}^*))\nonumber \\&= ((1-\alpha \sigma ) I_{dT}+\alpha \sigma \bar{\mathcal M}^t){\tilde{\varvec{\mathrm {w}}}}^t\nonumber \\&\quad -\,\alpha \int _0^1 \nabla ^2 f(\varvec{\mathrm {w}}^*+\mu (\varvec{\mathrm {w}}^t-\varvec{\mathrm {w}}^*))d\mu {\tilde{\varvec{\mathrm {w}}}}\nonumber \\&\quad +\,\alpha (\sigma (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*-\nabla f(\varvec{\mathrm {w}}^*))\nonumber \\&= ((1-\alpha \sigma ) I_{dT}+\alpha \sigma \bar{\mathcal M}^t-\alpha H^t){\tilde{\varvec{\mathrm {w}}}}^t\nonumber \\&\quad +\,\alpha (\sigma (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*-\nabla f(\varvec{\mathrm {w}}^*)), \end{aligned}$$
(11)

where \(H^t = \int _0^1\nabla ^2 f(\varvec{\mathrm {w}}^*+\mu (\varvec{\mathrm {w}}^t-\varvec{\mathrm {w}}^*))d\mu \in \mathbb R^{dT\times dT}\). It can be verified that \(H^t\) is a block diagonal matrix and the block diagonal elements \(H_i^t = \int _0^1\nabla ^2 f_i(\varvec{w}_i^*+\mu (\varvec{w}_i^t-\varvec{w}_i^*))d\mu \in \mathbb R^{d\times d}\) for \(i=1,\ldots ,T\) are Hermitian.

We use the block maximum norm defined in [39] to show the convergence of the above iteration. The block maximum norm of a vector \({\varvec{x}} = [{\varvec{x}}_i]_{\text {vec}}\in \mathbb R^{dT}\) with \({\varvec{x}}_i\in \mathbb R^d\) is defined as [39]

$$\begin{aligned} \Vert {\varvec{x}}\Vert _{b,\infty } = \max _{i}\Vert {\varvec{x}}_i\Vert . \end{aligned}$$

The induced matrix block maximum norm is therefore defined as [39]

$$\begin{aligned} \Vert A\Vert _{b,\infty } = \max _{{\varvec{x}}\ne 0}\frac{\Vert A{\varvec{x}}\Vert _{b,\infty }}{\Vert {\varvec{x}}\Vert _{b,\infty }}. \end{aligned}$$

From the iteration in (11) we have

$$\begin{aligned}&\Vert {\tilde{\varvec{\mathrm {w}}}}^{t+1}\Vert _{b,\infty } \le \Vert ((1-\alpha \sigma ) I_{dT}+\alpha \sigma \bar{\mathcal M}^t-\alpha H^t){\tilde{\varvec{\mathrm {w}}}}^t\Vert _{b,\infty }\\&\quad \quad +\,\alpha \Vert \sigma (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*-\nabla f(\varvec{\mathrm {w}}^*)\Vert _{b,\infty }\\&\quad \le \Vert (1-\alpha \sigma ) I_{dT}+\alpha \sigma \bar{\mathcal M}^t-\alpha H^t\Vert _{b,\infty }\Vert {\tilde{\varvec{\mathrm {w}}}}^t\Vert _{b,\infty }\\&\quad \quad +\,\alpha \Vert \sigma (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*-\nabla f(\varvec{\mathrm {w}}^*)\Vert _{b,\infty }\\&\quad \le (\Vert (1-\alpha \sigma ) I_{dT}-\alpha H^t\Vert _{b,\infty }+\alpha \sigma \Vert \bar{\mathcal M}^t\Vert _{b,\infty })\Vert {\tilde{\varvec{\mathrm {w}}}}^t\Vert _{b,\infty }\\&\quad \quad +\,\alpha \Vert \sigma (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*-\nabla f(\varvec{\mathrm {w}}^*)\Vert _{b,\infty }. \end{aligned}$$

From Lemma D.3 in [39], we have

$$\begin{aligned} \Vert \bar{\mathcal M}^t\Vert _{b,\infty } = \Vert {\bar{M}}^t\Vert _{\infty } = 1, \end{aligned}$$

where the last equality comes from the fact that \({\bar{m}}_{ij}^t\ge 0\) and the row summation of \({\bar{M}}^t\) is one. Since \(\xi _iI_d\le \nabla ^2f_i(\varvec{w}_i)\le L_{f_i}I_d\), \(\xi _iI_d\le \int _0^1\nabla ^2 f_i(\varvec{w}_i^*+\mu (\varvec{w}_i-\varvec{w}_i^*)d\mu \le L_{f_i}I_d\). Thus, \(\Vert (1-\alpha \sigma ) I_d-\alpha H_i^t\Vert \le \gamma _i\) where \(\gamma _i = \max \{|1-\alpha \sigma -\alpha \xi _i|,|1-\alpha \sigma -\alpha L_{f_i}|\}\). By the definition of induced matrix block maximum norm, we have

$$\begin{aligned}&\Vert (1-\alpha \sigma )I_{dT}-\alpha H^t\Vert _{b,\infty } \\&\quad = \max _{{\varvec{x}}\ne 0}\frac{\Vert ((1-\alpha \sigma )I_{dT}-\alpha H^t){\varvec{x}}\Vert _{b,\infty }}{\Vert {\varvec{x}}\Vert _{b,\infty }}\\&\quad \le \max _{{\varvec{x}}\ne 0}\frac{\max _i\Vert ((1-\alpha \sigma )I_{d}-\alpha H_i^t)\Vert \Vert {\varvec{x}}\Vert _{b,\infty }}{\Vert {\varvec{x}}\Vert _{b,\infty }}\\&\quad = \max _i \Vert (1-\alpha \sigma )I_{d}-\alpha H_i^t\Vert \\&\quad \le {\bar{\gamma }}, \end{aligned}$$

where \({\bar{\gamma }} = \max \{\gamma _i\}\). Thus,

$$\begin{aligned}&\Vert {\tilde{\varvec{\mathrm {w}}}}^{t+1}\Vert _{b,\infty }\le ({\bar{\gamma }}+\alpha \sigma )\Vert {\tilde{\varvec{\mathrm {w}}}}^t\Vert _{b,\infty }\alpha \sigma \Vert (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*\Vert _{b,\infty }\nonumber \\&\quad +\alpha \Vert \nabla f(\varvec{\mathrm {w}}^*)\Vert _{b,\infty }. \end{aligned}$$
(12)

By choosing the step size \(\alpha \) to satisfy \({\bar{\gamma }}+\alpha \sigma <1\), the iteration asymptotically converges. To ensure \({\bar{\gamma }}+\alpha \sigma <1\), it is sufficient to ensure

$$\begin{aligned} |1-\alpha \sigma -\alpha \xi _i|+\alpha \sigma<1 \ \text {and} \\ |1-\alpha \sigma -\alpha L_{f_i}|+\alpha \sigma <1,\ \forall i, \end{aligned}$$

which leads to

$$\begin{aligned} 0<\alpha <\frac{2}{2\sigma + {\bar{L}}_{f_i}}. \end{aligned}$$

Since \({\bar{M}}^t\) is row-sum-to-one and the elements of \({\bar{M}}^t\) are all non-negative, the elements in \(\bar{\mathcal M}^t\varvec{\mathrm {w}}^*\) are convex combinations of \(\varvec{w}_i^*\). Thus, \(\Vert (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*\Vert _{b,\infty }\) is upper bounded by \(\max _{i,j}\Vert \varvec{w}_i^*-\varvec{w}_j^*\Vert \). From the iteration in (12), we have

$$\begin{aligned}&\Vert {\tilde{\varvec{\mathrm {w}}}}^{t+1}\Vert _{b,\infty }\le ({\bar{\gamma }}+\alpha \sigma )^{t+1}\Vert {\tilde{\varvec{\mathrm {w}}}}^{0}\Vert _{b,\infty }\\&+(\alpha \sigma \max _{i,j}\Vert \varvec{w}_i^*-\varvec{w}_j^*\Vert +\alpha \Vert \nabla f(\varvec{\mathrm {w}}^*)\Vert _{b,\infty })\sum _{k=0}^t({\bar{\gamma }}+\alpha \sigma )^k. \end{aligned}$$

Under the condition that \({\bar{\gamma }}+\alpha \sigma <1\),

$$\begin{aligned} \lim _{t\rightarrow \infty } \Vert {\tilde{\varvec{\mathrm {w}}}}^{t}\Vert _{b,\infty } \le \frac{\alpha \sigma \max _{i,j}\Vert \varvec{w}_i^*-\varvec{w}_j^*\Vert +\alpha \Vert \nabla f(\varvec{\mathrm {w}}^*)\Vert _{b,\infty }}{1-({\bar{\gamma }}+\alpha \sigma )}. \end{aligned}$$

From the definition of block maximum norm, (9) is obtained. \(\square \)

In iteration (3), the transfer coefficient \(m_{ij}^t\) between task i and task j is a scalar. In the following, we consider the element-wise feature similarities between task i and task j. The transfer coefficient between task i and task j is assumed to be a diagonal matrix \(P_{ij}\in \mathbb R^{d\times d}\) with its k-th diagonal element \(P_{ij,k}\) being the transfer coefficient from the k-th element of \(\varvec{w}_j\) to the k-th element of \(\varvec{w}_i\). The MGD iteration in (3) then becomes

$$\begin{aligned} \varvec{w}_i^{t+1} = \sum _{j=1}^T P_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t), \end{aligned}$$
(13)

where

$$\begin{aligned}&\sum _{j=1}^TP_{ij}^t = I_d,\nonumber \\&P_{ij,k}^t\ge 0,\ \forall i,j=1,\ldots ,T, k=1,\ldots ,d. \end{aligned}$$
(14)

Following the same rescaling,

$$\begin{aligned} {\bar{P}}_{ij}^t= {\left\{ \begin{array}{ll} \frac{1}{\alpha \sigma }P_{ij}^t, &{} j\ne i,\\ I_d-\frac{1}{\alpha \sigma }\sum _{j\ne i}P_{ij}^t, &{} j=i, \end{array}\right. } \end{aligned}$$
(15)

(13) becomes

$$\begin{aligned} \varvec{w}_i^{t+1} = (1-\alpha \sigma )\varvec{w}_i^t + \alpha \sigma \sum _{j=1}^T{\bar{P}}_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t). \end{aligned}$$
(16)

Corollary 1

Under (16) with the transfer coefficient \({\bar{P}}_{ij}^t\) satisfies

$$\begin{aligned}&\sum _{j=1}^T{\bar{P}}_{ij}^t = I_d,\\&{\bar{P}}_{ij,k}^t\ge 0,\ \forall i,j=1,\ldots ,T, k=1,\ldots ,d, \end{aligned}$$

\(\varvec{w}_i^t\) is convergent if the following conditions are satisfied:

$$\begin{aligned} \sigma<\frac{{\bar{L}}_{f_i}}{T-1},\ for \ T>1,\\ 0<\alpha <\frac{2}{(T+1)\sigma +{\bar{L}}_{f_i}}. \end{aligned}$$

Proof

Let the ij-th block element of \(\bar{\mathcal P}^t\in \mathbb R^{dT\times dT}\) being \({\bar{P}}_{ij}^t\in \mathbb R^{d\times d}\). Following the similar procedure of the proof of Theorem 1, we obtain

$$\begin{aligned} \Vert {\tilde{\varvec{\mathrm {w}}}}^{t+1}\Vert _{b,\infty }&\le (\Vert (1-\alpha \sigma ) I_{dT}-\alpha H^t\Vert _{b,\infty }\nonumber \\&\quad +\alpha \sigma \Vert \bar{\mathcal P}^t\Vert _{b,\infty })\Vert {\tilde{\varvec{\mathrm {w}}}}^t\Vert _{b,\infty }\nonumber \\&\quad +\alpha \Vert \sigma (\bar{\mathcal P}^t-I_{dT})\varvec{\mathrm {w}}^*-\nabla f(\varvec{\mathrm {w}}^*)\Vert _{b,\infty }. \end{aligned}$$
(17)

Let \({\varvec{x}}=[{\varvec{x}}_i]_{\text {vec}}\in \mathbb R^{dT}\) being a block column vector with \({\varvec{x}}_i\in \mathbb R^d\).

$$\begin{aligned} \Vert \bar{\mathcal P}^t {\varvec{x}}\Vert _{b,\infty }&=\max _{i} \Vert \sum _{j=1}^T{\bar{P}}_{ij}^t {\varvec{x}}_j\Vert \\&\le \max _i\sum _{j=1}^T\Vert {\bar{P}}_{ij}^t\Vert \Vert {\varvec{x}}_j\Vert \\&\le \left( \max _i\sum _{j=1}^T\Vert {\bar{P}}_{ij}^t\Vert \right) \max _j\Vert {\varvec{x}}_j\Vert . \end{aligned}$$

Recall that \({\bar{P}}_{ij}^t\) is a diagonal matrix and the elements therein are all no greater than 1, thus, \(\sum _{j=1}^T\Vert {\bar{P}}_{ij}^t\Vert \le T\). As a result

$$\begin{aligned} \Vert \bar{\mathcal P}^t {\varvec{x}}\Vert _{b,\infty }\le T\max _j\Vert {\varvec{x}}_j\Vert . \end{aligned}$$

By the definition of matrix block maximum norm, we have

$$\begin{aligned} \Vert \bar{\mathcal P}^t\Vert _{b,\infty }\le T. \end{aligned}$$

The condition to ensure convergence of the iteration in (17) becomes

$$\begin{aligned} {\bar{\gamma }}+\alpha \sigma T<1, \end{aligned}$$

which gives

$$\begin{aligned} \sigma<\frac{{\bar{L}}_{f_i}}{T-1},\ \text {for}\ T\ne 1,\\ 0<\alpha <\frac{2}{(T+1)\sigma +{\bar{L}}_{f_i}}. \end{aligned}$$

\(\square \)

3.4 Relation with regularization based multi-task learning

From the iteration in (7), we have

$$\begin{aligned} \varvec{w}_i^{t+1}&= (1-\alpha \sigma )\varvec{w}_i^t \nonumber \\&\quad +\, \alpha \sigma \sum _{j=1}^T{\bar{m}}_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t)\nonumber \\&=\varvec{w}_i^t -\alpha \left( \sigma \sum _{j=1}^T{\bar{m}}_{ij}^t(\varvec{w}_i^t-\varvec{w}_j^t)+\nabla f_i(\varvec{w}_i^t)\right) . \end{aligned}$$
(18)

If fix \({\bar{m}}_{ij}^t = {\bar{m}}_{ij}\) for all t, then, the last term in the brackets can be seen as the gradient of the following function

$$\begin{aligned} {\bar{f}}_i(\varvec{w}_i, \varvec{w}_{-i}) = f_i(\varvec{w}_i)+\frac{1}{2}\sigma \sum _{j=1}^T {\bar{m}}_{ij}\Vert \varvec{w}_i-\varvec{w}_j\Vert ^2, \end{aligned}$$

where \(\varvec{w}_{-i}\) denotes the collection of other tasks’ variables, i.e., \(\varvec{w}_{-i}=[\varvec{w}_1',\ldots , \varvec{w}_{i-1}',\varvec{w}_{i+1}',\ldots ,\varvec{w}_T']'\). Thus, the iteration in (18) with fixed \({\bar{m}}_{ij}\) can be seen as the gradient descent algorithm which solves the following Nash equilibrium problem

$$\begin{aligned} \min _{\varvec{w}_i}\ {\bar{f}}_i(\varvec{w}_i, \varvec{w}_{-i}),\ i=1,\ldots ,T. \end{aligned}$$
(19)

In (19), each task’s cost function is influenced by other tasks’ decision variables. Since the cost function \({\bar{f}}_i(\varvec{w}_i, \varvec{w}_{-i})\) is continuous in all its arguments, strongly convex with respect to \(\varvec{w}_i\) for fixed \(\varvec{w}_{-i}\), and satisfies \({\bar{f}}_i(\varvec{w}_i, \varvec{w}_{-i})\rightarrow \infty \) as \(\Vert \varvec{w}_i\Vert \rightarrow \infty \) for fixed \(\varvec{w}_{-i}\), a Nash equilibrium exists [4]. Furthermore, as a result of strongly convexity, the gradient of \({\bar{f}}_i(\varvec{w}_i, \varvec{w}_{-i})\) with respect to \(\varvec{w}_i\) for fixed \(\varvec{w}_{-i}\) is strongly monotone. Thus, the Nash equilibrium for (19) is unique [12]. Denote the Nash equilibrium of (19) as \(\varvec{w}_i^o\), \(i=\{1,\ldots ,T\}\). It is known that the Nash equilibrium satisfies the following condition [4]:

$$\begin{aligned} \varvec{w}_i^o = \text {argmin}_{\varvec{w}_i} {\bar{f}}_i(\varvec{w}_i, \varvec{w}^o_{-i}),\ i=1,\ldots ,T, \end{aligned}$$

which implies

$$\begin{aligned} \nabla f_i(\varvec{w}_i^o)+\sigma \sum _{j=1}^T{\bar{m}}_{ij}(\varvec{w}_i^o -\varvec{w}_j^o) = 0, \ i=1,\ldots ,T. \end{aligned}$$
(20)

Write the conditions in (20) in a concatenated form gives

$$\begin{aligned} \nabla f(\varvec{\mathrm {w}}^o)+\sigma (I_{T}-{\bar{M}})\otimes I_d\varvec{\mathrm {w}}^o = 0. \end{aligned}$$
(21)

It has been pointed out in [51] that the regularized MTL algorithms which learn with task relations can be expressed as

$$\begin{aligned} \min _{\varvec{w}_i,\varSigma } \sum _{i=1}^T L_i(\varvec{w}_i)+\frac{1}{2}\lambda \varvec{\mathrm {w}}^T(\varSigma ^{-1}\otimes I_d)\varvec{\mathrm {w}}+g(\varSigma ), \end{aligned}$$
(22)

where \(L_i\) is the training loss of task i, \(\lambda \) is a positive regularization parameter, \(\varSigma \in \mathbb R^{T\times T}\) models the task relations, and \(g(\varSigma )\) denotes constraints on \(\varSigma \). For comparison, we eliminate the constraints on \(\varSigma \), consider the case that \(\varSigma \) is fixed, and let \(f(\varvec{\mathrm {w}}) = \sum _{i=1}^TL_{i}(\varvec{w}_i)\). Denote the optimal solutions of problem (22) as \(\varvec{\mathrm {w}}^{g}\). The optimal solution satisfies the following condition,

$$\begin{aligned} \nabla f(\varvec{\mathrm {w}}^{g})+\frac{1}{2}\lambda (\varSigma ^{-1}+(\varSigma ^{-1})^T)\otimes I_d\varvec{\mathrm {w}}^{g}=0. \end{aligned}$$
(23)

Comparing the optimality conditions (21) and (23) for the Nash equilibrium problem (19) and the MTL problem (22), we find that if \({\bar{M}}\) can be set as

$$\begin{aligned} \sigma (I_{T}-{\bar{M}}) = \frac{1}{2}\lambda (\varSigma ^{-1}+(\varSigma ^{-1})^T), \end{aligned}$$
(24)

the optimal solution \(\varvec{\mathrm {w}}^o\) will be the same as \(\varvec{\mathrm {w}}^g\). The only limitation is that \({\bar{m}}_{ij}>0\), which can not cover the situation where there exists non-negative non-diagonal values in \(\varSigma ^{-1}\). Overall, the regularized multi-tasking learning problem with task relation learning can be solved by the MGD algorithm by setting the coefficients \({\bar{m}}_{ij}\) between task i and task j properly. In addition, using MGD, we can consider feature-feature relations between different tasks since we can use \({\bar{P}}_{ij}\in \mathbb R^{d\times d}\) as the transfer coefficient. Furthermore, in MGD, \({\bar{m}}_{ij}\) is not required to be equal to \({\bar{m}}_{ji}\). This relaxation allows asymmetric task relations in MTL [30], which is hard to achieve by most MTL methods since \(\varSigma ^{-1}+(\varSigma ^{-1})^T\) is always symmetric in (23).

Another category of regularized MTL method is learning with feature relations [51]. The cost function of this kind of method is

$$\begin{aligned} \min _{\varvec{w}_i,\varTheta } \sum _{i=1}^T L_i(\varvec{w}_i)+\frac{1}{2}\lambda \varvec{\mathrm {w}}^T(I_T\otimes \varTheta ^{-1})\varvec{\mathrm {w}}+g(\varTheta ), \end{aligned}$$
(25)

where \(\varTheta \in \mathbb R^{d\times d}\) models the covariance between the features. The term \(\varvec{\mathrm {w}}^T(I_T\otimes \varTheta ^{-1})\varvec{\mathrm {w}}\) can be decoupled as \(\sum _{i=1}^T \varvec{w}_i^T\varTheta ^{-1}\varvec{w}_i\), which can be incorporated into \(f_i(\varvec{w}_i)\) for task i.

3.5 Pseudocode of the MGD algorithm

MGD provides a novel framework of utilizing task similarities to improve the learning performance. The pseudocode of the proposed MGD is summarized in Algorithm 1.

figure a

As shown in Theorem 1, the transfer coefficients are only required to satisfy mild conditions to ensure the convergence of the proposed algorithm, which allows a variety of existing task relation learning methods to be used to design the transfer coefficients. A straightforward way to set the transfer coefficient is based on task similarities. The more similar two tasks are, the larger the corresponding transfer coefficient is expected to be. For a multi-label learning problem, the similarity between task i and task j can be modeled by the correlation between the label sets, which can be calculated by many different similarity measurements, such as cosine similarity and Euclidean distance between the labels. By assuming the task relations be known as a prior like in [11, 27], the transfer coefficient can be designed utilizing the relation between MGD and the regularization based MTL. In addition to set the transfer coefficients as a predefined value according to prior assumption or statistical methods, the value of \(m_{ij}^t\) can also be learned during the learning of the parameters. Methods placing a prior on the learning parameters like in [50, 52, 53] can also be utilized to design the transfer coefficients. Furthermore, the regularization based MTL methods usually result in symmetric task relations, while MGD can achieve asymmetric transfer easily for a specific problem such that negative transfer can be mitigated.

3.6 Complexity analysis

We analyze the complexity of the iteration MGD using predefined transfer coefficients. In each iteration, the gradient calculation leads to a complexity of \(\mathcal O(g(d)nT)\), where g(d) is the complexity of calculating the gradient w.r.t. the dimension d, which is determined by the cost function used, and the update of the model parameter according to (3) needs \(\mathcal O(dT^2)\). Therefore, the overall complexity of the MGD algorithms is of order \(\mathcal O(t(ng(d)T+dT^2))\), where t is the iteration time.

4 Experiments

In this section, we evaluate the MGD algorithm on different types of MTL problems, including regression problems, multi-label learning as multi-task learning problems, and a deep neural network model. Specifically, we first conduct a simple linear regression using synthetic datasets to demonstrate the effect of MGD compared to single-task gradient descent. Then, we validate the effectiveness of MGD for the multi-label learning problem on a series of real-world multi-label learning datasets, and compare it with both classical and state-of-the-art algorithms. Finally, we use MGD for digit classification on the MultiMNIST dataset, an MTL version of the MNIST dataset, based on LeNet. To show the effectiveness of the proposed framework, the transfer coefficients are set manually or through statistical methods, which are simple but already effective. More sophisticated transfer coefficients can be used to further improve the performance.

4.1 Toy problem

We first validate our approach by an experiment with synthetic datasets. We generate two synthetic datasets for regression that have same mean but different variances. Specifically, the noise level for the first task is set to be low while that for the second task is set to be high as follows,

$$\begin{aligned} T_1: y_{1j} =&w^*x_{1j}+\delta ,\ j=1,\ldots ,n,\\ T_2: y_{2j} =&w^*x_{2j}+10\delta ,\ j=1,\ldots ,n, \end{aligned}$$

where n is the number of data samples and \(\delta \sim \mathcal N(0,1)\). Taking the Mean Squared Error (MSE) as the cost function, denote \({\varvec{x}}_i\) and \(\varvec{y}_i\) as the data vector for task i, by minimizing the cost function, Single-task Gradient Descent (SGD) gives the following iteration

$$\begin{aligned} w_i^{t+1} =w_i^t-\frac{\alpha _i}{n} {\varvec{x}}_i^T(w_i^t{\varvec{x}}_i-\varvec{y}_i),\ i = 1,2. \end{aligned}$$

Under MGD, the iteration is

$$\begin{aligned} w_i^{t+1} = \sum _{j=1}^2 m_{ij}^tw_j^t-\frac{\alpha _i}{n} {\varvec{x}}_i^T(w_i^t{\varvec{x}}_i-\varvec{y}_i),\ i=1,2. \end{aligned}$$

Note that although we can directly obtain the analytic expression which minimizes the MSE for this simple problem, iteration method is more commonly used for most problems, and we use iteration here to showcase the difference between MGD and SGD.

We set \(w^* = 2\), \(\alpha _1=\alpha _2 = 0.2\). We generate 100 points for each task from the standard Gaussian distribution and use 5-folder split for training and test, and run the iteration 50 times. First, we use a fixed transfer coefficients \(m_{12}=m_{21}=0.05\) for symmetric transfer. Then, since we already know that task 2 has a higher noisy level than task 1, it is believed that transfer information from task 2 to task 1 will probably cause negative transfer, thus, we conduct asymmetric transfer with transfer coefficients \(m_{12}=0.0001\) and \(m_{21} = 0.05\).

The results are shown in Fig. 1.

Fig. 1
figure 1

Linear regression under SGD and MGD

Figure 1a shows the updates of the absolute error between \(w_i\) and \(w^*\), it can be seen that the parameter of task 2 under both MGD and MGD-asy converge to better solutions than under SGD, while the parameter of task 1 under MGD converges to a solution worse than under SGD, and that of MGD-asy converges to a solution similar with SGD. This indicates that a negative transfer from task 2 to task 1 exists when the transfer is symmetric, and it can be mitigated by asymmetric transfer. Comparing the results obtained by MGD and MGD-asy, we can see that the solution for task 1 improves more under MGD-asy than MGD. This is due to the better solution obtained for task 1 under asymmetric transfer. The update of the log of the total training loss is shown in Fig. 1b. As can be seen, all the iterations converge to steady states, while the steady states are different. SGD produces the lowest training loss since it directly minimizes the cost function. However, note that the cost functions are defined for each task separately, which ignore task similarities, the one which gives the lowest loss doesn’t guarantee the best performance. The different steady state values produced by MGD and MGD-asy showcase that the information transfer between the tasks implicitly changes the cost function, which may give better solution when the transfer is properly designed. The MSE reduction of MGD and MGD-asy over SGD on the test set is shown in Fig. 1c. The result shows that MGD has a higher MSE than SGD on task 1, while this negative transfer is suppressed by asymmetric transfer in MGD-asy. Nevertheless, looking at the total MSE, both MGD and MGD-asy have better performance than SGD.

4.2 Multi-label classification

Multi-label learning deals with the problem that one instance is associated with multiple labels. Given the multi-label training set \(\mathcal D=\{({\varvec{x}}_j, \varvec{y}_j)| 1\le j\le n\}\), where n is number of instances, \({\varvec{x}}_j\in \mathcal X\) is the feature vector for the j-th instance and \(\varvec{y}_j\in \{0,1\}^{T}\) is the set of labels associated with the j-th instance. The task of multi-label learning is to learn a function h from \(\mathcal D\) which can assign a set of proper labels to an instance. We decompose the multi-label learning problem into T binary classification tasks. For each of the classification tasks, we use the 2-norm regularized logistic loss as the cost function. Thus, for any task i, the following cost function is optimized by each task individually,

$$\begin{aligned} f_i(\varvec{w}_i) =&-\frac{1}{n}\sum _{j=1}^n(y^i_j\log h(z^i_j)+(1-y^i_j)\log (1-h(z^i_j)))\\&\quad +\frac{1}{2}\rho \Vert \varvec{w}_{i,-1}\Vert ^2, \end{aligned}$$

where \(h(z^i_j) = P(y^i_j=1|x_j) = \frac{1}{1+e^{-z^i_j}}\), \(z^i_j = [1\ {\varvec{x}}_j^T]\varvec{w}_i\), \(\varvec{w}_i\in \mathbb R^{p+1}\) is the model parameter, \(\varvec{w}_{i,-1}\in \mathbb R^p\) is the remaining elements in \(\varvec{w}_i\) except the first element, and \(\rho \) is the regularization parameter. Let

$$\begin{aligned} X = \begin{bmatrix} 1 &{} {\varvec{x}}_1^T \\ \vdots &{}\vdots \\ 1 &{} {\varvec{x}}_n^T \end{bmatrix}, \varvec{y}^i = \begin{bmatrix} y_1^i\\ \vdots \\ y_n^i \end{bmatrix}. \end{aligned}$$

The MGD iteration is

$$\begin{aligned} \varvec{w}_i^{t+1} =&\sum _{j}^T m_{ij}^t\varvec{w}_j^t-\frac{\alpha }{n} X^T(g(X\varvec{w}_i^t)-\varvec{y}^i)-\alpha \rho \begin{bmatrix} 0\\ \varvec{w}_{i,-1} \end{bmatrix}, \end{aligned}$$
(26)

where \(g(X\varvec{w}_i^t) = [\frac{1}{1+e^{-[1\ {\varvec{x}}_1^T]\varvec{w}_i^t}},\ldots ,\frac{1}{1+e^{-[1\ {\varvec{x}}_n^T]\varvec{w}_i^t}}]'\).

In multi-label learning problems, the similarity between task i and task j can be modeled by the correlation between labels \(\varvec{y}^i\) and \(\varvec{y}^j\). In this experiment, we use the cosine similarity to calculate the correlation matrix C, i.e., \(C_{ij} = \frac{<\varvec{y}^i,\varvec{y}^j>}{\Vert \varvec{y}^i\Vert \Vert \varvec{y}^j\Vert }\). Then, we normalize each row of the correlation matrix C to be row-sum-to-one and set \({\bar{m}}_{ij}=C_{ij}\). Finally, we rescale \({\bar{m}}_{ij}\) according to (5) to get the transfer coefficient \(m_{ij}\).

After learning the model parameter \(\varvec{w}_i^*\), we can predict the label \(y_t^i\) for a test instance \({\varvec{x}}_t\) by the corresponding prediction function. The ith label prediction for an instant \({\varvec{x}}_t\) is predicted 1 if \(h(z_t^i)\ge \eta \) and 0 otherwise, where \(\eta \) is the threshold. In the experiment, \(\eta \) is chosen from \(\{0.1,0.2,0.3\}\).

4.2.1 Experimental setup

We conduct the multi-label classification on six benchmark multi-label datasets, including regular-scale datasets: emotions, genbase, and cal500; and relatively large-scale datasets: enron, corel5k, and bibtex. The details of the datasets are summarized in Table 1, where |S|, \(\text {dim}(S)\), L(S), \(\text {Card}(S)\), and \(\text {Dom}(S)\) represent the number of examples, the number of features, the number of class labels, the average number of labels per example, and feature type of dataset S, respectively. The datasets are downloaded from the website of MulanFootnote 1 [42].

Table 1 Characteristics of the tested multi-label datasets

Five widely used evaluation metrics are employed to evaluate the performance, including Average precision, Macro-averaging F1, Micro-averaging F1, Coverage score, and Ranking loss. Concrete metric definitions can be found in [46]. Note that for the comparison purpose, the coverage score is normalized by the number of labels. For Average precision, Macro averaging F1, and Micro averaging F1, the larger the values the better the performance. For the other two metrics, the smaller the values the better the performance.

We compare our proposed method MGD with three classical algorithms including BR [5], RAkEL [43], ECC [36], and two state-of-the-art multi-label learning algorithms LIFT [45] and LLSF-DL [24].

In the experiments, we used the source codes provided by the authors for implementation. BR, ECC, and RAkEL are implemented under the Mulan multi-label learning package [42] using the logistic regression model as the base classifier. Parameters suggested in the corresponding literatures are used, i.e., RAkEL: ensemble size 2T with \(k=3\); ECC: ensemble size 30; LIFT: the ratio parameter r is tuned in {0.1,0.2,...,0.5}; LLSF-DL: \(\alpha \), \(\beta \), \(\gamma \) are searched in \(\{4^{-5},4^{-4},\ldots ,4^5\}\), and \(\rho \) is searched in \(\{0.1,1,10\}\). For the proposed approach MGD, \(\alpha \) is set to 0.02, \(\rho \) is chosen from \(\{0.1,0.2,\ldots ,1\}\), and \(\sigma \) is chosen from \(\{0,0.05,0.1,0.15,\ldots ,0.3\}\).

4.2.2 Experimental results

We run the algorithms 5 times on five sets of randomly partitioned training (80%) and testing (20%) data, the mean metric values with standard deviations are recorded in Tables 2 and 3. The best performance is shown in boldface, \(\uparrow \) indicates the larger the better, and \(\downarrow \) indicates the smaller the better. From the results, we can see that MGD outperforms other comparing algorithms in most cases. Specifically, MGD ranks first in 86.7% cases. Compared with the existing algorithms, MGD introduced a new approach to incorporate label correlations, which is easy to implement and has low complexity. The results demonstrate the effectiveness of the proposed approach in improving the learning performance.

Table 2 Prediction performance (mean ± std. deviation) on the regular-scale tested datasets
Table 3 Prediction performance (mean ± std. deviation) on the large-scale tested datasets

Compared to single gradient descent, the transfer in MGD also helps to accelerate the convergence. To see this, the iterations of the total loss, i.e., \(\sum _{i=1}^T f_i(\varvec{w}_i)\), are plotted in the first row of Fig. 2 for three datasets. It can be seen that the MGD converges faster than single gradient descent, especially at early iterations. The iterations of the average precision score are also plotted in the second row of Fig. 2. It can be seen that for limited iteration times, the score under MGD is much better than single gradient descent.

Fig. 2
figure 2

Convergence test

We investigate the sensitivity of MGD with respect to the two hyperparameters \(\rho \) and \(\sigma \), which control the norm 2 regularization strength in the logistic regression and the transfer strength. Due to space limit, we only report the results on the emotions dataset using the average precision score. Figure 3 shows how the average precision score varies with respect to \(\rho \) and \(\sigma \). Figure 3b, c are obtained by keeping the other parameter fixed at its best setting. It can be seen that both \(\rho \) and \(\sigma \) influence the performance. While, under a relatively wide range of parameters combinations, the score does not vary too much.

Fig. 3
figure 3

Sensitivity analysis on the emotions dataset

In the above experiment, we use the cosine similarity on the label set to calculate the correlation matrix C, which results in symmetric correlation. To see the effect of asymmetric transfer, we take the emotions dataset as an example and impose asymmetric correlation on the labels. There are six labels in the emotions datasets representing amazed-surprised (L1), happy-pleased (L2), relaxing-calm (L3), quiet-still (L4), sad-longly (L5), and angry-fearful (L6). Based on the ease of predictions, which ranked in the following descending order L4, L6, L5, L1, L3, L2, we added a vector [3/21, 1/21, 2/21, 6/21, 4/21, 5/21] on each row of the cosine similarity matrix, to make more information transferred from easier tasks to harder tasks, and less information transferred vice versa. We compare the evaluation metrics obtained by using the above asymmetric correlation matrix (MGD-asy) and cosine similarity (MGD) in Table 4.

Table 4 Prediction performance (mean± std. deviation) on the emotions dataset obtained by asymmetric MGD and MGD

As can be seen from Table 4, MGD with asymmetric transfer improves the performance in all the evaluation metrics. The classification accuracy for each label is further compared in Table 5. It can be seen that the classification accuracy for both easier and harder predicted labels are improved under asymmetric transfer.

Table 5 Classification accuracy for each label on the emotions dataset obtained by asymmetric MGD and MGD
Table 6 Performance on the MultiMNIST dataset
Fig. 4
figure 4

Architecture of the multiMNIST experiment

4.3 MultiMNIST

MultiMNIST is an MTL version of the MNIST dataset [38], where multiple images are overlaid together to convert digit classification into a multi-task problem. We use the construction from [41]. For each image, a different one is chosen uniformly in random. Then one of the images is put at the top-left and the other is at the bottom-right with partially overlapping. The resulting two tasks are classifying the digit on the top-left and classifying the digit on the bottom-right with the transformed images as the input. We use the dataset created by [41], which contains 60K examples.

The LeNet [29] is used as the base model for each task. During the optimization procedure, we transfer the parameters of the first two convolution layers, and leave the fully connected layers without transfer. The architecture is visualized in Fig. 4. For both tasks, the cross-entropy loss is used as the cost function. We adapt the SGD and modify it with the multi-task transfer as the optimizer. The transfer parameter between the two tasks is set based on the Euclidean distance (Euc) between the two label sets, which is \(1/(Euc+1)\), where the scaler 1 in the dominator is to ensure the transfer parameter less than 1. The rescale parameter \(\sigma \) is set as 1.

To show the effectiveness, each single task is solved solely by SGD to serve as the baseline. Furthermore, based on the code by the author, we run the MTL method proposed in Sener and Koltun 2018 [41], which trained the first two convolution layers and the first fully connected layers as a shared encoder and two independent fully-connected layers as task-specific function for the two tasks, as a comparison. For all the experiments, the learning rate is set as 0.001 and halved every 30 epoches, the momentum is set as 0.9. We use batch size of 256 and train for 100 epoches. The results averaged over 5 runs are listed in Table 6.

As can be seen from the results in Table 6, our method performs the best. Specifically, compared with single task baseline, MGD which superimpose transfer on single-task learning achieves a better performance, which showcases the effectiveness of utilizing information from related tasks. Compared with the model-based MTL method [38] which uses the first two convolutional layers and one fully connected layer as shared layers, our method also achieves better results on both tasks. The result validates the efficacy of our method which promotes relations between related tasks through transferring parameter values during the model learning process.

5 Conclusion and future work

In this paper, we propose the MGD algorithm for MTL. Different from the state-of-the-art, MGD treats multi-task learning as multiple learning tasks with independent cost functions, and transfers correlated model parameter values during the model learning process of the independent cost functions. By implicitly changing the cost function through the learning process, MGD achieves utilizing information from related tasks with proper transfer mechanisms. The convergence of the algorithm has been theoretically proven for any transfer mechanism satisfying easily achievable conditions, which provides flexibility in using different kinds of similarity measurements. Compared to existing MTL approaches, MGD is easy to implement, can achieve seamless asymmetric transformation such that negative transfer is mitigated, and can benefit from parallel computing when the number of tasks is large. The competitive experimental results validate the effectiveness of MGD. In our current work, we only require the transfer coefficients to satisfy easily achievable conditions and utilize simple similarity measurements such as cosine similarity to find task relations in the experiment. It is desirable to design more effective and systematic learning of transfer coefficients to improve the performance, including asymmetric transfer. Besides, it is also interesting to investigate element-wise feature-feature relations rather than only task-task relations in the future.