Multi-task gradient descent for multi-task learning

Bai, Lu; Ong, Yew-Soon; He, Tiantian; Gupta, Abhishek

doi:10.1007/s12293-020-00316-3

Multi-task gradient descent for multi-task learning

Regular Research Paper
Published: 19 October 2020

Volume 12, pages 355–369, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Memetic Computing Aims and scope Submit manuscript

Multi-task gradient descent for multi-task learning

Download PDF

Lu Bai ORCID: orcid.org/0000-0003-4882-6672¹,
Yew-Soon Ong¹,
Tiantian He¹ &
…
Abhishek Gupta²

1039 Accesses
14 Citations
Explore all metrics

Abstract

Multi-Task Learning (MTL) aims to simultaneously solve a group of related learning tasks by leveraging the salutary knowledge memes contained in the multiple tasks to improve the generalization performance. Many prevalent approaches focus on designing a sophisticated cost function, which integrates all the learning tasks and explores the task-task relationship in a predefined manner. Different from previous approaches, in this paper, we propose a novel Multi-task Gradient Descent (MGD) framework, which improves the generalization performance of multiple tasks through knowledge transfer. The uniqueness of MGD lies in assuming individual task-specific learning objectives at the start, but with the cost functions implicitly changing during the course of parameter optimization based on task-task relationships. Specifically, MGD optimizes the individual cost function of each task using a reformative gradient descent iteration, where relations to other tasks are facilitated through effectively transferring parameter values (serving as the computational representations of memes) from other tasks. Theoretical analysis shows that the proposed framework is convergent under any appropriate transfer mechanism. Compared with existing MTL approaches, MGD provides a novel easy-to-implement framework for MTL, which can mitigate negative transfer in the learning procedure by asymmetric transfer. The proposed MGD has been compared with both classical and state-of-the-art approaches on multiple MTL datasets. The competitive experimental results validate the effectiveness of the proposed algorithm.

Knowledge Distillation for Multi-task Learning

Sample-level weighting for multi-task learning with auxiliary tasks

Article 01 February 2024

Learning with Partially Shared Features for Multi-Task Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Inspired by human learning activities where people often apply the knowledge learned from other tasks to help learn a related task, knowledge (memes) transfer has been investigated to enhance the optimization performance in many related optimization tasks, which can be real-world problems that have similarities in nature or different methods solving one complicated problem [1, 14, 17, 19, 49]. Based on this idea, evolutionary multitasking algorithms have been developed and verified on a range of optimization tasks [3, 8, 18, 48]. Similarly, in the community of machine learning, Multi-task Learning (MTL) solves multiple related tasks simultaneously to leverage knowledge contained in one task to help learn other tasks. MTL has shown to outperform single-task learning in many cases, from computer vision [28, 54], drug discovery [21, 22], to natural language processing [9], which validate the idea of utilizing knowledge from related tasks.

MTL handles multiple related tasks by jointly training them to improve generalization ability, either by shallow or deep models [37, 51]. There are different mechanisms to utilize information from similar tasks in MTL. In a majority of shallow model based approaches, the similarities are promoted through regularization in the global cost function, which is composed of all the tasks, such as feature based approaches [2, 6, 20, 31, 33] and task-relation based approaches [11, 16, 53]. In the deep model based approaches, one way is to let different tasks share the first several hidden layers, where common feature representations for multiple tasks are learned, and then have task-specific parameters in the subsequent layers [23, 32, 37, 54]. Inspired by regularization techniques for shallow MTL, the other way of parameter sharing in deep models is to regularize the distance between the parameters of different models to encourage the parameters to be similar [10, 44]. As a subproblem of MTL, multi-label learning deals with the problem that one instance is associated with multiple labels. Due to the special form of multi-label learning problems, a lot of methods have been proposed to improve the performance of multi-label learning by exploring the label correlations [46]. Classical methods such as classifier chains [36], calibrated label ranking [15], and random k-labelsets [43] transform the problem into a combination of classification problems. [24, 25] considered taking the symmetric label correlations as prior knowledge and incorporating it into the model training. Local label correlations are also exploited in [26, 55] by partitioning the dataset into groups and then learn local label correlations within the groups.

In most existing MTL models, the information transfer is symmetric between any two participating tasks based on the coupled cost functions, or a common feature representation is learned from all the tasks. The symmetric information sharing will deteriorate the performance of some of the participating learners since not always all tasks benefit from the joint learning [30], and the indistinguishable common feature sharing may result in performance degeneration if there are noisy and outlier tasks. Although some methods proposed for multi-label learning, such as classifier chains, can achieve asymmetric information transfer, they rely largely on the ordering or relation combination of the labels, which usually have high complexity with a large number of tasks.

Inspired by the merits of first-order gradient descent and taking into account the importance of relations among the tasks, a novel Multi-task Gradient Descent (MGD) algorithm is proposed in this paper to solve the MTL problem. In MGD, instead of modeling all the tasks into one coupled cost function, each task minimizes its individual cost function using the gradient descent algorithm. The similarities among the tasks are then facilitated through transferring model parameter values during the model learning process of each task. By implicitly changing the cost function during the learning process, MGD achieves MTL with proper transfer mechanisms. The convergence of MGD when the transfer mechanism and the step size of gradient descent satisfy certain easily achievable conditions is theoretically proven, which allows a variety of similarity measurements to be leveraged to promote information sharing. Compared with the existing approaches, the advantages of MGD are threefold:

1.
MGD provides a novel easy-to-implement framework for MTL and more flexible way to conduct information transfer between related tasks.
2.
It can achieve asymmetric transfer easily such that negative transfer is mitigated.
3.
It can benefit from parallel computing with a small amount of information processed centrally when the number of tasks is large.

The rest of the paper is organized as follows. Section 2 briefly introduces related works, including MTL, multi-label learning, and transfer learning. Then, the proposed MGD is presented in Sect. 3, where the convergence of the proposed algorithm is theoretically proven, and the relation of MGD and the regularization based MTL is analyzed. Experiments on a synthetic problem, a set of real-world multi-label learning datasets, and MultiMNIST dataset using LeNet are conducted in Sect. 4 to evaluate the performance of the algorithm. Finally, Sect. 5 gives the conclusion and future work.

2 Related work

In the existing MTL works, information sharing is achieved by designing a common model concerning the related tasks, either is shallow or deep. In most of the deep model based MTL, the first several hidden layers are trained on all the tasks to learn common feature representations for multiple tasks, and the subsequent layers are left as task specific parameters which are only trained on a specific task [32, 47, 54]. The deep MTL models can learn powerful feature representations. However, the sharing hidden layers approach usually lacks of interpretability, and it is vulnerable to noisy and outlier tasks. The shallow model usually comes up with a global cost function, the relations among the tasks are promoted through a regularization term in the cost function that composed of all the tasks’ parameters, such as feature based approaches [2, 6, 20, 31, 33] and task relation based approaches [11, 16, 53]. Taking the classical feature selection method in [33] as an example, the cost function under the regularization framework is

$$\begin{aligned} \min \sum _{i=1}^T f_i(\varvec{w}_i)+\sigma \Vert W\Vert _{2,1}, \end{aligned}$$

where T is the number of tasks, $\varvec{w}_i$ is the model parameters for the ith task, $f_i$ denotes a function of $\varvec{w}_i$ including the loss function and other regularization functions of $\varvec{w}_i$, W is a matrix with the ith column being $\varvec{w}_i$, and $\sigma $ is a positive regularization parameter. The $\ell _{2,1}$-norm regularization on W equals the sum of $\ell _2$ norm of rows in W, which enforces W to be row sparse and results in selecting important features shared across multiple tasks. In this kind of regularization based approaches, same shared features are used for all the participating tasks.

Treating a single-label learning problem as one task, the multi-label learning problem can be seen as a special case of MTL problem, where the feature vectors ${\varvec{x}}_j$ for $j=1,\ldots ,n$ are the same for different tasks. The relations between the tasks can be learned from the relations between the labels. For first-order methods, the label correlations are ignored and the multi-label learning problem is handled in a label by label manner, such as BR [5] and LIFT [45]. Second-order methods consider pairwise relations between labels, such as LLSF [24] and JFSC [25]. High-order methods, where high-order relations among label subsets or all the labels are considered, such as RAkEL [43], ECC [36], LLSF-DL [24], and CAMEL [13]. Generally, the higher the order of correlations being considered, the stronger is the correlation-modeling capabilities, while on the other hand, the more computationally demanding and less scalable the approach becomes.

In contrast to the existing MTL approaches which incorporate correlation information into the modeling in the form of regularization or shared hidden layers, MGD serves as the first attempt to incorporate the correlations by transferring model parameter values during the model learning process of each task. Unlike the multi-label learning methods which rely on the ordering or relation combination of the labels to achieve asymmetric information transfer, MGD can achieve asymmetric transfer easily even with a large number of tasks. When considering only one task receive information from other tasks in an asymmetric transfer manner, it seems similar to transfer learning [7, 34], however, they have significant difference. The objective in transfer learning is to improve the performance of a target task with the help of source tasks which are already learned, while in MGD with only one task receives information, all the tasks are simultaneously solved. The asymmetric information transfer is utilized to mitigate negative transfer.

3 The MGD approach

In this section, we elaborate the proposed MGD algorithm for MTL. The mathematical notations used in the manuscript are first introduced. We then generically formulate the MTL problem and introduce the proposed MGD. At last, we perform the theoretical analysis of MGD, including convergence proof, analysis on relation with regularization based MTL, and computational complexity.

Throughout this paper, normal font small letters denote scalars, boldface small letters denote column vectors, and capital letters denote matrices. $\varvec{0}$ denotes zero column vector with proper dimension, $I_n$ denotes identity matrix of size $n\times n$. $A'$ denotes the transpose of matrix A and $\otimes $ denotes the Kronecker product. $[\varvec{z}_i]_{\text {vec}}$ denotes a concatenated column vector formed by stacking $\varvec{z}_i$ on top of each other, and $\text {diag}\{z_i\}$ denotes a diagonal matrix with the ith diagonal element being $z_i$. The norm $\Vert \cdot \Vert $ without specifying the subscript represents the Euclidean norm by default.

3.1 Problem formulation

Suppose we have T tasks to be solved simultaneously. For simplicity, we assume that all the tasks share the common data space and the tasks are positively correlated. Each task $i\in \{1,\ldots ,T\}$ aims to solve the following minimization problem,

$$\begin{aligned} \min _{\varvec{w}_i} \ f_i(\varvec{w}_i), \end{aligned}$$

(1)

where $\varvec{w}_i\in {\mathbb {R}}^{d}$ is the model parameter and $f_i: \mathbb R^{d}\rightarrow \mathbb R$ is the cost function of the ith task. In this paper, we do not restrict the specific form of the cost functions. In particular, the cost functions $f_i(\varvec{w}_i)$ is assumed to be strongly convex, twice differentiable, and the gradient of $f_i$ is Lipschitz continuous with constant $L_{f_i}$, i.e.,

$$\begin{aligned} \Vert \nabla f_i(\varvec{u})-\nabla f_i(\varvec{v})\Vert \le L_{f_i} \Vert \varvec{u}-\varvec{v}\Vert ,\quad \forall \varvec{u}, \varvec{v}\in \mathbb R^{d}. \end{aligned}$$

Cost functions in machine learning problems such as mean squared error with $\ell _2$ norm regularization and cross-entropy with $\ell _2$ norm regularization apply. Non-differentiable cost functions where $\ell _1$ norm regularization is used can also be approximately considered [40]. Since $f_i(\varvec{w}_i)$ is strongly convex and twice differentiable, there exists positive constant $\xi _i$ such that $\nabla ^2 f_i(\varvec{u})\ge \xi _i I_d$. As a result, we have

$$\begin{aligned} \xi _iI_d\le \nabla ^2f_i(\varvec{u})\le L_{f_i}I_d, \ \forall \varvec{u}\in \mathbb R^d. \end{aligned}$$

3.2 The proposed framework

Using the gradient descent, problem (1) can be solved using the following iteration,

$$\begin{aligned} \varvec{w}_i^{t+1} = \varvec{w}_i^{t}-\alpha \nabla f_i(\varvec{w}_i^t), \end{aligned}$$

(2)

where t is the iteration index, $\alpha $ is the step size, and $\nabla f_i(\varvec{w}_i^t)\in \mathbb R^d$ is the gradient of $f_i$ at $\varvec{w}_i^t$. As there are relations among tasks, we are able to improve the learning performance by considering the correlation of parameters belonging to different tasks. Based on this idea, we propose a reformative gradient descent iteration, which allows the values of the model parameters to be transferred across tasks during each iteration. The MGD is designed as follows,

$$\begin{aligned} \varvec{w}_i^{t+1} = \sum _{j=1}^T m_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t),\ i=1,\ldots ,T, \end{aligned}$$

(3)

where $m_{ij}^t$ is the transfer coefficient describes the information flow from task j to task i, which satisfies the following conditions,

$$\begin{aligned} m_{ij}^t\ge 0, \end{aligned}$$

(4a)

$$\begin{aligned} \sum _{j=1}^T m_{ij}^t = 1. \end{aligned}$$

(4b)

From (3), we can see that under MGD, the learning of different tasks can be decoupled with only a small amount of information need to be transferred among the tasks, thus can benefit from parallel computing.

From (4b), we have $m_{ii}^t = 1-\sum _{j\ne i}m_{ij}^t$. Rewriting iteration (3) as follows

$$\begin{aligned} \varvec{w}_i^{t+1}&= m_{ii}^t\varvec{w}_i^t+\sum _{j\ne i}m_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t)\\&=\left( 1-\sum _{j\ne i}m_{ij}^t\right) \varvec{w}_i^t+\sum _{j\ne i}m_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t). \end{aligned}$$

Rescale $m_{ij}^t$ as

$$\begin{aligned} {\bar{m}}_{ij}^t= {\left\{ \begin{array}{ll} \frac{1}{\alpha \sigma }m_{ij}^t, &{} j\ne i,\\ 1-\frac{1}{\alpha \sigma }\sum _{j\ne i}m_{ij}^t, &{} j=i, \end{array}\right. } \end{aligned}$$

(5)

where $\sigma $ is a positive constant and satisfies the condition

$$\begin{aligned} 1-\frac{1}{\alpha \sigma }\sum _{j\ne i}m_{ij}^t>0. \end{aligned}$$

(6)

Given (5), $m_{ij}^t$ is parameterized by $\sigma $. With the rescaling, the iteration in (3) can be alternatively expressed as

$$\begin{aligned} \varvec{w}_i^{t+1}\!&= \! \left( 1-\!\alpha \sigma \!\sum _{j\ne i}{\bar{m}}_{ij}^t\right) \varvec{w}_i^t+\alpha \sigma \sum _{j\ne i}{\bar{m}}_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t)\nonumber \\&= \varvec{w}_i^t-\alpha \sigma (1-{\bar{m}}_{ii}^t)\varvec{w}_i^t+\alpha \sigma \sum _{j\ne i}{\bar{m}}_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t)\nonumber \\&= (1-\alpha \sigma )\varvec{w}_i^t+\alpha \sigma \sum _{j=1}^T {\bar{m}}_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t). \end{aligned}$$

(7)

3.3 Convergence analysis

In this section, we give the convergence property of the proposed MGD iteration based on the expression in (7).

Denote $\varvec{w}_i^*$ as the underlying target model parameter for task i, $\tilde{\varvec{w}_i}^t = \varvec{w}_i^t-\varvec{w}_i^*$, and ${\bar{L}}_{f_i} = \max _i\{L_{f_i}\}$. The following theorem gives the convergence property of the iteration (7) under certain conditions on the transfer parameter ${\bar{m}}_{ij}^t$ and step-size $\alpha $.

Theorem 1

Under the iteration in (7) with the transfer coefficient ${\bar{m}}_{ij}^t$ satisfying

$$\begin{aligned}&\sum _{j=1}^T{\bar{m}}_{ij}^t=1,\ \forall i,\\&{\bar{m}}_{ij}^t\ge 0,\ \forall i,j, \end{aligned}$$

$\varvec{w}_i^t$ is convergent if the step size $\alpha $ is chosen to satisfy

$$\begin{aligned} 0<\alpha <\frac{2}{2\sigma + {\bar{L}}_{f_i}}. \end{aligned}$$

(8)

Specifically,

$$\begin{aligned} \lim _{t\rightarrow \infty } \max _i\Vert \tilde{\varvec{w}_i}^{t}\Vert \le \frac{\alpha \sigma \max _{i,j}\Vert \varvec{w}_i^*-\varvec{w}_j^*\Vert +\alpha \max _i\Vert \nabla f_i(\varvec{w}_i^*)\Vert }{1-({\bar{\gamma }}+\alpha \sigma )}, \end{aligned}$$

(9)

where ${\bar{\gamma }} = \max _i\{|1-\alpha \sigma -\alpha \xi _i|,|1-\alpha \sigma -\alpha L_{f_i}|\}$.

Proof

Let the i, j-th element of $ {\bar{M}}^t\in \mathbb R^{T\times T}$ at iteration time t being ${\bar{m}}_{ij}^t$, denote $\bar{\mathcal M}^t = {\bar{M}}^t\otimes I_d\in \mathbb R^{dT\times dT}$, $\varvec{\mathrm {w}}= [\varvec{w}_1^{'}, \ldots , \varvec{w}_T^{'}]'\in \mathbb R^{dT}$, and $\nabla f(\varvec{\mathrm {w}}^t) = [\nabla f_1(\varvec{w}_1^t)',\ldots ,\nabla f_T(\varvec{w}_T^t)']'\in \mathbb R^{dT}$. Note that we are using the typeface $\varvec{\mathrm {w}}$ to distinguish this from the single vector-valued variable $\varvec{w}_i$. Write (7) into a concatenated form gives

$$\begin{aligned} \varvec{\mathrm {w}}^{t+1} = (1-\alpha \sigma )\varvec{\mathrm {w}}^t+\alpha \sigma \bar{\mathcal M}^t\varvec{\mathrm {w}}^t-\alpha \nabla f(\varvec{\mathrm {w}}^t). \end{aligned}$$

(10)

Denote $\varvec{\mathrm {w}}^* = [\varvec{w}_1^{*'},\ldots ,\varvec{w}_T^{*'}]'$ and ${\tilde{\varvec{\mathrm {w}}}}^t=\varvec{\mathrm {w}}^t-\varvec{\mathrm {w}}^*$. Subtracting $\varvec{\mathrm {w}}^*$ from both sides of (10) gives

$$\begin{aligned} {\tilde{\varvec{\mathrm {w}}}}^{t+1}&= ((1-\alpha \sigma )I_{dT}+\alpha \sigma \bar{\mathcal M}^t)\varvec{\mathrm {w}}^t-\varvec{\mathrm {w}}^*-\alpha \nabla f(\varvec{\mathrm {w}}^t)\nonumber \\&= ((1-\alpha \sigma ) I_{dT}{+}\alpha \sigma \bar{\mathcal M}^t){\tilde{\varvec{\mathrm {w}}}}^t-\alpha (\nabla f(\varvec{\mathrm {w}}^t){-}\nabla f(\varvec{\mathrm {w}}^*))\nonumber \\&\quad +\,\alpha (\sigma (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*-\nabla f(\varvec{\mathrm {w}}^*))\nonumber \\&= ((1-\alpha \sigma ) I_{dT}+\alpha \sigma \bar{\mathcal M}^t){\tilde{\varvec{\mathrm {w}}}}^t\nonumber \\&\quad -\,\alpha \int _0^1 \nabla ^2 f(\varvec{\mathrm {w}}^*+\mu (\varvec{\mathrm {w}}^t-\varvec{\mathrm {w}}^*))d\mu {\tilde{\varvec{\mathrm {w}}}}\nonumber \\&\quad +\,\alpha (\sigma (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*-\nabla f(\varvec{\mathrm {w}}^*))\nonumber \\&= ((1-\alpha \sigma ) I_{dT}+\alpha \sigma \bar{\mathcal M}^t-\alpha H^t){\tilde{\varvec{\mathrm {w}}}}^t\nonumber \\&\quad +\,\alpha (\sigma (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*-\nabla f(\varvec{\mathrm {w}}^*)), \end{aligned}$$

(11)

where $H^t = \int _0^1\nabla ^2 f(\varvec{\mathrm {w}}^*+\mu (\varvec{\mathrm {w}}^t-\varvec{\mathrm {w}}^*))d\mu \in \mathbb R^{dT\times dT}$. It can be verified that $H^t$ is a block diagonal matrix and the block diagonal elements $H_i^t = \int _0^1\nabla ^2 f_i(\varvec{w}_i^*+\mu (\varvec{w}_i^t-\varvec{w}_i^*))d\mu \in \mathbb R^{d\times d}$ for $i=1,\ldots ,T$ are Hermitian.

We use the block maximum norm defined in [39] to show the convergence of the above iteration. The block maximum norm of a vector ${\varvec{x}} = [{\varvec{x}}_i]_{\text {vec}}\in \mathbb R^{dT}$ with ${\varvec{x}}_i\in \mathbb R^d$ is defined as [39]

$$\begin{aligned} \Vert {\varvec{x}}\Vert _{b,\infty } = \max _{i}\Vert {\varvec{x}}_i\Vert . \end{aligned}$$

The induced matrix block maximum norm is therefore defined as [39]

$$\begin{aligned} \Vert A\Vert _{b,\infty } = \max _{{\varvec{x}}\ne 0}\frac{\Vert A{\varvec{x}}\Vert _{b,\infty }}{\Vert {\varvec{x}}\Vert _{b,\infty }}. \end{aligned}$$

From the iteration in (11) we have

$$\begin{aligned}&\Vert {\tilde{\varvec{\mathrm {w}}}}^{t+1}\Vert _{b,\infty } \le \Vert ((1-\alpha \sigma ) I_{dT}+\alpha \sigma \bar{\mathcal M}^t-\alpha H^t){\tilde{\varvec{\mathrm {w}}}}^t\Vert _{b,\infty }\\&\quad \quad +\,\alpha \Vert \sigma (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*-\nabla f(\varvec{\mathrm {w}}^*)\Vert _{b,\infty }\\&\quad \le \Vert (1-\alpha \sigma ) I_{dT}+\alpha \sigma \bar{\mathcal M}^t-\alpha H^t\Vert _{b,\infty }\Vert {\tilde{\varvec{\mathrm {w}}}}^t\Vert _{b,\infty }\\&\quad \quad +\,\alpha \Vert \sigma (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*-\nabla f(\varvec{\mathrm {w}}^*)\Vert _{b,\infty }\\&\quad \le (\Vert (1-\alpha \sigma ) I_{dT}-\alpha H^t\Vert _{b,\infty }+\alpha \sigma \Vert \bar{\mathcal M}^t\Vert _{b,\infty })\Vert {\tilde{\varvec{\mathrm {w}}}}^t\Vert _{b,\infty }\\&\quad \quad +\,\alpha \Vert \sigma (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*-\nabla f(\varvec{\mathrm {w}}^*)\Vert _{b,\infty }. \end{aligned}$$

From Lemma D.3 in [39], we have

$$\begin{aligned} \Vert \bar{\mathcal M}^t\Vert _{b,\infty } = \Vert {\bar{M}}^t\Vert _{\infty } = 1, \end{aligned}$$

where the last equality comes from the fact that ${\bar{m}}_{ij}^t\ge 0$ and the row summation of ${\bar{M}}^t$ is one. Since $\xi _iI_d\le \nabla ^2f_i(\varvec{w}_i)\le L_{f_i}I_d$, $\xi _iI_d\le \int _0^1\nabla ^2 f_i(\varvec{w}_i^*+\mu (\varvec{w}_i-\varvec{w}_i^*)d\mu \le L_{f_i}I_d$. Thus, $\Vert (1-\alpha \sigma ) I_d-\alpha H_i^t\Vert \le \gamma _i$ where $\gamma _i = \max \{|1-\alpha \sigma -\alpha \xi _i|,|1-\alpha \sigma -\alpha L_{f_i}|\}$. By the definition of induced matrix block maximum norm, we have

$$\begin{aligned}&\Vert (1-\alpha \sigma )I_{dT}-\alpha H^t\Vert _{b,\infty } \\&\quad = \max _{{\varvec{x}}\ne 0}\frac{\Vert ((1-\alpha \sigma )I_{dT}-\alpha H^t){\varvec{x}}\Vert _{b,\infty }}{\Vert {\varvec{x}}\Vert _{b,\infty }}\\&\quad \le \max _{{\varvec{x}}\ne 0}\frac{\max _i\Vert ((1-\alpha \sigma )I_{d}-\alpha H_i^t)\Vert \Vert {\varvec{x}}\Vert _{b,\infty }}{\Vert {\varvec{x}}\Vert _{b,\infty }}\\&\quad = \max _i \Vert (1-\alpha \sigma )I_{d}-\alpha H_i^t\Vert \\&\quad \le {\bar{\gamma }}, \end{aligned}$$

where ${\bar{\gamma }} = \max \{\gamma _i\}$. Thus,

$$\begin{aligned}&\Vert {\tilde{\varvec{\mathrm {w}}}}^{t+1}\Vert _{b,\infty }\le ({\bar{\gamma }}+\alpha \sigma )\Vert {\tilde{\varvec{\mathrm {w}}}}^t\Vert _{b,\infty }\alpha \sigma \Vert (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*\Vert _{b,\infty }\nonumber \\&\quad +\alpha \Vert \nabla f(\varvec{\mathrm {w}}^*)\Vert _{b,\infty }. \end{aligned}$$

(12)

By choosing the step size $\alpha $ to satisfy ${\bar{\gamma }}+\alpha \sigma <1$, the iteration asymptotically converges. To ensure ${\bar{\gamma }}+\alpha \sigma <1$, it is sufficient to ensure

$$\begin{aligned} |1-\alpha \sigma -\alpha \xi _i|+\alpha \sigma<1 \ \text {and} \\ |1-\alpha \sigma -\alpha L_{f_i}|+\alpha \sigma <1,\ \forall i, \end{aligned}$$

which leads to

$$\begin{aligned} 0<\alpha <\frac{2}{2\sigma + {\bar{L}}_{f_i}}. \end{aligned}$$

Since ${\bar{M}}^t$ is row-sum-to-one and the elements of ${\bar{M}}^t$ are all non-negative, the elements in $\bar{\mathcal M}^t\varvec{\mathrm {w}}^*$ are convex combinations of $\varvec{w}_i^*$. Thus, $\Vert (\bar{\mathcal M}^t-I_{dT})\varvec{\mathrm {w}}^*\Vert _{b,\infty }$ is upper bounded by $\max _{i,j}\Vert \varvec{w}_i^*-\varvec{w}_j^*\Vert $. From the iteration in (12), we have

$$\begin{aligned}&\Vert {\tilde{\varvec{\mathrm {w}}}}^{t+1}\Vert _{b,\infty }\le ({\bar{\gamma }}+\alpha \sigma )^{t+1}\Vert {\tilde{\varvec{\mathrm {w}}}}^{0}\Vert _{b,\infty }\\&+(\alpha \sigma \max _{i,j}\Vert \varvec{w}_i^*-\varvec{w}_j^*\Vert +\alpha \Vert \nabla f(\varvec{\mathrm {w}}^*)\Vert _{b,\infty })\sum _{k=0}^t({\bar{\gamma }}+\alpha \sigma )^k. \end{aligned}$$

Under the condition that ${\bar{\gamma }}+\alpha \sigma <1$,

$$\begin{aligned} \lim _{t\rightarrow \infty } \Vert {\tilde{\varvec{\mathrm {w}}}}^{t}\Vert _{b,\infty } \le \frac{\alpha \sigma \max _{i,j}\Vert \varvec{w}_i^*-\varvec{w}_j^*\Vert +\alpha \Vert \nabla f(\varvec{\mathrm {w}}^*)\Vert _{b,\infty }}{1-({\bar{\gamma }}+\alpha \sigma )}. \end{aligned}$$

From the definition of block maximum norm, (9) is obtained. $\square $

In iteration (3), the transfer coefficient $m_{ij}^t$ between task i and task j is a scalar. In the following, we consider the element-wise feature similarities between task i and task j. The transfer coefficient between task i and task j is assumed to be a diagonal matrix $P_{ij}\in \mathbb R^{d\times d}$ with its k-th diagonal element $P_{ij,k}$ being the transfer coefficient from the k-th element of $\varvec{w}_j$ to the k-th element of $\varvec{w}_i$. The MGD iteration in (3) then becomes

$$\begin{aligned} \varvec{w}_i^{t+1} = \sum _{j=1}^T P_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t), \end{aligned}$$

(13)

where

$$\begin{aligned}&\sum _{j=1}^TP_{ij}^t = I_d,\nonumber \\&P_{ij,k}^t\ge 0,\ \forall i,j=1,\ldots ,T, k=1,\ldots ,d. \end{aligned}$$

(14)

Following the same rescaling,

$$\begin{aligned} {\bar{P}}_{ij}^t= {\left\{ \begin{array}{ll} \frac{1}{\alpha \sigma }P_{ij}^t, &{} j\ne i,\\ I_d-\frac{1}{\alpha \sigma }\sum _{j\ne i}P_{ij}^t, &{} j=i, \end{array}\right. } \end{aligned}$$

(15)

(13) becomes

$$\begin{aligned} \varvec{w}_i^{t+1} = (1-\alpha \sigma )\varvec{w}_i^t + \alpha \sigma \sum _{j=1}^T{\bar{P}}_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t). \end{aligned}$$

(16)

Corollary 1

Under (16) with the transfer coefficient ${\bar{P}}_{ij}^t$ satisfies

$$\begin{aligned}&\sum _{j=1}^T{\bar{P}}_{ij}^t = I_d,\\&{\bar{P}}_{ij,k}^t\ge 0,\ \forall i,j=1,\ldots ,T, k=1,\ldots ,d, \end{aligned}$$

$\varvec{w}_i^t$ is convergent if the following conditions are satisfied:

$$\begin{aligned} \sigma<\frac{{\bar{L}}_{f_i}}{T-1},\ for \ T>1,\\ 0<\alpha <\frac{2}{(T+1)\sigma +{\bar{L}}_{f_i}}. \end{aligned}$$

Proof

Let the i, j-th block element of $\bar{\mathcal P}^t\in \mathbb R^{dT\times dT}$ being ${\bar{P}}_{ij}^t\in \mathbb R^{d\times d}$. Following the similar procedure of the proof of Theorem 1, we obtain

$$\begin{aligned} \Vert {\tilde{\varvec{\mathrm {w}}}}^{t+1}\Vert _{b,\infty }&\le (\Vert (1-\alpha \sigma ) I_{dT}-\alpha H^t\Vert _{b,\infty }\nonumber \\&\quad +\alpha \sigma \Vert \bar{\mathcal P}^t\Vert _{b,\infty })\Vert {\tilde{\varvec{\mathrm {w}}}}^t\Vert _{b,\infty }\nonumber \\&\quad +\alpha \Vert \sigma (\bar{\mathcal P}^t-I_{dT})\varvec{\mathrm {w}}^*-\nabla f(\varvec{\mathrm {w}}^*)\Vert _{b,\infty }. \end{aligned}$$

(17)

Let ${\varvec{x}}=[{\varvec{x}}_i]_{\text {vec}}\in \mathbb R^{dT}$ being a block column vector with ${\varvec{x}}_i\in \mathbb R^d$.

$$\begin{aligned} \Vert \bar{\mathcal P}^t {\varvec{x}}\Vert _{b,\infty }&=\max _{i} \Vert \sum _{j=1}^T{\bar{P}}_{ij}^t {\varvec{x}}_j\Vert \\&\le \max _i\sum _{j=1}^T\Vert {\bar{P}}_{ij}^t\Vert \Vert {\varvec{x}}_j\Vert \\&\le \left( \max _i\sum _{j=1}^T\Vert {\bar{P}}_{ij}^t\Vert \right) \max _j\Vert {\varvec{x}}_j\Vert . \end{aligned}$$

Recall that ${\bar{P}}_{ij}^t$ is a diagonal matrix and the elements therein are all no greater than 1, thus, $\sum _{j=1}^T\Vert {\bar{P}}_{ij}^t\Vert \le T$. As a result

$$\begin{aligned} \Vert \bar{\mathcal P}^t {\varvec{x}}\Vert _{b,\infty }\le T\max _j\Vert {\varvec{x}}_j\Vert . \end{aligned}$$

By the definition of matrix block maximum norm, we have

$$\begin{aligned} \Vert \bar{\mathcal P}^t\Vert _{b,\infty }\le T. \end{aligned}$$

The condition to ensure convergence of the iteration in (17) becomes

$$\begin{aligned} {\bar{\gamma }}+\alpha \sigma T<1, \end{aligned}$$

which gives

$$\begin{aligned} \sigma<\frac{{\bar{L}}_{f_i}}{T-1},\ \text {for}\ T\ne 1,\\ 0<\alpha <\frac{2}{(T+1)\sigma +{\bar{L}}_{f_i}}. \end{aligned}$$

$\square $

3.4 Relation with regularization based multi-task learning

From the iteration in (7), we have

$$\begin{aligned} \varvec{w}_i^{t+1}&= (1-\alpha \sigma )\varvec{w}_i^t \nonumber \\&\quad +\, \alpha \sigma \sum _{j=1}^T{\bar{m}}_{ij}^t\varvec{w}_j^t-\alpha \nabla f_i(\varvec{w}_i^t)\nonumber \\&=\varvec{w}_i^t -\alpha \left( \sigma \sum _{j=1}^T{\bar{m}}_{ij}^t(\varvec{w}_i^t-\varvec{w}_j^t)+\nabla f_i(\varvec{w}_i^t)\right) . \end{aligned}$$

(18)

If fix ${\bar{m}}_{ij}^t = {\bar{m}}_{ij}$ for all t, then, the last term in the brackets can be seen as the gradient of the following function

$$\begin{aligned} {\bar{f}}_i(\varvec{w}_i, \varvec{w}_{-i}) = f_i(\varvec{w}_i)+\frac{1}{2}\sigma \sum _{j=1}^T {\bar{m}}_{ij}\Vert \varvec{w}_i-\varvec{w}_j\Vert ^2, \end{aligned}$$

where $\varvec{w}_{-i}$ denotes the collection of other tasks’ variables, i.e., $\varvec{w}_{-i}=[\varvec{w}_1',\ldots , \varvec{w}_{i-1}',\varvec{w}_{i+1}',\ldots ,\varvec{w}_T']'$. Thus, the iteration in (18) with fixed ${\bar{m}}_{ij}$ can be seen as the gradient descent algorithm which solves the following Nash equilibrium problem

$$\begin{aligned} \min _{\varvec{w}_i}\ {\bar{f}}_i(\varvec{w}_i, \varvec{w}_{-i}),\ i=1,\ldots ,T. \end{aligned}$$

(19)

In (19), each task’s cost function is influenced by other tasks’ decision variables. Since the cost function ${\bar{f}}_i(\varvec{w}_i, \varvec{w}_{-i})$ is continuous in all its arguments, strongly convex with respect to $\varvec{w}_i$ for fixed $\varvec{w}_{-i}$, and satisfies ${\bar{f}}_i(\varvec{w}_i, \varvec{w}_{-i})\rightarrow \infty $ as $\Vert \varvec{w}_i\Vert \rightarrow \infty $ for fixed $\varvec{w}_{-i}$, a Nash equilibrium exists [4]. Furthermore, as a result of strongly convexity, the gradient of ${\bar{f}}_i(\varvec{w}_i, \varvec{w}_{-i})$ with respect to $\varvec{w}_i$ for fixed $\varvec{w}_{-i}$ is strongly monotone. Thus, the Nash equilibrium for (19) is unique [12]. Denote the Nash equilibrium of (19) as $\varvec{w}_i^o$, $i=\{1,\ldots ,T\}$. It is known that the Nash equilibrium satisfies the following condition [4]:

$$\begin{aligned} \varvec{w}_i^o = \text {argmin}_{\varvec{w}_i} {\bar{f}}_i(\varvec{w}_i, \varvec{w}^o_{-i}),\ i=1,\ldots ,T, \end{aligned}$$

which implies

$$\begin{aligned} \nabla f_i(\varvec{w}_i^o)+\sigma \sum _{j=1}^T{\bar{m}}_{ij}(\varvec{w}_i^o -\varvec{w}_j^o) = 0, \ i=1,\ldots ,T. \end{aligned}$$

(20)

Write the conditions in (20) in a concatenated form gives

$$\begin{aligned} \nabla f(\varvec{\mathrm {w}}^o)+\sigma (I_{T}-{\bar{M}})\otimes I_d\varvec{\mathrm {w}}^o = 0. \end{aligned}$$

(21)

It has been pointed out in [51] that the regularized MTL algorithms which learn with task relations can be expressed as

$$\begin{aligned} \min _{\varvec{w}_i,\varSigma } \sum _{i=1}^T L_i(\varvec{w}_i)+\frac{1}{2}\lambda \varvec{\mathrm {w}}^T(\varSigma ^{-1}\otimes I_d)\varvec{\mathrm {w}}+g(\varSigma ), \end{aligned}$$

(22)

where $L_i$ is the training loss of task i, $\lambda $ is a positive regularization parameter, $\varSigma \in \mathbb R^{T\times T}$ models the task relations, and $g(\varSigma )$ denotes constraints on $\varSigma $. For comparison, we eliminate the constraints on $\varSigma $, consider the case that $\varSigma $ is fixed, and let $f(\varvec{\mathrm {w}}) = \sum _{i=1}^TL_{i}(\varvec{w}_i)$. Denote the optimal solutions of problem (22) as $\varvec{\mathrm {w}}^{g}$. The optimal solution satisfies the following condition,

$$\begin{aligned} \nabla f(\varvec{\mathrm {w}}^{g})+\frac{1}{2}\lambda (\varSigma ^{-1}+(\varSigma ^{-1})^T)\otimes I_d\varvec{\mathrm {w}}^{g}=0. \end{aligned}$$

(23)

Comparing the optimality conditions (21) and (23) for the Nash equilibrium problem (19) and the MTL problem (22), we find that if ${\bar{M}}$ can be set as

$$\begin{aligned} \sigma (I_{T}-{\bar{M}}) = \frac{1}{2}\lambda (\varSigma ^{-1}+(\varSigma ^{-1})^T), \end{aligned}$$

(24)

the optimal solution $\varvec{\mathrm {w}}^o$ will be the same as $\varvec{\mathrm {w}}^g$. The only limitation is that ${\bar{m}}_{ij}>0$, which can not cover the situation where there exists non-negative non-diagonal values in $\varSigma ^{-1}$. Overall, the regularized multi-tasking learning problem with task relation learning can be solved by the MGD algorithm by setting the coefficients ${\bar{m}}_{ij}$ between task i and task j properly. In addition, using MGD, we can consider feature-feature relations between different tasks since we can use ${\bar{P}}_{ij}\in \mathbb R^{d\times d}$ as the transfer coefficient. Furthermore, in MGD, ${\bar{m}}_{ij}$ is not required to be equal to ${\bar{m}}_{ji}$. This relaxation allows asymmetric task relations in MTL [30], which is hard to achieve by most MTL methods since $\varSigma ^{-1}+(\varSigma ^{-1})^T$ is always symmetric in (23).

Another category of regularized MTL method is learning with feature relations [51]. The cost function of this kind of method is

$$\begin{aligned} \min _{\varvec{w}_i,\varTheta } \sum _{i=1}^T L_i(\varvec{w}_i)+\frac{1}{2}\lambda \varvec{\mathrm {w}}^T(I_T\otimes \varTheta ^{-1})\varvec{\mathrm {w}}+g(\varTheta ), \end{aligned}$$

(25)

where $\varTheta \in \mathbb R^{d\times d}$ models the covariance between the features. The term $\varvec{\mathrm {w}}^T(I_T\otimes \varTheta ^{-1})\varvec{\mathrm {w}}$ can be decoupled as $\sum _{i=1}^T \varvec{w}_i^T\varTheta ^{-1}\varvec{w}_i$, which can be incorporated into $f_i(\varvec{w}_i)$ for task i.

3.5 Pseudocode of the MGD algorithm

MGD provides a novel framework of utilizing task similarities to improve the learning performance. The pseudocode of the proposed MGD is summarized in Algorithm 1.

As shown in Theorem 1, the transfer coefficients are only required to satisfy mild conditions to ensure the convergence of the proposed algorithm, which allows a variety of existing task relation learning methods to be used to design the transfer coefficients. A straightforward way to set the transfer coefficient is based on task similarities. The more similar two tasks are, the larger the corresponding transfer coefficient is expected to be. For a multi-label learning problem, the similarity between task i and task j can be modeled by the correlation between the label sets, which can be calculated by many different similarity measurements, such as cosine similarity and Euclidean distance between the labels. By assuming the task relations be known as a prior like in [11, 27], the transfer coefficient can be designed utilizing the relation between MGD and the regularization based MTL. In addition to set the transfer coefficients as a predefined value according to prior assumption or statistical methods, the value of $m_{ij}^t$ can also be learned during the learning of the parameters. Methods placing a prior on the learning parameters like in [50, 52, 53] can also be utilized to design the transfer coefficients. Furthermore, the regularization based MTL methods usually result in symmetric task relations, while MGD can achieve asymmetric transfer easily for a specific problem such that negative transfer can be mitigated.

3.6 Complexity analysis

We analyze the complexity of the iteration MGD using predefined transfer coefficients. In each iteration, the gradient calculation leads to a complexity of $\mathcal O(g(d)nT)$, where g(d) is the complexity of calculating the gradient w.r.t. the dimension d, which is determined by the cost function used, and the update of the model parameter according to (3) needs $\mathcal O(dT^2)$. Therefore, the overall complexity of the MGD algorithms is of order $\mathcal O(t(ng(d)T+dT^2))$, where t is the iteration time.

4 Experiments

In this section, we evaluate the MGD algorithm on different types of MTL problems, including regression problems, multi-label learning as multi-task learning problems, and a deep neural network model. Specifically, we first conduct a simple linear regression using synthetic datasets to demonstrate the effect of MGD compared to single-task gradient descent. Then, we validate the effectiveness of MGD for the multi-label learning problem on a series of real-world multi-label learning datasets, and compare it with both classical and state-of-the-art algorithms. Finally, we use MGD for digit classification on the MultiMNIST dataset, an MTL version of the MNIST dataset, based on LeNet. To show the effectiveness of the proposed framework, the transfer coefficients are set manually or through statistical methods, which are simple but already effective. More sophisticated transfer coefficients can be used to further improve the performance.

4.1 Toy problem

We first validate our approach by an experiment with synthetic datasets. We generate two synthetic datasets for regression that have same mean but different variances. Specifically, the noise level for the first task is set to be low while that for the second task is set to be high as follows,

$$\begin{aligned} T_1: y_{1j} =&w^*x_{1j}+\delta ,\ j=1,\ldots ,n,\\ T_2: y_{2j} =&w^*x_{2j}+10\delta ,\ j=1,\ldots ,n, \end{aligned}$$

where n is the number of data samples and $\delta \sim \mathcal N(0,1)$. Taking the Mean Squared Error (MSE) as the cost function, denote ${\varvec{x}}_i$ and $\varvec{y}_i$ as the data vector for task i, by minimizing the cost function, Single-task Gradient Descent (SGD) gives the following iteration

$$\begin{aligned} w_i^{t+1} =w_i^t-\frac{\alpha _i}{n} {\varvec{x}}_i^T(w_i^t{\varvec{x}}_i-\varvec{y}_i),\ i = 1,2. \end{aligned}$$

Under MGD, the iteration is

$$\begin{aligned} w_i^{t+1} = \sum _{j=1}^2 m_{ij}^tw_j^t-\frac{\alpha _i}{n} {\varvec{x}}_i^T(w_i^t{\varvec{x}}_i-\varvec{y}_i),\ i=1,2. \end{aligned}$$

Note that although we can directly obtain the analytic expression which minimizes the MSE for this simple problem, iteration method is more commonly used for most problems, and we use iteration here to showcase the difference between MGD and SGD.

We set $w^* = 2$, $\alpha _1=\alpha _2 = 0.2$. We generate 100 points for each task from the standard Gaussian distribution and use 5-folder split for training and test, and run the iteration 50 times. First, we use a fixed transfer coefficients $m_{12}=m_{21}=0.05$ for symmetric transfer. Then, since we already know that task 2 has a higher noisy level than task 1, it is believed that transfer information from task 2 to task 1 will probably cause negative transfer, thus, we conduct asymmetric transfer with transfer coefficients $m_{12}=0.0001$ and $m_{21} = 0.05$.

The results are shown in Fig. 1.

Figure 1a shows the updates of the absolute error between $w_i$ and $w^*$, it can be seen that the parameter of task 2 under both MGD and MGD-asy converge to better solutions than under SGD, while the parameter of task 1 under MGD converges to a solution worse than under SGD, and that of MGD-asy converges to a solution similar with SGD. This indicates that a negative transfer from task 2 to task 1 exists when the transfer is symmetric, and it can be mitigated by asymmetric transfer. Comparing the results obtained by MGD and MGD-asy, we can see that the solution for task 1 improves more under MGD-asy than MGD. This is due to the better solution obtained for task 1 under asymmetric transfer. The update of the log of the total training loss is shown in Fig. 1b. As can be seen, all the iterations converge to steady states, while the steady states are different. SGD produces the lowest training loss since it directly minimizes the cost function. However, note that the cost functions are defined for each task separately, which ignore task similarities, the one which gives the lowest loss doesn’t guarantee the best performance. The different steady state values produced by MGD and MGD-asy showcase that the information transfer between the tasks implicitly changes the cost function, which may give better solution when the transfer is properly designed. The MSE reduction of MGD and MGD-asy over SGD on the test set is shown in Fig. 1c. The result shows that MGD has a higher MSE than SGD on task 1, while this negative transfer is suppressed by asymmetric transfer in MGD-asy. Nevertheless, looking at the total MSE, both MGD and MGD-asy have better performance than SGD.

4.2 Multi-label classification

Multi-label learning deals with the problem that one instance is associated with multiple labels. Given the multi-label training set $\mathcal D=\{({\varvec{x}}_j, \varvec{y}_j)| 1\le j\le n\}$, where n is number of instances, ${\varvec{x}}_j\in \mathcal X$ is the feature vector for the j-th instance and $\varvec{y}_j\in \{0,1\}^{T}$ is the set of labels associated with the j-th instance. The task of multi-label learning is to learn a function h from $\mathcal D$ which can assign a set of proper labels to an instance. We decompose the multi-label learning problem into T binary classification tasks. For each of the classification tasks, we use the 2-norm regularized logistic loss as the cost function. Thus, for any task i, the following cost function is optimized by each task individually,

$$\begin{aligned} f_i(\varvec{w}_i) =&-\frac{1}{n}\sum _{j=1}^n(y^i_j\log h(z^i_j)+(1-y^i_j)\log (1-h(z^i_j)))\\&\quad +\frac{1}{2}\rho \Vert \varvec{w}_{i,-1}\Vert ^2, \end{aligned}$$

where $h(z^i_j) = P(y^i_j=1|x_j) = \frac{1}{1+e^{-z^i_j}}$, $z^i_j = [1\ {\varvec{x}}_j^T]\varvec{w}_i$, $\varvec{w}_i\in \mathbb R^{p+1}$ is the model parameter, $\varvec{w}_{i,-1}\in \mathbb R^p$ is the remaining elements in $\varvec{w}_i$ except the first element, and $\rho $ is the regularization parameter. Let

$$\begin{aligned} X = \begin{bmatrix} 1 &{} {\varvec{x}}_1^T \\ \vdots &{}\vdots \\ 1 &{} {\varvec{x}}_n^T \end{bmatrix}, \varvec{y}^i = \begin{bmatrix} y_1^i\\ \vdots \\ y_n^i \end{bmatrix}. \end{aligned}$$

The MGD iteration is

$$\begin{aligned} \varvec{w}_i^{t+1} =&\sum _{j}^T m_{ij}^t\varvec{w}_j^t-\frac{\alpha }{n} X^T(g(X\varvec{w}_i^t)-\varvec{y}^i)-\alpha \rho \begin{bmatrix} 0\\ \varvec{w}_{i,-1} \end{bmatrix}, \end{aligned}$$

(26)

where $g(X\varvec{w}_i^t) = [\frac{1}{1+e^{-[1\ {\varvec{x}}_1^T]\varvec{w}_i^t}},\ldots ,\frac{1}{1+e^{-[1\ {\varvec{x}}_n^T]\varvec{w}_i^t}}]'$.

In multi-label learning problems, the similarity between task i and task j can be modeled by the correlation between labels $\varvec{y}^i$ and $\varvec{y}^j$. In this experiment, we use the cosine similarity to calculate the correlation matrix C, i.e., $C_{ij} = \frac{<\varvec{y}^i,\varvec{y}^j>}{\Vert \varvec{y}^i\Vert \Vert \varvec{y}^j\Vert }$. Then, we normalize each row of the correlation matrix C to be row-sum-to-one and set ${\bar{m}}_{ij}=C_{ij}$. Finally, we rescale ${\bar{m}}_{ij}$ according to (5) to get the transfer coefficient $m_{ij}$.

After learning the model parameter $\varvec{w}_i^*$, we can predict the label $y_t^i$ for a test instance ${\varvec{x}}_t$ by the corresponding prediction function. The ith label prediction for an instant ${\varvec{x}}_t$ is predicted 1 if $h(z_t^i)\ge \eta $ and 0 otherwise, where $\eta $ is the threshold. In the experiment, $\eta $ is chosen from $\{0.1,0.2,0.3\}$.

4.2.1 Experimental setup

We conduct the multi-label classification on six benchmark multi-label datasets, including regular-scale datasets: emotions, genbase, and cal500; and relatively large-scale datasets: enron, corel5k, and bibtex. The details of the datasets are summarized in Table 1, where |S|, $\text {dim}(S)$, L(S), $\text {Card}(S)$, and $\text {Dom}(S)$ represent the number of examples, the number of features, the number of class labels, the average number of labels per example, and feature type of dataset S, respectively. The datasets are downloaded from the website of Mulan^{Footnote 1} [42].

Table 1 Characteristics of the tested multi-label datasets

Full size table

Five widely used evaluation metrics are employed to evaluate the performance, including Average precision, Macro-averaging F1, Micro-averaging F1, Coverage score, and Ranking loss. Concrete metric definitions can be found in [46]. Note that for the comparison purpose, the coverage score is normalized by the number of labels. For Average precision, Macro averaging F1, and Micro averaging F1, the larger the values the better the performance. For the other two metrics, the smaller the values the better the performance.

We compare our proposed method MGD with three classical algorithms including BR [5], RAkEL [43], ECC [36], and two state-of-the-art multi-label learning algorithms LIFT [45] and LLSF-DL [24].

In the experiments, we used the source codes provided by the authors for implementation. BR, ECC, and RAkEL are implemented under the Mulan multi-label learning package [42] using the logistic regression model as the base classifier. Parameters suggested in the corresponding literatures are used, i.e., RAkEL: ensemble size 2T with $k=3$; ECC: ensemble size 30; LIFT: the ratio parameter r is tuned in {0.1,0.2,...,0.5}; LLSF-DL: $\alpha $, $\beta $, $\gamma $ are searched in $\{4^{-5},4^{-4},\ldots ,4^5\}$, and $\rho $ is searched in $\{0.1,1,10\}$. For the proposed approach MGD, $\alpha $ is set to 0.02, $\rho $ is chosen from $\{0.1,0.2,\ldots ,1\}$, and $\sigma $ is chosen from $\{0,0.05,0.1,0.15,\ldots ,0.3\}$.

4.2.2 Experimental results

We run the algorithms 5 times on five sets of randomly partitioned training (80%) and testing (20%) data, the mean metric values with standard deviations are recorded in Tables 2 and 3. The best performance is shown in boldface, $\uparrow $ indicates the larger the better, and $\downarrow $ indicates the smaller the better. From the results, we can see that MGD outperforms other comparing algorithms in most cases. Specifically, MGD ranks first in 86.7% cases. Compared with the existing algorithms, MGD introduced a new approach to incorporate label correlations, which is easy to implement and has low complexity. The results demonstrate the effectiveness of the proposed approach in improving the learning performance.

Table 2 Prediction performance (mean ± std. deviation) on the regular-scale tested datasets

Full size table

Table 3 Prediction performance (mean ± std. deviation) on the large-scale tested datasets

Full size table

Compared to single gradient descent, the transfer in MGD also helps to accelerate the convergence. To see this, the iterations of the total loss, i.e., $\sum _{i=1}^T f_i(\varvec{w}_i)$, are plotted in the first row of Fig. 2 for three datasets. It can be seen that the MGD converges faster than single gradient descent, especially at early iterations. The iterations of the average precision score are also plotted in the second row of Fig. 2. It can be seen that for limited iteration times, the score under MGD is much better than single gradient descent.

We investigate the sensitivity of MGD with respect to the two hyperparameters $\rho $ and $\sigma $, which control the norm 2 regularization strength in the logistic regression and the transfer strength. Due to space limit, we only report the results on the emotions dataset using the average precision score. Figure 3 shows how the average precision score varies with respect to $\rho $ and $\sigma $. Figure 3b, c are obtained by keeping the other parameter fixed at its best setting. It can be seen that both $\rho $ and $\sigma $ influence the performance. While, under a relatively wide range of parameters combinations, the score does not vary too much.

In the above experiment, we use the cosine similarity on the label set to calculate the correlation matrix C, which results in symmetric correlation. To see the effect of asymmetric transfer, we take the emotions dataset as an example and impose asymmetric correlation on the labels. There are six labels in the emotions datasets representing amazed-surprised (L1), happy-pleased (L2), relaxing-calm (L3), quiet-still (L4), sad-longly (L5), and angry-fearful (L6). Based on the ease of predictions, which ranked in the following descending order L4, L6, L5, L1, L3, L2, we added a vector [3/21, 1/21, 2/21, 6/21, 4/21, 5/21] on each row of the cosine similarity matrix, to make more information transferred from easier tasks to harder tasks, and less information transferred vice versa. We compare the evaluation metrics obtained by using the above asymmetric correlation matrix (MGD-asy) and cosine similarity (MGD) in Table 4.

Table 4 Prediction performance (mean± std. deviation) on the emotions dataset obtained by asymmetric MGD and MGD

Full size table

As can be seen from Table 4, MGD with asymmetric transfer improves the performance in all the evaluation metrics. The classification accuracy for each label is further compared in Table 5. It can be seen that the classification accuracy for both easier and harder predicted labels are improved under asymmetric transfer.

Table 5 Classification accuracy for each label on the emotions dataset obtained by asymmetric MGD and MGD

Full size table

Table 6 Performance on the MultiMNIST dataset

Full size table

4.3 MultiMNIST

MultiMNIST is an MTL version of the MNIST dataset [38], where multiple images are overlaid together to convert digit classification into a multi-task problem. We use the construction from [41]. For each image, a different one is chosen uniformly in random. Then one of the images is put at the top-left and the other is at the bottom-right with partially overlapping. The resulting two tasks are classifying the digit on the top-left and classifying the digit on the bottom-right with the transformed images as the input. We use the dataset created by [41], which contains 60K examples.

The LeNet [29] is used as the base model for each task. During the optimization procedure, we transfer the parameters of the first two convolution layers, and leave the fully connected layers without transfer. The architecture is visualized in Fig. 4. For both tasks, the cross-entropy loss is used as the cost function. We adapt the SGD and modify it with the multi-task transfer as the optimizer. The transfer parameter between the two tasks is set based on the Euclidean distance (Euc) between the two label sets, which is $1/(Euc+1)$, where the scaler 1 in the dominator is to ensure the transfer parameter less than 1. The rescale parameter $\sigma $ is set as 1.

To show the effectiveness, each single task is solved solely by SGD to serve as the baseline. Furthermore, based on the code by the author, we run the MTL method proposed in Sener and Koltun 2018 [41], which trained the first two convolution layers and the first fully connected layers as a shared encoder and two independent fully-connected layers as task-specific function for the two tasks, as a comparison. For all the experiments, the learning rate is set as 0.001 and halved every 30 epoches, the momentum is set as 0.9. We use batch size of 256 and train for 100 epoches. The results averaged over 5 runs are listed in Table 6.

As can be seen from the results in Table 6, our method performs the best. Specifically, compared with single task baseline, MGD which superimpose transfer on single-task learning achieves a better performance, which showcases the effectiveness of utilizing information from related tasks. Compared with the model-based MTL method [38] which uses the first two convolutional layers and one fully connected layer as shared layers, our method also achieves better results on both tasks. The result validates the efficacy of our method which promotes relations between related tasks through transferring parameter values during the model learning process.

5 Conclusion and future work

In this paper, we propose the MGD algorithm for MTL. Different from the state-of-the-art, MGD treats multi-task learning as multiple learning tasks with independent cost functions, and transfers correlated model parameter values during the model learning process of the independent cost functions. By implicitly changing the cost function through the learning process, MGD achieves utilizing information from related tasks with proper transfer mechanisms. The convergence of the algorithm has been theoretically proven for any transfer mechanism satisfying easily achievable conditions, which provides flexibility in using different kinds of similarity measurements. Compared to existing MTL approaches, MGD is easy to implement, can achieve seamless asymmetric transformation such that negative transfer is mitigated, and can benefit from parallel computing when the number of tasks is large. The competitive experimental results validate the effectiveness of MGD. In our current work, we only require the transfer coefficients to satisfy easily achievable conditions and utilize simple similarity measurements such as cosine similarity to find task relations in the experiment. It is desirable to design more effective and systematic learning of transfer coefficients to improve the performance, including asymmetric transfer. Besides, it is also interesting to investigate element-wise feature-feature relations rather than only task-task relations in the future.

Notes

http://mulan.sourceforge.net/datasets-mlc.html

References

Amaya JE, Cotta C, Fernández-Leiva AJ, García-Sánchez P (2020) Deep memetic models for combinatorial optimization problems: application to the tool switching problem. Memet Comput 12(1):3–22
Article Google Scholar
Argyriou A, Evgeniou T, Pontil M (2007) Multi-task feature learning. Adv Neural Inf Process Syst 20:41–48
Google Scholar
Bali KK, Ong Y-S, Gupta A, Tan PS (2019) Multifactorial evolutionary algorithm with online transfer parameter estimation: Mfea-ii. IEEE Trans Evol Comput 24(1):69–83
Article Google Scholar
Basar T, Olsder GJ (1999) Dynamic noncooperative game theory, vol 23. SIAM, Philadelphia
MATH Google Scholar
Boutell MR, Luo J, Shen X, Brown CM (2004) Learning multi-label scene classification. Pattern Recogn 37(9):1757–1771
Article Google Scholar
Chen J, Tang L, Liu J, Ye J (2009) A convex formulation for learning shared structures from multiple tasks. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 137–144
Deng Z, Lu J, Wu D, Choi K-S, Sun S, Nojima Y (2019) Guest editorial: special issue on new advances in deep-transfer learning. IEEE Trans Emerg Top Comput Intell 3(5):357–359
Article Google Scholar
Dinh TP, Thanh BHT, Ba TT, Binh LN (2020) Multifactorial evolutionary algorithm for solving clustered tree problems: competition among cayley codes. Memet Comput 12(3):185–217
Article Google Scholar
Dong D, Wu H, He W, Yu D, Wang H (2015) Multi-task learning for multiple language translation. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, pp 1723–1732
Duong L, Cohn T, Bird S, Cook P (2015) Low resource dependency parsing: cross-lingual parameter sharing in a neural network parser. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: short papers), pp 845–850
Evgeniou T, Pontil M (2004) Regularized multi–task learning. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 109–117
Facchinei F, Pang J-S (2007) Finite-dimensional variational inequalities and complementarity problems. Springer, Berlin
MATH Google Scholar
Feng L, An B, He S (2019) Collaboration based multi-label learning. In: Thirty-third AAAI conference on artificial intelligence
Feng L, Ong Y-S, Tan A-H, Tsang IW (2015) Memes as building blocks: a case study on evolutionary optimization + transfer learning for routing problems. Memet Comput 7(3):159–180
Article Google Scholar
Fürnkranz J, Hüllermeier E, Mencía EL, Brinker K (2008) Multilabel classification via calibrated label ranking. Mach Learn 73(2):133–153
Article Google Scholar
Görnitz N, Widmer C, Zeller G, Kahles A, Rätsch G, Sonnenburg S (2011) Hierarchical multitask structured output learning for large-scale sequence segmentation. Adv Neural Inf Process Syst 24:2690–2698
Google Scholar
Gupta A, Ong Y-S (2019) Memetic computation: the mainspring of knowledge transfer in a data-driven optimization era, vol 21. Springer
Gupta A, Ong Y-S, Feng L (2015) Multifactorial evolution: toward evolutionary multitasking. IEEE Trans Evol Comput 20(3):343–357
Article Google Scholar
Gupta A, Ong Y-S, Feng L (2017) Insights on transfer optimization: because experience is the best teacher. IEEE Trans Emerg Top Comput Intell 2(1):51–64
Article Google Scholar
Han L, Zhang Y, Song G, Xie K (2014) Encoding tree sparsity in multi-task learning: a probabilistic framework. In: Twenty-eighth AAAI conference on artificial intelligence
He T, Liu Y, Ko T-H, Chan K-C, Ong Y-S (2019) Contextual correlation preserving multiview featured graph clustering. IEEE trans cybern 50(10):4318–4331
Article Google Scholar
He T, Bai L, Ong Y-S (2019) Manifold regularized stochastic block model. In: International Conference on Tools with Artificial Intelligence, pp 800–807
Hou J-C, Wang S-S, Lai Y-H, Tsao Y, Chang H-W, Wang H-M (2018) Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Trans Emerg Top Comput Intell 2(2):117–128
Article Google Scholar
Huang J, Li G, Huang Q, Wu X (2016) Learning label-specific features and class-dependent labels for multi-label classification. IEEE Trans Knowl Data Eng 28(12):3309–3323
Article Google Scholar
Huang J, Li G, Huang Q, Wu X (2018) Joint feature selection and classification for multilabel learning. IEEE Trans Cybern 48(3):876–889
Article Google Scholar
Huang S-J, Zhou Z-H (2012) Multi-label learning by exploiting label correlations locally. In: Twenty-sixth AAAI conference on artificial intelligence
Kato T, Kashima H, Sugiyama M, Asai K (2008) Multi-task learning via conic programming. Adv Neural Inf Process Syst 21:737–744
Google Scholar
Kendall A, Gal Y, Cipolla R (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of the IEEE conference on computer vision and pattern recognition 7482–7491
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Lee G, Yang E, Hwang S (2016) Asymmetric multi-task learning based on task relatedness and loss. In: International conference on machine learning, pp 230–238
Liu H, Palatucci M, Zhang J (2009) Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 649–656
Liu W, Mei T, Zhang Y, Che C, Luo J (2015) Multi-task deep visual-semantic embedding for video thumbnail selection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3707–3715
Obozinski G, Taskar B, Jordan M (2006) Multi-task feature selection. Statistics Department, UC Berkeley, Tech. Rep, 2(2.2):2
Pan SJ, Yang Q (2009) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Article Google Scholar
Ramsundar B, Kearnes S, Riley P, Webster D, Konerding D, Pande V (2015) Massively multitask networks for drug discovery. arXiv preprint arXiv:1502.02072
Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach Learn 85(3):333
Article MathSciNet Google Scholar
Ruder S (2017). An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098
Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. Adv Neural Inf Process Syst 3856–3866
Sayed AH (2014) Diffusion adaptation over networks. In: Academic Press library in signal processing, vol 3. Elsevier, pp 323–453
Schmidt M, Fung G, Rosales R (2007) Fast optimization methods for l1 regularization: a comparative study and two new approaches. In: European conference on machine learning. Springer, pp 286–297
Sener O, Koltun V (2018) Multi-task learning as multi-objective optimization. Adv Neural Inf Process Syst 525–536
Tsoumakas G, Katakis I, Vlahavas I (2010) Mining multi-label data. In: Data mining and knowledge discovery handbook. Springer, pp 667–685
Tsoumakas G, Katakis I, Vlahavas I (2010) Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng 23(7):1079–1089
Article Google Scholar
Yang Y, Hospedales TM (2016) Trace norm regularised deep multi-task learning. arXiv preprint arXiv:1606.04038
Zhang M-L, Wu L (2014) Lift: multi-label learning with label-specific features. IEEE Trans Pattern Anal Mach Intell 37(1):107–120
Article Google Scholar
Zhang M-L, Zhou Z-H (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(8):1819–1837
Article Google Scholar
Zhang W, Li R, Zeng T, Sun Q, Kumar S, Ye J, Ji S (2016) Deep model based transfer and multi-task learning for biological image analysis. IEEE Trans Big Data 6(2):322–333
Article Google Scholar
Zhang X, Yang Z, Cao F, Cao J-Z, Wang M, Cai N (2020) Conditioning optimization of extreme learning machine by multitask beetle antennae swarm algorithm. Memet Comput 12(2):151–164
Article Google Scholar
Zhang X, Zhuang Y, Wang W, Pedrycz W (2016) Transfer boosting with synthetic instances for class imbalanced object recognition. IEEE Trans Cybern 48(1):357–370
Article Google Scholar
Zhang Y, Yang Q (2017) Learning sparse task relations in multi-task learning. In: Proceedings of the thirty-first AAAI conference on artificial intelligence, pp 2914–2920
Zhang Y, Yang Q (2017) A survey on multi-task learning. arXiv preprint arXiv:1707.08114
Zhang Y, Yeung D-Y (2013) Multilabel relationship learning. ACM Trans Knowl Discov Data (TKDD) 7(2):1–30
Article Google Scholar
Zhang Y, Yeung D-Y (2014) A regularization approach to learning task relationships in multitask learning. ACM Trans Knowl Discov Data (TKDD) 8(3):12
Google Scholar
Zhang Z, Luo P, Loy CC, Tang X (2014) Facial landmark detection by deep multi-task learning. In: European conference on computer vision. Springer, pp 94–108
Zhu Y, Kwok JT, Zhou Z-H (2018) Multi-label learning with global and local label correlation. IEEE Trans Knowl Data Eng 30(6):1081–1094
Article Google Scholar

Download references

Acknowledgements

This work were supported in part by the A*STAR Cyber-Physical Production System (CPPS)-Towards Contextual and Intelligent Response Research Program, under the RIE2020 IAF-PP Grant A19C1a0018, the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-RP-2018-004), and Data Science & Artificial Intelligence Research Centre, Nanyang Technological University. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not reflect the views of National Research Foundation, Singapore.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Lu Bai, Yew-Soon Ong & Tiantian He
Singapore Institute of Manufacturing Technology, Agency for Science, Technology and Research, Singapore, Singapore
Abhishek Gupta

Authors

Lu Bai
View author publications
You can also search for this author in PubMed Google Scholar
Yew-Soon Ong
View author publications
You can also search for this author in PubMed Google Scholar
Tiantian He
View author publications
You can also search for this author in PubMed Google Scholar
Abhishek Gupta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lu Bai.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bai, L., Ong, YS., He, T. et al. Multi-task gradient descent for multi-task learning. Memetic Comp. 12, 355–369 (2020). https://doi.org/10.1007/s12293-020-00316-3

Download citation

Received: 23 September 2020
Accepted: 01 October 2020
Published: 19 October 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s12293-020-00316-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-task gradient descent for multi-task learning

Abstract

Similar content being viewed by others

Knowledge Distillation for Multi-task Learning

Sample-level weighting for multi-task learning with auxiliary tasks

Learning with Partially Shared Features for Multi-Task Learning

1 Introduction

2 Related work

3 The MGD approach

3.1 Problem formulation

3.2 The proposed framework

3.3 Convergence analysis

Theorem 1

Proof

Corollary 1

Proof

3.4 Relation with regularization based multi-task learning

3.5 Pseudocode of the MGD algorithm

3.6 Complexity analysis

4 Experiments

4.1 Toy problem

4.2 Multi-label classification

4.2.1 Experimental setup

4.2.2 Experimental results

4.3 MultiMNIST

5 Conclusion and future work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-task gradient descent for multi-task learning

Abstract

Similar content being viewed by others

Knowledge Distillation for Multi-task Learning

Sample-level weighting for multi-task learning with auxiliary tasks

Learning with Partially Shared Features for Multi-Task Learning

Explore related subjects

1 Introduction

2 Related work

3 The MGD approach

3.1 Problem formulation

3.2 The proposed framework

3.3 Convergence analysis

Theorem 1

Proof

Corollary 1

Proof

3.4 Relation with regularization based multi-task learning

3.5 Pseudocode of the MGD algorithm

3.6 Complexity analysis

4 Experiments

4.1 Toy problem

4.2 Multi-label classification

4.2.1 Experimental setup

4.2.2 Experimental results

4.3 MultiMNIST

5 Conclusion and future work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation