1 Introduction

Multi-task learning has received enormous interest over the years because it leverages common information between related tasks to improve generalization performance [71]. Although previous efforts were made in the single-task learning setting [41, 48, 68, 72, 74, 75, 86], many studies which are not limited to [5, 10, 25, 26, 51, 58, 60, 64, 71, 78, 81, 85, 94] have also shown that MTL can provide robust improvement to the single-task learning methods with usefulness in applications such as computer vision [60, 62, 64, 92], bioinformatics [19, 49] and web search ranking [12, 77]. Besides, MTL is related to some sub-fields of machine learning, such as transfer learning [6, 27], multi-label learning [33, 34] and multi-class learning [17]. However, MTL differs mainly from transfer learning because it learns many related tasks simultaneously to extract shared information.

Therefore, the strategy adopted in MTL as described in [11] is to assume that tasks are often related and may well be unrelated too because a pairwise relationship may exist between several tasks. For example, task A may be related to task B and task D while task C is only related to task D. Consequently, the relationships between all or subsets of tasks can be learnt through existing MTL approaches which can be categorized as follows: Regularization based methods [4, 5, 10, 12, 19, 25, 26, 40, 49, 51, 58, 64, 81, 85, 89, 94], low-rank methods [3, 29, 43, 59, 67], clustering methods [7, 36, 44, 76], tasks similarity learning methods [22, 62, 71, 90,91,92] and decomposition methods [37,38,39, 93, 96]. These learning methods are widely applied in a supervised learning paradigm using shallow algorithms such as support vector machine (SVM) [10, 60, 71], single-layer artificial neural network [5, 11] and Bayesian network [90, 91].

The above implies that most studies on MTL after [11] focuses on supervised learning with minimal efforts made in unsupervised learning [18, 28] and reinforcement learning [65]. Besides, there are only a few attempts made in the deep MTL direction [46, 70]. It is not surprising, given that shallow models are still more generalized for handling many real-world problems due to some deep learning limitations. For example, whereas shallow models can perform well with a limited quantity of data, a typical deep learning model would require a huge amount of data to perform better than shallow models. Furthermore, suitable theories that can assist researchers in selecting adequate deep learning tools are scarce. As a result, some researchers are still hesitant to embrace the deep learning paradigm. Thus, this paper shall focus on supervised shallow-based multi-task learning SSMTL methods, in which a task with labeled data sets can be a regression or classification problem. Notwithstanding, an overview of deep learning methods is also provided to compare shallow to non-shallow approaches.

The main contributions of the authors in this paper are summarized as follows:

  1. (1)

    An up-to-date and simplified overview of MTL with illustrating examples.

  2. (2)

    A presentation of the progresses made in MTL research through the discussion of its existing approaches.

  3. (3)

    A formulation of a typical SSMTL method using a general approach and an SVM approach to aid analysis of existing SSMTL methods.

  4. (4)

    An overview of challenges and future research directions of SSMTL.

This paper is structured as follows. Section 2 presents an overview of MTL while providing answers to many questions bothering on MTL. Next, in Sect. 3, an SSMTL problem with a bit more emphasis on the SVM approach is formulated due to its generalization goal. Afterward, Sect. 4 reviews SSMTL methods. Then Sect. 5 provides an overview of deep learning methods in supervised MTL. Furthermore, Sect. 6 presents a discussion of the challenges and future direction of supervised MTL, while Sect. 7 presents the conclusion.

2 Why Multi-task Learning?

In order to improve learning accuracy for single-task models, techniques such as ensemble learning [41, 74] and transfer learning [6, 27] became handy over the years. Specifically, the ensemble technique involves creating multiple models where each model can be combined linearly to produce improved results. On the other hand, the transfer learning technique stores the knowledge gained while learning one task and then use it as a bias to learn another task sequentially. Although both approaches have been substantially demonstrated in the literature to be effective, they both have limitations. For instance, the ensemble technique requires that each model performs better than a random guess to have a favorable output. Otherwise, the worst result may be achieved when compared to a single-task model. This is typical of the transfer learning technique if there is a negative transfer from one task to another. MTL tackles these limitations by providing a way to learn multiple tasks simultaneously to improve performance (see Fig. 1). As such, the MTL process does not just focus on improving prediction accuracy; it also increases data efficiency while reducing training time [73]. For example, through a virtual input, a self-driving vehicle can simultaneously learn the tasks of predicting objects trajectories (avoiding collisions), detecting the location of pedestrians, responding to traffic signals, determining a per-pixel depth using MTL technique with a more reduced training time than in transfer learning. Such that the knowledge gain in one task can be shared simultaneously to learn other tasks, unlike the sequential process of transfer learning, which is more susceptible to negative transfer.

Fig. 1
figure 1

An illustration of a single task learning versus multi-task learning

According to the work of [8], MTL is particularly advantageous for learning problems that belong to an environment of related problems. As an example, medical diagnostic challenges are discussed, in which a pathology test can be used to detect numerous diseases at the same time by identifying a common bias that will also aid in learning fresh cases. MTL is also useful in a variety of other complicated real-world situations, like as emotion recognition. In this case, multiple models can be trained at the same time to recognize a dress type and weather condition depending on the learned dress type. Therefore, it’s easy to understand how reference [8] and most other preliminary works on MTL, including reference [11] are based on the notion that tasks often share a certain similarity. However, under what condition can one expect different tasks to belong to an environment of related problems? To answer this question, Ben-David and Borbely [9] focuses on sample generating distributions that underpin learning tasks, where task relatedness is defined as an explicit relationship between the distributions. Their idea appears to include a subset of applications where multi-task learning could be effective, while excluding many other MTL scenarios from the picture. Specifically, the proposed methodology applies to circumstances in which the learner’s prior knowledge includes knowledge of some family F of transformations. So, a typical example involves several sensors providing data for the same classification task, such as a system of cameras positioned at the presidential palace’s entrance to automatically detect intrusion through the photographs they capture. Thus, let us assume these cameras are placed at different heights, light conditions, and angles. Then it should be clear that each of these cameras has its own bias, which can be difficult to identify. Therefore, Ben-David and Borbely’s framework may be utilized to mimic the above in the MTL scenario by developing a collection of picture transformations F so that the data distributions of images collected by all of these cameras are F-related.

The illustrations above show MTL problems that cannot be solved well using a single-learning technique. Besides, Liu et al. [52] demonstrate this by assessing individual task performance in MTL and comparing it to single-task learning. Their findings reveal that individual task performance in the MTL context was superior to that of single-task learning, providing a compelling justification for MTL.

3 Problem Formulation

Suppose we have T classification tasks, a typical MTL problem formulation using any shallow learning algorithms such as SVM, logistic regression, the artificial neural network can be generalized as follows

3.1 General Approach

$$\begin{aligned} \begin{array}{c} {\min \limits _\mathbf{W =w_1+w_2\ldots w_t}} \sum \limits _{t = 1}^T {\mathcal {L}} \left( S_t, w_t \right) + \lambda \left( \mathbf{W} \right) , \end{array} \end{aligned}$$
(1)

where \(\varvec{S_t}\) is the training data for the t task given as follows \({\{\varvec{x_{ti}},y_{ti}}\}_{i=1}^{N_t}\) in which \(\varvec{x_{ti}}\in {R^{d}}\) is the i-th training instance of the t task, labeled with \(y_{ti}\), \(w_t \in {R^{d}}\) is the weight vector of the t task, d is the feature space dimension, assuming that each task’s input matrix has the same feature dimension (homogeneous feature, but it can alternatively be heterogeneous where d varies per task). The \(+\) sign allows \(w_1,w_2,w_3\ldots w_t\) to be concatenated to learn \(\varvec{W}\) (i.e., each row of \(\varvec{W}\) has a corresponding feature) with a specific regularization constraint denoted by \(\lambda (\varvec{W})\) that can be informed mainly by the data’s prior knowledge [77]. To illustrate this, we revisit the medical diagnostic and self-driving vehicle problems from Sect. 2, where the input matrices \(X_t\) and \(X_u\) of two different tasks are the same, but the target outputs \(y_t\) and \(y_u\) are not. Here, T tasks model can be trained concurrently to learn a common bias for all tasks by carefully selecting the value of \(\lambda \)( a regularization parameter) to avoid overfitting. It should be emphasized, however, that this does not necessarily indicate a strict MTL problem. As mentioned in reference [88], it may be best be described as a multi-label learning or multi-output problem. As a result, the camera system example also presented in Sect. 2 will more accurately depict a strict MTL problem with different input samples for each task.

3.2 Standard SVM Approach

In this section, we employ the SVM MTL formulation from [23] as an example. The reason for this is that SVM is commonly employed to solve MTL problems [66]. Perhaps because of its strong generalization capability, which is ideal for MTL. Thus, given training data \({\{{\{(\varvec{x_{i}},y_{i})}\}_{i=1}^{N}, \varvec{x_{i}}\in {R^{d}}, y_{i}\in {\{1,-1}\}}\} \), SVM finds the hyperplane with the maximum margin separating the points − 1 and 1 as illustrated in Fig. 2. As such, the standard soft margin SVM for a single task problem is as follows

$$\begin{aligned} \begin{array}{c} {{\min \limits _{w,b}} \sum \limits _{i = 1}^N \xi _i + \lambda \parallel w {\parallel ^2}},\\ s.t.,{} {} y_i (w . x_i+b)\ge 1-\xi _i, \xi _i\ge 0, \end{array} \end{aligned}$$
(2)

where \(\xi _i\) is the slack variable introduced into the constraint to accommodate any outlier that hitherto would not be allowed in the hard margin SVM. In other words, \(\xi _i\) measures how much a point violates the margin. \(\lambda \) is the regularization constant that regulates the tradeoff between complexity and generalization. It is crucial since any increase in a model’s complexity can lead to overfitting, which is when a trained model fits well to train examples but fails to generalize to unknown ones. Therefore, given the same datasets as in Eq. (1), Eq. (16) of [23] provides an example extension of the single task SVM problem of Eq. (2) to MTL SVM as follows

Fig. 2
figure 2

The image on the left represents traditional binary classification, while the one on the right side depicts typical SVM binary classification

$$\begin{aligned} \begin{array}{c} {{\min \limits _{w,B_t}} \sum \limits _{t = 1}^T\sum \limits _{i = 1}^{N_t} \xi _{ti} + \lambda \parallel w {\parallel ^2}},\\ s.t.,{} {} {y_{ti} w'B_t x_{ti} \ge 1-\xi _{ti}, \quad \xi _{ti}\ge 0, \forall t,\forall i}, \end{array} \end{aligned}$$
(3)

where the matrix \(B_t\) is assumed to be a full rank d for each t to ensure a solution w to the equation exist. Thus, using the Lagrange multiplier approach, a typical solution to Eq. (3) can be found in several steps (the work of [54, 80] offers the mathematical deductions of SVM for easy understanding) by first obtaining its dual form. This strategy is beneficial because it incorporates a sparsity effect into SVM by relying on LaGrange multipliers (which are only non-zero at the locations in the margin referred to as support vectors) rather than the feature space, making it computationally efficient for high-dimensional data. Accordingly, the dual problem of Eq. (3) that also considers non-linear cases using the kernel trick (In Reproducing Kernel Hilbert Spaces (RKHSs) [57]) is given in Eq. (18) of [23]. In the next section, we will use the general MTL formulation to review existing SSMTL Methods.

4 SSMTL Methods

As shown in Fig. 3, existing SSMTL methods can be categorized into five groups: regularization-based methods, low-rank methods, clustering methods, tasks similarity learning methods, and decomposition methods. We will go through each of these methods in detail in the subsections that follow.

Fig. 3
figure 3

Supervised shallow based multitask learning methods

4.1 Regularization Based Methods

Regularization concept has been well-known over time, primarily for its application in resolving overfitting and underfitting issues [63] to reduce training and test errors and improve a model’s generalization performance. Therefore, various studies [25, 27, 49] have shown the effectiveness of the regularization technique in MTL, where it is utilized to learn a shared representation across multiple related tasks. For example, if we have a collection of datasets with n correlated features for all tasks, we may use the regularization technique to simultaneously learn an uncorrelated subspace of the original feature space that is shared across all tasks. Particularly in the case of supervised MTL, regularization techniques such as \(L_1\)-norm, \(L_{p,q}\)-norm regularization impose a penalty on the weight matrix \(\varvec{W}\). This penalty shrinks the row of the weight matrix closer to zero so that only none zero rows are selected. In the next subsections, we will review the existing regularization-based methods.

4.1.1 \(L_1\)-Norm or Lasso Sparsity

\(L_1\)-norm which can also be referred to as the least absolute shrinkage and selection operator (Lasso) penalty, is an alternative to \(L_2\)-norm. Although, the \(L_2\)-norm can be used minimize computing complexity while boosting performance accuracy by shrinking the rows of the weight matrix closer to zero, it cannot impose sparsity on the weight matrix. As a result, \(L_2\)-norm cannot perform feature selection automatically. So, to illustrate the capability of \(L_1\)-norm, we consider the \(L_1\)-norm version of Eq. (1), which is given as follows

$$\begin{aligned} \begin{array}{c} {\min \limits _{\varvec{W}}} \sum \limits _{t = 1}^T {\mathcal {L}} \left( S_t, w_t \right) + \lambda \parallel \varvec{W} {\parallel _1}, \end{array} \end{aligned}$$
(4)

It is easy to see that the \(L_1\)-norm regularization in Eq. (4) is non-differentiable. As such, a large value for the regularization constant \(\lambda \) will cause some rows of the weight matrix whose columns are the T tasks specific weight vector to be exactly zero. This characteristic of \(L_1\)-norm encourages sparsity of the feature space but it fails to perform group selection in cases where there are several correlated features that are all important in determining the target variable. That is because, when the above is the case, \(L_1\)-norm will select only a few features while it shrinks the others to zero. In doing so, \(L_1\)-norm will fail to capture an absolute relationship between the T tasks. Due to this limitation, \(L_1\)-norm is often applied in combination with other norms. For instance, several variants of \(L_1\)-norm such as \(L_{1,2}\)-norm [25, 95], \(L_{1,\infty }\)-norm [15, 38], \(L_{1,1}\)-norm [56] have been used to capture sparse representation shared across tasks. Specifically, Gong et al. [25] proposes a method based on capped-\(L_1\), \(L_1\) norm as follows

$$\begin{aligned} \begin{array}{c} {\min \limits _{\varvec{W}}} {\mathcal {L}} (\varvec{W}) + \lambda \sum \limits _{j = 1}^d m(\parallel w_j {\parallel _1}, \theta ): \varvec{W} \in {R^{ d \times T}}. \end{array} \end{aligned}$$
(5)

First, in Eq. (5), an \(L_1\) norm penalty is imposed on the row of the weight matrix \(\varvec{W}\) to obtain a sparse representation for all related tasks. Then, a capped- 1 norm which was initially proposed in [87] is further imposed on the weight matrix. With this combination, the optimal \(\varvec{W}\) in Eq. (5) would have many non-zero rows. Besides, it can observe also through Eq. (5) that \(\varvec{W}\) is threshold by a parameter \(\theta \) ,where \(w_j^{1 \times T}\) denotes the jth row of W. In other words, the threshold parameter \(\theta \) regulates \(\varvec{W}\)’s sparsity such that as it becomes smaller, the rows of \(\varvec{W}\) gets sparser. Thus, making only a subset of features to be utilized. Moreover, the work of [47], which proposed the GO-MTL for Grouping and Overlap Multi-Task Learning, had also previously used the \(L_1\)-norm to impose sparsity on a matrix \(\varvec{S} \in {R^{ k \times T}}\) containing the weights of a linear combination of each task. This approach enforces that each observed task is obtained from only a few of the k latent tasks. Such that, the weight matrix \(\varvec{W}\) can be calculated as \(\varvec{W} = {\varvec{L}}{\varvec{S}}\), where \(\varvec{L}\) is a matrix of size \(d \times k\) with each column representing a latent task.

4.1.2 \(L_{2,1}\)-Norm for Group Sparsity

\(L_{2,1}\)-norm is a variant of \(L_2\)-norm, which can be applied for group feature selection. This is because \(L_{2,1}\)-norm can capture tasks relatedness using a shared representation of a similar set of features amongst related tasks. Accordingly, Eq. (4) can be extended for group sparsity based on the \(L_{2,1}\)-norm as follows

$$\begin{aligned} \begin{array}{c} {\min \limits _{\varvec{W}}} \sum \limits _{t = 1}^T {\mathcal {L}} \left( S_t, w_t \right) + \lambda \parallel \varvec{W} {\parallel _{2,1}}, \end{array} \end{aligned}$$
(6)

Several MTL research works based on the \(L_{2,1}\)-norm includes [4, 5, 19, 26, 49, 51, 55, 64]. In particular, Argyriou et al. [5] proposed a convex optimization problem based \(L_{2,1}\)-norm as follows

$$\begin{aligned} \begin{array}{c} {\min \limits _{\varvec{A, U}}} \sum \limits _{t = 1}^T\sum \limits _{i = 1}^{m} {\mathcal {L}} (y_{ti},\langle a_t,\varvec{U^T} x_{ti}\rangle ) + \gamma \parallel \varvec{A} {\parallel _{2,1}^2} : \varvec{A} \in {R^{ d \times T}}, \end{array} \end{aligned}$$
(7)

Basically, Eq. (7) is formed under the assumption that all related tasks share small feature sets with \(N \le d\). It then means that matrix \(\varvec{A}\) will have many zero rows, corresponding to the columns of matrix \(\varvec{U}\) (the irrelevant features) not required by any task. Therefore, to learn the required features N, the \(L_{2,1}\)-norm regularization is introduced to ensure that matrix \(\varvec{A}\) has a small number of non-zero rows. Besides, by removing matrix \(\varvec{U}\), it is clear that Eq. (7) is comparable to the method proposed in [64], which uses the \(L_{2,1}\)-norm to select a subset of features that is good for all tasks. However, in a single task scenario, this method will reduce to an \(L_1\)-norm approach. Furthermore, Li et al. [49] applied \(L_{2,1}\)-norm for survival analysis in MTL scenario by means of a single base kernel. Yet, this single kernel approach was extended to multiple kernels in [19] to demonstrate that survival analysis in MTL can benefit more by capturing a shared representation through more gene data sources. Hence an additional data source (pathways/gene datasets) is incorporated to identify survival-related molecular mechanisms. Such that one kernel is used for the cancer survival benchmark dataset and another kernel for the pathways/gene dataset. This approach, however, does not utilize the \(L_{2,1}\)-norm as the regularization term. Besides, for \(p,q \ge 1\), Eq. (6) can generalize to \(L_{p,q}\)-norm as follows

$$\begin{aligned} \begin{array}{c} {\min \limits _{\varvec{W}}} \sum \limits _{t = 1}^T {\mathcal {L}} \left( S_t, w_t \right) + \lambda \parallel \varvec{W} {\parallel _{p,q}}, \end{array} \end{aligned}$$
(8)

where the \(L_p\)-norm is applied on the rows, followed by the \(L_q\)-norm on the vector of row norms. Therefore, the variants of \(L_{p,q}\)-norm include \(L_{p,1}\)-norm and capped \(L_{p,1}\)-norm but, \(L_{p,1}\)-norm is the same as \(L_{2,1}\)-norm in Eq. (6) if \(p=2\). Thus like Eq. (5), a capped \(L_{p,1}\)-norm can be obtained as follows

$$\begin{aligned} \begin{array}{c} {\min \limits _{\varvec{W}}} \sum \limits _{t = 1}^T {\mathcal {L}} \left( S_t, w_t \right) + \lambda \sum \limits _{j = 1}^d m(\parallel w_j {\parallel _p}, \theta ): \varvec{W} \in {R^{ d \times T}}. \end{array} \end{aligned}$$
(9)

In any case, when the threshold parameter \(\theta \) in Eq. (9) becomes too large, the capped \(L_{p,1}\)-norm will reduce to \(L_{p,1}\)-norm. To conclude this section, it may be worth mentioning that aside from the norms discussed above, the cluster norm [35] and K support norm [59] can also be utilized to learn a better similarity weight matrix. Moreover, to considerably acquire an accurate similarity between tasks, the Multitask Learning problem was previously formulated as a Multiple Kernel Learning [53] one by Widmer et al. [82] using a q-Norm MKL algorithm. And this approach was shown to outperform similar baseline methods.

4.2 Low-Rank

Theoretically, the weight matrix \(\varvec{W}\) can be assumed to be low-rank since tasks are usually related with similar model parameters. However, to obtain the low rank of \(\varvec{W}\), one can solve the following nuclear norm-based optimization problem:

$$\begin{aligned} \begin{array}{c} {\min \limits _{\varvec{W}}} \sum \limits _{t = 1}^T {\mathcal {L}} \left( S_t, w_t \right) + \lambda \parallel \varvec{W} {\parallel _{*}}, \end{array} \end{aligned}$$
(10)

where \( \varvec{W} \in {R^{ d \times T}}\) denotes the weight matrix, \(\parallel . {\parallel _{*}}\) denotes nuclear norm [1, 2]. Actually, the methods in [3, 14, 43, 67] use a similar approach as Eq. (10) to find the low-rank representation of \(\varvec{W}\). For example, reference [3] proposed a non-convex formulation to learn a low dimensional subspace shared between multiple related tasks under the assumption that all related tasks have similar model parameters. As such, the weight vector of the t task can be obtained as follows

$$\begin{aligned} \begin{array}{c} w_t =u_t+ \Theta ^T v_t, \end{array} \end{aligned}$$
(11)

where \(u_t\) is the t task learned weight vector, \(\varvec{\varTheta }\) is the low rank representation of \(\varvec{W}\) and \(v_t\) is the bias for the t task. Since Eq. (10) is non-convex, it will be difficult to solve it, especially when the feature space is highly correlated. Hence, to relax the non-convex approach, Chen et al. [13] proposed a convex formation that is much easier to solve. It uses \(L_{2,2}\)-norm (which is a special case when \(p = q = 2\) for the \(L_{p,q}\)-norm in Eq. (8)), also known as the Frobenius norm or the Hilbert–Schmidt norm, to penalize eigenvalues. This approach, however, uses complex constraints, so it is not scalable to large data sets. Furthermore, to learn a better low-rank matrix, [29] extended the idea in [3, 67] by introducing a capped trace norm regularization as follows

$$\begin{aligned} \begin{array}{c} {\min \limits _{\varvec{W}}} {\mathcal {L}} \left( \varvec{ W} \right) + \lambda \sum \limits _{i = 1}^R min (\sigma _i(\varvec{W}), \tau ): \varvec{W} \in {R^{ d \times T}}, \end{array} \end{aligned}$$
(12)

where \(\sum \nolimits _{i = 1}^R \sigma _i (\varvec{W})\) is denoted as the set of non-increasing ordered singular values of \(\varvec{W}\). Noticeably, Eq. (12) is like the capped \(L_{p,1}\)-norm in Eq. (9) because it can be reduced to Eq. (10) when the threshold parameter \(\tau \) becomes very large, e.g., \(\tau \rightarrow \infty \). However, these approaches cannot guarantee robust classification results when the data originate from nonlinear subspaces. Therefore, several kernel-based approaches have been proposed over the years which focus on tackling the above issue. For example, to handle multiple features from the variational mode decomposition (VMD) domain, He et al. [31] proposed the kernel low-rank multitask learning (KL-MTL). KL-MTL uses the Low-rank representation (LRR) [50] nuclear norm strategy to capture the global structure of multiple tasks, then using the kernel trick, this approach was extended for nonlinear low-rank multitask learning. Besides, the KL-MTL approach was further expanded in [32] to handle 2-D variational mode decomposition (2-D-VMD). Subsequently, Tian et al. [78] proposed a nonparametric multitask learning method, which measures the task relatedness in a reproducing kernel Hilbert space (RKHS). Specifically, the multitask learning problem is formulated as a linear combination of common eigenfunctions shared by different tasks and individual task’s unique eigenfunctions. In this way, each task’s eigenfunctions can then provide some additional information to another and so as to improve generalization performance.

Fig. 4
figure 4

Task clustering method

4.3 Clustering

As we saw in Sect. 1, a pairwise relationship can exist among the tasks, where Task A is only related to Task B, and Task C is only related to Task D. Thus, the clustering method can be used to learn model parameters by placing all related but separate tasks in the same cluster where they are co-learned. As a result, the work of [7, 76] proposed methods, which obtains the model parameters by clustering tasks(see Fig. 4 for illustration) into group of related tasks based on prior knowledge obtained in the single task setting. However, the downside is that not too good model parameters can be learned in this two-stage approach resulting in poor generalization performance for all tasks. To address the above weakness, Kang et al. [44] proposed a method that can determine the pairwise relationship existing between tasks while obtaining their parameter. It is achieved by solving the single optimization problem below.

$$\begin{aligned} \begin{array}{c} \varvec{W^*}={\min \limits _{}} \sum \limits _{t} {\mathcal {L}} \left( D_t, w_t \right) + \gamma \sum \limits _{g}\parallel \varvec{W_g} {\parallel _{*}^2}, \end{array} \end{aligned}$$
(13)

where G denotes the number of clusters available for all tasks. As a result, the weight matrix of the tasks in the gth cluster is denoted by \(\varvec{W_g}\). With this formulation, all tasks in the same cluster can be co-learned in contrast to the tasks in other clusters. Thus, \(\varvec{W_g}\) can be obtained as follows

$$\begin{aligned} \begin{array}{c} \parallel \varvec{WQ_g} {\parallel _{*}}=Trace[\varvec{WQ_g}(\varvec{WQ_g})^T]^\frac{1}{2}, \end{array} \end{aligned}$$
(14)

where \(\varvec{Q}\) is the group assignment matrix composed of \( q_{gt} \in {\{0,1}\} \). That is, 0 and 1 indicates whether the t task is assigned to gth cluster or not. Then \(\varvec{Q_g} \in {R^{ T \times T}}\) is a diagonal matrix with \(q_{gt}\) as the diagonal elements. This method is very effective because the fact that tasks are related does not automatically suggest that successful sharing will occur between them. Furthermore, Jacob et al. [36] introduced a new spectral norm that encodes the priori assumption (tasks within a group have similar weight vectors) without prior knowledge of task grouping. This approach was shown to outperform similar state-of-the-arts methods. Subsequently, reference [16] proposed a method for learning a small pool of shared hypotheses in the context where many related tasks exist with few examples. This way, each task is then mapped to a single hypothesis in the learned pool (associating each with other related tasks). Thus, avoiding a possible inherent error that may occur in learning all the tasks together using a single hypothesis.

4.4 Decomposition

The decomposition method divides the weight matrix \(\varvec{W}\) into two or more component matrices (E.g., \(\varvec{W= D}+\varvec{B}\))), each of which can be penalized independently. As such, there are two main variations of which this method exists; the Dirty and Multilevel methods. While the dirty method decomposes the weight matrix into exactly two-component matrices, as shown in Fig. 5, the multilevel method decomposes the weight matrix into two or more component matrices. This way, each component matrix can then capture the various aspects of the task relationship. Hence, many MTL studies such as [37,38,39, 93, 96], utilized this technique. Illustratively, the least square convex optimization problem proposed in [38] is given as

Fig. 5
figure 5

Dirty decomposition method

$$\begin{aligned} \begin{array}{c} {\min \limits _{\varvec{S,B}}} \frac{1}{2n}\sum \limits _{k = 1}^r \parallel y_k-X_k\left( S_k+B_k \right) {\parallel _{2}^2} + \lambda _s\parallel S {\parallel _{1,1}} +\lambda _B\parallel \varvec{B} {\parallel _{1,\infty }}, \end{array} \end{aligned}$$
(15)

where matrix \(\varvec{\varTheta } \in {R^{ p \times r}} = \varvec{B}+\varvec{S}\) based on the assumption that a certain number of rows in \(\varvec{\varTheta }\) matrix will contain large non-zero entries, which correspond to the feature shared across various tasks. Accordingly, some rows in \(\varvec{\varTheta }\) matrix will also contain all-zero entries, which correspond to irrelevant features not needed by any task, while some rows will have elementwise sparseness corresponding to those features that are only relevant to some tasks but not all. Thus, matrices \(\varvec{B}\) and \(\varvec{S}\) capture a different aspect of relationship such that \(\varvec{S}\) captures elementwise sparsity whereas \(\varvec{B}\) captures row-wise sparsity with different regularization on both. Jalali et al. [37] then extended [38] by proposing a new forward–backward greedy procedure for the dirty model. The suggested technique identifies the best single variable and best row variable in each forward step that gives the largest incremental drop in the loss function. In contrast, it looks for the variable whose removal leads to the smallest incremental loss function rise in each backward step. Besides, a new adaptive method for multiple sparse linear regression was presented by Jalali et al. [39]. This approach was conceived by examining the multiple sparse linear regression problem, which entails recovering several related sparse vectors simultaneously. Thus, when there is support and parameter overlap, the proposed method takes advantage of it but does not pay the penalty when there isn’t.

4.5 Tasks Similarity Learning

In this context, the pairwise relationships between tasks are learned directly from the data through a common model. Take as an example, when relying on the formulation in Eq. (1), then, the approach proposed in [22] is as follows

$$\begin{aligned} \begin{array}{c} {{\min \limits _{w_0,v_t}} \sum \limits _{t = 1}^T\sum \limits _{i = 1}^{N_t} \xi _{ti} + \frac{\lambda _1}{T}\sum \limits _{t = 1}^T\parallel v_t {\parallel ^2} + \lambda _2\parallel w_0 {\parallel ^2}},\\ s.t.,{} {} {y_{ti}( w_0 + v_t)\cdot x_{ti} \ge 1-\xi _{ti}, \xi _{ti}\ge 0, \forall t,\forall i}, \end{array} \end{aligned}$$
(16)

where \(w_t\) is used to denote \(w_0+v_t, v_t\) is t task-specific weight vector, and \(w_0\) is common model between different tasks. The regularization constraint is imposed on \(w_0\) (which captures the similarity between tasks) while constraining how much each \(w_t\) vary from one another (allowing each \(w_t\) to be close to some mean function \(w_0\)) by simultaneously controlling \(v_t\)’s size. Essentially, \(v_t\) is smaller when tasks are related but, when \(w_0 \rightarrow 0\), Eq. (16) reduces to an independent task problem where \(w_t =v_t\). To further improve learning accuracy, Ji and Sun [42] extended the idea of [22] for non-linear MTL with a different task-specific base kernel. Since most previous multitask multiclass learning approaches aimed at decomposing multitask multiclass problems into multiple multitask binary, they do not completely capture the inherent correlations between classes. Therefore, a method was presented which can learn the multitask multiclass problems directly and efficiently. It was achieved by using a quadratic objective function to cast these problems into a constrained optimization one. Meanwhile, to capture negative task correlation and identify outlier tasks, Zhang et al. [92] proposed a method, which captures task relationship through a prior task covariance matrix obtained via the trace of a square matrix regularizer on weight matrix \(\varvec{W}\) as follows

$$\begin{aligned} \begin{array}{c} { tr(\varvec{W}\Omega ^{-1}\varvec{W}^T)}, \end{array} \end{aligned}$$
(17)

where tr(.) is the trace of a square matrix regularizer and \(\Omega \) denotes a positive semi definite (PSD) tasks covariance matrix. Therefore, Eq. (17) is the same as the matrix fractional function given as:

$$\begin{aligned} \begin{array}{c} \sum \limits _{t} \varvec{W}(t,:)\Omega ^{-1}\varvec{W}(t,:)^T, \end{array} \end{aligned}$$
(18)

where \(\varvec{W}(t,:)\Omega ^{-1}\) denotes the t-th row of \(\varvec{W}\) matrix. Then by obtaining the Hessian matrix of \( \varvec{W}(t,:)\Omega ^{-1}\varvec{W}(t,:)^T\), Eq. (18) can be proved to be jointly convex w.r.t. \( \varvec{W},\Omega \). Therefore, Murugesan and Carbonell [62] extended the single kernel-based approach in [92] with task-specific multiple base kernels and proposed a method named Multitask Multiple Kernel Relationship Learning (MK-MTRL). MK-MTRL’s main idea is to automatically assume task relationships in the RKHS space, similar to the one proposed in [32]. However, different from the work of [32], MK-MTRL formulation allows for incorporating prior knowledge to aid the simultaneous learning of several related tasks. Besides, Ruiz et al. [71] proposed a convex approach which can capture task relationship such that a convex penalty is imposed on both the task-specific weight \(\parallel v_t {\parallel _{}^2} \) and the common part \(\parallel u {\parallel _{}^2} \).

Also, Williams et al. [83] employed Gaussian processes to learn a task-similarity matrix with a block-diagonal structure that captures inter-task correlations by assuming tasks are ordered with regard to clusters. Consequently, a kernel-based method for automatically revealing structural inter-task relationships, which extend the low-rank output kernels strategy initially introduced in [21] to a multi-task environment, was proposed in the work of [20]. This approach uses a properly weighted loss, allowing several datasets with different input sampling patterns to be used. In another way, some efforts were made in [23, 45] to capture the similarity between tasks using the Graph Laplacian strategy. Thus, guaranteeing that all tasks in the same cluster will have identical model parameters. Meanwhile, other efforts, such as [14, 30, 78] combined the ideas of numerous SSMTL techniques to increase generalization performance across all tasks.

Table 1 A brief performance comparison of clustering, decomposition and tasks coupling methods using office-Caltech and MHC-I datasets with respect to classification error evaluation metric

Therefore, Table 1 gives brief performance comparison of Clustering, Decomposition and Tasks Coupling methods using Office-Caltech [38] and MHC-I [36] datasets.

5 Non-shallow Approach to SMTL

Before now, we focused on the Supervised Shallow approach to MTL, in which features are handcrafted according to the target problem. However, in a supervised deep learning paradigm, the best feature representation can be derived from the data directly using deep learning algorithms such as Convolutional Neural Network.

Fig. 6
figure 6

Hard parameter-based sharing of the hidden layers

Therefore, Ruder [70] classified the deep efforts in MTL into hard and soft parameter sharing of hidden layers. The hard parameter method shown in Fig. 6 shares the hidden layers across several related tasks while keeping the tasks specific output layers. In contrast, the soft parameter-based method assigns to each task a specific model with its parameter. As a result, one can liken the soft margin approach to the shallow-based approach but, to capture relationships across multiple related tasks, the soft margin-based method obtains the distance between parameters of the different but related tasks, which are then regularized to encourage similarity.

Fig. 7
figure 7

Shallow versus deep learning pipeline

MTL studies based on deep approach includes [24, 61, 69, 79, 84] with application area such as computer vision [24, 69], speech synthesis [84] and bioinformatics-neuroanatomy [61, 79]. All the same, [70, 77] gave an extensive overview of deep MTL methods. And Fig. 7 shows a graphical comparison of Shallow vs. Deep Learning methods, while Table 2 gives a brief comparative analysis on both.

Table 2 A brief comparative analysis of shallow and deep learning approaches

6 Challenges and Future Research Direction of SMTL

MTL aims to improve generalization performance by leveraging common information shared between related tasks. This way, the lost function is minimized on all similar tasks to obtain a unified model that generalizes to new tasks. At present, many studies have shown that MTL can provide robust improvement to single-task learning. Nonetheless, the generalization performance can degrade if a new task is unrelated or is an outlier to the model tasks. Besides, many existing MTL methods cannot guarantee that a trained unified MTL model will outperform the single-task model in all tasks. This is because an outlier task(s) can contribute negatively to learning the common information between related tasks. Although Zhang and Yang [88] had suggested an approach to tackle the first issue, it is not realistic in most real-world scenarios. For instance, while the suggested technique of detecting when a new task is not well-matched with the trained MTL model may be feasible, training another tasks model that matches the outlier task(s) will then present a new challenge. To address these concerns holistically, there is a need to explore the combination of MTL and ensemble learning to learn common information shared between related tasks. This approach will improve generalization performance and further reduce the complexity of training a strong specific task model. Besides, task embeddings for MTL will be a fascinating area of research in the future. In this instance, tasks consistency can be addressed in order to preserve the geometric structure and information in each task to the greatest extent possible to help in learning a robust model that generalizes to newer tasks.

7 Conclusion

Most research work done on MTL focused on supervised learning, with several experimental results, which show that MTL is effective. Nevertheless, MTL based on unsupervised and re-enforcement learning has recently gained more attention. Besides, few attempts exist to extend MTL to the semi-supervised learning paradigm such that MTL can benefit from incomplete data. In this paper, a review of existing supervised shallow-based MTL methods is made explicit, with specific attempts to present these methods without sophisticated mathematical deductions. Moreover, efforts were made to explain the concept of MTL with basic examples by avoiding ambiguity for readers.