Keywords

1 Introduction

Substantial advances have been achieved in visual representation learning, such as those based on curated large-scale image datasets with supervised [30, 59], weakly-supervised [29, 41], semi-supervised [65, 66], as well as self-supervised [7, 11, 12, 21, 25] pre-training. These visual representations show promising abilities in improving the performance on downstream tasks.

Among these pre-training techniques, supervised pre-training is widely adopted for its clear objective and steady training process. Nevertheless, existing works in this direction only consider individual upstream taskFootnote 1 (e.g., classification or detection) and most of them solely utilize one single data source (e.g., ImageNet [13] or COCO [39]). We argue this single-source single-task (SSST, Fig. 1 (a)) paradigm has several drawbacks: 1) The learned representation in SSST is specialized for one given task and is likely to have inferior performance on other tasks [19, 26, 44, 55, 56]. 2) It misses the potentials of a more robust representation by integrating characteristic semantic information from different tasks. Intuitively, we can opt to a simple hard-sharing method, i.e. single-source, multi-task (SSMT) paradigm, as described in Fig. 1 (b), by building many heads, each of which is specific for one task  [24, 55]. However, this over-simplified algorithm usually encounters task interference [43, 73], especially for heterogeneous tasks, leading to a significant drop in performance. Besides, it requires the same image with a variety of labels [71, 72], which is not scalable easily due to the high annotation cost. A recent self-training work [19] attempts to create a pseudo multi-task dataset to alleviate the data-scarcity issue of multi-task learning, which follows a similar spirit to other SSMT works.

In light of issues with previous settings, we focus on utilizing numerous data sources of multiple tasks to learn a universal visual representation which should transfer well to various downstream tasks like classification, object detection and semantic segmentation. To leverage cross-source, cross-task information and mitigate undesired task interference, we propose a new pre-training paradigm X-Learner, as shown in Fig. 1(c). The X-Learner contains two dedicated stages: 1) Expansion Stage: It first trains a set of sub-backbones, each of which specifically exploits one task enriched with multiple sources. It then joins together these sub-backbones and combine their representational knowledge via our proposed reconciliation layer, forming an expanded backbone with enhanced modeling capacity. 2) Squeeze Stage: Given the expanded backbone, this stage reduces the model complexity back to sub-backbone level and produces a unified and compact multi-task-aware representation. This new paradigm has two main advantages: 1) It can effectively consolidate diverse knowledge from our new multi-source multi-task learning and avoid task conflicts. The resulting representation generalizes well to different types of tasks simultaneously. 2) Compared to traditional multi-task methods, it is highly extensible with new tasks and sources, since we only require data sources annotated with single-task labels.

Fig. 1.
figure 1

a) Single-Source Single-Task; b) Single-Source Multi-Task; c) X-Learner: Multi-Source Multi-Task; d) Our proposed X-Learner achieves the best performances in Classification (average linear probe results across 10 classification datasets), Detection (Pascal VOC Detection [15]) and Segmentation (Pascal VOC Semantic Segmentation [15]).

Our contributions are summarized as follows:

  • We propose a new multi-source multi-task learning setting that only requires single-task label per datum, and is highly scalable with more tasks and sources without requiring any extra annotation effort.

  • We present X-Learner, a general framework for learning a universal representation from supervised multi-source multi-task learning, with Expansion Stage and Squeeze Stage. Task interference can be well mitigated by Expansion Stage, while a compact and generalizable model is produced by Squeeze Stage. With X-Learner, heterogeneous tasks can be jointly learned, and the resulting single model renders a universal visual representation suitable for various tasks.

  • We show the strong transfer ability of feature representations learned by our X-Learner. In terms of transfer learning performance, multi-source multi-task learning with our two-stage design outperforms traditional supervised single/multi-task training, self-supervised learning and self-training methods. As illustrated in Fig. 1(d), a model pre-trained with X-Learner exhibits significant gains (3.0%, 3.3% and 1.8%) over the ImageNet supervised counterpart on downstream image classification, object detection and semantic segmentation.

  • We offer several new insights into representation learning and the framework design for multi-task and multi-source learning through extensive experiments.

2 Related Work

Visual Representation Learning. Significant progress has been made in the field of visual representation learning, including unsupervised method [10, 11, 14, 25, 47, 49], supervised training [30, 59], weakly-supervised learning [29, 41], and semi-supervised learning [65, 66]. A large quantity of prior works use supervised datasets, including ImageNet1k [31], ImageNet-21K [52], IG-3.5B-17k [41] and JFT [30], for learning visual representations. In supervised pre-training, labeled training data provide significant improvement for transfer performance in the same task as the one for which the data are annotated. However, the ability of transferring across different tasks is not good enough [57]. In unsupervised learning, [49] focuses on multi-modal vision language pre-training to achieve strong performances in classification, but not do well in other visual tasks like detection [22]. In order to obtain uniformly high transfer performance on diverse task types, it is important to improve the task diversity of training data, justifying the necessity of multi-task pre-training.

Fig. 2.
figure 2

Structure of X-Learner. a) illustrates how reconciliation layers make the features from different tasks interact with each other. We use \(\gamma \) to represent the reconciliation layer. We present two typical ways of connection by reconciliation layer: cross different tasks and cross multiple layers; b) Features for different tasks are learned in Expansion Stage and unified in Squeeze Stage. After the two stages, X-Learner obtains a general representation for transferring to downstream tasks.

Multi-task Learning. There has been substantial interest in multi-task learning [4, 8, 23, 40, 50, 62, 72, 74, 77] in the community. A common practice for multi-task learning is to share the hidden layers of a backbone model across different tasks, which is called “hard-sharing” in the literature. However, such sharing is not always beneficial, in many times hurting performance [23, 63, 69, 70]. To alleviate this, there are several lines of works to solve the problem in different ways. One of them is the use of a split architecture with parallel backbones for different tasks [18, 40, 45]. [45] proposes a cross-stitch module, which intelligently combines task-specific networks, avoiding the need to brute-force search through numerous architectures. Another line of works is improving optimization during learning [35, 63, 69, 70]. For example, [70] mitigates gradient interference by altering the gradients directly, i.e., performing “gradient surgery”. [63] addresses interference by de-conflicting gradients via projection. [35, 36] use distillation to avoid interference, but they are limited to a retrained setting, either single-task multi-source or single-source multi-task. Other works attempt to develop systematic techniques to determine which tasks should be trained together in a multi-task neural network to avoid harmful conflicts between non-affinitive tasks [1,2,3, 17, 34]. These methods perform multi-task learning to improve the performances of tasks involved, but they are not concerned with the transfer performance on downstream tasks.  [37] applies vision transformer on multiple modalities and achieves impressive performance. For the image modality, it deals with the classification task only, and learns in a simple hard-sharing way. The problem of multi-task learning remains. A recent work [19] turns to semi-supervised learning and constructs cross-task pseudo labels with task-specific teachers, creating a complete multi-task dataset for pre-training. Yet it only considers the single-source setting, and its student training still follows a hard-sharing regime.

3 X-Learner

In this section, we introduce X-Learner, which leverages multiple vision tasks and various data sources to learn a unified representation that transfers well to a wide range of downstream tasks. It combines the superior modelling capacity of a split architecture design with the simplicity of hard parameter sharing. The whole two-stage framework is shown in Fig. 2. In Expansion Stage, we learn individual sub-backbones for different tasks with multi-source data in parallel. We further interconnect them to an expanded backbone that effectively alleviates interference among tasks. We then condense the expanded backbone to a normal-sized one in Squeeze Stage, producing the final general representation for downstream transfer.

3.1 Multi-Task and Multi-Source Learning

As illustrated in Fig. 1(a), the most common supervised learning setting involves only one task with a single source, i.e., a datum from the source has one label or annotation corresponding to the only task (SSST). There is no task interference during optimization, yet the generated representation is weak in terms of transferability to other tasks.

Traditional multi-task approaches in previous works concurrently learn multiple tasks within a single data source (SSMT), which is shown in Fig. 1(b). The single data source should have multiple sets of labels, each for one task. Such a data source is hardly scalable due to the high annotation cost.

To fix the drawbacks of previous setups, we propose our multi-source multi-task setting (MSMT), which is displayed in Fig. 1(c). More concretely, let T be the number of tasks, then for each task \(t\in \{1,2,...,T\}\), there are \(N_t\) data sources \(\mathcal {S}^t=\{(X_n^t,Y_n^t)\}_{n=1}^{N_t}\) with labels of the task. In this way, we only require \(N=\sum _{t=1}^TN_t\) single-task data sources which are easily attainable, avoiding the difficulty of multi-task annotation. Our setting is also highly extensible since adding new tasks or data sources becomes an effortless process. During training, the optimization objective of our multi-task and multi-source paradigm is to simply minimize the average loss over all the N data sources consisting of T different tasks:

$$\begin{aligned} \min _\theta L(\theta ,\{\mathcal {S}^t\}_{t=1}^T)=\frac{1}{N}\sum _{t=1}^{T}\sum _{n=1}^{N_t}\ell _t(\theta ,(X_n^t,Y_n^t)) \end{aligned}$$
(1)

where \(\theta \) denotes model parameters, and \(\ell _t\) refers to the loss function for task t.

3.2 Expansion Stage

figure a

We aim to learn general representation from heterogeneous tasks while being least affected by the harmful interference among tasks. This motivates us to design this Expansion Stage to learn a split architecture combining multiple single-task networks. We first train T sub-backbones individually for the T tasks, leveraging their own data sources. We then join all T sub-backbones into one holistic architecture, integrating information learned from all tasks to form a general representation. Specifically, we introduce an expanded backbone composed of multiple sub-backbones corresponding to T tasks, along with several reconciliation layers for connecting them, which we describe in detail below. The expanded backbone learned in this pipeline largely 1) preserves the high precision of single-task training, and 2) combines advantages of all tasks to achieve better generalizability on downstream tasks. The full training process is summarized in Algorithm 1.

Reconciliation Layer. As shown in Fig. 2(a), each reconciliation layer is a link between two sub-backbones of two tasks. It obtains features from one task, transforms them with a few operations, and then fuses them into the features of another task at the same or a deeper layer.

Suppose each sub-backbone has D output layers, and we denote the original output of layer \(i\in \{1,2,...,D\}\) from the sub-backbone for task \(t\in \{1,2,...,T\}\) by \(\mathcal {E}^t_i\). Let \(\gamma _{j \rightarrow i}^{k \rightarrow t}\) (\(j\le i\), \(k\ne t\)) refer to the reconciliation layer taking \(\mathcal {E}_j^k\) as input and providing its output to the \(i^\text {th}\) layer of another task t. According to Fig. 2(a), \(\gamma _{j \rightarrow i}^{k \rightarrow t}\) can be expressed as the composition of one \(\gamma _b\) and \(i-j\) times of \(\gamma _a\). Receiving all cross-task and cross-layer features, we take a summation to compute the final fused output \(F^t_i\) at layer i of the sub-backbone for task t:

$$\begin{aligned} F^t_i = \mathcal {E}^t_i+\sum _{\begin{array}{c} k=1\\ k\ne t \end{array}}^T\sum _{j=1}^i\gamma _{j \rightarrow i}^{k \rightarrow t}\left( \mathcal {E}^k_j\right) . \end{aligned}$$
(2)

Adding reconciliation layers directly facilitates interactions among information from different tasks. Thus it closely unifies all sub-backbones into one expanded backbone expressing an integrated and general representation. In practical implementation, to avoid task interference introduced by such cross-task communication, we detach inputs to all reconciliation layers from the computational graph to cut off further gradient propagation.

Fig. 3.
figure 3

Variants of X-Learner. (a) is the default form of X-Learner. (b) The expansion stage of X-Learner++ is supervised by extra hints from single-task single-source pre-trained models. (c) X-Learner\(_r\) is a Squeeze-Expansion version. (d) X-Learner\(_p\) replace the distillation with pruning in the squeeze stage. (e) We switch to a new reconciliation layer in X-Learner\(_t\). Differences between variants and the default X-Learner are highlighted in red.

3.3 Squeeze Stage

The previous Expansion Stage gives a concerted representation provided by the expanded backbone uniting all T sub-backbones of T tasks. However, it also introduces an undesirable T times increase in the number of model parameters and computational complexity. To maintain performance while reducing the expanded parameters, we present the Squeeze Stage. The final squeezed model remains highly generalizable for downstream transfer while sharing the same number of parameters with a single-task sub-backbone.

In Squeeze Stage, given an expanded backbone, we adopt distillation to consolidate the model. We employ the FitNets [53] approach, but with multiple targets (hints) from the expanded backbone as the student’s supervision. Formally, given multiple outputs from the expanded teacher indexed by \(t\in \{1, 2, ..., T\}\), we refer to \(F^{t}\) as the output feature of task t, and \(\hat{F}\) as the feature of the student network. We perform distillation between the student model and the bunch of teacher outputs. Specifically, we project the single student feature \(\hat{F}\) through a task-specific guidance layer \(\mathcal {G}^{t}\), and expect the outcome to match the teacher’s version \(F^{t}\). Therefore, our distillation loss \(L_\text {squeeze}\) is simply the sum over squared \(L_2\) losses of all teacher-student pairs:

$$\begin{aligned} L_\text {squeeze} = \sum _{t=1}^T{ ||{F^{t} - \mathcal {G}^{t}(\hat{F})}||^2_2 }. \end{aligned}$$
(3)

The guidance layer \( \mathcal {G}^{t}\) is composed of a convolutional layer and a normalization layer:

$$\begin{aligned} \mathcal {G}^{t}(x) = \text {Norm}(\text {Conv}(x)). \end{aligned}$$
(4)

We adopt an \(1\times 1\) convolution which transforms the student’s feature to have the same number of channels as the teacher’s output. For the normalization function, we simply choose Batch Normalization [28] as in [53].

3.4 Variants of X-Learner

X-Learner is a highly flexible multi-task pre-training framework, and many variants can be designed from the default setting. In this section, we describe several possibilities, which are illustrated in Fig. 3. More detailed differences among those variants are listed in Fig. 4.

X-Learner\(_{\boldsymbol{r}}\). We notice that the number of parameters in each individual model is first rising and then declining in our default X-Learner. It is natural to also study the reversed order, i.e., Squeeze-Expansion. In the new squeeze stage, we use T task-specific teachers trained with multiple sources to distill T more light-weight sub-backbones. They are then combined into one network with normal computational complexity via reconciliation layers in the following expansion stage.

X-Learner\(_{\boldsymbol{t}}\). We make a modification on the reconciliation layers and let them take features from deeper layers of other sub-backbones as input and fuse to low-level features of a task. We also replace \(\gamma _a\) in cross-layer reconciliation layers with \(\gamma _c\) which is composed of an up-sampling layer and a convolutional layer.

X-Learner\(_{\boldsymbol{p}}\). We replace the distillation operation with unstructured pruning in Squeeze Stage. It is another way to reduce computation consumption while maintaining the performance of a network. We adopt a simple unstructured pruning method referencing [78].

X-Learner++. Inspired by [36], in the Expansion Stage, we add extra supervisions from single-task single-source pre-trained model in the form of hints besides the original supervision from labels of multiple data sources. This can be viewed as adding a pre-distillation process with multiple SSST teachers prior to training the expanded backbone.

Table 1. Datasets used for X-Learner pre-training. We grouped them into manually defined image domains according to [44].
Table 2. Comparison with supervised and self-supervised methods on classification, detection and segmentation. represents the model is not pre-trained with semantic segmentation. We compare X-Learner to supervised pre-training, self-supervised learning, and a simple hard-sharing multi-task learning baseline. Relative gains are computed with respect to the ImageNet supervised baseline.

4 Experiments

4.1 Pre-training Settings

Pre-Training Sources (Datasets). Table 1 summarizes the sources we use for experiments. Most of our experiments are conducted in a base setting, where we pre-train models with 2 tasks: classification and object detection. We use 3 sources for image classification: ImageNet [54], iNat2021 [61] and Places365 [75] (Challenge version), and 2 sources for object detection: COCO [39] and Objects365 [56]. We also consider two extended settings: 1) to investigate the effect of more sources on X-Learner, we add CompCars [67] as well as Tsinghua Dogs [79] as two extra classification sources, and select WIDER FACE [68] as a new object detection source; 2) we study the impact of adding a new task, which is semantic segmentation, with ADE20K [76] and COCO-Stuff [6] as its sources.

Implementation Details. We implement X-Learner and its variants described in Sect. 3.4 using ResNet-50 [27] as the basic backbone throughout our experiments unless otherwise specified. The weights of reconciliation layers are initialized with [20]. We use SGD optimizer with a momentum of 0.9 [60], \(10^{-4}\) weight decay and a base learning rate of 0.2. We decay the learning rate three times by a multi-step schedule with factors 0.5, 0.2 and 0.1 at 50%, 70% and 90% of the total iterations respectively.

Fig. 4.
figure 4

Differences among X-learner variants. We conduct different ablation study of X-Learner. Pre-distillation refers to applying extra supervisions from single-task single-source pre-trained models as is introduced in X-Learner++. In the Squeeze column, we denote distillation by D and pruning by P if there is a squeeze stage present in the pipeline. The change of the parameter can refer to the figure on the right.

4.2 Downstream Task Settings

Classification. We select 10 datasets from the well-studied evaluation suite introduced by [31], including general object classification (CIFAR-10 [33], CIFAR-100 [33]); fine-grained object classification (Food-101 [5], Stanford Cars [32], FGVC-Aircraft [42], Oxford-IIIT Pets [48], Oxford 102 Flower [46], Caltech-101 [16]), and scene classification (SUN397 [64]). We follow the linear probe evaluation setting used in [49]. We use the average accuracy of 10 classification datasets (AVG Cls) to represent the overall performance on the classification task. We train a logistic regression classifier using the L-BFGS optimizer, with a maximum of 1, 000 iterations. We search the value for the L2 regularization strength \(\lambda \) over a set which distributes evenly over the range between \(10^{-1}\) and \(10^{-5}\). We use images of resolution \(224 \times 224\) for both training and evaluation.

Detection. We fine-tune our pre-trained model on PASCAL VOC07+12 (PASCAL Det) [15] for the detection task. We use Faster-RCNN [51] architecture in our experiments and run 24,000 iterations with a batch size of 16. We use SGD as the optimizer and search the best learning rate between 0.001 and 0.05. Weight decay is set to \(10^{-4}\), and momentum is set to 0.9. Evaluation is performed on the PASCAL VOC 2007 test set, with the shorter edges of images scaled to 800 pixels.

Semantic Segmentation. We evaluate models on PASCAL VOC 2012 (PASCAL Seg) [15]. We run 33,000 iterations with a batch size of 16. The architecture is based on Deeplab v3 [9]. We use SGD as the optimizer with a learning rate between 0.001 and 0.07. Weight decay is set to \(10^{-4}\), and momentum is set to 0.9. Images are scaled to \(513\times 513\).

Table 3. Comparison on extended settings with extra pre-training sources. By adding sources in different tasks (marked in bold italic), Hard-sharing suffers performance drops on both upstream and downstream tasks, while our X-Learner is stable across different settings, benefiting from the proposed Expansion Stage.

4.3 Main Results

Table 4. Comparison with self-training. PASCAL Seg is an unseen task for X-Learner\(++\), which is marked with  . NYU-Depth V2 is an unseen task for X-Learner\(_{R152}\), which is marked with  .
Table 5. The effect of applying reconciliation layers in the Expand Stage. The reconciliation layer can significantly improve the performance in multi-task learning.

Pre-Training Paradigm Comparison. Table 2 compares our pre-training scheme X-Learner with supervised training and self-supervised learning (SimCLR [10]) on ImageNet [54], as well as a simple hard-parameter-sharing baseline (named as “Hard-sharing”) on our multi-task and multi-source setting. We report performances on all three types of downstream tasks. Under the base setting, X-Learner uniformly outperforms all compared methods in terms of all evaluated metrics, especially AVG Cls. We also observe that the Hard-sharing model has better performance than the ImageNet-supervised model on PASCAL Det, but suffers a performance drop of 1.2% in AVG Cls. This suggests that the hard-sharing model benefits from multi-task pre-training with object detection sources included, but is harmed by task interference. In contrast, our X-Learner clearly overcomes the shortcoming and alleviates undesirable interference, leading to performance boosts on all considered tasks. Moreover, compared with training solely on ImageNet which is already specialized for classification, our approach still enjoys a 2.5% increase on AVG Cls. This result demonstrates that our setting of learning with multiple tasks simultaneously is beneficial for all involved pre-training tasks, such as classification here.

In addition, our X-Learner++ mentioned in Sect. 3.4 further enhances performance by means of its extra distillation process during sub-backbone training in the Expansion Stage, and achieves the best performance on all three downstream tasks.

We also compare our X-Learner++ with the multi-task self-training method MuST [19] in Table 4, For fair comparison, we fine-tune on the CIFAR-100 dataset instead of applying our default linear probe setting, evaluate PASCAL Det with pre-trained FPN [38], and set output stride to 8 in segmentation.

Our model surpasses MuST on classification and detection tasks despite using ResNet-50 instead of the more advanced ResNet-152 applied by MuST. To better show the effectiveness of our setting, we also conduct an experiment with the ResNet-152 backbone. Table 4 shows the performance of X-Learner\(_\text {R152}\) as well as MuST on four different tasks. We observe that our framework outperforms the self-training method by significant margins on all evaluated downstream tasks. Moreover, it is worth mentioning that on NYU-Depth V2, our X-Learner, without any depth estimation pre-training, surpasses MuST which is learned with MiDaS, a mixture dataset with 10 depth-wise datasets. This zero-shot result further demonstrates the strong generalization capability of X-Learner.

We also compare our X-Learner\(_{R152}\) with a stronger version of MuST model pre-trained with JFT-300M, which is much larger than our datasets. As our X-Learner achieves 89.7 and 88.6 in downstream classification and detection tasks. This comparison proves that the dataset size is not an important factor, and our design has its superiority.

Cross-task Generalization and Scalability. In Table 2, among methods that are not pre-trained on semantic segmentation, our X-Learner++ has the highest result on PASCAL Seg. This validates that our models produce more generalizable representations in terms of unseen tasks.

In addition to generalizability, our framework is also highly scalable and can incorporate extra tasks or sources effortlessly. As a demonstration, we add a semantic segmentation task according to the extended setting with ADE20K and COCO-Stuff. Results of “X-Learner w/seg” in Table 2 show improvement on PASCAL Seg by 0.5 mIoU compared to the basic X-Learner. Classification performance is also benefitted from the new task introduced, demonstrating the effectiveness of our multi-task learning approach.

Necessity of Reconciliation Layers. As shown in Table 5, we train an X-Learner without reconciliation layer to study the importance of the component. Compared to the default setting, removing reconciliation layers leads to significant performance drops at downstream transfer learning, especially on fine-grained datasets. We find that the feature from detection sub-backbone contains more detail, and it can be enhanced to a universal feature by the reconciliation layer. This phenomenon also verifies that reconciliation layers play a crucial role in coordinating multiple tasks towards the common goal of general representation learning.

Table 6. Comparison of various X-Learner variants. Pre-training tasks and downstream tasks are evaluated on X-Learner variants. Our framework always performs better than Hard-sharing.

4.4 In-Depth Studies

Multi-task and Multi-source Pre-training

Observation 1: Proper Multi-Task Learning Promotes Collaboration Instead of Bringing Interference. As is discussed in Sect. 4.3, X-Learner not only resolves the task interference issue encountered by the hard-sharing model, but also surpasses single-task pre-trained models such as the ImageNet baseline in terms of downstream results. This shows that with an appropriately designed learning scheme, multi-task training is able to collaboratively enhance performances on all pre-training tasks. This conclusion is again corroborated by the results of X-Learner++ in Table 2. With a more elaborated design, performances on all tasks are again consistently boosted.

Observation 2: Additional Sources Further Improve Multi-Task and Multi-Source Representation Learning If Task Conflicts are Well-Mitigated. We experiment on the extended setting with extra classification and detection sources. The added sources, such as CompCars [67] and WIDER FACE [68], have data in domains very different from existing sources. Ideally, including sources of complementary nature should help the overall multi-task and multi-source learning, since information available for pre-training is enriched and is more likely to cover downstream domains. However, this may also increase conflicts among tasks if not dealt with properly. In Table 3, we can see that the over-simplified hard-sharing baseline has considerably inferior results at both upstream and downsteam if more sources are added. In pre-training stage, there is slight decrease after adding classification sources. This is due to the increase in task conflict when introducing new data domains. Nonetheless, we can find that additional sources becomes beneficial to transfer learning tasks both in hard-sharing and X-Learner. Compared to hard-sharing, X-Learner has mitigated such detrimental conflict to a certain extent with the aid of our two-stage design. This suggests that when task interference is properly alleviated, new data sources can be fully utilized by the model to learn more diverse knowledge and enhance the final representation.

Design of X-Learner Framework

Observation 3: Expansion-Squeeze is better than Squeeze-Expansion. In Sect. 3.4, we have described the X-Learner\(_{r}\) variant in which the order of the two stages within X-Learner is reversed. Performing squeezing first would result in smaller single-task sub-backbones with 1/T of the original size. Since \(T=2\) in our base setting, we should get two halved ResNet-50 models, corresponding to HalfResNet-50 in Fig. 4, which are to be joined in the further expansion process. HalfResNet-50 is a sub-backbone with only \(1/\sqrt{2}\) of the original ResNet-50 channels. As shown in Table 6, X-Learner\(_{r}\) has lower performance on most pre-training tasks and all downstream tasks than the default X-Learner. This finding is reasonable since by intuition, shrinking sub-backbones first is likely to cause unrecoverable information loss. It also validates our choice of Expansion-Squeeze for the default setup. Note that X-Learner\(_{r}\) is still better than the hard-sharing model, which again highlights the importance of a two-stage paradigm to mitigate task interference.

Observation 4: Reconciliation Layers Should Receive Information from Lower Levels. We also evaluate the alternative design of X-Learner\(_t\), where reconciliation layers take features from deeper layers instead of shallower ones. Experiments in Table 6 show that the modified and original setups are both competitive at upstream pre-training. However, X-Learner\(_t\) is not as good as X-Learner in terms of downstream tasks. In conclusion, low-level features are more suitable to serve as complementary information among heterogeneous tasks.

Observation 5: Pruning May Replace Distillation in Squeeze Stage. In Table 6, X-Learner\(_{p}\) achieves results similar to those of X-Learner. This shows that pruning is also a valid choice for squeezing the expanded backbone, and thus is able to substitute distillation in Squeeze Stage.

5 Discussion and Conclusion

In this paper, we propose a flexible multi-task and multi-source pre-training paradigm called X-Learner, the general framework for representation learning by supervised multi-task learning. Heterogeneous tasks and diverse sources can be jointly learned with the help of the Expansion Stage and Squeeze Stage. We validate that X-Learner mitigates the well-known task interference problem and learns unified general representation that generalizes well to multiple seen and unseen tasks. We also show that X-Learner is superior to traditional supervised and self-supervised learning methods, as well as self-training approaches. In addition, We also demonstrate that our framework is highly flexible and additional tasks or sources can be integrated in a “plug-and-play” manner. Moreover, we offer several insightful observations through our experiments. One possible limitation is that the representation capability of our current pre-training is confined by the scale of publicly available datasets. It is possible to study with larger sources and more tasks in our framework. We hope this work will encourage further researches towards creating general representations by performing multi-task and multi-source learning at scale.