Keywords

1 Introduction

A long-standing goal of AI is the ability to adapt an initial, pre-trained model to novel, unseen scenarios. This is crucial for increasing the knowledge of an intelligent system and developing effective life-long learning [38, 41, 42] algorithms. While fascinating, achieving this goal requires facing multiple challenges. First, learning a new task should not negatively affect the performance on old tasks, avoiding the catastrophic forgetting phenomenon [6, 8]. Second, it should be avoided adding multiple parameters to the model for each new task learned, as it would lead to poor scalability of the framework [31]. In this context, while deep learning algorithms have achieved impressive results on many computer vision benchmarks [7, 11, 17, 22], mainstream approaches for adapting deep models to novel tasks tend to suffer from the problems mentioned above.

Different works addressed these problems by either considering regularization techniques [14, 21] or task-specific network parameters [24, 25, 31, 34, 36]. Interestingly, in [25] the authors effectively addressed sequential multi-task learning by creating a binary mask for each task. This mask is then multiplied by the main network weights, determining which of them are useful for addressing the new task and requiring just one bit for each parameter per task.

Our paper takes inspiration from this last work. We formulate sequential multi-task learning as the problem of learning a perturbation of a baseline, pre-trained network, maximizing the performance on a new task. As opposed to [25], we apply an affine transformation to each convolutional weight of the baseline network, involving both a learned binary mask and few additional parameters. Our solution allows to: (1) boosting the performance of each task-specific network, by leveraging the higher degree of freedom in perturbing the baseline network; (2) keeping a low per-task overhead in terms of additional parameters (slightly more than 1 bit per parameter per task). We assess the validity of our method on standard benchmarks, achieving performances comparable with fine-tuning separate networks for each task.

2 Related Works

The keen interest on incremental and life-long learning methods dates back to the pre-convnet era, with shallow learning approaches ranging from large margin classifiers [18, 19] to non-parametric methods [27, 33].

Recently, various works have addressed these problems within the framework of deep architectures [1, 10, 31]. A major risk when training a neural network on a novel task is to deteriorate its performances on old tasks, discarding previous knowledge, a phenomenon called catastrophic forgetting [6, 8, 26]. To alleviate this issue, various works designed constrained optimization procedures taking into account the initial network weights, trained on previous tasks. In [21], the authors exploit knowledge distillation [13] to obtain target objectives for previous tasks, while training for novel ones. In [14] the authors design an update of the network parameters, based on their importance for previously seen tasks.

Recent methods achieved higher performances with the cost of adding task specific parameters for each newly learned task, keeping untouched the initial network parameters. The extreme case is [36], where a parallel network is added each time a new task is presented. In [31, 32], task-specific residual components are added in standard residual blocks. In [34] the authors use controller modules where the parameters of the base architecture are recombined channel-wise. In [24] a different subset of network parameters is considered for each task. A more compact and effective solution is [25], where separate binary masks are learned for each novel task and multiplied to the original network weights. The binary masks determine which parameters are useful for the new task and which are not. We take inspiration from this last work but we use the binary masks to design task specific affine transformations through. This allows us to use a comparable number of parameters per task with increased flexibility, further reducing the gap with the individual end-to-end trained architectures.

3 Method

We address the problem of sequential multi-task learning, as in [25], i.e. we modify a baseline network such as, e.g. ResNet-50 pretrained on the ImageNet classification task, so to maximize its performance on a new task, while limiting the amount of additional parameters needed. The solution we propose exploits the key idea from Piggyback [25] of learning task-specific masks, but instead of pursuing the simple multiplicative transformation of the parameters of the baseline network, we define a parametrized, affine transformation mixing a binary mask and real parameters. This choice keeps a low per-task overhead while significantly increases the expressiveness of the approach, leading to a rich and nuanced ability to adapt the old parameters to the needs of the new tasks.

3.1 Overview

Let us assume to be given a pre-trained, baseline network \(f_0(\cdot ; \varTheta , \varOmega _0):\mathcal {X}\rightarrow \mathcal {Y}_0\) assigning a class label in \(\mathcal {Y}_0\) to elements of an input space \(\mathcal {X}\) (e.g. images).Footnote 1 The parameters of the baseline network are partitioned into two sets: \(\varTheta \) comprises parameters that will be shared for other tasks, whereas \(\varOmega _0\) entails the rest of the parameters (e.g. the classifier). Our goal is to learn for each task \(i\in \{1,\ldots ,\mathsf {m}\}\), with a possibly different output space \(\mathcal {Y}_i\), a classifier \(f_i(\cdot ;\varTheta ,\varOmega _i):\mathcal {X}\rightarrow \mathcal {Y}_i\). Here, \(\varOmega _i\) entails the parameters specific for the ith task, while \(\varTheta \) holds the shareable parameters of the baseline network mentioned above. Before delving into the details of our method, we review the Piggyback solution presented in [25].

Each task-specific network \(f_i\) shares the same structure of the baseline network \(f_0\), except for having a possibly, differently sized classification layer. All parameters of \(f_0\), excepting the classifier, are shared across all the tasks. For each convolutional layerFootnote 2 of \(f_0\) with parameters \(\mathtt {W}\), the task-specific network \(f_i\) holds a binary mask \(\mathtt {M}\) that is used to mask \(\mathtt {W}\) obtaining

$$\begin{aligned} \hat{\mathtt {W}}=\mathtt {W}\circ \mathtt {M}, \end{aligned}$$
(1)

where \(\circ \) is the Hadamard product. The transformed parameters \(\hat{\mathtt {W}}\) are then used in the convolutional layer of \(f_i\). By doing so, the task-specific parameters that are stored in \(\varOmega _i\) amount to just a single bit per parameter in each convolutional layer, yielding a low overhead per additional task, while retaining a sufficient degree of freedom to build new convolutional weights.

Proposed. Similarly to [25], we consider task-specific networks \(f_i\) that are shaped as the baseline network \(f_0\) and we store in \(\varOmega _i\) a binary mask \(\mathtt {M}\) for each convolutional kernel \(\mathtt {W}\) in the shared set \(\varTheta \). However, we depart from the simple multiplicative transformation of \(\mathtt {W}\) used in (1), and consider instead an affine transformation of the base convolutional kernel \(\mathtt {W}\) that depends on a binary mask \(\mathtt {M}\) as well as additional parameters. Specifically, we transform \(\mathtt {W}\) into

$$\begin{aligned} \check{\mathtt {W}}=k_0\mathtt {W}+k_1 \mathtt 1+k_2\mathtt {M}, \end{aligned}$$
(2)

where \(k_j\in \mathbb R\) are additional task-specific parameters in \(\varOmega _i\) that we learn along with the binary mask \(\mathtt {M}\), and \(\mathtt {1}\) is an opportunely sized tensor of 1 s (Fig. 1). We can consider either a scale (\(k_2\)) and bias (\(k_1\)) parameter per convolutional kernel, or distinct values for each feature channel.

Besides learning the binary masks and the parameters \(k_j\), we opt also for task-specific batch-normalization (BN) parameters (i.e. mean, variance, scale and bias), which will be part of \(\varOmega _i\), and thus optimized for each task, rather than being fixed in \(\varTheta \). In the cases where we have a convolutional layer followed by BN, we keep the corresponding parameter \(k_0\) fixed to 1, because the output of batch normalization is invariant to the scale of the convolutional weights.

Fig. 1.
figure 1

Proposed model. An affine transformation scale and translate the binary masks through the parameters \(k_2\) and \(k_1\) respectively. The obtained mask is summed to the pretrained kernel in order to obtain the final task-specific weights.

The additional parameters introduced with our method bring a negligible per-task overhead compared to Piggyback, which is nevertheless generously balanced out by a significant boost of the performance of the task-specific classifiers.

3.2 Learning Binary Masks

We learn the parameters \(\varOmega _i\) of each task-specific network \(f_i\) by minimizing the classification log-loss, given a training set, using standard, stochastic optimization methods. However, special care should be taken for the optimization of the binary masks. Instead of optimizing the binary masks directly, which would turn the learning into a combinatorial problem, we apply the solution adopted in [25], i.e. we replace each binary mask \(\mathtt {M}\) with a thresholded real matrix \(\mathtt {R}\). By doing so, we shift from optimizing discrete variables in \(\mathtt {M}\) to continuous ones in \(\mathtt {R}\). However, the gradient of the hard threshold function \(h(r)=1_{r\ge 0}\) is zero almost everywhere, which makes this solution apparently incompatible with gradient-based optimization approaches. To sidestep this issue we consider a strictly increasing, surrogate function \(\tilde{h}\) that will be used in place of h only for the gradient computation, i.e. if \(h'\) denotes the derivative of h with respect to its argument, we use \(h'(r)\approx \tilde{h}'(r)\). The gradient obtained via the surrogate function has the property that it always points in the right down hill direction in the error surface.

By taking \(\tilde{h}(x)=x\), i.e. the identity function, we recover the workaround suggested in [12], employed also in [25]. By taking \(\tilde{h}(x)=(1+e^{-x})^{-1}\), i.e. the sigmoid function, we obtain a better approximation, as suggested in [2, 9].

4 Experiments

Datasets. In the following we test our method on two different benchmarks. For the first benchmark we follow [25], and we use 6 datasets: ImageNet [35], VGG-Flowers [30], Stanford Cars [15], Caltech-UCSD Birds (CUBS) [43], Sketches [5] and WikiArt [37]. These datasets contain a lot of variations both from the category addressed (i.e. cars [15] vs birds [43]) and the appearance of their instances (i.e. from natural images [35] to art paintings [37] and sketches [5]).

The second benchmark is the Visual Decathlon Challenge [31]. The goal of this challenge is to use a single algorithm tackle 10 different classification tasks: ImageNet [35], CIFAR-100 [16], Aircraft [23], Daimler pedestrian (DPed) [28], Describable textures (DTD) [4], German traffic signs (GTSR) [40], Omniglot [20], SVHN [29], UCF101 Dynamic Images [3, 39] and VGG-Flowers [30]. A more detailed description of the challenge can be found in [31]. For this challenge, an independent scoring function is defined: the S-score [31]. This score takes into account the performances of a model on all 10 tasks, preferring models with good performances on all tasks to ones with peaked performances in few of them.

Networks and Training Protocols. For the first benchmark, we use a ResNet-50, comparing our model with Piggyback [25], PackNet [24] and two baselines considering the network only as feature extractor (training only the task-specific classifier) and individual networks separately fine-tuned on each task. Since [24] is dependent on the order of the task, we report the performances for two different orderings [25]: starting from the model pre-trained on ImageNet, the first (\(\rightarrow \)) is CUBS-Cars-Flowers-WikiArt-Sketch while the second (\(\leftarrow \)) is reversed. For training, we followed the preprocessing, hyper-parameters and schedule of [25].

For the Visual Decathlon we employ the Wide ResNet-28 [44] adopted by previous methods [25, 31, 34], using the same data preprocessing. For training we choose the same hyper-parameters of [25], keeping the same values for all the tasks except the ImageNet pretraining, for which we followed [31]. For both benchmarks we employ \(\tilde{h}(x)=x\) as surrogate, initializing the real-valued masks with uniform random values drawn between 0.0001 and 0.0002.

4.1 Results

ImageNet-to-Sketch. In the following we discuss the results obtained by our model on the ImageNet-to-Sketch scenario. For fairness, since our model includes task-specific BN layers, we report also the results of [25] with separate BN layers.

Results are shown in Table 1. Our model is able to fill the gap between the classifier only baseline and the individual fine-tuned architectures, almost entirely in all settings. For larger and more diverse datasets such as Sketch and WikiArt, the gap is not completely covered, but the distance between our model and the individual architectures is always less than 1%. These results are remarkable given the simplicity of our method, not involving any assumption of the optimal weights per task [21, 24], and the small overhead in terms of parameters that we report in the row “# Params” (i.e. 1.17), which represents the total number of parameters (counting all tasks and excluding the classifiers) relative to the ones in the baseline network. Comparing with the other algorithms, our model consistently outperforms both the basic version of Piggyback and PackNet in all settings. Introducing task-specific BN also for Piggyback reduces the performance gap, which still remains large in some settings (i.e. Flowers, Cars): this show how the advantages of our model are not only due to the additional BN parameters, but also to the more flexible affine transformation introduced.

Both Piggyback and our model outperform PackNet and, as opposed to the latter, do not suffer from the heavily dependence on the ordering of the tasks. This advantage stems from having a learning strategy that is task independent, with the base network not affected by the new tasks that are learned.

Table 1. Accuracy of ResNet-50 architectures in the ImageNet-to-Sketch setting.

Visual Decathlon Challenge. In this section we report the results for the Visual Decathlon Challenge. We compare our model with other sequential multi-task learning methods: Piggyback [25] (PB), the improved version of the winner entry of the 2017 edition of the challenge [34] (DAN), the network with task-specific residual [31] (RA) and parallel [32] (PA) adapters. We additionally report the baselines of [31]: the pre-trained network used as feature extractor (Feature) and 10 different models fine-tuned on each task (Finetune). Moreover, we add the results of our implementation of [25] with the same pre-trained model and training schedule adopted for our method (PB ours).

The results are reported in Table 2. We can see that our simple model achieves close to state-of-the-art performances on this competition. The only model outperforming ours is [32]: however, we employ a much lower parameters overhead and a single training schedule for all ten tasks. This produces a gain of more than 800 points with respect to [32] in the ratio between the S-Score and the number of parameters adopted. Remarkably, we obtain a gain on the previous winning entry [34] and Piggyback of more than 400 points.

Table 2. Results in terms of accuracy and S-Score, for the Visual Decathlon Challenge. Best model in bold, second best underlined.

From the partial results, excluding the ImageNet baseline, our model achieves the top-1 or top-2 scores in 4 out of 9 tasks, with comparable performances in the others. The only exceptions are UCF-101 and Aircraft, where our model suffers a high accuracy drop. Tuning the hyper-parameters could cover this gap, but this is out of the scope of this work. Interestingly, while our model achieves comparable (e.g. PB, DAN) average accuracy with respect to other approaches, it obtains a much higher decathlon score. This highlights its capabilities of tackling all 10 tasks with good results, without peaked accuracies on just few of them.

5 Conclusions

We presented a simple yet powerful method for learning incrementally new tasks, given a fixed, pre-trained deep architecture. We build on the intuition of [25], generalizing the idea of masking the original weights of the network with learned binary masks. By introducing an affine transformation that acts upon such weights, we allow for a richer set of possible modifications of the original network, allowing to better capture the characteristics of the new tasks. Experiments on two public benchmarks confirm the effectiveness of our approach.