1 Introduction

A fundamental characteristic of natural intelligence is its ability to continually learn new concepts while updating information about the old ones. Realizing that very objective in machines is precisely the motivation behind continual learning (CL). While current machine learning (ML) algorithms can achieve excellent performance given any single task, learning new (or even related) tasks continually is extremely difficult for them as, in such scenarios, they are prone to the phenomenon called catastrophic forgetting  [1, 2]. Significant attention has been paid recently to this problem [3,4,5,6,7,8] and a diverse set of approaches have been proposed in the literature (refer  [9] for an overview). However, these approaches impose different sets of simplifying constraints to the CL problem and propose tailored algorithms for the same. Sometimes these constraints are so rigid that they even break the notion of learning continually, for example, one such constraint would be knowing a priori the subset of labels a given input might take. In addition, these approaches are never tested exhaustively on useful scenarios. Keeping this observation in mind, we suggest that it is of paramount importance to understand the caveats in these simplifying assumptions, understand why these simplified forms are of little practical usability, and shift our focus on a more general and practically useful form of continual learning formulation to help progress the field.

To this end, we first provide a general formulation of CL for classification. Then, we investigate popular variants of existing CL algorithms, and categorize them based on the simplifying assumptions they impose over the said general formulation. We discuss how each of them impose constraints either over the growing nature of the label space, the size of the label space, or over the resources available. One of the primary drawbacks of these restricted settings is that algorithms tailored towards them fail miserably when exposed to a slightly different variant of CL, making them extremely specific to a particular situation. We would also like to emphasize that there is no explicit consensus among researchers regarding which formulation of CL is the most appropriate, leading to a diverse experimental scenarios, none of which actually mimic the general form of CL problem one would face when exposed to the real-world.

Then, we take a step back and design an extremely simple algorithm with almost no simplifying assumptions compared to the recent approaches. We call this approach GDumb (Greedy Sampler and Dumb Learner). As the name suggest, the two core components of our approach are a greedy sampler and a dumb learner. Given a memory budget, the sampler greedily stores samples from a data-stream while making sure that the classes are balanced, and, at inference, the learner (neural network) is trained from scratch (hence dumb) using all the samples stored in the memory. When tested on a variety of scenarios on which various recent works have proposed highly tuned algorithms, GDumb surprisingly provides state-of-the-art results with large margins in almost all the cases.

The fact that GDumb, even though not designed to handle the intricacies in the challenging CL problems, outperforms recently proposed algorithms in their own experimental set-ups, is alarming. It raises concerns relating to the popular and widely used assumptions, evaluation metrics, and also questions the efficacy of various recently proposed algorithms for continual learning.

2 Problem Formulation, Assumptions, and Trends

To provide a general and practically useful view of CL problem, we begin with the following example. Imagine a robot walking in a living room solving a task that requires it to identify all the objects it encounters. In this setting, the robot will be identifying known objects that it has learned in the past, will be learning about a few unknown objects by asking an oracle to provide labels for them, and, at the same time, will be updating its information about the known objects if the new instances of them provided extra cues useful for the task. In a nutshell, the robot begins with some partial information about the world and keeps on improving its knowledge about it as it explores new parts of the world.

Inspired by this example, a realistic formulation of continual learning for classification would be where there is a stream of training samples or data accessible to a learner, each sample comprising a two-tuple (\(\mathbf{x}_t\), \(\mathbf{y}_t\)), where t represents the timestamp or the sample index. Let \(\mathcal {Y}_t = \cup _{i=1}^t \mathbf{y}_i\) be the set of labels seen until time t, then it is trivial to note that \(\mathcal {Y}_{t-1} \subseteq \mathcal {Y}_t\). This formulation implies that the stream might give us a sample that either belongs to a new class or to the old ones. Under this setting, at any given t, the objective is to provide a mapping \(f_{\theta _{t}}: \mathbf{x}\rightarrow \mathbf{y}\) that can accurately map a sample \(\mathbf{x}\) to a label \(\mathbf{y}\in \mathcal {Y}_t \cup \bar{\mathbf{y}}\), where \(\bar{\mathbf{y}}\) indicates that the sample does not belong to any of the learned classes. Notice, addition of this extra label \(\bar{\mathbf{y}}\) assumes that while training, there is incomplete knowledge about the world and a test sample might come from outside the training distribution. Interestingly, it connects an instance of CL very well with the well known open-set classification problem  [10]. However, in CL, the learner, with the help of an oracle (e.g., active learning), could improve its knowledge about the world by learning the semantics of samples inferred as \(\bar{\mathbf{y}}\).

2.1 Simplifying Assumptions in Continual Learning

The above discussed formulation is general in the sense that it does not put any constraints whatsoever on the growing nature of the label space, nature of test samples, and size of the output space \(\vert \lim _{t\rightarrow \infty } \mathcal {Y}_t \vert \). It does not put any restrictions on the resources (compute and storage) one might pick to get a reliable mapping \(f_{\theta _{t}}(.)\) either, however, the lack of information about the nature and the size of the output space makes the problem extremely hard. This has compelled almost all the work in this direction to impose additional simplifying constraints or assumptions. These assumptions are so varied that it is difficult to compare one CL algorithm with another as a slight variation in the assumption might change the complexity of the problem dramatically. For better understanding, below we discuss all the popular assumptions, highlight their drawbacks, and categorize various recently proposed CL algorithms depending on the simplifying assumptions they make. One assumption common to all is that the test samples always belong to the training distribution.

Disjoint Task Formulation: This formulation is being used in almost all the recent works [4,5,6,7,8] whereby the assumption made is that at a particular duration in time, the data-stream will provide samples specific to a task, in a pre-defined order of tasks, and the aim is to learn the mapping by learning each task at a time sequentially. In particular, let \(\mathcal {Y}= \lim _{t\rightarrow \infty } \mathcal {Y}_t\) be the set of labels that the stream might present until it runs out of samples. Recall, in the general CL formulation, the size of \(\mathcal {Y}\) is unknown and the samples can be presented in any order. This label space \(\mathcal {Y}\) is then divided into different disjoint subsets (could be a random or an informed split), where each label subset \(\mathcal {Y}_i\) represents a task and the sharp transition between these sets is called task boundaries. Let there be m splits (typically the split is balanced with nearly equal number of classes) then \(\mathcal {Y}= \cup _{i}^m \mathcal {Y}_i\), and \(\mathcal {Y}_i \cap \mathcal {Y}_j = \emptyset , \forall i \ne j\). An easy and widely used example is to divide ten digits of MNIST into 5 disjoint tasks where each task comprises of the samples from two consecutive digits and the stream is controlled to provide samples for each task in a pre-defined order, say \(\{0,1\}, \cdots , \{8,9\}\). This formulation simplifies the general CL problem to a great extent as the unknown growing nature of the label space is now being restricted and is known. It provides a very strong prior to the learner and helps in deciding both the space budget and the family of functions \(f_{\theta }(.)\) to learn.

Task-Incremental v/s Class-Incremental: To further make the training and the inference easier, a popular choice of CL formulation is the task-incremental continual learning (TI-CL) [7] where, along with the disjoint task assumption, the task information (or id) is also passed by an oracle during training and inference. Thus, instead of a two-tuple, a three-tuple \((\mathbf{x}, \mathbf{y}, \alpha )\) is given where \(\alpha \in \mathbb {N}\) represents the task identifier. This formulation is also known as multi-head and is an extremely simplified form of the continual learning problem  [8]. For instance, in the above mentioned MNIST example, at inference, if the input is \((\mathbf{x}, \alpha =3)\), it implies that the sample either belongs to class 4, or to 5. Knowing this subset of labels a-prior dramatically reduces the label space during training and inference, and is relatively impractical to know in real-world scenarios. Whereas, in a class-incremental formulation (CI-CL)  [4, 8], also known as the single-head, we do not have any such information about the task id.

Online CL v/s Offline CL: Note, the disjoint task formulation placed a restriction on the growing nature of the label space and inherently restricted the size of it, however, it did not put any constraints on the learner itself. Therefore, the learning paradigm may store task-specific samples coming from the stream depending on the space budget and then use them to update the parameters. Under this setting, in the online CL formulation, even though the learner is allowed to store samples as they come, they are not allowed to use a sample more than once for parameter update. Thus, the learner can not use the same sample (unless it is in the memory) multiple times at different iterations of the learning process. In contrast, offline CL allows unrestricted access to the entire dataset corresponding to a particular task (not to the previous ones) and one can use this dataset to learn the mapping by revisiting the samples again and again while performing multiple passes over the data  [4].

Table 1. Here we categorize various recently proposed CL approaches depending on the underlying simplifying assumptions they impose.

Memory Based CL: As mentioned earlier, we only have access to all/subset of samples corresponding to the current task. This restriction makes it extremely hard for the model to perform well, in particular, on CI-CL setting as the absence of samples from the previous tasks makes it difficult to learn to distinguish samples from the current and the previous tasks due to catastrophic forgetting. Very little forgetting is normally observed in TI-CL as the given task identifier works as the indicator of task boundary, thus the model does not have to learn to differentiate labels among tasks. To reduce forgetting, a common practice, inspired by the complementary learning systems theory [38, 39], is to store a subset of samples from each task and use them while training on the current task. There primarily are two components under this setting: a learner and a memorizer (or sampler). The learner has the goal of obtaining representations which generalize beyond current task. The memorizer, on the other hand, deals with remembering (storing) a collection of episode-like memories from the previous tasks. In recent approaches [4, 7, 8], the learner is modeled by a neural network and the memorizer is modeled by memory slots which store samples previously encountered.

2.2 Recent Trends in Continual Learning

Typically, continual learning approaches are categorized by ways they tackle forgetting such as (1) regularization-based, (2) replay (or memory)-based, (3) distillation-based, and (4) parameter-isolation based (for details refer [9]). However, they do vary in terms of simplifying assumptions they encode, and we argue that keeping track of these assumptions is extremely important for fair comparisons, and also to understand the limitations of each of them. Since all these algorithms in some sense use combinations of the above discussed simplifying assumptions, to give a bird’s eye view over all the recently proposed approaches, we categorize them in Table 1 depending on the simplifying assumptions they make. For example, Table 1 indicates that RWalk  [8] is an approach designed for a CL formulation that is offline, class-incremental, and assumes sharp task boundaries. Algorithmically, it is regularization based and uses memory. Note, one can potentially modify these approaches to apply to other settings as well. For example, the same RWalk can also be used without memory, or can be applied on task-incremental offline formulation. However, we focus on the formulation these methods were originally designed for. We now discuss some high-level problems associated with the simplifying assumptions these approaches make.

Most models, metrics, classifiers, and samplers for CL inherently encode disjoint task (or sharp task boundary) assumption into their design, hence fail to generalize even with slight deviation from the this formulation. Similarly, popular metrics like forgetting and intransigence [7, 8] are designed with this specific formulation encoded in their formal definition, and break with simple modifications like blurry boundaries (class-based, instead of sample-based, definitions of forgetting would appear as classes mix because of blurred boundaries). Moving to TI-CL v/s CI-CL, these are two extreme cases where CI-CL (single-head) faces scaling issues as there is no restriction on the size of \(\vert \lim _{t\rightarrow \infty } \mathcal {Y}_t \vert \), and TI-CL (multi-head) imposes a fixed, coherent two-level hierarchy among classes with oracle labels. This formulation is unrealistic in the sense that it does not allow dynamic contexts  [40].

Lastly, Offline CL v/s Online CL is normally defined depending on whether an algorithm is allowed to revisit a sample repeatedly (unless it is in the memory) during the training process or not. The intention here is to make the continual learning algorithm fast enough so that it can learn quickly from a single (or few) sample without having the need of revisiting it. This distinction makes sense if we imagine a data stream spitting samples very fast, then the learner has to adapt itself very quickly. Therefore, the algorithm must provide an acceptable trade-off between space (number of samples to store) and time (training complexity) budgets. However, because of the lack of proper definition and evaluation schemes, there are algorithms doing very well on one end (use a sample only once), however, performing very poorly on the other end (very expensive learning process). For example, GEM  [7], a widely known online CL algorithm, uses a sample only once, however, solves a quadratic program for parameter updates which is very time consuming. Therefore, without proper metrics or procedures to quantify how well various CL algorithms balance both space and time complexities, categorizing them into offline vs online might lead to wrong conclusions.

figure a
Fig. 1.
figure 1

Our approach (GDumb): The sampler greedily stores samples while balancing the classes. When asked, the learner trains a network from scratch on memory \(\mathcal {D}_t\) provided by the sampler. If a mask \(\mathbf{m}\) is given at inference, GDumb classifies on the subset of labels provided by the mask. Depending on the mask, GDumb’s inference can vary between two extremes: CI (class-incremental) and TI (task-incremental) formulations.

3 Greedy Sampler and Dumb Learner (GDumb)

We now propose a simple approach that does not put any restrictions, as discussed above, over the growing nature of the label space, task boundaries, online vs offline, and the ordering of the samples in which the data-stream provides them. Thus, can easily be applied to all the CL formulations discussed in Table 1. The only requirement is to be allowed to store some episodic memories. We emphasize that we do not claim that our approach solves the general CL problem. Rather, we experimentally show that our simple approach, that does not encode anything specific to the challenging CL problem at hand, is surprisingly effective compared to other approaches over all the formulations discussed previously, and also exposes important shortcomings with recent formulations and algorithms.

As illustrated in Fig. 1, our approach comprises of two key components: a greedy balancing sampler and a learner. Given a memory budget, say k samples, the sampler greedily stores samples from the data-stream (max k samples) with the constraint to asymptotically balance class distribution (Algorithm 1). It is greedy in the sense that whenever it encounters a new class, the sampler simply creates a new bucket for that class and starts removing samples from the old ones, in particular, from the one with a maximum number of samples. Any tie is broken randomly, and a sample is also removed randomly assuming that each sample is equally important. Note, this sampler does not rely on task boundaries or any information about the number of samples in each class.

Let the set of samples greedily stored by the sampler in the memory at any instant in time be \(\mathcal {D}_t\) (a dataset with \(\le k\) samples). Then, the objective of the learner, a deep neural network in our experiments, is to learn a mapping \(f_{{\theta }_t}: \mathbf{x}\rightarrow \mathbf{y}\), where \((\mathbf{x}, \mathbf{y}) \in \mathcal {D}_t\). This way, using a small dataset that the sampler has stored, the learner learns to classify all the labels seen until time t. Let \(\mathcal {Y}_t\) represents the set of labels in \(\mathcal {D}_t\). Then, at inference, given a sample \(\mathbf{x}\), the prediction is made as

$$\begin{aligned} \hat{y} = \mathop {\mathrm {arg\,max}}\, {\mathbf{p}\odot \mathbf{m}}, \end{aligned}$$
(1)

where, \(\mathbf{p}\) is the softmax probabilities over all the classes in \(\mathcal {Y}_t\), \(\mathbf{m}\in \{0,1\}^{\vert \mathcal {Y}_t \vert }\) is a user-defined mask, and \(\odot \) denotes the Hadamard product. Note, our prediction procedure allows us to mask any combination of labels at inference. When \(\mathbf{m}\) consists of all ones, the inference is exactly the same as that of single-head or class-incremental, and when the masking is done depending on the subset of classes in a particular task, it is exactly the same as multi-head or task-incremental. Since our sampler does not put any restrictions on the flow of the samples from the data-stream, and our learner does not require any task-boundaries, our overall approach puts minimal restrictions on the general continual learning problem. We would also like to emphasize that we do not use the class \(\bar{\mathbf{y}}\) as discussed in our general formulation in Sect. 2, we leave that for future work. However, our objective does encapsulate all the recently proposed CL formulations with minimal possible assumptions, allowing us to provide a fair comparison.

Table 2. Various CL formulations we considered in this work to evaluate GDumb. These formulations differ in terms of simplifying assumptions (refer Table 1) and also in terms of resources used. We ensure that selected benchmarks are diverse, covering all popular categorizations. Note, in B3 and D, memory is not constant– it increases over tasks uniformly by (+size) for xtasks times.

4 Experiments

We now compare GDumb with various existing algorithms for several recently proposed CL formulations. As shown in Table 1, there broadly are five such formulations \(\{\textit{A}, \textit{B}, \cdots , \textit{E}\}\). Since even within a formulation there can be sub-categories depending on the resources used, we further enumerate them and present a more detailed categorization, keeping fair comparisons in mind, in Table 2. For example, B1 and B2 belong to the same formulation B, however, they differ in terms of architectures, datsets, and memory sizes used in their respective papers. Therefore, in total, we pick 10 different formulations, most of them having multiple architectures and datasets (refer Appendix B for details).

Implementation details: GDumb uses the same fixed training settings, with no hyperparameter tuning whatsoever, in all the CL formulations. This is possible because of the fact that GDumb does not impose any simplifying assumptions. All results measure accuracy (fraction of correctly classifications) evaluated on the held-out test set. For all the formulations, GDumb uses an SGD optimizer, fixed batch size of 16, learning rates [0.05, 0.0005], an SGDR  [45] schedule with \(T_0\)= 1, \(T_{mult}\)= 2 and warm start of 1 epoch. Early stopping with patience of 1 cycle of SGDR, along with standard data augmentation is used. GDumb uses cutmix  [46] with p = 0.5 and \(\alpha \) = 1.0 for regularization on all datasets except MNIST. The training set-up comprises of an Intel i7 4790, 32 GB RAM and a single GTX 1070 GPU. All results are averaged over 3 random seeds, each with different class-to-task assignments. In formulations B2 and B3, we strictly follow class order specified in iCARL  [4] and PODNet  [21]. Our pytorch implementation is publicly available at: https://github.com/drimpossible/GDumb.

Table 3. (CI-Online-Disjoint) Performance on formulation A1.

4.1 Results and Discussions

Class Incremental Online CL with Disjoint Tasks (Form. A): The first sub-category under this formulation is A1, which follows exactly the same setting, on Split-MNIST and Split-CIFAR10, as presented in MIR  [11]. Results are shown in Table 3. We observe that on both MNIST and CIFAR10 for all choices of k, GDumb outperforms all the approaches by a large margin. For example, in the case of MNIST with memory size \(k=500\), GDumb outperforms ER-MIR  [11] by around 4.3%. Similarly, on CIFAR10 with memory sizes of 200, 500, and 1K, our approach outperforms current approaches by nearly 5%, 6% and 11%, respectively, convincingly achieving state-of-the-art results. Note, increasing the memory size from 200 to 1K in CIFAR10 increases the performance of GDumb by \(26.3\%\) (expected as GDumb is trained only on memory), whereas, this increase is only \(18\%\) in the case of ER-MIR  [11]. Similar or even much worst improvements are noticed in other recent approaches, suggesting they might not be utilizing the memory samples efficiently.

We now benchmark our approach on Split-MNIST and Split-CIFAR10 as detailed by parallel works GMED  [12] (sub-category A2) and ARM  [41] (sub-category A3). We present results in Table 4 and show that GDumb outperforms parallel works like HAL  [35], QMED  [12], ARM  [41] in addition to outperforming recent works GSS  [37], MIR  [11], and ADI  [47], consistently across datasets. It outperforms the best alternatives in QMED  [12] by over 4% and 10% on MNIST and CIFAR10, respectively, and in ARM  [41] by over 5% and 13% on MNIST and CIFAR10 datasets, respectively. Results from ARM  [41] indicate—(i) GDumb consistently outperforms other experience replay approaches and (ii) experience replay methods obtain much better performance than generative replay with much smaller memory footprint.

Table 4. (CI-Online-Disjoint) Performance on formulations A2 (left) and A3 (right).

Class Incremental Offline CL with Disjoint Tasks (Form. B): We proceed next to offline CI-CL formulations. We first compare our proposed approach with 12 popular methods on sub-category B1. Results are presented in Table 5 (left). Our approach outperforms all memory-based methods like GEM, RtF, DGR by over 5% on MNIST. We outperform the recent RPS-Net  [23] and OvA-INN  [49] by over 1%, and are as good as iTAML  [24], on MNIST. On SVHN, we outperform recently proposed methods like RPS-Net by 4.5% and far exceeding methods like GEM. Note, we achieve the same accuracy as the best offline CL method iTAML  [24] despite using an extremely simple approach in online fashion.

We now discuss two very interesting sub-categories B2 (as in iCARL  [4]) and B3 (from a very recent work PODNet  [21]). The primary difference between B2 and B3 merely lies in the number of classes per task. However, as will be seen, this minor difference changes the complexity of the problem dramatically. In the case of B2, CIFAR100 is divided into 20 tasks, whereas, B3 starts with a network trained on 50 classes and then learns one class per task incrementally (leading to 50 new tasks). Performance of GDumb on B2 and B3 formulations are shown in Table 5 (center) and Table 5 (right), respectively. An interesting observation here is that GDumb which performed nearly 20% worse than BiC and iCARL in B2, performs over 10–15% better than BiC, UCIR and iCARL in B3. This drastic shift against previous results might suggest that having higher number of classes per task and less number of tasks might give added advantage to scaling/bias correction type approaches, which otherwise would quickly deteriorate over greater timesteps. Furthermore, we note that our simple baseline narrowly outperforms PODNet (CNN) on both CIFAR100 and ImageNet100 datasets.

Table 5. (CI-Offline-Disjoint) Performance on B1, B2, and B3.

Task Incremental Offline CL with Disjoint Tasks (Form. C): We now proceed to compare the performance of GDumb in task incremental formulation. Recall, GDumb does not put any restrictions such as task vs class incremental, or online vs offline. However, in the case of GDumb, we use masking (subset of labels in a task) over softmax probabilities at test time to mimic TI-CL formulation.

Table 6 (left) shows the results on C1, a very widely used and most popular offline TI-CL (or multi-head) formulation for CL on Split-MNIST. We now move to C2 on Split-TinyImagenet (offline TI-CL formulation as in  [9]). Note, in this particular formulation we used a different architecture called DenseNet-100-BC  [54]. Results are presented in Table 6 (middle). We observe that for \(k=9000\), we outperform all 10 approaches including GEM and iCARL by margins of atleast 7%. When the memory is halved to \(k=4500\), we perform slightly better than GEM and nearly 3% worse than iCARL. Since we used different architecture, we do not claim that we would notice similar improvements had we trained GDumb using the networks used in respective papers. However, these results are still encouraging as the approaches we compare against are trained in TI-CL manner and GDumb is always trained in CI-CL manner (much more difficult).

Task Incremental Online CL with Disjoint Tasks (Form. D): We now compare GDumb with 12 TI-CL tuned online approaches with small memory (not a favourable setting for GDumb as it relies totally on the samples in the memory) and detail the results in Table 6 (right). We observe that GDumb outperforms 8 out of 11 approaches even though it is trained in CI-CL manner.

Table 6. (TI-Offline-Disjoint) Performance on C1 (left) and C2 (middle). (TI-Onlione-Disjoint) Performance on D (right).

Class Incremental Online CL with Joint Tasks (Form. E): We now measure impact of imbalanced data stream with blurry task boundaries   [37]. Results are presented in Table 7 (left). We outperform competing models by over 10% and 16%, overwhelming surpassing complicated methods attuned to this benchmark. This demonstrates that GDumb works well even when almost all simplifying assumptions are removed.

Table 7. (CI-Online-Joint) Performance on E (left). Note, this is particularly challenging as the tasks here are non-disjoint (blurry task boundary) with class-imbalance. On the (right), we benchmark resource consumption in terms of training time and memory usage. Memory cost is provided in terms of the total parameters P, the size of the minibatch B, the total size of the network hidden state H (assuming all methods use the same architecture), the size of the episodic memory M per task. GDumb, at the very least, is 7.5\(\times \) times faster than the existing efficient CL formulations.

4.2 Resources Needed

It is important that our approach is in the ballpark of online continual learning constraints of memory and compute usage to achieve its performance. We benchmark our resource consumption against the efficient CL algorithms in Table 7 (right) benchmarked with a V100 GPU on formulation E [36]. We observe that we require only 60s on a slower GTX 1070 GPU (and 350s on a 4790 i7 CPU), performing several times efficient than various recently proposed algorithm. Note that sampling time is negligible, while testing time is not included in the above.

4.3 Potential Future Extensions

Active Sampling: Given an importance value \(v_t \in \mathrm {R}^+\) (by active learner) along with sample (\(x_t, y_t\)) at time t, we can extend our sampler by having the objective of storing most important samples (maximizing \(\sum _{i=1}^{|\mathcal {D}_c|} v_i\)) for any given class c in its storage of size k. This will allow an algorithm to reject less important samples. Of course, it is not clear how to learn to quantify importance of a sample.

Dynamic Probabilistic Masking: It is possible to extend masking in GDumb beyond CI-CL and TI-CL to dynamic task hierarchies across video/scene types useful in recently proposed settings  [40]. Since GDumb applies a mask (given a context) only at inference, we can dynamically adapt to the context. Similarly, we can extend GDumb beyond deterministic oracles (\(m_i \in \{0,1\}\)) to probabilistic one (\(m_i \in [0,1]\)). This delivers a lot of flexibility to handle diverse extensions like cost-sensitive classification, class-imbalance among others.

5 Conclusion

In this work, we provided a general view of a continual image classification problem. We then proposed a simple and general approach with minimal restrictions and empirically showed that it outperforms almost all the complicated state-of-the-art approaches in their own formulations for which they were specifically designed. We hope that our approach serves the purpose of a strong baseline to benchmark the effectiveness of any newly proposed CL algorithm. Our solution also raises various concerns to be investigated: (1) Even though there are plenty of research articles focused on specific scenarios relating CL problem, are we really progressing in the right direction? (2) Which formulation to focus on? and (3) Do we need different experimental formulations, more complex than the current ones, so that the effectiveness of recent CL models, if they are, is pronounced?