Keywords

1 Introduction

Deep learning (DL) [1] is one of the most widely used learning methods in the field of machine learning, and it has made significant contributions to the development of pattern recognition technologies. In particular, deep neural networks (DNNs) [2] are nowadays prevalent in both academic studies and industrial applications, and many image classifier models, including VGG [3], ResNet [4] and AlexNet [5], for instance, have been used across a wide range of fields. In many of the cases, the performance of a classification model is typically evaluated after the training performed in an iterative fashion, using the entire set of the training patterns. In practice, however, it is often the case that the training patterns are not ready to be available for all the classes but given only partially at a time and those of the remaining classes come later. Therefore, it is desirable to perform a continual learning (CL) [6], where the network training continues with the incoming new patterns. Within the CL principle, a number of class incremental learning (CIL) [7] approaches have been proposed to date [8,9,10]. In CIL, a pattern classifier is augmented with new classes, using only an additional set of the patterns for the new classes. However, it has been increasingly acknowledged, as pointed out in [11,12,13], that catastrophic forgetting of previously learned classes is a major problem in DL for performing the CIL/CL.

According to the taxonomy proposed in [14], the CIL approaches for DL proposed to date can be classified into the three categories of i) data-centric, ii) model-centric, and iii) algorithm-centric ones. For i), the CIL method in [8] introduces maximum entropy regularization (MER) into the loss function in order to prevent overfitting to the uncertain knowledge, and a confident fitting is given as penalty. In contrast, the method in [9] utilizes the dynamically expandable representation (DER) and can be categorized as a type ii) approach, based upon the classification in [14]. Within the method, the representation previously learned by a DNN is frozen and new additional feature extractors are learned for the additional tasks. For iii) the method in [10], i.e. weight aligning (WA), maintains the fairness between the previously learned and new classes. One of the other model-centric CIL methods is elastic weight consolidation (EWC) [15], and it is claimed in [15] that the old tasks can be memorized by slowing down the learning pace of some selected network parameters. However, the work in [14] reports rather counterfactual simulation results. In addition, replay methods have received much attention in the CIL for DNNs, and iCaRL [11] is particularly widely used among them. The iCaRL attempts to get over the catastrophic forgetting by adding some previously trained data to the newly incoming data. Although several methods have proposed to date, it is said that the catastrophic forgetting inherent to DNNs has still not been fully overcome.

In contrast to the DNN approaches described above, probabilistic neural network (PNN) [16] is another artificial neural network model, originally proposed for pattern classification problems. The original PNN model has only a single hyperparameter to be given a priori, and its training is straightforwardly completed in a one-shot manner; each training pattern is accommodated as the attribute vector of a second layer unit. Moreover, it is also shown in a recent study [17] that the parallel implementation of a PNN using k-means clustering enables its reference (testing) mode to be performed superior to a DNN, by exploiting the independence property of the arithmetic operations for each unit in the hidden layer, as well as those in each output layer unit representing a single class.

In this paper, we propose a novel class incremental learning algorithm for a PNN with a reduced number of the hidden layer units (a.k.a. compact sized PNN; CS-PNN) and empirically verify its effectiveness by comparing a DNN and PNN with the original training scheme. The organization of the paper is as follows: Sect. 2 describes what CIL is, followed by a brief introduction of the PNN, and comparison between the CIL on a PNN and DNN. Section 3 proposes a new CIL algorithm for PNN. Section 4 is devoted to the simulation study, i) comparing a DNN and the original PNN under the CIL situations, each yielding the baseline, and then ii) evaluating the classification performance of the CS-PNN. Section 4 provides the summary of the present work and suggests some future directions.

2 Preliminary Studies

2.1 Class Incremental Learning

In [11], CIL is meant to be the task of learning incrementally some additional classes for an already learned model, so that the newly added classes can be also identified by the model, besides the classes already learned. For instance, provided that a model has been trained for the classification task of handwritten digits and is capable of performing the task only partially, e.g. the digits from zero to eight, an additional learning is performed on the model, so that a new digit nine can also be classified, in addition to the nine digits already learned, as illustrated in Fig. 1.

Fig. 1.
figure 1

The flow of the class incremental learning.

Fig. 2.
figure 2

(a) The structure of a PNN with a total number of training patterns and \(N_c\) classes, and (b) The PNN represented as a composite structure by a set of the \(N_{c}\) class-independent sub-networks.

2.2 Probabilistic Neural Network

A PNN is a feed-forward neural network composed of an input and output layer, with linear-sum activation units for the latter, and a single hidden layer with the nonlinear units. Figure 2 (a) illustrates the structure of a PNN. In Fig. 2, each of the hidden layer units has a radial basis function (RBF) in terms of Gaussian response function as an activation function:

$$\begin{aligned} h_{j}( \boldsymbol{x} )=\exp \left( - \frac{\Vert \boldsymbol{x}- \boldsymbol{c_{j}} \Vert ^{2}_{2}}{\sigma ^{2}} \right) \end{aligned}$$
(1)

where \(\Vert \cdots \Vert _{2}\) denotes L2-norm, \(h_j\), \(\boldsymbol{x}\), and \(\boldsymbol{c_j}\) are respectively the output value of the j-th hidden layer unit, the input vector, and the attribute vector representing one of the training patterns, and where \(\sigma \) is the radius unique to all the RBFs. Each hidden layer unit is connected to an output layer unit corresponding to the class to which the training pattern belongs. In PNN, the weight between a hidden layer and output layer units is set as follows:

$$\begin{aligned} w_{j, k} = {\left\{ \begin{array}{ll} 1 &{} \text {if}\, \boldsymbol{c_j}\, \text {belongs to class}\, k, \\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(2)

In the output layer, each unit yields the activation value:

$$\begin{aligned} o_{k} = \frac{1}{U_k} \sum ^{U_{k}}_{j=1} w_{k(j), k}h_{k(j)}( \boldsymbol{x} ) \end{aligned}$$
(3)
$$\begin{aligned} N_{tr} = \sum ^{N_{c}}_{k=1} U_{k} \end{aligned}$$
(4)

where \(w_{k(j), k}\) is the weight between the j-th RBF belonging to class k and the k-th output layer unit, \(U_k\) is the number of the training vectors for class k, \(N_{tr}\) is the total number of the training vectors for all the classes, and k(j) is the ID of the RBFs that belong to class k (which is the same as \(o_k\)). In a PNN, since both the hidden and output layer units are completely separated from class to class, the network can be eventually regarded as a composite network consisting of its class-independent subnetworks, as shown in Fig. 2 (b). Moreover, the training of the original PNN is completed in only 2 steps: i) storing each training vector as the attribute of a hidden layer unit and ii) determining \(\sigma \) in (1) appropriately.

2.3 Class Incremental Learning on DNN and PNN

For training a DNN, DL aims to minimize the output error for \(\boldsymbol{w}\), i.e. a certain set of the network parameters, where the input vector \(\boldsymbol{x}\) is given:

$$\begin{aligned} \boldsymbol{w} \leftarrow \boldsymbol{w} - \eta \frac{\partial E(\boldsymbol{x}, \boldsymbol{w})}{\partial \boldsymbol{w}} \end{aligned}$$
(5)

where E is the error function, for which a categorical cross entropy is often used in classification problems, and \(\eta \) is the learning rate. Then, in the situation where the network parameters are optimized for a new set of the training patterns, it is quite often the case that the knowledge acquired from new training patterns can easily override the weight and bias parameters optimized using previous patterns. This is the main cause of the catastrophic forgetting occurring in a DNN. Therefore, in the situation where the CIL is performed on a DNN, the problem arises in the pattern classification tasks, as the classification accuracy for a new class is apt to decline dramatically. In other words, this is due to the fact that the model can be optimized only for a particular set of the training patterns for a new class.

In contrast to the DNNs, PNN does not suffer from such forgetting, as the network training on a PNN essentially proceeds by only adding each hidden layer unit with storing a training vector as an attribute vector within the RBF, as described earlier. Moreover, in the situation where the dataset given is well-balanced, it is considered that the feature space can be reasonably covered by a set of the local subspaces spanned by the respective attribute vectors, each stored in a hidden layer unit, and thereby it is considered that an additional learning such as the CIL can be effectively and straightforwardly carried out.

3 Class Incremental Learning Algorithm for a Compact-Sized Probabilistic Neural Network

Here, we propose a new method applied for training the PNN, i.e. a compact-sized probabilistic neural network (CS-PNN) for CIL, with reducing the number of units in the hidden layer. In the CS-PNN, each unit in the hidden layer memorizes the location of a selected training vector in the feature space. Also, the number of the hidden layer units can be suppressed by updating the location of some existing vector, if a new training vector is located nearby the existing one, without adding it as a new attribute vector. The training algorithm for the CS-PNN is summarized as follows.

  • step 1. For the first training pattern belonging to a new class (k=1,2,...,\(N_c\)), add \(h_{k(j=1)}\) into the hidden layer with \(\boldsymbol{c_{k(i=1)}} = \boldsymbol{x_{tr(k,i= 1)}}\), and \(o_k\) into the output layer.

  • step 2. For \(i=2,3,\dots ,U_k\), perform the following for all the remaining training patterns:

    • If \(\exists h_k; h_k (\boldsymbol{x_{tr(k,i)}} )>\theta _k\) then update \(\boldsymbol{c_j}\) for the unit \(h_j\) which yields a maximum output value among the existing units belonging to class k to the average between \(\boldsymbol{x_{tr(k,i)}}\) and \(\boldsymbol{c_j}\) as follows:

      $$\begin{aligned} \boldsymbol{c_{j}} \leftarrow \frac{ \boldsymbol{x_{tr(k, i)} } + \boldsymbol{ c_{j} } }{2} \end{aligned}$$
      (6)
    • Otherwise, add \(h_{k(i)}\) into the hidden layer with \(\boldsymbol{c_{k(i)}} = \boldsymbol{x_{tr(k,i)}}\), as well as \(o_k\) to the output layer, if it does not exist yet, and connect \(h_{k(i)}\) to \(o_k\) (i.e. with the weight unity in between);

Note that applying the CS-PNN algorithm in the above will automatically generate the units in both the hidden and output layers, where necessary. Also, since the parameters \(\boldsymbol{c_j} (j=1,2, \dots ,U_k\); \(U_k\) is the number of training patterns belonging to class k) are updated only for the subnet representing a single class, the CIL is performed without affecting to the attribute vectors stored in other subnets (i.e. representing the respective classes). Moreover, during the operation of the CIL algorithm in the above, only the attribute vector of the unit which is the nearest to the training is updated, while the stored vectors in other units remain intact, with only a small and necessary change in the feature space spanned. In addition, since only the attribute vectors of the units with their activations larger than the threshold are updated, a smaller value of \(\theta \) can lead to a further reduction in the number of the hidden layer units, depending upon the situations. Note, however, that there needs a trade-off between the performance, since, with a small setting of \(\theta \), the chance that the attribute located far from the primary feature space for a particular class can be selected for the update increases. For this reason, a care should be taken for the choice of the value \(\theta \), besides the unique radius \(\sigma \) in (1).

4 Simulation Study

We conducted a series of the simulations, aimed for comparing the CIL capabilities of the DNN, original PNN, and the proposed CS-PNN.

Table 1. Summary of the seven datasets used for the simulation study.

In the simulation study, we used the seven publicly available datasets: abalone, isolet, letter-recognition, MNIST, optdigits, pendigits, and wdbc; one obtained from the MNIST [18], and the remaining six from the UCI machine learning repository [19]. Then, each pattern vector of a dataset was normalized within the range of \([-1,1]\):

$$\begin{aligned} x_{i} \leftarrow 2 \left( \frac{x_i - x_{i \textrm{MIN}}}{x_{i \textrm{MAX}}-x_{i \textrm{MIN}}} - 0.5 \right) \end{aligned}$$
(7)

where i is the ID of an attribute vector. The properties of the seven datasets used for the simulations are summarized in Table 1. For the unique hyper-parameter of a PNN, \(\sigma \) was set by the following equation, and the setting was used through all the simulations, which yielded relatively a reasonable performance for each dataset:

$$\begin{aligned} \sigma = \frac{d_{\textrm{MAX}}}{N_{c}} \end{aligned}$$
(8)

where \(d_{\textrm{MAX}}\) is the maximum distance computed using all the pairs of the training patterns across all the classes used for training, and \(N_c\) is the number of classes.

In the case of training a DNN, an entire set of the training vectors was divided into those of the respective class, and each set of the vectors was presented to the network as one batch at a time for the training. For the DNN, once an iterative training session using the batch belonging to a certain class was completed, the batch for other class was used for the next iterative training session. This manner of the subsequent training sessions continued till the last class.

4.1 Comparison Between the DNN and Original PNN Under the CIL Situation

In this simulation, we first compared the DNN and PNN for the baseline tasks of the CIL. For the baseline tasks, each batch for the class till the last was subsequently presented to a DNN/PNN and the CIL at each presentation was performed. Then, the classification accuracy after completing the CIL was evaluated for each of the two models, using the testing vectors. In the tasks, therefore, each training pattern vector was stored, as described earlier, as an attribute vector of a unit in the hidden layer of a PNN, while the parameter setting, as well as the training, of the DNN was done based upon the following manner:

  • Number of the Hidden Layers: varied from 1 to 3.

  • Training Algorithm: Adam [20] (\(\mathrm {learning \, rate}=0.01\), \(\alpha =0.9\), \(\beta =0.999\)).

  • Number of Epochs in an Iterative Training Session:

    • Patterns for the first class (i.e. the first CIL task): 20 epochs.

    • Patterns for other classes (i.e. other remaining tasks): once epoch/20 epochs.

In the setting for the DNN above, the three hyper-parameters for the Adam, which is considered as one of the state-of-the-art DL algorithms, were those given as the default values used in PyTorch [21] except for the learning rate, whereas the numbers of the epochs chosen were based upon the preliminary simulations; it was empirically confirmed that 20 epochs were sufficient to reach a convergent state for each run. The simulation results are summarized in Table 2. In this table, we obviously see that a DNN is not able to perform properly the CIL tasks at all, where the number of classes was increased one-by-one. On the other hand, the classification accuracies obtained using a PNN were always higher than those obtained using the DNNs for all the seven datasets used in the simulation study. As shown in Table 2, it was observed that the classification accuracy by a PNN was around 2.5 times as high as those by DNNs for the wdbc case, whereas 26.5 times higher for the letter-recognition case. For the CIL tasks using the MNIST, we then analyzed the classification accuracies in each class obtained by a DNN with three hidden layers, upon the iterative training of 20 epochs in each additional training task. In the analysis, we confirmed that the accuracies for the preciously learned classes were all dropped to zero, except that of 100% for each new class just learned. Therefore, it is said that the catastrophic forgetting did occur in the early stage of learning in each case of the DNNs, since there was no significant difference between the classification accuracies for the one-epoch cases and those of the twenty epochs, as shown in Table 2. From these observations, it can be therefore concluded that a PNN is capable of performing the CIL, without any serious forgetting occurred, as reported in [22], while the DNNs are not.

Table 2. Simulation results comparing the classification accuracies, obtained after the completion of the training for all the classes in the CIL situation, and showing the baseline performance of DNN and PNN, under the CIL situation.
Table 3. Simulation results comparing the classification accuracy and the number of the RBFs using the original PNN and CS-PNN for each of the seven datasets.
Table 4. Comparison of relative difference in classification accuracy between the original PNN and CS-PNN, and reduction rate of hidden layer units (RBFs).

4.2 Evaluation of Classification Accuracy and Reduction Rate of Hidden Layer Units for the Proposed CS-PNN

We then conducted another set of simulations, in order to validate the classification performance in the case where the CS-PNN was applied. The simulation results are shown in Tables 3 and 4 and Figs. 3 and 4.

Table 3 summarizes both the classification accuracies and number of hidden layer units after performing the incrementally training of a PNN by applying the proposed algorithm (i.e. CS-PNN) in Sect. 3.

In Table 4, a performance comparison of both the relative difference in terms of classification accuracy and reduction rate of RBFs between the original PNN and CS-PNN, calculated using the values in Table 3, is shown. The relative difference and reduction rate were calculated, respectively, as follows:

$$\begin{aligned} \text {Relative Difference} = \text {Acc. of original PNN} - \text {Acc. of CS-PNN} \end{aligned}$$
(9)
$$\begin{aligned} \text {Reduction Rate} = 1 - \frac{\text {Num. RBFs in each CS-PNN}}{\text {Num. RBFs in original PNN}} \end{aligned}$$
(10)

Here, the relative difference in (9) is introduced for a straightforward comparison of the difference between the classification accuracy of the original PNN and CS-PNN; a minus value shows the accuracy of CS-PNN inferior to that of the original PNN. As shown in the bottom two rows in Table 4, the average over the relative difference for each value of \(\theta \) for the CS-PNN was \(-7.06\%\), \(-4.86\%\), \(-4.19\%\), and \(-1.77\%\), respectively, whereas the corresponding median, which was computed over these results, was respectively as \(-5.63\%\), \(-1.22\%\), \(-0.83\%\), and \(-0.06\%\). Therefore, overall, a dramatic performance degradation as in the DNNs was not observed for the CS-PNN, compared to the original PNN. In Table 4, it is, however, observed that the range of the decrease in terms of the classification accuracy for both the abalone and wdbc datasets is relatively larger than that for other datasets. The possible reason for such a deterioration is ascribed to the relatively higher reduction rate of the hidden layer units, as shown in Table 4, and/or the removal of the units yielding a significant impact on the classification performance. In contrast, the accuracies obtained using the CS-PNN for the other five datasets, i.e. isolet, letter-recognition, MNIST, optdigits, and pendigits, are all shown to remain almost intact, each with a relatively high reduction rate of the RBFs (except the isolet case with \(\theta =0.6\)), as shown in Table 4.

Fig. 3.
figure 3

Classification accuracies obtained using original PNN, CS-PNN (\(\theta \)=0.6,0.7,0.8,0.9), and iCaRL on DNN (\(K = 0.2 \times \text {the number of all training data}\)), with the number of classes varied for the abalone, isolet, letter-recognition, and MNIST datasets.

Fig. 4.
figure 4

Classification accuracies obtained using original PNN, CS-PNN (\(\theta \)=0.6,0.7,0.8,0.9), and iCaRL on DNN (\(K = 0.2 \times \text {the number of all training data}\)), with the number of classes varied for the optdigits, pendigits, and wdbc datasets.

On the other hand, the reduction rates for both the abalone and wdbc (except for \(\theta \)=0.09) were over 90%, while a significant decrease in terms of the classification accuracy of over 10% was also observed. In contrast, the relative difference in classification accuracy for the five cases of the isolet, letter-recognition, MNIST, optdigits, and pendigits almost always stayed below the corresponding averaged one, while the reduction rates of the RBFs were varied greatly with the setting of \(\theta \) as shown in the bottom in Table 4. This indicates that an appropriate setting of \(\theta \) by somehow taking into account the overall distribution of the distances between the input and attribute vectors is necessary, so as to effectively reduce the number of the RBFs. Therefore, it is considered, as a rule of thumb in practice, that the value of the unique radius \(\sigma \) is first tuned to yield a higher classification accuracy, then \(\theta \) is varied for an effective reduction in the number of the RBFs.

In sum, from these observations, it is said that the CS-PNN can effectively select the training pattern vectors to be accommodated within the hidden units, while maintaining relatively well-separated class boundaries in between, as compared to the original PNN approach.

Figures 3 and 4 show the changes in the classification accuracy of the CIL with an original PNN, CS-PNN, and the DNN with iCaRL. In this simulation, the memory size K for the iCaRL, i.e., the number of stored data for the subsequent additional training task, was set at 0.2 times the number of all training data in each dataset. In Figs. 3 and 4, we observe that for all the datasets except the abalone the classification accuracy almost always stays above 70–80%. Moreover, as shown in Figs. 3 and 4, the accuracy for all the four datasets but the abalone, isolet, and wdbc was improved steadily, albeit sometimes exhibiting a sudden drop at an earlier CIL task, and remained relatively high afterwards, unlike the DNN without applying the iCaRL. In addition, it was confirmed that the accuracy of the iCaRL consistently showed a decrease with each additional training for the letter-recognition, optdigits, and pendigits cases, compared to the CS-PNN. In the isolet case, however, the accuracy of CS-PNN was lower than iCaRL at the early stages, though the classification accuracy of the CS-PNN approached that of the iCaRL as the additional training tasks proceeded. A similar trend was observed for the MNIST case. In contrast, for the abalone and wdbc, the CS-PNN and iCaRL exhibited no significant difference in the accuracy. Therefore, it is said that the CS-PNN performed effectively for all the seven datasets used for the simulation study compared to the CIL of the DNN with iCaRL.

5 Conclusion

In this work, we have firstly shown that a PNN is capable of performing the CIL, through the simulation study using seven publicly available datasets in comparison with the DNNs. We have then proposed the CS-PNN, for the purpose of effectively reducing the number of the hidden layer units in a PNN, while maintaining a reasonably high classification performance. It has also been observed that performing a CIL is virtually not possible by a bare DNN approach using the Adam algorithm for all the cases, due to the catastrophic forgetting occurred during the simulation, while the CS-PNN can cope moderately well with the CIL. Compared to the iCaRL, CS-PNN can also perform robustly in the CIL tasks. Moreover, it is worth mentioning that,unlike DNNs, the training of a PNN is fast, as the training does not require iterative training of the network parameters at all but can be simply done by assigning some selected training data to the hidden layer unit\(\text {'}\)s attributed vectors as described in this work. It is also notable that, beside the hidden layer units, the network obtained via the CS-PNN has a varying number of output units during a CIL task, unlike conventional, fixed-sized DNN models. In addition, it is also reported in [23] that a PNN exhibits the high robustness against an adversarial attack.

Future work is directed to the investigation of the effective choice of the unique radius \(\sigma \) of a PNN, as well as that applicable to the CS-PNN.