Keywords

1 Introduction

Machine learning (ML) algorithms aim to train a computer to make decisions [1]. These algorithms are used in various fields such as image processing [2, 3], object recognition [4], handwriting recognition [5], natural language processing [6] and even quantum computing [7]. There are two major techniques for decision making used by ML algorithms—(i) supervised learning where the algorithm is initially trained with a set of labeled data [8], and (ii) unsupervised learning where the algorithm looks for patterns and similarities and tries to group similar data in the same cluster without any prior learning phase [9]. Although these two are the major types of learning algorithms, other forms of learning algorithms such as semi-supervised learning [10] and reinforcement learning [11] are also widely studied.

Artificial neural network (ANN) algorithms are supervised algorithms which mimic the working principles of neurons in the human brain. The most basic ANN algorithm is the single-layer perceptron. However, more sophisticated algorithms use multiple layers, complicated functions for decision making, and take the error of the output into consideration in order to update the weights. Although these algorithms assume a single initial training phase, in real world, data often arrives in batches, which requires the training to be performed in multiple stages. The model is first trained with the initial set of training data, and when more training data is available, the model is trained further. Such a model of training is called incremental learning [12].

In this paper, we study incremental learning in ANN where the training dataset may be noisy, i.e., some of the training data has incorrect label. It is obvious that if the training dataset is erroneous, the training will be less effective and the ANN will be less accurate in its prediction. In this paper, we show, however, that not only the number of erroneous data, but also the location of the error in incremental learning plays a vital role in the performance of the algorithm. The accuracy of an algorithm varies even when the percentage of error is the same, but the errors are concentrated in different locations of the training set.

In this paper, we show numerically by using three ANN algorithms (perceptron [13], feedforward neural network (FFN) [14, 15] and radial basis function neural network (RBF) [15]) that (i) for the same percentage of erroneous data, the location of error clusters can significantly alter the performance of the algorithm, (ii) the performance degradation is independent of the number of features per data and (iii) although more sophisticated algorithms are more robust to errors, the degradation in performance due to concentrated error has a similar nature for all algorithms.

The rest of the paper is organized as follows—in Sect. 2, we give a brief description of the three ANN algorithms used. In Sects. 3 and 4, we show the performance of the algorithms for two-step and three-step incremental learning. We conclude in Sect. 5.

2 Brief Review of the ANN Algorithms Used

An m-class classification problem [16], where C1, C2, …, Cm are the object classes, is associated with k training samples and n testing samples. Each training sample is a vector (si, Csi), where Csi is the designated class of the sample si. After training the algorithm with this set of training data, for each testing sample ti, such that ti belongs to class Cj, the algorithm is expected to produce

$${\text{Prob}}\left( {t_{i} \in C_{j} } \right) > {\text{Prob}}\left( {t_{i} \in C_{l} } \right)\quad \forall l \ne j.$$
(1)

The algorithm is said to have made an error in the prediction if, for some testing sample tp ∈ Cp, it produces Prob(tp ∈ Cp) < Prob(tp ∈ Cq) for some \(q \ne p\). The objective of learning is to minimize the number of errors. In the following part of this subsection, we briefly discuss the working principles of the three ANN algorithms (perceptron, FNN and RBF) used in this paper. Our motivation behind using these three algorithms is to show that although sophistication of the algorithm enhances the robustness to training errors, the performance loss due to the location of error concentration remains invariant under the type of algorithm used.

Perceptron Learning Algorithm. Perceptron is one of the simplest ANN algorithms. In this algorithm, an input is a vector (x1, x2, …, xn), where each xi is called a feature and associated with each feature is a weight wi. For a 2-class classification problem, the output

$$y = \left\{ {\begin{array}{*{20}l} 1 \hfill & {{\text{if}}\,\sum\limits_{i = 1}^{n} {w_{i} x_{i} \ge \theta } } \hfill \\ 0 \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right.$$
(2)

where θ is a threshold value. The weights wi are initialized randomly and are modified during the training phase to match the class labels of each training sample. This algorithm can be easily modified for multi-class classification.

Feedforward Neural Network Algorithm. In FFN, the perceptrons are arranged in layers, with the first layer taking in inputs and the last layer producing outputs. The middle layers are called hidden layers. Each perceptron in one layer is connected to every perceptron on the next layer, but there is no interconnection among the perceptrons of the same layer. The information is constantly fed forward from one layer to the next. A single perceptron can classify points into two regions which are linearly separable, whereas by varying the number of layers, the number of input, output and hidden nodes, one can classify points in arbitrary dimension into an arbitrary number of groups.

Radial Basis Function Algorithm. A drawback of perceptron is that the activation function is linear and hence fails to classify nonlinearly separated data. The RBF algorithm uses radial basis functions as activation functions. RBF is a real-valued function \(\varphi\) whose value depends only on the distance from the origin \(\varphi \left( x \right) = \varphi \left( {\left\| x \right\|} \right)\). Its characteristic feature is that the response decreases or increases monotonically with distance from a central point. A typical radial function is the Gaussian function. RBF transforms the input signal into a different form, which can then be fed to the ANN to get linear separability. RBF has an input layer, a single hidden layer and an output layer. The sophistication of the activation function of this algorithm usually leads to better classification accuracy than perceptron or FFN.

3 Performance of ANN in Noisy Two-Step Incremental Learning

We have performed our study on the standard benchmark dataset IRIS [17], which contains 120 data samples, where each sample is a vector containing four features. Forty samples have been used to train each ANN algorithm, while the other 80 have been used to test their performance.

Error in training data implies that for a particular training sample si ∈ Ci, the class is incorrectly labeled as Cj, j ≠ i. For this dataset, containing 2n training samples (here n = 20), each ANN algorithm is trained twice sequentially with the first n and the last n training samples. We have varied the errors in the training set from 0 to 50% by a gap of 10. Three scenarios are considered, where the erroneous data is (i) uniformly distributed in the entire training set, (ii) uniformly distributed in the first half of the training set and (iii) uniformly distributed in the second half of the training set.

In Table 1, we show the accuracy obtained by perceptron, FFN and RBF algorithms, respectively, as the error in the training sample is varied as discussed above.

Table 1 Accuracy of perceptron, FFN and RBF in the presence of clustered noise

4 Performance of ANN in Noisy Three-Step Incremental Learning

We have performed this study on the WINE dataset [17], which contains 150 data samples, where each sample is a vector containing 13 features. The IRIS dataset chosen in the previous two-step experiment does not contain enough samples to effectively divide into three training sets. As such, we chose the WINE dataset for this experiment. For this dataset, the ANN algorithms have been trained using 60 samples and the other 90 have been used to test their performance. The motivation to use these two different datasets (IRIS and WINE) is to show that the performance degradation due to error clusters is independent of the number of features per sample. For the WINE dataset containing 3n training samples, each ANN algorithm is trained thrice sequentially with the first n, second n and the last n training samples. We have varied the errors in the training set from 0 to 40% by a gap of 10. Three scenarios are considered in each case, where the erroneous data is (i) uniformly distributed in the entire training set, (ii) uniformly distributed in the first 20 entries, (iii) uniformly distributed in the second 20 entries and (iv) uniformly distributed in the last 20 entries of the training set.

In Fig. 1, we show the graph of the accuracy of the three algorithms for incremental training. The first row shows the performance for two-step training, and the second row shows the same for three-step training. The graphs readily show that the degradation in performance is heavily dependent on the location of error. The performance of all the algorithms, when errors are distributed uniformly or are clustered in the first training set, is almost the same. However, as the errors move to later training sets, the performance of the algorithm increases significantly. Moreover, although the performance of RBF is better than FFN, which in its turn is better than perceptron, the nature of degradation remains similar for all the algorithms, irrespective of its sophistication.

Fig. 1
figure 1

Accuracy of perceptron, FFN and RBF for two- and three-step incremental learning

For the three-step learning, we have also studied the accuracy of the said algorithms when the error is uniformly distributed in two of the three steps. The graph of the accuracy in this scenario is shown in Fig. 2. The performance of the algorithms due to error in two of the three halves also has a similar nature.

Fig. 2
figure 2

Accuracy of ANN algorithms when errors are uniformly distributed in two of the three steps

5 Conclusion

In this paper, we have numerically studied the accuracy of perceptron, FFN and RBF for two-step and three-step incremental learning in the presence of noisy dataset. We show that the accuracy of the algorithms depends not only on the percentage of error on the training set, but also on the location of the error concentration. In fact, for the same percentage of error, the location plays a significant role in the accuracy of the algorithms. Moreover, we also show that the nature of degradation due to concentrated error is invariant of the number of features in the data. The accuracy obtained from the most basic ANN (perceptron) and more sophisticated ANNs (FFN, RBF) shows that although sophistication makes the algorithm more robust to errors, the nature of the performance degradation due to location of error is similar for all the algorithms. Therefore, the concentration of error is a more acute threat to incremental learning with noisy training data than the percentage of error.