Introduction

Recently, the design and implementation of cognitive systems that can perceive, learn, memorize, decide, act, and communicate have attracted a lot of attention in both academia and industry [1, 2]. There are several important input modalities for natural interaction with intelligent systems. They include context-sensitive keyword [3], gaze [4], human speech and handwriting [5], and sensors [6]. Activity recognition in different kinds of cognitive systems [7, 8] plays a key role in diverse practical applications such as metabolic energy measurement and fall detection. Automatically recognizing various physical activities (e.g., standing still, walking, running, going upstairs or downstairs) has many applications in healthcare [9], monitoring of elderly people [1012], energy expenditure [13], and so on [14, 15]. Because of their inherent properties, accelerometers are often used to measure movement for human activity recognition. In particular, smart phone-based activity recognition is a popular academic and industrial research topic that uses an embedded accelerometer sensor and has the obvious advantage of being unobtrusive to peoples’ daily lives.

Most existing activity recognition methods collect and label samples of some predefined sort of activity to train a fixed activity recognition model. In [16, 17], the authors used multiple accelerometers to classify different activities. Chen [18] used a smart phone to detect six activities and find the change point between different states. These models achieved high recognition accuracies, because their test and training samples were from the same data and followed the same distribution. These literatures did not consider the incremental activity recognition problem.

The fixed activity recognition model is not appropriate for meeting the diverse demands of different people. It is because that human beings learn motion activities as an incremental process. As a baby, we learn to sit, stand, and toddle. As a child, we learn to walk along the road, run back and forth, ride a bicycle, and so on. When we mature into adults, we may learn to how to drive a car. From the perspective of sports medicine, an individual’s activities can reflect their physical condition. To address this problem, some existing methods retrain a new model by combining the original known activities with new activities [19, 20]. There are also some incremental learning methods. GAP-RBF [21] used feedforward networks with radial basis function (RBF) nodes. It pruned insignificant nodes from the networks. GAP-RBF attempted to simplify the sequential learning algorithms and increase learning speed. However, it requires information about the input sampling distribution or input sampling range, and the learning speed may still be slow for large applications. Fuzzy ARTMAP (FAM) [22] is a variant of an adaptive resonance theory map (ARTMAP) network, which is an incremental learning network based on adaptive resonance theory. It incorporates fuzzy set theory to govern the dynamics of ARTMAP. The major drawbacks of FAM are that it is sensitive to noise in the training dataset and often suffers from overfitting. This decreases its classification accuracy and generalization ability. The resource allocation network (RAN) [23] and its extensions are sequential learning algorithms that have also become popular feedforward networks. However, these sequential learning algorithms can only process one sample at a time, and not in larger sections. In [24], an online sequential extreme learning machine (OSELM) was introduced. OSELM can handle the training data in a sequential manner. At any time, only the newly arrived sample or a set of samples (instead of the entire past dataset) are used to train the incremental model. After the learning procedure, the current section of data is immediately discarded. However, RAN and OSELM cannot train using new classes of incremental data and integrate the results into an existing model.

It is not feasible to retrain a model on resource limited devices; they cannot store large amounts of training data and have low computing power. In this paper, we propose a class incremental extreme learning machine (CIELM). First, we train an ELM classifier using labeled samples of some predefined kinds of activities. Then, labeled samples of new activities are selected, and an incremental learning method is used to update the original ELM classifier.

Fig. 1
figure 1

CIELM framework

The remainder of this paper is organized as follows. In “The Class Incremental Extreme Learning Machine for Activity Recognition” section, we present details of the CIELM. In “Experiments” section, and we give the results of our experiments. Section “Conclusion and Future Work” concludes the paper.

The Class Incremental Extreme Learning Machine for Activity Recognition

The proposed CIELM can be incrementally built in a similar manner to a human’s cognitive processes. As illustrated in Fig. 1, CIELM has three main steps:

  1. 1.

    some activity samples are collected and labeled;

  2. 2.

    an ELM activity classifier is trained on these labeled samples (called ELM_1);

  3. 3.

    ELM_1 is increasingly adapted to a new classifier (ELM_2) using an incremental learning algorithm and labeled data of the known and new classes.

The ELM structure is changed after the adaption phase. Output neutrons that correspond to the new classes are added. They are represented by the dotted circles in Fig. 1.

Brief Description of ELM

ELM is a recently developed neural network algorithm. It is known to perform well for complex problems and reduce computation time when compared with other machine learning algorithms. The ELM algorithm does not train the input weights or the biases of neurons. It acquires the output weights by using the least-squares solution and Moore–Penrose inverse of a general linear system. The final result is derived by selecting the maximum output value and its corresponding index.

Fig. 2
figure 2

ELM network structure

Figure 2 shows the ELM network structure with a single hidden layer. The learning phase of the ELM algorithm can be summarized as in Algorithm 1.

figure a

In the offline phase, an ELM classifier can be trained using the labeled activity samples. After the classifier is deployed on a smart phone, it can be used to classify the users’ activities in the online phase.

OSELM Algorithm

The ELM described in “Brief Description of ELM” section is a batch learning algorithm. It assumes that all the data are available for training. However, in real applications, the training data may arrive in sections or as individual samples. Therefore, the batch ELM algorithm must be modified so that it can deal with the online sequential problem [24].

  1. Step 1

    Given a chunk of the initial training set: \(\aleph _0=\{(x_i,t_i)\}_{i=1}^{N_0}\), where \(x_i=\langle f_i^1,f_i^2,\ldots ,f_i^n \rangle\), \(n\) is the feature number of the input data and also the input node number of the ELM. We select the activated function \(G(a,b,x)\) and hidden node number \(\tilde{N}\) and assign random input weights \(a=\{a_i|a_i\in R^n\}_{i=1}^{\tilde{N}}\) and biases \(b=\{b_i|b_i\in R\}_{i=1}^{\tilde{N}}\). The hidden layer output matrix, \(H_0\), can be calculated using

    $$\begin{aligned} H_0=\left[ \begin{array}{ccc} G(a_1,b_1,x_1) &{} \cdots &{} G(a_{\tilde{N}},b_{\tilde{N}},x_1) \\ \vdots &{} \ddots &{} \vdots \\ G(a_1,b_1,x_{N_0}) &{} \cdots &{} G(a_{\tilde{N}},b_{\tilde{N}},x_{N_0}) \\ \end{array} \right] _{{N_0}\times \tilde{N}}. \end{aligned}$$
    (1)

    Then, the output weight \(\beta _0\) can be calculated using

    $$\begin{aligned} \beta _0=K_0^{-1}H_0^{\mathrm{{T}}}T_0, \end{aligned}$$
    (2)

    where \(K_0=H_0^{\mathrm{{T}}}H_0\), and \(T_0=[t_1,t_2,\ldots ,t_{N_0}]^{\mathrm{T}}\).

  2. Step 2

    Suppose that we are given another chunk of data \(\aleph _1=\{(x_i,t_i)\}_{i=N_0+1}^{N_0+N_1}\). Then, using the parameters \(a, b, G, \hbox {and} \tilde{N}\) in Step 1, \(H_1\) can be calculated using

    $$\begin{aligned} H_1=\left[ \begin{array}{ccc} G(a_1,b_1,x_{N_0+1}) &{} \cdots &{} G(a_{\tilde{N}},b_{\tilde{N}},x_{N_0+1}) \\ \vdots &{} \ddots &{} \vdots \\ G(a_1,b_1,x_{N_0+N_1}) &{} \cdots &{} G(a_{\tilde{N}},b_{\tilde{N}},x_{N_0+N_1}) \\ \end{array} \right] _{{N_1}\times \tilde{N}}. \end{aligned}$$
    (3)

    results:

    $$\begin{aligned} K_1=\left[ \begin{array}{c} H_0 \\ H_1 \\ \end{array} \right] ^{ \mathrm{T}}\left[ \begin{array}{c} H_0 \\ H_1 \\ \end{array} \right] =\left[ \begin{array}{cc} H_0^{\mathrm{T}} &{} H_1^{\mathrm{T}} \\ \end{array} \right] \left[ \begin{array}{c} H_0 \\ H_1 \\ \end{array} \right] =K_0+H_1^TH_1. \end{aligned}$$
    (4)

    Therefore, if we combine the datasets \(\aleph _0\) and \(\aleph _1\) to train the ELM, \(\beta _1\) can be calculated using

    $$\begin{aligned} \beta _1 = K_1^{-1}\left[ \begin{array}{c} H_0 \\ H_1 \\ \end{array} \right] ^{\mathrm{T}}\left[ \begin{array}{c} T_0 \\ T_1 \\ \end{array} \right] = \beta _0+K_1^{-1}H_1^{\mathrm{T}}(T_1-H_1\beta _0). \end{aligned}$$
    (5)

    As seen in Eq. (5), \(\beta _1\) is a function of \(\beta _0, K_1,\) and \(H_1\), and not a function of the dataset \(\aleph _0\). Hence, we do not need to store the current data chunk after \(\beta\) has been calculated.

Class Incremental Extreme Learning Machine

As mentioned in “OSELM Algorithm” section, the OSELM model can be incrementally updated using individual samples or data chunks. The labels of the incremental samples are the same as those in the training dataset. However, in reality, the user may form some new activities that the model needs to recognize. This case cannot be solved by OSELM. To address this problem, we extended OSELM so that it can be incrementally trained.

Suppose that we have a given labeled dataset \(S=\{(x_s^{(i)}),(y_s^{(i)})|i=1,2,\ldots ,N_0\}\) and a new dataset \(D=\{(x_d^{(i)}),(y_d^{(i)})|i=1,2,\ldots ,N_1\}\). Without loss of generality, we can simplify the description by assuming that there is only one new class label in \(D\), and it is ‘\(1+\hbox {max}\{(y_s^{(i)})|i=1,2,\ldots ,N_0\}\)’.

According to the ELM algorithm, after the training phase on the dataset of S is completed, the output weight can be calculated using

$$\begin{aligned} \beta _0=K_0^{-1}H_0^{\mathrm{{T}}}T_0, \end{aligned}$$
(6)

where

$$\begin{aligned} H_0&= \left[ \begin{array}{ccc} G(a_1,b_1,x_s^{(1)}) &{} \cdots &{} G(a_{\tilde{N}},b_{\tilde{N}},x_s^{(1)}) \\ \vdots &{} \ddots &{} \vdots \\ G(a_1,b_1,x_s^{(N_0)}) &{} \cdots &{} G(a_{\tilde{N}},b_{\tilde{N}},x_s^{(N_0)}) \\ \end{array} \right] _{{N_0}\times \tilde{N}}, \end{aligned}$$
(7)
$$\begin{aligned} T_0&= \left[ \begin{array}{c} t_1^{\mathrm{T}} \\ \vdots \\ t_{N_0}^{\mathrm{T}} \\ \end{array} \right] _{{N_0}\times m}, \end{aligned}$$
(8)

\(m\) is the number of sample classes in dataset \(S\), and

$$\begin{aligned} K_0=H_0^TH_0. \end{aligned}$$
(9)

From the dataset of \(D\) and the function of the ELM algorithm, \(H_1\) can be calculated using

$$\begin{aligned} H_1=\left[ \begin{array}{ccc} G(a_1,b_1,x_d^{(1)}) &{} \cdots &{} G(a_{\tilde{N}},b_{\tilde{N}},x_d^{(1)}) \\ \vdots &{} \ddots &{} \vdots \\ G(a_1,b_1,x_d^{(N_1)}) &{} \cdots &{} G(a_{\tilde{N}},b_{\tilde{N}},x_d^{(N_1)}) \\ \end{array} \right] _{{N_1}\times \tilde{N}}. \end{aligned}$$
(10)

The labels of the samples in \(D\) can be represented by a multidimensional vector of dimension \(m+1\), where \(m\) is the column number of \(T_0\).

$$\begin{aligned} T_1=\left[ \begin{array}{cccc} 0 &{} \cdots &{} 0 &{} 1 \\ \vdots &{} \ddots &{} \vdots &{} \vdots \\ 0 &{} \cdots &{} 0 &{} 1 \\ \end{array} \right] _{{N_1}\times (m+1)}. \end{aligned}$$
(11)

If we combine datasets \(S\) and \(D\) to train ELM, \(\beta _1\) can be calculated using

$$\begin{aligned} \beta _1=K_1^{-1}\left[ \begin{array}{c} H_0 \\ H_1 \\ \end{array} \right] ^{\mathrm{{T}}} \left[ \begin{array}{c} T_0\cdot M \\ T_1 \\ \end{array} \right] , \end{aligned}$$
(12)

where

$$\begin{aligned} K_1&= \left[ \begin{array}{c} H_0 \\ H_1 \\ \end{array} \right]^{\mathrm{T}} \left[ \begin{array}{c} H_0 \\ H_1 \\ \end{array} \right] =K_{0}+H_{1}^{\mathrm{{T}}} H_{1};\\ M&= \left[ \begin{array}{cccc} 1 &{} \cdots &{} 0 &{} 0 \\ \vdots &{} \ddots &{} \vdots &{} 0 \\ 0 &{} \cdots &{} 1 &{} 0 \\ \end{array} \right] _{m\times (m+1)}. \end{aligned}$$

\(M\) is a transform matrix that adds one column of zeroes on the right of matrix \(T_0\). Therefore, \(T_0\cdot M\) and \(T_1\) have the same number of columns.

Using Eq. (12) we can derive another form of \(\beta _1\):

$$\begin{aligned} \beta _1&= \beta _0 M+K_1^{-1}H_{1}^{\mathrm{{T}}}(T_1-H_1\beta _0 M)\nonumber \\&= \beta _0 M+(K_0+H_1^TH_1)^{-1}H_{1}^{\mathrm{{T}}}(T_1-H_1\beta _0 M). \end{aligned}$$
(13)

It can be seen from Eq. (13) that \(\beta _1\) can be calculated without the dataset \(S\) and can be achieved through incremental training. By introducing matrix \(M\), we have changed the ELM structure. This means that we have added one or more output neurons to the initial ELM structure.

Experiments

Class incremental learning has many applications in the activity recognition field. We incrementally learn to perform motion activities as we mature. Thus, we have used a class incremental extreme learning machine to enable the existing model to distinguish new activities using the small amount of storage and computing power available on a smart phone.

Sample Preparation

We have considered three benchmark problems for these classification experiments: activity data from our research group and two datasets (image segments and satellite images) from UCI.Footnote 1 We used the activity dataset to show that CIELM performs well for the activity recognition problem. The experiments on the UCI datasets demonstrated that CIELM can be used in other applications.

Activity Data Preparation

In our experiments, Nokia N95 8GB mobile phones were used to collect accelerometer data. An activity database was built from the collected data. There were 12 participants and 5 activities (standing still, walking, running, and going upstairs or downstairs). The data collection procedure was designed as follows:

  1. 1.

    Standing still for at least 1 min at the first floor of a 14-story building.

  2. 2.

    Climbing up the stairs for at least eight stories.

  3. 3.

    Standing still for at least 1 min.

  4. 4.

    Climbing downstairs to the first floor.

  5. 5.

    Standing still for at least 1 min.

  6. 6.

    Walking outside of the building to a playground.

  7. 7.

    Standing still for at least 1 min.

  8. 8.

    Running for at least 3 min.

  9. 9.

    Standing still for at least 1 min.

The N95 phone collected the data while in the subject’s right trouser pocket. A sliding window with a 50 % overlap was used to extract the features. The sampling frequency of the N95 accelerometer sensor was set to approximately 32 Hz using the Nokia accelerometers plug-in API. Our chosen window size was two seconds, and the overlap time was one second. Thus, a complete activity could be included in the window. Previous work has demonstrated that feature extraction using windows with 50 % overlaps was successful [25].

An accelerometer detects changes in capacitance and transforms them into an analog output voltage that is proportional to acceleration. For a triaxial accelerometer, the output voltages can be mapped into acceleration along three axes, \(a_x,a_y,a_z\). As \(a_x,a_y,a_z\) are the orthogonal decompositions of real acceleration, the magnitude of the synthesized acceleration is \(a=\sqrt{a_x^2+a_y^2+a_z^2}\) [19], where \(a\) is the magnitude of the real acceleration with no directional information. Thus, the acceleration magnitude-based activity recognition model is orientation-independent.

Using a data series of acceleration magnitude, 16 statistical features [26] were extracted from a sliding window of 64 samples with a 50 % overlap between consecutive windows. These features were maximum, minimum, mean, standard deviation, mode, mean-crossing rate, range, signal magnitude area, four amplitude statistical features, and four shape statistical features of the power spectral density (PSD), as in [19, 27]. Additionally, using an FFT transformation of the series of acceleration magnitudes, all frequency components from 1Hz to 64Hz were extracted and added to the feature vector (a total of 80 features). To eliminate scaling effects, all the features were normalized using the z-score normalization algorithm.

Table 1 Activity samples

The numbers of samples for each activity are listed in Table 1.

UCI Dataset Preparation

The image segmentation problem consists of a database of images drawn randomly from seven outdoor images. It consists of 2,310 regions of \(3\times 3\) pixels. The goal is classify each region as belonging to one of the seven classes using 19 extracted attributes.

Table 2 UCI datasets

The satellite image problem consists of a database of images generated from a landsat multispectral scanner. One frame of landsat multispectral scanner imagery consists of four digital images of the same scene in four different spectral bands. The database is a (tiny) subarea of a scene, consisting of \(82\times 100\) pixels. Each image in the database corresponds to a region of \(3\times 3\) pixels. The aim is to classify the central pixel in a region into one of six categories (red soil, cotton crop, gray soil, damp gray soil, soil with vegetation stubble, and very damp gray soil) using 36 spectral values for each region.

The datasets used in our experiments are described in Table 2. Before being used, they were normalized using the \(z\)-score method.

Experimental Results

We evaluated the CIELM’s performance using the datasets described in Tables 1 and 2. We used four evaluations: model selection, CIELM’s performance according to the incremental data chunk size, stability test for activity recognition, and a performance comparison between batch learning and the CIELM. All the simulations were implemented in MATLAB2009a running on an ordinary PC with 2.6 GHz CPU. The ELM source code was downloaded from Professor Huang’s homepage.Footnote 2 We used the sigmoid activation function.

Model Selection

Model selection refers to the estimations of optimal architecture of the network and optimal parameters of the learning algorithm. It is problem-specific, and must be predetermined. For the CIELM, we only need to determine the optimal number of hidden nodes.

We used the training and validation methods to select the optimal number of hidden nodes. The total dataset was divided into two equal subsets such that the samples of each activity were divided into two equal subsets. We varied the number of hidden nodes between 1 and 300. The performance corresponding to each number is illustrated in Fig. 3.

Fig. 3
figure 3

Model selection

As shown in Fig. 3, the recognition accuracy improved as the number of hidden nodes increased from 1 to 100. For the activity dataset, when the number of hidden nodes was larger than 100, the recognition accuracy remained stable with no significant changes. The same effect can be seen for the image segment and satellite image datasets when the number was larger than 50. Using more hidden nodes may achieve a high recognition accuracy, but there is an increased risk of overfitting. To achieve a trade-off between recognition accuracy and overfitting, we have used 100 hidden nodes for the activity recognition problem and 50 for the two UCI problems.

Change in CIELM’s Performance According to Incremental Data Chunk Size

In this experiment, we evaluated CIELM’s performance according to the incremental data chunk size. As our activity data were collected from 12 objects, we repeated our experiment for 12 times (one for each object). Each trial had two steps. First, the activity data for standing still, walking, running, and going upstairs were used to train the model M1. Then, the going downstairs data were divided into two sections (part_a and part_b). part_a was divided into equal size chunks and used to incrementally update M1. part_b was used to evaluate the new model after each dataset was added to the model. To observe the relationship between model performance and chunk size, we varied the size between 1 and 50 (”ChkSize” in Fig. 4). We took the average values of the 12 trials.

Fig. 4
figure 4

CIELM’s performance using the activity data

Three conclusions can be made from Fig. 4. First, the activity recognition accuracy increased with the number of model updates, and this situation did not change with varying chunk size. Second, a larger chunk size resulted in a better recognition result. Third, more updates resulted in a better recognition result.

We can conclude that when the chunk size was 50, the model accuracy gradually increased after four update procedures.

We repeated this experiment using the satellite image and image segmentation datasets. The experiment results are illustrated in Figs. 5 and 6.

Fig. 5
figure 5

CIELM’s performance using the satellite image data

Fig. 6
figure 6

CIELM’s performance using the image segmentation data

For the satellite image data, when the chunk size was 60, 80 % accuracy was achieved after 5 update procedures. For the image segmentation data, when the chunk size was 6, 90 % accuracy was achieved after 4 update procedures.

Based on the results of these three experiments, we can conclude that the CIELM algorithm’s performance is problem-related. To optimize the model, we should predetermine the chunk size and number of updates.

Stability Test

In this section, we evaluated the stability of the CIELM. We wished to determine whether it is new class-independent. We wish to make each of the five kinds of activities (A, B, C, D, and E) a new class. Without loss of generality, we can make A the new class. Samples of activities B, C, D, and E were used to train the model. Samples of activity A were divided into two parts: One part was used to incrementally train the new model, and the other was used to test it. For each subject, we used five trials. The process was repeated for twelve subjects. The average results are shown in Fig. 7.

Fig. 7
figure 7

Stability test

It can be observed from Fig. 7 that the recognition accuracy for the new class samples was more than 90 %, regardless of the activity that was considered to be the new class. Thus, we have demonstrated that CIELM is a persuasive method for activity recognition and is not sensitive to the specific new class. When deployed on a smart mobile terminal that has limited memory, computation, and power resources, a personalized activity recognition model can be achieved through incremental learning.

Batch Learning Compared with CIELM

In the following experiment, we compared the performances of the batch learning and CIELM methods.

Fig. 8
figure 8

Batch learning compared with CIELM

We can list the permutations of an activity set Act = {Still, Walking, Running, Upstairs, Downstairs}. For each subject, every activity permutation can be denoted Per = {A,B,C,D,E}. Samples of the \(j\mathrm{{th}}(\hbox {j}=1,2,\ldots ,5)\) activity were divided into two equal parts. To simplify the description, we represent them using \(TR_j\) and \(TE_j\).

  1. 1.

    Batch learning: For every permutation, the batch learning validation process consists of four steps. At the first step, an ELM classifier is trained using the \(TR_1 \bigcup TR_2\) dataset and tested using \(TE_1 \bigcup TE_2\). At the \(k^{\mathrm{th}}(k=2,3,4)\) step, an ELM classifier is trained using \(TR_1 \bigcup TR_2 \bigcup \cdots \bigcup TR_{k+1}\) and tested using \(TE_1 \bigcup TE_2 \bigcup \cdots \bigcup TE_{k+1}\). The above process is repeated for all subjects and all permutations, and the average is taken as the result. The results using the batch learning method are shown in Fig. 8.

  2. 2.

    CIELM: As with batch learning, the incremental learning validation process consists of four steps. At the first step, an ELM classifier is trained using the \(TR_1 \bigcup TR_2\) dataset and tested using \(TE_1 \bigcup TE_2\). At the \(k^{th}(k=2,3,4)\) step, the model trained from the \((k-1)\)th step is adapted by our proposed incremental learning method using samples from \(TR_{k+1}\) and tested using \(TE_1 \bigcup TE_2 \bigcup \cdots \bigcup TE_{k+1}\). The above process is repeated for all subjects and all permutations, and the average is taken as the result. The results of the incremental learning method are shown in Fig. 8.

It can be seen from Fig. 8 that the batch learning and CIELM methods performed similarly when new classes were added to the model. When samples of the third activity class were used, the accuracy decreased from more than 96 % to less than 94 %. However, when the fourth and fifth activity classes were added, the recognition performance decreased very slowly.

We can also see that the CIELM performs slightly worse than the batch learning method. This is because batch learning can learn more because it has a larger training set than CIELM.

To evaluate the computational complexity of these two algorithms, we considered the peak memory and total execution time. We used a profile viewer to investigate the two methods. The related peak memory and total execution times are listed in Table 3.

Table 3 Computational complexity comparisons

As seen in Table 3, the CIELM algorithm uses less memory and consumes less CPU time than the batch learning algorithm.

Conclusion and Future Work

This paper proposed the CIELM algorithm to address the class-related incremental learning problem. Our empirical evaluations have shown the following. First, the CIELM model can be used to learn new classes, changing the structure of an existing ELM model by adding output nodes to the network. Second, the number of added samples can affect the performance of the classifier. Third, CIELM’s performance is slightly worse than the batch learning method, which indicates that there is a trade-off between the optimal model and restricted resources.