Keywords

1 Introduction

Human Activity Recognition (HAR) has been widely applied in some scenarios, e.g. industry and medicine [1]. For instance, it could be used for early warning when factory operators carry out dangerous behaviors, and exhibiting the accuracy degree of mobility impairments patients’ performed actions for physical therapy. The basis pattern of HAR is to collect actual data of specific activities and then achieve accurate recognition through comparison and classification. In generally, feature extraction by sliding window is an effective activity recognition paradigm. The basic essence of this type of method is to design a sliding window which contains the perception data of the activity to be recognized, then extract features of the window data to represent the activity, and design an effective classifier for activity recognition. However, in the above special scenarios, most of them require the deployment of dedicated sensing systems, such as IoT devices and sensor networks [2,3,4], resulting in high cost and poor universality.

The development of wireless communication technology [5] and the miniaturization of sensors [6] have enabled smart devices [7] embedded with many MEMS sensors (such as smart phones, tablets, etc.) to have good data collection capabilities. In the meanwhile, in-depth research in the field of deep learning has further improved the accuracy of activity recognition [8]. However, the popularization of mobile devices puts forward higher requirements for activity recognition in complex scenarios. Particularly, mobile devices must quickly and accurately perceive activity changes, and recognition methods must be oriented to the special behavior habits of different users. The main challenges that activity recognition still faces are as follows: 1) The raw data collected by the accelerometer contains much noise. 2) The influence of historical data on current activity data. 3) Special behavior habits are not easy to accurately recognize.

To address the above challenges, in this paper, we propose a fine-grained activity recognition method and designs a corresponding system farer. The system implements effective preprocessing of the source data by smoothing and filtering the sampled data and segmenting activities into individual actions. Then deeply mine the features of actions through fine-grained subsegment division, feature extraction and dimensionality reduction. Finally, an incremental activity recognition model based on broad learning (BL) is constructed to realize accurate activity recognition and satisfy incremental model update. The main contributions of this paper are summarized as follows:

  1. 1)

    Realize accurate segmentation of activities. By smoothing and filtering the source data, a neighborhood extreme value method is proposed to avoid the interference of peaks and valleys.

  2. 2)

    Deeply mine the features of action subsegments. The action data is partitioned to fine-grained subsegments according to changes of acceleration, and a feature extraction technique oriented to adjacent difference is designed.

  3. 3)

    The farer system based on BL is designed with stable performance and good practicability. After a series of experimental verifications, the system has a high recognition rate for different activities under the condition of stable activities, with an overall recognition rate of 97.91%. Under the condition of changed activities, the recognition accuracy is 90.14%, far exceeding other methods.

2 Related Work

Recognizing human activities based on sensor data is essentially a pattern classification problem. When using a sliding window for activity recognition, it is required to process the noise generated during the sampling process, select an appropriate sliding window, and design an effective activity classifier. These are all key factors that determine the recognition performance. This section mainly introduces the current research status of three aspects in the case of activity recognition: the processing of data noise, the setting of sliding window and the construction of classifier model.

Since the data collected by the sensor often contains high-frequency noise, it needs to be processed, otherwise it will affect the accuracy of activity recognition. Yang et al. [9] used Gaussian filtering algorithm to eliminate the influence of noise. Garcia-Ceja et al. [10] used the average smoothing method, replacing each original data with the average of two adjacent data points. Khan et al. [11] used a third-order moving average filtering algorithm to remove noise. These methods cannot eliminate the interference extreme points to a great extent.

The collection of human activity data often takes a long time. In the activity recognition stage, the sensor data stream needs to be segmented by windows. Because the fixed-size sliding window segmentation technology is simple to operate, it is adopted by most researches. In the terms of sliding window setting, Fida et al. [12] tested the window of 0.5 s to 3 s, and achieved the best result in 1.5 s. Elsts [13] adopted a 2.56 s sliding window to design the energy-saving activity recognition framework. Shuvo [14] and Xia [15] adopted a sliding window of 2.56 s with a step length of 1.28 s, achieving the recognition accuracy of more than 95%. Cha et al. [16] adopted a window length of 1 to 4 s and found that using 4 s achieved the best accuracy of 96.1%. Pienaar et al. [17] adopted a large window of 10 s with a step length of 1 s to segment data and achieved the recognition accuracy of 94%. The above methods verify that the selection of the action window will greatly affect the recognition performance of the activity. The traditional method of fixed-size sliding window is oriented to the recognition of a single stable activity, while ignoring the change of the activity.

In the terms of classifier construction, researchers manually extract features from activity sensor data and employ them in various traditional machine learning algorithms. Since these data fragments cannot adequately represent complex human activities, they have become the performance bottleneck of the classifier [18]. To further improve the accuracy of activity recognition, methods based on deep learning are employed by more and more people, such as Convolutional Neural Network (CNN) [19], Long Short-Term Memory (LSTM) [20] and their joint improved models CNN-LSTM [21] and ConvLSTM [22]. Although these models show good performance, their designs are more complex. In addition, the amount of calculation is large and hardware requirements are high. More importantly, these models are constructed based on training data, so they have poor perception of activity changes and lack robustness [23].

To solve the problems of accurate recognition and flexibility of activities, we propose effective data preprocessing and employ fine-grained features of action subsegments, which improve the accuracy of activity recognition, especially for changed activities. Incremental model construction based on broad learning is also proposed to realize the incremental update of the special activity behavior, without using the source data to retrain the entire model.

3 Proposed Method

For complex activity situations, we propose a fine-grained activity recognition method based on features of action subsegments and incremental broad learning. The corresponding recognition system named farer is also designed. The system contains three sub-modules, namely data preprocessing, feature extraction based on fine-grained subsegment, and incremental recognition model based on broad learning. First, smooth and filter the sampled data, and design a peak and valley recognition method to accurately segment activities into individual actions. Then deeply mine the features of action subsegments through the fine-grained segmentation of the action data, targeted feature extraction and dimensionality reduction. Finally, build an incremental activity recognition model based on broad learning to realize the accurate recognition of activities and satisfy the incremental model update. The framework of farer is shown in Fig. 1.

Fig. 1.
figure 1

farer framework.

3.1 Data Preprocessing

Data Smoothing and Filtering.

According to the travel characteristics of pedestrians, the sensor data changes smoothly in a relatively short period. Since people cannot maintain a fixed posture when traveling, the sensor perceives irregular changes during the collection process, resulting in abnormal points. Without changing the trend of data changes, we employ the neighborhood smoothing method to filter the source data.

Assuming that the sampled data at time \(t\) is \({x}_{t}\), its neighborhood interval is \([t-\varphi ,t+\varphi ]\) and the interval range is \(\mu =2\varphi +1\). Construct a k-order polynomial to fit the points in the interval. Denote \(S_{0} ,S_{1} , \cdots S_{k}\) as the polynomial coefficients, then the data \({x}_{t}\) can be expressed as

$${x}_{t}={s}_{0}+{s}_{1}t+{s}_{2}{t}^{2}+\cdots +{s}_{k}{t}^{k}$$
(1)

The least square fitting of the neighborhood interval \(X={\left({x}_{t-\varphi }, \cdots ,{x}_{t}, \cdots ,{x}_{t+\varphi }\right)}^{T}\) is calculated as

$$\widehat{X}=P{\left({P}^{T}P\right)}^{-1}{P}^{T}X$$
(2)

where,

$$P=\left(\begin{array}{ccc}\begin{array}{cc}\begin{array}{cc}\begin{array}{c}1\\ \vdots \\ 1\\ 1\\ 1\\ \vdots \\ 1\end{array}& \begin{array}{c}t-\varphi \\ \vdots \\ t-1\\ t\\ t+1\\ \vdots \\ t+\varphi \end{array}\end{array}& \begin{array}{c}{(t-\varphi )}^{2}\\ \vdots \\ {(t-1)}^{2}\\ {t}^{2}\\ {(t+1)}^{2}\\ \vdots \\ {(t+\varphi )}^{2}\end{array}\end{array}& \begin{array}{c}\cdots \\ \vdots \\ \cdots \\ \cdots \\ \cdots \\ \vdots \\ \cdots \end{array}& \begin{array}{c}{(t-\varphi )}^{k}\\ \vdots \\ {(t-1)}^{k}\\ {t}^{k}\\ {(t+1)}^{k}\\ \vdots \\ {(t+\varphi )}^{k}\end{array}\end{array}\right)$$

The method of smoothing and filtering the source data saves the change information of the signal and eliminates outliers, which makes the data curve smoother. After filtering, the value of \({x}_{t}\) is \(\widehat{X}\left(\varphi +1\right)\).

Activity Segmentation.

The periodicity of activities makes the acceleration data in the vertical direction sampled by the sensor show regular changes in peaks and valleys, so dividing activities by identifying peaks and valleys is the basic way of activity segmentation. The activity data is filtered to eliminate abnormal signal points, making the entire data stream smoother. However, there are still multiple extreme interference points at the peak and valley of activities, which seriously affects the accurate segmentation of activities.

We design a neighboring extremum value method (NEV) to more accurately identify the real peaks and valleys and avoid the interference of extreme points. Denote \(X=({x}_{1},{x}_{2},\dots ,{x}_{n})\) as the acceleration sampling data in the vertical direction. Then,

  1. 1.

    Calculate extreme points.

    Obtain all maximum points \({X}_{max}=\left({x}_{max}^{1},{x}_{max}^{2},\dots \right)\) and minimum points \({X}_{min}=\left({x}_{min}^{1},{x}_{min}^{2},\dots \right)\).

  2. 2.

    Filter extreme interference points.

    We employ the altitude, prominence and isolation of the peaks and valleys to filter the noise at the extreme points, where the altitude refers to the height of the peak or the depth of the valley, the prominence refers to the convexity of the peak or the concavity of the valley, and the isolation is the horizontal distance between two peaks or two valleys. We set the altitude threshold \({\varGamma }_{A}\), the prominence threshold \({\varGamma }_{P}\) and the isolation threshold \({\varGamma }_{I}\) to eliminate extreme points that do not meet the threshold.

  3. 3.

    Segment activities into individual actions.

    When the vertical acceleration direction is upward and gradually increases from 0, it is defined as the starting point of the action. Then the peak and valley are reached. After reaching valley, when the vertical acceleration direction is downward and decreases to 0, it is defined as the ending point of the action. Thus, the activity data is segmented into individual action data.

Different from the traditional activity recognition, we set the size of the sliding window to the size of the complete action segment, so each window has a different size. We call the sliding window here the action window. The action window only contains the data of the current action, without historical data, and it slides a complete action window every time. Our design avoids the influence of historical data on current data.

3.2 Feature Extraction Based on Fine-Grained Subsegments

Fine-grained Subsegment Feature Extraction. To extract fine-grained features of segmented actions, we design an activity recognition method based on fine-grained subsegments. We perform fine-grained subsegment division of actions to realize fine-grained cognition of actions, that is, the action window is evenly divided into several subsegments. The fine-grained division of the action window shows the change of the behavior state.

In order to fully mine the action characteristics to realize the fine-grained cognition of the action data, we design a feature extraction technique based on adjacent difference (FETAD) for 3 axes acceleration. The change of measured data is relatively stable in a short period due to the continuity of the action. Therefore, when the action is fine-grained divided, the difference in adjacent subsegments changes most smoothly. According to this principle, the steps of FETAD are as follows:

  1. 1)

    The action window is evenly divided into \({k}_{f}\) subsegments and the length is \({l}_{m}.\) The three-axis data vectors are \({G}_{{l}_{i}}^{x}=\left[{g}_{\left(i,1\right)}^{x},{g}_{\left(i,2\right)}^{x},\cdots ,{g}_{\left(i,{l}_{i}\right)}^{x}\right]\), \({G}_{{l}_{i}}^{y}=\left[{g}_{\left(i,1\right)}^{y},{g}_{\left(i,2\right)}^{y},\cdots ,{g}_{\left(i,{l}_{i}\right)}^{y}\right]\), \({G}_{{l}_{i}}^{z}=\left[{g}_{\left(i,1\right)}^{z},{g}_{\left(i,2\right)}^{z},\cdots ,{g}_{\left(i,{l}_{i}\right)}^{z}\right]\), where \(1\le i\le {k}_{f}\).

  2. 2)

    Adopt the difference between the data in adjacent subsegments as the data of the previous subsegment, i.e. \({G}_{{l}_{i}}^{x}={G}_{{l}_{i+1}}^{x}-{G}_{{l}_{i}}^{x}\), \({G}_{{l}_{i}}^{y}={G}_{{l}_{i+1}}^{y}-{G}_{{l}_{i}}^{y}\), \({G}_{{l}_{i}}^{z}={G}_{{l}_{i+1}}^{z}-{G}_{{l}_{i}}^{z}\), where \(1\le i<{k}_{f}\). Each axis gets \({k}_{f}-1\) difference vectors.

  3. 3)

    Extract features for each difference vector of each coordinate axis. Denote \({n}_{f}\) as the number of features to be extracted. The feature vector of the difference vector is \({D}_{{l}_{i}}=\left[{d}_{i}\left(1\right),{d}_{i}\left(2\right),\cdots ,{d}_{i}\left({n}_{f}\right)\right]\).

  4. 4)

    Combine the features of \({k}_{f}-1\) difference vectors on each coordinate axis into a new feature vector. The feature vectors of the three coordinate axes are expressed as \({D}^{x}={\left[{D}_{{l}_{1}}^{x},{D}_{{l}_{2}}^{x},\cdots {,D}_{{l}_{{k}_{f}-1}}^{x}\right]}^{T}\), \({D}^{y}={\left[{D}_{{l}_{1}}^{y},{D}_{{l}_{2}}^{y},\cdots {,D}_{{l}_{{k}_{f}-1}}^{y}\right]}^{T}\) and \({D}^{z}={\left[{D}_{{l}_{1}}^{z},{D}_{{l}_{2}}^{z},\cdots {,D}_{{l}_{{k}_{f}-1}}^{z}\right]}^{T}.\)

  5. 5)

    Finally, the feature vectors of the three coordinate axes are combined into a two-dimensional matrix as \({D}^{xyz}={[{D}^{x},{D}^{y},{D}^{z}]}^{T}\). The size is \(3\times [\left({k}_{f}-1\right)\times {n}_{f}]\).

Feature Matrix Dimensionality Reduction.

To improve the speed of activity recognition, we further extract the effective information of the three-axis features. First, perform feature extraction on the combined matrix \({D}^{xyz}\) of the \(x\), \(y\) and \(z\) three axes. Aiming at the obtained two-dimensional feature matrix, we adopt complete two-dimensional principal component analysis (C2DPCA). C2DPCA reduces the dimensionality of the matrix from two aspects: row projection and column projection. Then, flatten the principal component matrix of the three axes to obtain a one-dimensional vector to meet the input requirements of BLS.

Denote \(D\) as the feature matrix set of \(N\) actions. The feature matrix of the i-th action is \({D}_{i}\in {R}^{p\times q}\), \(i\in \left[1,N\right]\). To realize the complete two-dimensional principal component analysis of \({D}_{i}\), it needs to be projected from two angels of row and column. The column divergence matrix and row divergence matrix of \(D\) are formulated as \({G}_{P}=\sum_{i=1}^{N}\left({D}_{i}-\overline{D }\right){\left({D}_{i}-\overline{D }\right)}^{T}\) and \({G}_{Q}=\sum_{i=1}^{N}{\left({D}_{i}-\overline{D }\right)}^{T}\left({D}_{i}-\overline{D }\right)\), where \(\overline{D }=\frac{1}{N}\sum_{i=1}^{N}{D}_{i}\) is the average value of the feature matrix set \(D\).

By choosing proper eigenvectors of the matrices \({G}_{P}\) and \({G}_{Q}\), the projection of \({D}_{i}\) is as dispersed as possible. The numbers of eigenvalues of \({G}_{P}\) and \({G}_{Q}\) are calculated as \({n}_{P}\) and \({n}_{Q}\). The eigenvalues of \({G}_{P}\) and \({G}_{Q}\) are sorted in descending order. The eigenvalue set of \({G}_{P}\) are \(\alpha =\left[{\alpha }_{1}, {\alpha }_{2}, \cdots , {\alpha }_{{n}_{P}}\right]\), and the corresponding eigenvector set is \(\left[{\mu }_{1}, {\mu }_{2}, \cdots , {\mu }_{{n}_{P}}\right]\). The eigenvalue set of \({G}_{Q}\) are \(\beta =\left[{\beta }_{1}, {\beta }_{2}, \cdots , {\beta }_{{n}_{Q}}\right]\), and the corresponding eigenvector set is \(\left[{\nu }_{1}, {\nu }_{2}, \cdots , {\nu }_{{n}_{Q}}\right]\). Since \({G}_{P}\) and \({G}_{Q}\) are both non-negative definite matrices, their eigenvalues are not less than zero.

If the first \({n}_{1}\) eigenvalues of the eigenvalue sequence \(\alpha \) of \({G}_{P}\) satisfy \(\sum_{i=1}^{{n}_{1}}{\alpha }_{i}\ge {\partial }_{P}\cdot \sum_{i=1}^{{n}_{P}}{\alpha }_{i}\), where \({\partial }_{P}\) is the column hash threshold. Select the eigenvectors corresponding to \({n}_{1}\) eigenvalues to form the column projection of \({D}_{i}\), i.e. \({\rm P}={\left[{\mu }_{1}, {\mu }_{2}, \cdots , {\mu }_{{n}_{1}}\right]}^{T}\). The size of \({\rm P}\) is \({n}_{1}\times p\). Similarly, if the first \({n}_{2}\) eigenvalues of the eigenvalue sequence \(\beta \) of \({G}_{Q}\) satisfy \(\sum_{j=1}^{{n}_{2}}{\beta }_{j}\ge {\partial }_{Q}\cdot \sum_{j=1}^{{n}_{Q}}{\beta }_{j}\), where \({\partial }_{Q}\) is the row hash threshold. Select the eigenvectors corresponding to \({n}_{2}\) eigenvalues to form the row projection of \({D}_{i}\), i.e. \(Q=\left[{\nu }_{1}, {\nu }_{2}, \cdots , {\nu }_{{n}_{2}}\right]\). The size of \(Q\) is \(q\times {n}_{2}\). \({\rm P}\) and \(Q\) respectively project \({D}_{i}\) to obtain the principle component analysis matrix is \(H=P\cdot {D}_{i}\cdot Q\). The size of \(H\) is \({n}_{1}\times {n}_{2}\).

Fig. 2.
figure 2

Broad learning system model.

3.3 Recognition Model Construction and Incremental Update Based on BL

Broad learning system was proposed by Chen [24] in 2018. It is a structure of a single hidden layer including an input layer, a hidden layer and an output layer. The hidden layer includes a feature layer and an enhancement layer. First, the input layer receives the activity features which are extracted by windows. Then, the feature layer linearly maps activity features to construct feature nodes and the enhancement layer employs non-linear activation function for the feature nodes to obtain enhancement nodes. The feature nodes and the enhancement nodes are combined to form the hidden layer matrix. Finally, the output layer obtains the output coefficients by the method of pseudo inverse, and gives the learning results. The model architecture is shown in Fig. 2.

Recognition Model Construction Based on BL.

We adopt broad learning to build an activity recognition model. Suppose the training data sample set is \(X={\left[{x}_{1}, {x}_{2}, \cdots , {x}_{n}\right]}^{T}\) and each sample has \(m\) dimensions. The sample category label set is \(C={\left[{c}_{1}, {c}_{2}, \cdots , {c}_{n}\right]}^{T}\). The construction process of the recognition model is as follows:

  1. 1.

    Input layer.

    Reduce the dimensionality of the action feature matrix and flatten it to obtain targeted one-dimensional data, which is employed as the training sample of the input layer.

  2. 2.

    Feature layer.

    Suppose there are \({m}_{g}\) groups of feature mapping and each group has \({n}_{g}\) feature nodes. The i-th group of feature mapping is calculated as follows:

    $${F}_{i}={\psi }_{i}(X{W}_{i}+{\beta }_{i})$$
    (3)

    where \(1\le i\le {m}_{g}\), \({\psi }_{i}\) is linear transformation function, \({W}_{i}\) is a randomly generated matrix and \({\beta }_{i}\) is a randomly generated vector. The integrated feature node matrix is \(F=[{F}_{1}\left|{F}_{2}\right| \cdots |{F}_{{m}_{g}}]\).

  3. 3.

    Enhancement layer.

    The main purpose of the enhancement layer is to increase the non-linear factor of the entire network. Since the feature nodes are all obtained in a linear manner, the BL recognition model introduces enhancement nodes to supplement it. Suppose there are \({m}_{e}\) groups of enhancement nodes and each group has \({n}_{e}\) enhancement nodes. The j-th group of enhancement nodes is calculated as follows:

    $${E}_{j}={\xi }_{j}(F{W}_{{o}_{j}}+{\beta }_{j})$$
    (4)

    where \(1\le j\le {m}_{e}\), \({\xi }_{i}\) is nonlinear transformation function, \({W}_{{o}_{j}}\) is a randomly generated orthogonal matrix and \({\beta }_{j}\) is a randomly generated vector. The integrated feature node matrix is \(E=[{E}_{1}\left|{E}_{2}\right| \cdots |{E}_{{m}_{e}}]\).

    Finally, the feature node matrix \(F\) and the enhancement node matrix \(E\) are integrated to generate the hidden layer input matrix \(\varLambda =[F|E]\).

  4. 4.

    Output layer.

    The output layer mainly realizes the mapping from the input matrix of hidden layer to the label matrix. Since the category label matrix is \(C\) and the input matrix is \(\varLambda \), if the mapping matrix \(\Omega \) satisfies

    $$\Lambda \cdot\Omega =C$$
    (5)

    Then \(\Omega \) can be obtained by matrix inversion. However, it should be noted that since \(\Lambda \) is generally not a square matrix, its pseudo-inverse can be solved as

    $${\Lambda }^{-1}={({\Lambda }^{T}\Lambda +\delta I)}^{-1}{\Lambda }^{T}$$
    (6)

    where \(I\) is the identity matrix, \(\delta \) is the regularization coefficient and the value of \(\delta \) is close to zero.

Incremental Update.

When the recognition objects are some special users, such as lameness, large swing during the activity, etc., the false alarm rate of the model will increase, resulting in a poor user experience. Simply matching the activity characteristics of a special user to the recognition model makes it difficult to guarantee the recognition rate of the special activity. In addition, it also affects the accurate recognition of the trained actions. On the other hand, if the special user’s activity data is added to the source data and the model is retrained, a lot of model construction time is spent. To achieve targeted activity recognition, the recognition model needs to be updated incrementally.

By incrementally fusing the characteristics of user’s behavior, we realize effective recognition of personalized activities. Our system farer has incremental learning capabilities and can be updated on the trained model without retraining historical data. This method satisfies activity recognition scenarios in more complex situations.

Denote \(\dot{X}\) as the special activity data set and \(\dot{C}\) as special activity label. The feature node matrix \({F}_{\dot{X}}\) and the enhancement node matrix \({E}_{\dot{X}}\) of \(\dot{X}\) are given by the random matrix of farer. The hidden layer matrix is \({\Lambda }_{\dot{X}}=\left[{F}_{\dot{X}}|{E}_{\dot{X}}\right]\), and its pseudo-inverse is

$${{\Lambda }_{\dot{X}}}^{-1}=[{\Lambda }^{-1}-\phi \cdot \sigma )|\phi ]$$
(7)

where \(\phi =\left\{\begin{array}{cc}{\omega }^{-1}& \omega \ne 0\\ {\left({\Lambda }^{-1}\right)\cdot \sigma \cdot \left(I+{\sigma }^{T}\cdot \sigma \right)}^{-1}& \omega =0\end{array}\right.\), \(\sigma ={\Lambda }_{\dot{X}}\cdot {\Lambda }^{-1}\), \(\omega ={\Lambda }_{\dot{X}}-{\sigma }^{T}\cdot\Lambda \), \(I\) is the identity matrix.

After getting the incremental pseudo-inverse, the output of farer is calculated as

$$\dot{\Omega }={{\Lambda }_{\dot{X}}}^{-1}\left(\genfrac{}{}{0pt}{}{C}{\dot{C}}\right)=\Omega +\phi \cdot \left(\dot{C}-{\Lambda }_{\dot{X}}\cdot\Omega \right)$$
(8)

Therefore, the system in this paper can achieve rapid incremental update of the original model through matrix operations.

4 Experiment and Analysis

4.1 Experimental Settings

To verify the recognition advantages of farer, we not only compare the classification effect of the traditional machine learning method SVM [25], but also compare the convolutional neural network CNN, LSTM, and their joint model CNN-LSTM and ConvLSTM.

We collect a large amount of three-axis accelerometer data of activities. There are 1556436 pieces of sampling data, and the sampling frequency is set to 180 Hz. There are 7 activity states collected in the experiment, namely trickling, walking, brisk walking, jogging, upstairs, downstairs and jumping.

The action window is divided into 6 subsegments, employing typical features in the activity recognition research, which are the maximum, minimum, average, median, standard deviation, variance, interquartile range, skewness, kurtosis, root mean square, sum, range and entropy. The feature vector size of each coordinate axis is 1 × 65 (Table 1).

4.2 Performance Evaluation

System Recognition Accuracy. Table 2 shows the comparison of the activity recognition effect between farer and other methods. In the table, SA represents single activities and CA represents changed activities. According to the experimental results, the performance of farer is the best, with an overall recognition accuracy of 94.03%, which surpasses other recognition methods. Compared with single activity recognition, the accuracy of farer is 97.91%. More importantly, the recognition performance of farer is stable. It has high recognition accuracy for different activities, and there is no tendency deviation. In contrast, the recognition rate of other methods is either lower than that of farer, or has a higher false alarm rate for certain activities. For example, for downstairs, LSTM achieves 100% recognition accuracy, which exceeds 98.99% of farer. However, in the recognition of two more common activities, walking and brisk walking, the recognition rates of LSTM are only 73.98% and 78.20%, while the rates of farer are 96.64% and 97.78%.

Comparing the situation of activity changes, the overall recognition rate of farer is 90.14%, which is much higher than other recognition methods. This is because other methods are affected by historical data. The historical data leads to a great reduction in recognition accuracy. The sliding window of farer only represents the current action, so it effectively avoids the interference of historical data. For example, the recognition accuracy of ConvLSTM which recognizes a single activity more accurately is reduced to 77.78%. LSTM’s recognition of downstairs reaches 100%, but the recognition of changed activities is only 67.78%.

Table 1. Experimental parameter settings.
Table 2. Comparison of the activity recognition effect.
Table 3. Recognition accuracy of different sliding windows for activity scenes.

Performance Comparison of Different Windows.

The experiment in this section compares the three types of stable activities, activity changes slowly (S) and activity changes quickly (Q), as shown in Table 3. To verify the improvement of the activity recognition rate, we employ window data of different durations, namely 1.28 s, 2.56 s, 4 s, 6 s, 8 s, and 10 s. The sliding step of them is 1/2.

According to the experiment results, when the activity is stable, the recognition ability of action window is slightly higher than that of the fixed-size sliding window. However, when the activity changes, the advantage of the former increases significantly. As the window duration increases, the historical data has an increased influence on the feature extraction of the activity to be recognized. This results in a sharp drop in the recognition rate for fixed-size sliding windows.

When the activity changes slowly, such as trickling to walking, there is more interference data in the window because of the long switching time, resulting in a lower recognition rate. More importantly, the activity changes slowly, except for reasons of their own behavior habits, mostly because the front and back activities are similar. These changes further increase the difficulty of recognition. When the activity changes quickly, such as walking to upstairs, the difference between the front and back activities is generally large. As the activity changes quickly, the interference data in the window becomes less. The old activity is quite different from the new activity, so the recognition rate is increased.

Performance Comparison of Feature Matrix Dimensions.

We adopt FETAD to extract features of the action data, and adopt C2DPCA to reduce the dimensionality of the three-axis feature matrix. According to the above parameter settings, the number of fine-grained subsegments is 6 and the number of features is 13. Thus, the number of features of each axis is \(\left(6-1\right)\times 13=65\). The size of action feature matrix is \(3\times 65\). We compare different dimensionality reduction results, as shown in Fig. 3.

Fig. 3.
figure 3

Dimensionality reduction effect of farer.

Table 4. The maximum and minimum accuracy in the dimensionality reduction process.

Figure 3 shows the recognition accuracy of the three dimensionality reduction curves. The dimensionality of the feature matrix is reduced from \(3\times 65\) to \(3\times n\), \(2\times n\), and \(1\times n\), where \(2\le n\le 65\). Experimental results show that the recognition rate of \(3\times n\) is higher than that of the two categories. The experiment in this section further compares the difference between the two processing methods of flattening and square root of sum of squares (srss) when the row dimension is 3 after dimensionality reduction.

Table 4 compares the highest recognition rate and the lowest recognition rate when the feature dimension drops to different sizes, as well as the corresponding row and column values. It can be seen from the table that the row dimension of 3 has the best effect. Especially the lowest recognition rate is much higher than other cases. Considering the recognition accuracy and speed of farer, the column hash threshold and row hash threshold are both set to 99.95%. The size of the feature matrix after dimensionality reduction is \(3\times 21\). The recognition accuracy of the system is 97.91%.

Performance Comparison of Incremental Update.

Some users’ behavior habits are different from most people’s common activities. For traditional neural network, a lot of special behavior data need to be sampled as new training samples. These new training samples are appended to the original training samples. Then the model is retrained, which takes a long time. Even worse, it requires users to perform special activities for a long time to obtain adequate behavior samples. Obviously, this update method brings a lot of trouble to users and is impractical.

To effectively and quickly recognition special activities of different users, farer incrementally updates the recognition model. Based on the original model, it directly updates the model parameters according to the new activity data. The incremental update of farer greatly reduces users’ activity sampling time and model training time, and ensures the balance of recognition accuracy. farer achieve a balance of the three.

We compare farer with traditional machine learning and deep learning from the perspective of recognition accuracy and model training time under the same sampling time. In the experiment, volunteers perform special behaviors and acts continuously for 30 min. In the sampling process, each learning model uses the activity data of this period to update the model at regular intervals. Among them, farer employs an incremental update method, while other systems mix the sampled data with the original training data and retrain models.

Fig. 4.
figure 4

Recognition accuracy of special activities with different durations

Figure 4 shows the recognition accuracy of special activities of different durations, with an interval of 5 min. It can be seen from the figure that the recognition accuracy of this system is the highest when the sampling time is the same. farer is growing faster than other methods.

5 Conclusion

By studying the problem of poor accuracy of activity recognition for sliding windows, we propose a fine-grained activity recognition method, which employs fine-grained subsegments and incremental broad learning. We also design the corresponding activity recognition system farer. The system can effectively process the activity data, realize the accurate segmentation of activities and ensure the effectiveness of the action feature extraction. Furthermore, the fine-grained subsegment division is used to dig deeply into the action features and reduce the dimensionality to ensure the recognition rate of the activity. In the process of activity recognition, the incremental recognition model based on broad learning is adopted to learn the activity behavior of special users, which improves the user experience. After a large number of experiments, the performance of farer is better than that of other recognition methods. The recognition rates for stable activities and changed activities are 97.91% and 90.14%.