Keywords

1 Introduction

Human motion recognition has a large variety of applications, for example, in safety and surveillance such as access control and congestion analysis, abnormal behavior detection [12], and in behavioral biometrics including gesture and posture recognition for human computer interaction (HCI) [11]. These applications employ different representations and recognition techniques; however, the representation and recognition methods are usually mutually reliant. As suggested in [7], the human motion representation can be categorized into two models: humanoid body model and humanoid image model. The humanoid body model uses structural representations from joint-positions of 2D or 3D points in space simulating point-light display. This point-light display can be seen as the model of a stick figure which can be used to estimate human body parts in the humanoid body model. The earliest application of using such point-light displays of motion patterns was introduced for studying biological motion recognition on human visual perception mechanisms [5]. The experiment reveals that the movement of 10–12 bright spots attached to human body parts is sufficient for humans to distinguish the actions. The point-light display of human motion has later been widely used for subsequent studies on human behavior in psychology and cognitive science because human motion also conveys information about emotions or mental states, personality traits and biological attributes [8, 9, 15].

Our objective in this study is to demonstrate the possibility of applying biologically inspired models for both feature representation and recognition of fuzzy human motion. This study was conducted using motion capture (MoCap) from the CMU MoCap database [1]. The 3D data was projected onto a 2D plane and transformed to screen coordinates simulating a 2D point-light video. Afterwards, the 2D coordinate-space of each video frame was enlarged or shrunk to fit inside a grid. This idea was inspired by human grid cells for the formation of environment maps in the hippocampus. The receptive fields (RFs) resembling wavelets in the retina and primary visual cortex (V1) are generated inside the grid. With these techniques, we combined the trajectory-based approach from a kinematic structure of 2D point-lights with a pattern-based approach. The proposed feature extraction technique was tested under new angles of new subjects.

2 Temporal Pattern Classification Using an ESN

An ESN [4] is a type of RNNs of which the weights are left untrained. Only the output weights are trained for the desired target at the readout connection where no cyclic dependencies are created. The work of [10] presents an ESN as a framework for neurodynamical models of working memory. It illustrates ESN properties for storing, maintaining, retrieving and removing data that are similar to functions of the brain. A general ESN architecture is shown in Fig. 1.

Fig. 1.
figure 1

Architecture of an ESN. The dashed lines denote the connections which are not compulsory.

Consider a discrete time neural network with input dimensionality \(N_u\), neurons in the reservoir \(N_x\), and output dimensionality \(N_y\). Let , and denote the vectors of input activities, internal state and output unit activity for time t respectively. Further, let , and denote the weight matrices for input connections, internal connections, and output connections as seen in Fig. 1. In addition, the output might be back-coupled to the reservoir via weights . The internal unit activities \(\varvec{x}\) in Fig. 1 are updated from time step \(t-1\) to time t, where \(t=1,...,T\), by

$$\begin{aligned} \varvec{x}(t) = f( W_{in}\varvec{u}(t) + W\varvec{x}(t-1) + W_{fb}\varvec{y}(t)) \end{aligned}$$
(1)

\(f(\cdot \,)\) is an activation function of the neurons, a common choice is \(tanh(\cdot )\) applied element-wise. The leaky integration rate \(\alpha \in (0,1] \) is the leakage rate determining the speed of the reservoir update dynamics. The update rule for the internal units is extended to

$$\begin{aligned} \varvec{x}_{leaky}(t) = (1-\alpha )\varvec{x}(t-1) + \alpha \varvec{x}(t). \end{aligned}$$
(2)

If there are direct connections from the input \(\varvec{u}(t)\) to the output layer, the output can be computed according to

$$\begin{aligned} \varvec{y}(t) = f_{out}\left( W_{out} [ \varvec{u}(t); \varvec{x}(t)]\right) , \end{aligned}$$
(3)

where \([\cdot ;\cdot ]\) is a matrix concatenation and \(f_{out}\) is a nonlinear function. Accordingly, \(W_{out}\) now becomes . Typically, a simple linear regression is applied at the readout layer. Hence, Eq. (3) can be simplified to

$$\begin{aligned} \varvec{y}(t) = W_{out} [ \varvec{u}(t); \varvec{x}(t)]. \end{aligned}$$
(4)

The output class for testing the input sequences \(\varvec{u}(t)\) is then computed by

$$\begin{aligned} \text {class} (\varvec{u}(t)) = \mathop {\text {argmax}}\limits _{k} \left\{ \frac{1}{|\tau |} \sum _{t \in \tau } \varvec{y}_{k}(t) \right\} \end{aligned}$$
(5)

where \(y_{k}(t)\) is the corresponding output of class k, and \(\tau \) is the length of time series of input u(t).

3 Experimental Setup and Feature Representation

3.1 Dataset

Nine actions (\(N_y=9\)) from the CMU MoCap database were chosen for the experiment. They were bending (i.e. subjects bend to pick up objects from the ground and sometimes put them over their heads), boxing, golfing swing, jumping forward, marching, running, standing cross-crunch exercise (written shortly as crunching), standing side-twist exercise (or twisting), and walking. The markers on MoCap were reduced to 15 representing the joints of a skeleton. Each training and test set consists of five videos of different subjects. For some actions such as twisting and crunching, there are only a few subjects, but the videos are long; therefore, we cut these long videos in order to obtain ten short videos. It is important to note that subjects in the training set are excluded from the test set. For the training set, we apply five camera angles {−90, −45, 0, 45, 90} to each video. Three samples of these videos are shown in Fig. 2. For the test set, we use twenty-one angles in {\(-100, -90,..., 90, 100\)}. Therefore, we have \(9\,\times \,5\,\times \,5\) videos for training data, and \(9\,\times \,5\,\times \,21\) videos for testing the recognition of new subjects and, from unseen camera angles.

Fig. 2.
figure 2

Top: Three actions for 1.5 s (180 frames) at \(-45^{\circ }\). Walking, running and marching are shown in (a), (b) and (c), respectively. Bottom: The arbitrary views of corresponding trajectories of figures (a), (b) and (c) are extended in time-scale in a three dimensional space shown in (d), (e) and (f), accordingly.

Fig. 3.
figure 3

Point-light figures with a diameter of 15 pixels at each joint. (a) The point-light is stretched and filled in the grid of \({200\times 200}\) pixels. (b) The grid is mapped on the RFs of size \({N_{RF}=10\times 10}\) with a Marr wavelet of \({\sigma =10}\). (c) \({N_{RF}=20\times 20}\), \({\sigma =10}\) and (d) \({N_{RF}=10\times 10}\), \({\sigma =20}\), where there are overlappings of the RFs in the setting. This setting is based on a preferred bio-inspired model.

Fig. 4.
figure 4

The feature vectors \(\varvec{u}(t)\) of six videos of different classes at \(0^{\circ }\) are shown from left to right: golfing, bending, crunching, walking, marching and running. The x-axis indicates the varying frame numbers of the video, whereas the y-axis has the fixed number of RFs.

3.2 Feature Representation

The 2D coordinates of each video frame were stretched to fit inside a \(200 \times 200\) pixels grid as shown in Fig. 3(a). The grid has a fixed number of RFs producing an input feature vector , where \(N_{RF} = N_{RF_x}\,\times \,N_{RF_y}\) is the total number of RFs in a rectangular grid. \(N_{F}\) is the number of frames in a video. In our experiment, we chose \(N_{RF}=10 \times 10\) and adjusted the \(\sigma \) of Marr wavelet in order to design the RFs in the way that the RFs overlapped one another as shown in Fig. 3(b)–(d). Examples of feature vectors of six videos representing different actions are displayed in Fig. 4. The two leftmost figures are the golfing and bending, where there is no repetition of the action pattern. Next to them is the pattern of one and a half cycle of crunching. The last three images are walking, marching and running showing periodic patterns for about 2–3 cycles. The first 100 frames of the golfing and bending videos reveal very smooth patterns, indicating no significant movement of the agents in these two videos. This is typical for some actions such as golfing, bending and jumping forward. By contrast, actions such as running, marching and walking exhibit a very short onset of action and can complete one cycle in a very short time. In comparison, running is the shortest video with about 140 frames, while the other actions have an average in the range of 160–550 frames.

3.3 ESN Configurations

We set up a moderate reservoir size of \(N_x=500\) having sparsely connected neurons with \(10\%\) connectivity similar to [13]. The weight matrices, W and \(W_{in}\), are random values uniformly distributed in the range \([-1,1]\). The spectral radius \(\rho (W)\) can be considered as the scaling factor of the weight matrix W. The desired spectral radius can be simply computed from the ratio of the desired value and the maximum of the absolute eigenvalues of weight matrix. For a long short-term memory network, [3] shows that the peak performance in the setup has the spectral radius set to one. The only parameter that would be varied in our experiment is the leaky rate (\(\alpha \)), which can be regarded as a time warping of the input signal. All results in our experiment use the average of 4 runs of a randomly initialized ESN networks with the same configuration.

4 Experimental Results

4.1 Data Sequence Loss and Redundancy as Variations in Speed

We subsampled the original test data using subsampling factors of 1, 2, 4, 6, 8, and 10, whereas the training data still remained the same. The subsampling factor of 2 means that every \(2^{nd}\) frame of the data will be taken instead of each single frame (factor of 1). We evaluated our result using ESN with three leaky rates \(\alpha =0.1,0.5\) and 0.9 comparing with two methods, 1-Nearest Neighbor (1-NN) and Random Forest (RdF) which use 15 joint-positions in videos obtained directly from MoCap without RFs. We used both a naive approach and a dimensionality reduction method to extract feature vectors for these two classifiers. The feature vectors for the naive approach are obtained by simply stacking all video frames on top of each other for training the classifiers. The voting majority of the frames in a target video is counted for the classification. For the dimensionality reduction method, the feature vectors are acquired from PCA employing three principal components in combination with 1-NN and RdF. Figure 5-Left shows the results of recognition rate using test data from twenty-one untrained angles with respect to data sequence loss by using a training factor of 1. Figure 5-Right shows the result of recognition rate using a training factor of 5 as a testing for data redundancy. Both figures reveal that the ESN with \(\alpha =0.9\) gives the best performance with robustness against data sequence loss and redundancy yielding a recognition rate of 95% even with large training and test subsampling factors. The good performance of ESN using \(\alpha =0.9\) which can handle the variations in speed might be explained by the behavior of a long short-term memory in ESN which is demonstrated in [3]. Classifying data in space using 1-NN as the naive approach also gains a stable outcome of about 80%. In contrast, classifying data in subspaces using PCA is sensitive to data sequence loss and only gains good outcomes when the frequencies of training and test data sequences are about the same.

Fig. 5.
figure 5

Recognition rates from various classification approaches shown in y-axis. Subsampling factors of test data shown in x-axis. Left: Training factor of 1. Right: Training factor of 5.

4.2 Removing Key Points and Drunkard’s Walks

We furthered the experiment by removing key points from the test data simulating occlusion from all frames in videos. Removing a wrist from a skeleton does not affect the recognition rate, while removing an ankle from a skeleton makes running, crunching and walking all mistaken as marching, which results in a recognition rate drop to 61%. Furthermore, we extended the test by having three new persons performing four trials simulating drunkard’s walk from twenty-two untrained angles producing 92 test samples for walking, while the other actions remain the same. Two samples of drunkard’s walk are shown in Fig. 6-Left. The confusion matrix in Fig. 6-Right reveals that walking is misclassified as marching for 20.7%. The closed trajectories of these two actions can be inspected from Fig. 2(c) and Fig. 6-Left.

Fig. 6.
figure 6

Left: Two subjects perform drunkard’s walking. Right: Confusion matrix of test samples by substituting drunkard’s walk for walk.

4.3 Discussion of Related Work

One of the earliest promising methods for view-independent recognition of 3D MoCap was introduced by [6]. It applied a non-threshold recurrent plot by computing a similarity matrix of each joint as a sum of squared differences. The benefit of using this method is that the descriptors are stable across view changes. The recognition relies on a Bag-of-Features obtaining from the Histogram of Oriented Gradient. However, the disadvantage of this approach is that the sequences of all motions in the experiment must have an equal length in order to get a fixed window size for recognition. Another study on view-independent recognition of Mocap is [14]. It proposed a feature extraction technique to transform either 2D or 3D data into subspaces to form an action descriptor. The major advantage of this approach is that it yielded a very small fixed data size regardless of video length, as well as very fast computation. The test on projected motion in 2D achieves a recognition rate of 96.5% from 21 untrained angles for 10 actions and it is also very stable for the case of data sequence loss. Other interesting skeleton based action recognition approaches for 3D MoCap are for instance, [2, 16]. They proposed and compared several deep recurrent neural network architectures with Long Short-Term Memory (LSTM) for classification. The tests were carried out using 65 classes of HDM05 MoCap yielding up to 96.92% and 97.25% recognition rate respectively. Nonetheless, the tests were only performed for one default view.

5 Conclusion

We have introduced a feature extraction scheme from a biologically inspired model by applying the concept of receptive field to point light patterns of human motion. Our proposed scheme in combination with ESN which presents itself as a good approximator, yields a good performance and robustness against variations of speed even when the trajectories of motions are fuzzy. This representation could be deployed for human motion classification based on optical flow obtained from standard videos, where the human pose estimation is infeasible. The designed ESN is generic in the sense that it is not specialized to human motion. It also shows a good prediction of the unseen data. Hence, application to other domains of articulated objects in motion is possible. Furthermore, new technologies such as the IBM TrueNorth chip have introduced a dedicated neuro-inspired hardware that allows modeling hundreds of thousands up to a million of neurons with very low energy. The ESNs, which offer very simple learning mechanisms, can be optimized by local learning rules that scale well even with very large systems. Therefore, ESN is a potential candidate for low energy systems that can be an integral part of sensor technology for the future.