Abstract
We introduce a feature extraction scheme from a biologically inspired model using receptive fields (RFs) to point-light human motion patterns to form an action descriptor. The Echo State Network (ESN) which also has a biological plausibility is chosen for classification. We demonstrate the efficiency and robustness of applying the proposed feature extraction technique with ESN by constraining the test data based on arbitrary untrained viewpoints, in combination with unseen subjects under the following conditions: (i) lower sub-sampling frame rates to simulate data sequence loss, (ii) remove key points to simulate occlusion, and (iii) include untrained movements such as drunkard’s walk.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Human motion recognition has a large variety of applications, for example, in safety and surveillance such as access control and congestion analysis, abnormal behavior detection [12], and in behavioral biometrics including gesture and posture recognition for human computer interaction (HCI) [11]. These applications employ different representations and recognition techniques; however, the representation and recognition methods are usually mutually reliant. As suggested in [7], the human motion representation can be categorized into two models: humanoid body model and humanoid image model. The humanoid body model uses structural representations from joint-positions of 2D or 3D points in space simulating point-light display. This point-light display can be seen as the model of a stick figure which can be used to estimate human body parts in the humanoid body model. The earliest application of using such point-light displays of motion patterns was introduced for studying biological motion recognition on human visual perception mechanisms [5]. The experiment reveals that the movement of 10–12 bright spots attached to human body parts is sufficient for humans to distinguish the actions. The point-light display of human motion has later been widely used for subsequent studies on human behavior in psychology and cognitive science because human motion also conveys information about emotions or mental states, personality traits and biological attributes [8, 9, 15].
Our objective in this study is to demonstrate the possibility of applying biologically inspired models for both feature representation and recognition of fuzzy human motion. This study was conducted using motion capture (MoCap) from the CMU MoCap database [1]. The 3D data was projected onto a 2D plane and transformed to screen coordinates simulating a 2D point-light video. Afterwards, the 2D coordinate-space of each video frame was enlarged or shrunk to fit inside a grid. This idea was inspired by human grid cells for the formation of environment maps in the hippocampus. The receptive fields (RFs) resembling wavelets in the retina and primary visual cortex (V1) are generated inside the grid. With these techniques, we combined the trajectory-based approach from a kinematic structure of 2D point-lights with a pattern-based approach. The proposed feature extraction technique was tested under new angles of new subjects.
2 Temporal Pattern Classification Using an ESN
An ESN [4] is a type of RNNs of which the weights are left untrained. Only the output weights are trained for the desired target at the readout connection where no cyclic dependencies are created. The work of [10] presents an ESN as a framework for neurodynamical models of working memory. It illustrates ESN properties for storing, maintaining, retrieving and removing data that are similar to functions of the brain. A general ESN architecture is shown in Fig. 1.
Consider a discrete time neural network with input dimensionality \(N_u\), neurons in the reservoir \(N_x\), and output dimensionality \(N_y\). Let , and denote the vectors of input activities, internal state and output unit activity for time t respectively. Further, let , and denote the weight matrices for input connections, internal connections, and output connections as seen in Fig. 1. In addition, the output might be back-coupled to the reservoir via weights . The internal unit activities \(\varvec{x}\) in Fig. 1 are updated from time step \(t-1\) to time t, where \(t=1,...,T\), by
\(f(\cdot \,)\) is an activation function of the neurons, a common choice is \(tanh(\cdot )\) applied element-wise. The leaky integration rate \(\alpha \in (0,1] \) is the leakage rate determining the speed of the reservoir update dynamics. The update rule for the internal units is extended to
If there are direct connections from the input \(\varvec{u}(t)\) to the output layer, the output can be computed according to
where \([\cdot ;\cdot ]\) is a matrix concatenation and \(f_{out}\) is a nonlinear function. Accordingly, \(W_{out}\) now becomes . Typically, a simple linear regression is applied at the readout layer. Hence, Eq. (3) can be simplified to
The output class for testing the input sequences \(\varvec{u}(t)\) is then computed by
where \(y_{k}(t)\) is the corresponding output of class k, and \(\tau \) is the length of time series of input u(t).
3 Experimental Setup and Feature Representation
3.1 Dataset
Nine actions (\(N_y=9\)) from the CMU MoCap database were chosen for the experiment. They were bending (i.e. subjects bend to pick up objects from the ground and sometimes put them over their heads), boxing, golfing swing, jumping forward, marching, running, standing cross-crunch exercise (written shortly as crunching), standing side-twist exercise (or twisting), and walking. The markers on MoCap were reduced to 15 representing the joints of a skeleton. Each training and test set consists of five videos of different subjects. For some actions such as twisting and crunching, there are only a few subjects, but the videos are long; therefore, we cut these long videos in order to obtain ten short videos. It is important to note that subjects in the training set are excluded from the test set. For the training set, we apply five camera angles {−90, −45, 0, 45, 90} to each video. Three samples of these videos are shown in Fig. 2. For the test set, we use twenty-one angles in {\(-100, -90,..., 90, 100\)}. Therefore, we have \(9\,\times \,5\,\times \,5\) videos for training data, and \(9\,\times \,5\,\times \,21\) videos for testing the recognition of new subjects and, from unseen camera angles.
3.2 Feature Representation
The 2D coordinates of each video frame were stretched to fit inside a \(200 \times 200\) pixels grid as shown in Fig. 3(a). The grid has a fixed number of RFs producing an input feature vector , where \(N_{RF} = N_{RF_x}\,\times \,N_{RF_y}\) is the total number of RFs in a rectangular grid. \(N_{F}\) is the number of frames in a video. In our experiment, we chose \(N_{RF}=10 \times 10\) and adjusted the \(\sigma \) of Marr wavelet in order to design the RFs in the way that the RFs overlapped one another as shown in Fig. 3(b)–(d). Examples of feature vectors of six videos representing different actions are displayed in Fig. 4. The two leftmost figures are the golfing and bending, where there is no repetition of the action pattern. Next to them is the pattern of one and a half cycle of crunching. The last three images are walking, marching and running showing periodic patterns for about 2–3 cycles. The first 100 frames of the golfing and bending videos reveal very smooth patterns, indicating no significant movement of the agents in these two videos. This is typical for some actions such as golfing, bending and jumping forward. By contrast, actions such as running, marching and walking exhibit a very short onset of action and can complete one cycle in a very short time. In comparison, running is the shortest video with about 140 frames, while the other actions have an average in the range of 160–550 frames.
3.3 ESN Configurations
We set up a moderate reservoir size of \(N_x=500\) having sparsely connected neurons with \(10\%\) connectivity similar to [13]. The weight matrices, W and \(W_{in}\), are random values uniformly distributed in the range \([-1,1]\). The spectral radius \(\rho (W)\) can be considered as the scaling factor of the weight matrix W. The desired spectral radius can be simply computed from the ratio of the desired value and the maximum of the absolute eigenvalues of weight matrix. For a long short-term memory network, [3] shows that the peak performance in the setup has the spectral radius set to one. The only parameter that would be varied in our experiment is the leaky rate (\(\alpha \)), which can be regarded as a time warping of the input signal. All results in our experiment use the average of 4 runs of a randomly initialized ESN networks with the same configuration.
4 Experimental Results
4.1 Data Sequence Loss and Redundancy as Variations in Speed
We subsampled the original test data using subsampling factors of 1, 2, 4, 6, 8, and 10, whereas the training data still remained the same. The subsampling factor of 2 means that every \(2^{nd}\) frame of the data will be taken instead of each single frame (factor of 1). We evaluated our result using ESN with three leaky rates \(\alpha =0.1,0.5\) and 0.9 comparing with two methods, 1-Nearest Neighbor (1-NN) and Random Forest (RdF) which use 15 joint-positions in videos obtained directly from MoCap without RFs. We used both a naive approach and a dimensionality reduction method to extract feature vectors for these two classifiers. The feature vectors for the naive approach are obtained by simply stacking all video frames on top of each other for training the classifiers. The voting majority of the frames in a target video is counted for the classification. For the dimensionality reduction method, the feature vectors are acquired from PCA employing three principal components in combination with 1-NN and RdF. Figure 5-Left shows the results of recognition rate using test data from twenty-one untrained angles with respect to data sequence loss by using a training factor of 1. Figure 5-Right shows the result of recognition rate using a training factor of 5 as a testing for data redundancy. Both figures reveal that the ESN with \(\alpha =0.9\) gives the best performance with robustness against data sequence loss and redundancy yielding a recognition rate of 95% even with large training and test subsampling factors. The good performance of ESN using \(\alpha =0.9\) which can handle the variations in speed might be explained by the behavior of a long short-term memory in ESN which is demonstrated in [3]. Classifying data in space using 1-NN as the naive approach also gains a stable outcome of about 80%. In contrast, classifying data in subspaces using PCA is sensitive to data sequence loss and only gains good outcomes when the frequencies of training and test data sequences are about the same.
4.2 Removing Key Points and Drunkard’s Walks
We furthered the experiment by removing key points from the test data simulating occlusion from all frames in videos. Removing a wrist from a skeleton does not affect the recognition rate, while removing an ankle from a skeleton makes running, crunching and walking all mistaken as marching, which results in a recognition rate drop to 61%. Furthermore, we extended the test by having three new persons performing four trials simulating drunkard’s walk from twenty-two untrained angles producing 92 test samples for walking, while the other actions remain the same. Two samples of drunkard’s walk are shown in Fig. 6-Left. The confusion matrix in Fig. 6-Right reveals that walking is misclassified as marching for 20.7%. The closed trajectories of these two actions can be inspected from Fig. 2(c) and Fig. 6-Left.
4.3 Discussion of Related Work
One of the earliest promising methods for view-independent recognition of 3D MoCap was introduced by [6]. It applied a non-threshold recurrent plot by computing a similarity matrix of each joint as a sum of squared differences. The benefit of using this method is that the descriptors are stable across view changes. The recognition relies on a Bag-of-Features obtaining from the Histogram of Oriented Gradient. However, the disadvantage of this approach is that the sequences of all motions in the experiment must have an equal length in order to get a fixed window size for recognition. Another study on view-independent recognition of Mocap is [14]. It proposed a feature extraction technique to transform either 2D or 3D data into subspaces to form an action descriptor. The major advantage of this approach is that it yielded a very small fixed data size regardless of video length, as well as very fast computation. The test on projected motion in 2D achieves a recognition rate of 96.5% from 21 untrained angles for 10 actions and it is also very stable for the case of data sequence loss. Other interesting skeleton based action recognition approaches for 3D MoCap are for instance, [2, 16]. They proposed and compared several deep recurrent neural network architectures with Long Short-Term Memory (LSTM) for classification. The tests were carried out using 65 classes of HDM05 MoCap yielding up to 96.92% and 97.25% recognition rate respectively. Nonetheless, the tests were only performed for one default view.
5 Conclusion
We have introduced a feature extraction scheme from a biologically inspired model by applying the concept of receptive field to point light patterns of human motion. Our proposed scheme in combination with ESN which presents itself as a good approximator, yields a good performance and robustness against variations of speed even when the trajectories of motions are fuzzy. This representation could be deployed for human motion classification based on optical flow obtained from standard videos, where the human pose estimation is infeasible. The designed ESN is generic in the sense that it is not specialized to human motion. It also shows a good prediction of the unseen data. Hence, application to other domains of articulated objects in motion is possible. Furthermore, new technologies such as the IBM TrueNorth chip have introduced a dedicated neuro-inspired hardware that allows modeling hundreds of thousands up to a million of neurons with very low energy. The ESNs, which offer very simple learning mechanisms, can be optimized by local learning rules that scale well even with very large systems. Therefore, ESN is a potential candidate for low energy systems that can be an integral part of sensor technology for the future.
References
CMU Graphics Lab: CMU Motion Capture Database. http://mocap.cs.cmu.edu/
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015
Jaeger, H.: Long Short-Term Memory in Echo State Networks: Details of a Simulation Study. Technical Report, Jacobs University, Bremen, Germany, February 2012
Jaeger, H., Haas, H.: Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless telecommunication. Science 304(5667), 78–80 (2004)
Johansson, G.: Visual perception of biological motion and a model for its analysis. Percept. Psychophys. 14(2), 201–211 (1973)
Junejo, I.N., Dexter, E., Laptev, I., Perez, P.: View-independent action recognition from temporal self-similarities. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 172–185 (2011)
Kale, G.V., Patil, V.H.: A study of vision based human motion recognition and analysis. IJACI 7(2), 75–92 (2016)
Livne, M., Sigal, L., Troje, N.F., Fleet, D.J.: Human attributes from 3D pose tracking. Comput. Vision Image Underst. (CVIU) (2012)
Miller, L.E., Saygin, A.P.: Individual differences in the perception of biological motion: Links to social cognition and motor imagery. Cognition 128(2), 140–148 (2013)
Pascanu, R., Jaeger, H.: A neurodynamical model for working memory. Neural Netw. 24(2), 199–207 (2011)
Rautaray, S.S., Agrawal, A.: Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev. 43(1), 71 (2015)
Tanisaro, P., Schöning, J., Kurzhals, K., Heidemann, G., Weiskopf, D.: Visual analytics for video applications. Inf. Technol. 57, 30–36 (2015)
Tanisaro, P., Heidemann, G.: Time series classification using time warping invariant echo state networks. In: 15th IEEE International Conference on Machine Learning and Applications (ICMLA) (2016)
Tanisaro, P., Mahner, F., Heidemann, G.: Quasi view-independent human motion recognition in subspaces. In: Proceedings of 9th International Conference on Machine Learning and Computing (ICMLC) (2017)
Troje, N.F., Sadr, J., Geyer, H., Nakayama, K.: Adaptation aftereffects in the perception of gender from biological motion. J. Vision 6(8), 7 (2006)
Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., Xie, X.: Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 3697–3703 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Tanisaro, P., Lehman, C., Sütfeld, L., Pipa, G., Heidemann, G. (2017). Classifying Bio-Inspired Model of Point-Light Human Motion Using Echo State Networks. In: Lintas, A., Rovetta, S., Verschure, P., Villa, A. (eds) Artificial Neural Networks and Machine Learning – ICANN 2017. ICANN 2017. Lecture Notes in Computer Science(), vol 10613. Springer, Cham. https://doi.org/10.1007/978-3-319-68600-4_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-68600-4_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68599-1
Online ISBN: 978-3-319-68600-4
eBook Packages: Computer ScienceComputer Science (R0)