Keywords

1 Introduction

The touch and tactile based systems in cars cause visual distraction which affects the attention while driving [1]. The work by [2] shows that simple and natural interactions with multimedia devices in cars improve the driver’s safety. [3] has compared various in-vehicular interaction systems and reported that the gesture based interaction requires least eye contact. Work by [4] also shows that the performance of the driver can degrade sharply with small increase in the shift of attention. It can thus be argued that a robust, touch-sensor free gesture based interactions improve driver safety.

The vision based Hand Gesture Recognition (HGR) techniques can be distributed into two broad classes, static and dynamic. The first [5, 6] only recognizes a static pose of a hand while the second uses the changing hand pose and hand motion over frames in addition. The later scheme supports a potentially larger and more natural set of gestures. The primary challenge for an HGR system is the rapidly changing global illumination. Further, defining an optimal location for a camera that minimises the palm occlusion is a difficult task. It has been observed that an overhead location is best suited for such problems [7] because it minimises occlusion due to objects inside the car, however the self occlusion of the hand remains significant especially when a gesture is performed with vertically downward pointing palm. It is desirable to have a flexible system which can be modified to identify gestures which were not originally built into it. The problem of illumination is suppressed by the choice of sensor, on the other hand the problem of occlusion and flexibility require algorithmic solutions.

The early solutions for HGR used Finite State Machines (FSM) [8], a gesture was distributed into phases and a set of twelve gestures were classified. Inspired by the results on handwriting recognition [9] and speech analysis various adaptations of a Hidden Markov Model (HMM) have been used [10]. Another branch of solution includes neural networks and Recurrent Neural Networks (RNN) [11]. Most often, both the FSM and RNN strategies use the information of the instantaneous hand-pose for identifying gesture sequences.

The Long Short-Term Memory (LSTM) network [12, 13] is a variation of the traditional RNNs and has been shown to outperform the traditional RNN. It has been extensively used for hand-writing and speech recognition tasks recently [14]. In contrast to the HMM where some prior experiments are required to identify the number of states, it is easier to construct an LSTM model. [15] have used LSTM for gesture identification and demonstrated that it performs better than HMM and SVM.

Location, orientation and velocity of the palm have been used as features for gesture recognition problems [16]. This work reaffirms that features like palm and finger positions along with their velocity are useful for gesture classification. A two-phase classification scheme using three LSTMs is introduced. It is demonstrated that distributing the learning in which one phase learns from the hand pose and the other learns from the direction of motion, simplifies the learning tasks.

An early-detection system which is capable of predicting gesture class while the gesture is being completed is introduced. This is an important requirement for an interaction system. To achieve this a one to one labelling scheme between the gesture frames and gesture class is used. Some sequences are sub-sampled for learning fast sequences. The proposed cumulative probability addition scheme for prediction also help stabilise the system response during discontinuous hand movements. The HGR with this LSTM architecture demonstrates an overall frame-wise accuracy of over 90.5%. We observed that, with an equal sized data the proposed two-phase early hand gesture recognition system outperforms a single all-class, but larger LSTM based system which provides accuracy of 86%.

Section 2 introduces the gestures used for the experiments and describes the data collection and feature extraction process along with the data augmentation method and the data distribution scheme for cross-validation. The overall system architecture, the prediction scheme and the cumulative probability method is described in Sect. 3. The analysis of the training process, test results and comparisons with single all-class network prediction is presented in the Sect. 4. Discussion on results and possible future directions are presented in the Sect. 5.

2 Gesture Data and Features

2.1 Gesture Definition

A hand-gesture is a sequence of frames of moving palm. It can involve motion of palm without change in the hand-pose or it could be defined as a sequence of hand-poses where the occurrence of the different hand-poses have a predictable, predetermined order. For this work the recorded hand-gestures include, ‘Clicking’, ‘Swiping’ in Left and Right direction and in Up and Down motion, ‘Accepting’, ‘Declining’, ‘Drop’ and ‘Grabbing’. ‘Clicking’ involves a forward horizontal motion of the pointing finger. Hand motion in horizontal left-right direction is denoted as ‘Swiping’ in left and right direction. The swiping motion may be repeated more than once. Similarly vertical palm motion is denoted as vertical swiping. ‘Accepting’ is a motion of hand outwards from the screen (relative to the camera). ‘Declining’ is the motion of a hand into the screen. ‘Grabbing’ involves a transition of a spread hand with the palm facing vertically downwards to a position of joined fingers accompanied with some vertical motion. ‘Drop’ begins with joined fingers ending in a spread hand with a short downward motion.

2.2 Data Collection and Properties

The output frames from the camera have two channels, the depth and the amplitude. The amplitude value of the pixels are proportional to the reflectance of the surface and inversely proportional to the square of the distance values. The data is recorded with a frame rate of 25 Frames per second.

We use a Photonic Mixer Device (PMD) Nano sensor with a resolution of 120 \(\times \) 165 pixel for recording data. This ToF based 3-D camera is attached to the rear-view mirror holder protection. Shown in Fig. 1. Thus the dataset is 3-D top view of hand gestures, it is used for hand pose recognition problem by [17]. The experiments for hand gesture recognition inside the car are conducted with seventeen participants. The data is recorded inside the car and each participant repeats nine gestures around the sat-nav screen of the car. Every participant repeats each gesture six to twelve times. Each frame of the sequence is marked with two labels. ‘Accepting’, ‘Declining’, ‘Drop’, ‘Grabbing’, ‘Clicking’, ‘Horizontal’, and ‘Vertical’ are used as the primary labels. Sequences marked as ‘Horizontal’ are marked with a secondary label ‘Left’ and ‘Right’, and those marked with ‘Vertical’ are marked with secondary labels ‘Up’ and ‘Down’.

As the data is recorded inside a car it allows a combination of depth information with the information about the car environment. This information is utilised to extract the features for hand-shape, location and motion. Note that these features are explicitly utilized in the proposed approach and thus no comparison on a different dataset is shown.

Fig. 1.
figure 1

The camera setup.

2.3 Training and Testing Strategy

Before training the network, the data was shuffled such that the frames from a complete gesture sequence stay together, while the gesture sequences were placed randomly. This shuffling was essential because the participant continuously repeated the same gesture multiple times during recording. Each frame of the gesture sequence is marked with the label for the entire sequence. This allows us to train the network in way such that it attempts at predicting the sequence-label from the start of the gesture.

The total number of available sequences for training the model are increased by sub-sampling approximately one-fifth of sequences in time. Equal proportion of sequences from each class of gesture are reduced to half duration. Such down-sampling effectively creates sample-points on which the duration for completing a gesture is shorter than the average gesture sequence. The start and end of each sequence including the sub-sampled once are marked. Both training and testing phase of the algorithm use these sequence markers. Table 1 gives a description of the distribution of the data-samples over classes and the number of sub-sampled sequences created for each class.

For testing a leave-2 cross-validation was performed on the data. The data from the seventeen participants was distributed into eight sets of two participants and one set of one. Owing to an otherwise small test dataset a 9-fold cross-validation is done to report the average accuracy of the model. The sub-sampled sequences are separately divided into 9 groups and then used in training and testing accordingly.

Table 1. Number of gesture class samples in dataset

2.4 Segmentation and Feature Extraction

The palm region is segmented by creating a virtual cuboidal space in the region where we wish to observe the hand-gesture. The background was generated by recording a video in the car and keeping the consistent pixels. This 3-D background image was then removed from each incoming images of the video sequences. Furthermore, the palm pixel closest to screen is tracked. Hand region is segmented by assuming a real length of 18 cm, another threshold divides hand and finger and a Mahalanobis distance based K-mean clustering refines palm-finger segmentation. The hand palm centroid and finger-tip are estimated and tracked using Kalman filter. Features are further described in the Table 2. The features were centred and normalised such that the mean of each feature element over the training data was zero and the variance was unity.

Table 2. Description of features used for the experiments

3 System Architecture and LSTM Networks

A two-phase classification strategy is employed for classification. To this end, three neural network based systems are combined. The first network classifies the seven primary classes describing the nature of motion. The other two networks are trained to classify the direction of the motion, i.e. Up vs. Down and Left vs. Right. These networks are used in series with the first network. Various neural network architectures were trained and tested for the three classifiers, the network architectures which provided the best cross-validation results separately were used for the classification system. These networks are further described in detail.

Fig. 2.
figure 2

LSTM network with the output decision unit.

Each network has an LSTM layer and several fully connected dot product layers. The input layer is connected to a dot product layer. Non-linearity is added to the network by using a tanh activation function with each fully connected layers. The network for the primary classifier has five layers apart from the input layer and the output softmax layer. The LSTM layer is placed as the fourth layer from the input. The output layer has seven output nodes, each node represents one gesture, see Fig. 2.

The binary classifier identifies the intended direction of the motion when the palm moves in horizontal or vertical direction. Since the swiping motion may be repeated more than once while completing the gesture the identification of the intended gesture is more sophisticated problem than merely identifying the direction of motion. The binary classifier LSTM network has three hidden layers along with the one LSTM layer. The output layers have two nodes and a softmax activation function. The connection weights and bias are independent of each other in all networks. The three networks are trained independently using the samples belonging to the corresponding classes from the same training dataset. The training uses the RPROP Algorithm for the optimisation process [18].

Fig. 3.
figure 3

The system architecture and the cumulative probability addition scheme.

3.1 Prediction

The system is shown in Fig. 3, it can be broadly separated into a frame classification part Fig. 3a which produces a nine dimensional probability vector \( \overrightarrow{p(t)} \) at time t, and an output probability combination part Fig. 3b which results in another nine dimensional probability vector \(\overrightarrow{P(t)}\).

In the classification part of the system the primary classifier is connected with the two motion-direction classifiers, see Fig. 3. It provides a seven dimension probability vector, \(\overrightarrow{p^1(t)}\). \(\overrightarrow{p^1(t)}\) has five gesture probabilities \(\overrightarrow{p(t)^g} = p(t)^{1-5}\) and probability for horizontal and vertical direction of motion \(p(t)^h , p(t)^v\). On identifying vertical or horizontal swiping of the hand the vertical or horizontal motion classifier is activated with binary activation signals \(A_{v}\) and \(A_{h}\). The activated binary classifiers then detect the intended direction of the swiping gestures resulting in the two dimension probability vectors \(\overrightarrow{p(t)^v}\) and \(\overrightarrow{p(t)^h}\) for vertical and horizontal direction respectively. The output from these classifiers replace the motion probabilities in the primary classifier output, (1). The output probabilities from LSTM units are combined (2) to form a nine dimensional vector \(\overrightarrow{p(t)}\) and are weighted by the there values in the primary probability vector, the resulting output vector is re-normalised to form a probability vector \(\overrightarrow{\hat{p(t)}}\), (3).

$$\begin{aligned} \overrightarrow{p(t)^{'k}} = {\left\{ \begin{array}{ll} [\frac{p(t)^k}{2},\frac{p(t)^k}{2}] &{} \text {if } A_{k} = 0.\\ \overrightarrow{p(t)^k} &{} \text {if } A_{k} = 1 \text { where } k \in [v,h] . \end{array}\right. } \end{aligned}$$
(1)
$$\begin{aligned} \overrightarrow{p(t)} = [\frac{\sum _{j=1}^{5}p(t)^{j}}{5} . \overrightarrow{p(t)^g}; \; p(t)^v.\overrightarrow{p(t)^{'v}}; \; p(t)^h . \overrightarrow{p(t)^{'h}}]. \end{aligned}$$
(2)
$$\begin{aligned} \overrightarrow{\hat{p(t)}} = \overrightarrow{p(t)} / \vert \overrightarrow{p(t)} \vert . \end{aligned}$$
(3)

The early predictions of the LSTM based system are stabilised by using a cumulative probability addition scheme Fig. 3b. The cumulative addition of the probability regularizes the estimates while making an early prediction. This adds robustness towards jerks, stops and change in hand direction, during the completion of the hand-gesture sequence. Also a strategy based on maximum-probability or majority decision approach predicts the gesture at the end of the sequence. The described method makes a probability estimation for the gesture at every frame.

The system output probability is given as \(\overrightarrow{P(t)}\), (4). The sum is reset to zero whenever an end of sequence impulse is seen. \(I_n\) is the impulse corresponding to the \(n^{th}\) sequence. The impulse has a value 1 and the impulse time is given by \(t_{I_n}\). The probability addition is initiated again with a sequence-begin impulse. The \(n^{th}\) prediction \(G_{n}\), corresponds to the index i of the maximum value in the probability vector \(\overrightarrow{P(t)}\) (5). Since the initial frames of the sequence have little or no temporal context the predictions made during these first \(t_{d}\) frames of the input stream are not reliable and thus are not read at the output. This scheme allows continuous predictions unlike majority-vote like decisions where prediction is made after viewing the entire sequence.

$$\begin{aligned} \overrightarrow{P(t)} = \overrightarrow{\hat{p(t)}} + (1-I) \times (\overrightarrow{P(t-1)}) \end{aligned}$$
(4)
$$\begin{aligned} G_{n} = arg\max _{i} (\overrightarrow{P(t)}) : t-t_{I_{n-1}} > t_{d} \end{aligned}$$
(5)

4 Results and Comparisons

This section describes the training progression of the three models and presents the performance of the entire system. As mentioned in the last sections the first few frames of the prediction made by the system are not considered for output, we also skipped these frames for the evaluation analysis. The output probabilities for sequences beyond the eighth frame of the gesture, which corresponds to a time-period of 0.3 s are considered for the analysis.

Fig. 4.
figure 4

Train-Test error progression.

Table 3. Confusion matrix proposed system.
Table 4. Confusion matrix single all-class LSTM
Fig. 5.
figure 5

Test error with early start location.

The train-test error progression by learning epochs during a sample cross validation for primary phase of classification and the left-right binary classifiers are depicted in Fig. 4. The network was trained for 600 epochs and evaluation was conducted for every second epoch. The average misclassification rate for the given training was 5%. The misclassification rate for the test data at the end of the training was 7%, see Fig. 4a. Both up-down and left-right classifiers were trained as binary classifiers for 400 epochs. It is observed that the misclassification rate on training data after the completion of the training for the up-down motion classifier is 6%, and 1.5% for the left-right classifier. The misclassification rate for the test data is 8% and 3%, respectively. Figure 4b shows the left-right classifier error progression. On combining the three networks as the described system, the observed misclassification rate for the full system is 9.25%. The Table 3 shows the confusion matrix for the classification of the nine gesture classes in case of the architecture following the two level classification strategy.

In comparison with a larger all-class single LSTM, chosen after experiments on multiple LSTM models, the performance was considerably better. The improvement in the gestures where direction is important is large. In other gestures the performance improves in all classes apart from ‘Drop’ where accuracy remains the same and ‘Grab’ which has a small decrement. The performance of the compared LSTM model is shown in Table 4. Confusion matrices are calculated at each step of the 9-fold cross validation and the mean confusion matrix are reported.

5 Discussion and Conclusion

The performance of the system improves when decisions were taken after a longer delay from the beginning of the sequence. Figure 5 shows the accuracy performance when the latency period for the frame-wise prediction is changed. The decision after a longer latency gained from larger temporal context and is usually more accurate. Some gestures with similar shape and short motion were misclassified, which was reflected in the occasional misclassification of ‘Accept’ and ‘Decline’ as ‘up’, ‘down’, respectively. This explains the lower accuracy of the up-down gestures in the combined system even though the binary classification accuracy is high. The accumulated regularization of the system output also resulted in missing of fast-very short gestures.

5.1 Conclusion

This work presented a one to one gesture-sequence to label-sequence training procedure to make an immediate decision for a gesture label when the gesture sequence begins. A performance improvement in the two phase classification, when one phase classifies gestures by modification in shape and the other by the direction of motion, is demonstrated. A Modified system architecture achieves an average cross-validation accuracy of 90.75% on the dataset.

This work introduced an accumulated probability based solution for predicting gesture per frame, this eliminates the requirement of delaying the classification until the end of the sequence and also stabilises the prediction outcome to hand-jerks and motion-discontinuity.

As future work we plan to solve the problem of the misclassification of gestures with similar shapes for which we plan to develop more robust shape descriptors. Moreover, spotting of short intended motion of the palm might help in identifying the beginning of the sequences. The prediction performance for short gesture sequences is comparatively worse, using Bayesian filtering approaches with pose identification solutions may help improve this performance.