Keywords

1 Introduction

The task of increasing the level of automation and robotization in all spheres of human activity is one of the keys in the modern information society. In this connection, scientists and leaders of developed countries, as well as developing countries, in cooperation with world scientific centers and companies pay attention to technologies for an effective, natural and universal interaction of a person with computers and robots.

Currently, interactive information systems are used in the areas of social services, medicine, education, robotics, the military industry, community service centers, to interact with people in various situations. In addition, robotic assistants are finding more and more widespread which are simple and intuitive in use. Compared to industrial robots that are only able to repeat predetermined tasks, robot-assistants are aimed at interacting with people in the performance of tasks. In this case, many classical interfaces are not enough. Instead, more intuitive and natural approaches for human interfaces are needed (speech [2], gestural [3], multimodal [1, 4,5,6], etc.). For example, gestures can transmit simple commands to a robot that will carry unambiguous meaning and are effective at some distance from the robot and in noisy conditions when speech is ineffective.

It is also known that deaf people have limited capabilities when communicating with the hearing. Therefore, there is necessity to develop recognition of sign language technologies for deaf people. In addition to large world companies, national research centers are also working in this direction. Scientists from the American Institute of Robotics at Carnegie Mellon University are working on a system that can analyze the language of the body and gestures up to the point of the fingers [7]. A number of researchers rightly point out that serious differences in the semantic-syntactic structure of written and sign languages do not yet allow an unambiguous translation of the sign languages. Therefore, there are currently no fully automatic sign language translation systems. To create a complete model, it is necessary to make a semantic analysis of written phrases, and this is still possible only at a superficial level because of imperfections in text analysis algorithms and knowledge bases.

At present, Microsoft provides a tool in the form of a sensor-rangefinder Kinect for the development of systems with the possibility of recognizing the sign language [8, 9], which allows us to obtain a three-dimensional video stream of information in the form of a depth map or a three-dimensional cloud of points. MS Kinect 2.0 provides simultaneous detection and automatic tracking of up to 6 people at the distance of 1.2–3.5 meters from the sensor. In the software, a virtual model of human’s body is presented as a 3D skeleton of 25 points.

The paper is organized as follows: in Sect. 2 we introduce used dataset; in Sect. 3 we presented used processing methods and discuss software implementation details; in Sect. 4 we describe the experiment and show obtained results; and finally in Sect. 5 we draw a conclusion and outline our future research.

2 Dataset

In this paper, we use our own dataset of numeral hand gestures. We recorded 10 gestures of a hand performing numbers from American Sign Language. These gestures are, to some extent, universal and many other sign languages use them. We recorded 18 people performing the gestures with 5 repetitions using a commercial depth sensor Kinect v2. For the purpose of this research, we use only the depth data-stream. Each repetition of a gesture consists of a movement of the hand into the performing space, where the hand stops and a static gesture representing a number from zero to nine is shown. To obtain only the frames with the gesturing static hand we implemented our own semi-automatic labeling algorithm. Since Kinect provides us with a skeletal model of a human it is easy to follow the movement of the hand by tracking a joint representing the palm of the hand. Some time synchronization is needed but the position of the joints changes linearly between consecutive frames and thus the proper position of the palm joint in the time of depth map acquisition is easily interpolated. The palm joint location is considered as a center of a 3D box containing the hand. Since Kinect uses orthographic projection in the depth axis the depth of the 3D box is always constant and has been chosen to be 200 mm. However, the xy-axes use projective transformation and thus the size of the 3D box in this image plane has to be adapted according to the depth of the palm joint. We use the same size of the box in both the x and y axis computed using the formula:

$$\begin{aligned} M = \frac{\alpha \cdot \mathrm {depth_{max}}}{\mathrm {depth}}, \end{aligned}$$
(1)

where M is the size of the box in pixels, \(\mathrm {depth_{max}}\) is the maximal depth of the capturing device (in our case 8000), depth is the measured depth in the palm joint location, and \(\alpha \) is a scale coefficient, which we experimentally chose equal 15. All the 3D boxes are resized to \(96\,\times \,96\) pixels and the depth in the box is normalized from 0 to 1. These resulting hand depth images are manually labeled as either one of the numeral gestures or as a non-informative gesture simply named background. Furthermore, if the performer used his/her left hand for gesturing the resulting hand depth image was flipped.

Fig. 1.
figure 1

Example of the dataset. From top left to the bottom right: gesture for no. 5, background, no. 4, no. 0, no. 2, and background again.

Next, the hand depth images were augmented to help with the training of the neural network. We used random translation and planar rotation to obtain the final dataset. Each hand image was translated four times by a randomly selected 2D vector representing the planar translation. The numbers were drawn from a uniform distribution in an interval [−12; 12] px. The rotation was performed three times by a randomly selected angle from the interval \(\pm 20^\circ \). In total, the dataset consists of 130843 depth images of hands. Some examples of the dataset are shown in Fig. 1.

3 Methods

Due to the neural networks improvements since 2012 [10], most hand-crafted feature descriptors in image classification, if enough data available, become inferior in comparison with machine-learned ones. In this paper we tested two approaches on the task of numeral gesture classification.

First, we calculated Histogram of Gradients (HoGs) [11] for all the data. Each HoG’s cell had 16\(\,\times \,\)16 pixels and each block had 3\(\,\times \,\)3 cells. With this settings we obtained feature vector with dimension of 1152 for each image.

These HoGs were used to train standard Support Vector Machine (SVM) [12] classifier with RBF kernel. This setup is used as our baseline method.

Second, we trained convolutional neural network with modified VGG16 architecture [13]. This architecture belongs to the golden standard among neural network architectures used for image classification, especially for tasks with a lower amount of training data. The exact network configuration we used is shown in Table 1.

Table 1. Modified VGG16 architecture.

4 Experiments and Results

In our experiment, we evaluate the performance of methods on the classification task of numeral gestures, i.e. we want to classify the input image into one of 11 classes (10 numerals and background).

Due to the amount of data, we use cross-validation with 10 different cross-validation settings. For each of them, our dataset was split into two subsets - train set, and test set, where each test set contained data from 4 speakers and train set rest of them.

As a benchmark method SVM classifier trained on HoGs with dimension of 1152 was used. The average recognition accuracy among all the cross-validation settings was 52.31% ± 3.51% on the test data.

Table 2. Comparison of the recognition accuracy results from individual cross-validations (CVs).

For neural network architecture, we come out from VGG16 architecture, however, we cut one of the fully-connected layers entirely and the second one was resized from 4096 to 1024, i.e. this layer provides feature vector with size 1024, which is comparable with the dimension of used HoG descriptor.

The neural network was trained with 20 epochs with mini-batch size 64 and with initial learning rate = \(10^{-3}\). The learning rate was decreased after 10 epoch to \(10^{-4}\). For updating network parameters standard SGD optimization with momentum = 0.9 and weight decay = \(5\,\times \,10^{-4}\) was used. As a loss function, standard Softmax loss was used. Neural network was implemented in Python using Keras deep learning library [14]. The average recognition accuracy among all the cross-validation setting was 86.45% ± 2.93%, which is by more than 34% better than used baseline method. The results from the individual cross-validations can be found in Table 2.

The results show us, that not each cross-validation is equally difficult. This phenomenon is probably caused by the different ability of each speaker to perform numeral gestures properly. Further, it can be caused by inconsistency during labeling among our annotators. You can see some examples of misclassification in Fig. 2.

Fig. 2.
figure 2

Examples of misclassification. From the top row left to right: classified as 3 instead 2, classified as 7 instead 2, classified as background instead 3. Bottom row: classified as background instead 5, classified as background instead 6, classified as background instead 7. Last two are examples of wrong labels in our dataset.

We also tested some other neural network architectures during our initial experiments. All of them were tested only on cross-validation split number 1 with the same training settings as our modified VGG16. For comprehensive comparison see Table 3. CNN3\(\,\times \,\)32 is a simple architecture with three convolutional layers, whereas each of them has 32 filters with kernel size 3\(\,\times \,\)3, and two fully-connected layers (one with size 1024 and the last one with size 11 as a classification layer). CNN3\(\,\times \,\)32b is almost the same architecture, however, the number of filters of the second convolution is doubled and the third one is quadrupled. CNN3 + 5 + 7 has three convolutions and 2 fully-connected layers again, however, each convolution has different size of the kernel (3, 5, and 7 respectively). All of the convolutional layers have 32 kernels again. Last tested architecture CNN3 + 5 + 7b utilizes the same approach as CNN3\(\,\times \,\)32b, e.g. the number of kernels in convolutions is appropriately increased.

Table 3. Comparison of baseline method, modified VGG16 and other tested architectures in terms of recognition accuracy.

Overall, the experiment shows the superiority of the approach utilizing a neural network and machine-learned features over the classic HoG+SVM approach. Moreover, we reached very promising results, which show us a great potential of neural networks for gesture and sign language recognition.

5 Conclusion and Future Work

Sign language recognition and gesture recognition is very demanded task in the modern world. We believe it is essential for next generation of robotic assistants, as well as an assistive tool for deaf people. In this paper, we show the great potential of the usage of neural networks for this task. Moreover, we reach very promising recognition results on our own dataset of sign language numeral gestures. We believe that with some minor modification of our neural network architecture, with more augmentations, and with bigger training set, we can reach flawless results.

In our future research, we would like to extend our dataset with recordings from more speakers. Additionally, we would like to add some other important sign language gestures.