Abstract
Hand gesture recognition plays an important role in developing effective human–machine interfaces (HMIs) that enable direct communication between humans and machines. But in real-time scenarios, it is difficult to identify the correct hand gesture to control an application while moving the hands. To address this issue, in this work, a low-cost hand gesture recognition system based human-computer interface (HCI) is presented in real-time scenarios. The system consists of six stages: (1) hand detection, (2) gesture segmentation, (3) feature extraction and gesture classification using five pre-trained convolutional neural network models (CNN) and vision transformer (ViT), (4) building an interactive human–machine interface (HMI), (5) development of a gesture-controlled virtual mouse, (6) smoothing of virtual mouse pointer using of Kalman filter. In our work, five pre-trained CNN models (VGG16, VGG19, ResNet50, ResNet101, and Inception-V1) and ViT have been employed to classify hand gesture images. Two multi-class datasets (one public and one custom) have been used to validate the models. Considering the model’s performances, it is observed that Inception-V1 has significantly shown a better classification performance compared to the other four CNN models and ViT in terms of accuracy, precision, recall, and F-score values. We have also expanded this system to control some multimedia applications (such as VLC player, audio player, playing 2D Super-Mario-Bros game, etc.) with different customized gesture commands in real-time scenarios. The average speed of this system has reached 25 fps (frames per second), which meets the requirements for the real-time scenario. Performance of the proposed gesture control system obtained the average response time in milisecond for each control which makes it suitable for real-time. This model (prototype) will benefit physically disabled people interacting with desktops.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
We confirm that dataset will be made available on reasonable request.
References
Berezhnoy V, Popov D, Afanasyev I, Mavridis N (2018) The hand-gesture-based control interface with wearable glove system. In: ICINCO (2), pp 458–465
Abhishek KS, Qubeley LCF, Ho D (2016) Glove-based hand gesture recognition sign language translator using capacitive touch sensor. In: 2016 IEEE international conference on electron devices and solid-state circuits (EDSSC), IEEE, pp 334–337
Liao C-J, Su S-F, Chen M-C (2015) Vision-based hand gesture recognition system for a dynamic and complicated environment. In: 2015 IEEE international conference on systems, man, and cybernetics, pp 2891–2895. https://doi.org/10.1109/SMC.2015.503
Al Farid F, Hashim N, Abdullah J, Bhuiyan MR, Shahida Mohd Isa WN, Uddin J, Haque MA, Husen MN (2022) A structured and methodological review on vision-based hand gesture recognition system. J Imaging 8(6):153
Mantecón T, del Blanco CR, Jaureguizar F, García N (2016) Hand gesture recognition using infrared imagery provided by leap motion controller. In: International conference on advanced concepts for intelligent vision systems, Springer, pp 47–57
Huang D-Y, Hu W-C, Chang S-H (2011) Gabor filter-based hand-pose angle estimation for hand gesture recognition under varying illumination. Expert Syst Appl 38(5):6031–6042
Singha J, Roy A, Laskar RH (2018) Dynamic hand gesture recognition using vision-based approach for human-computer interaction. Neural Comput Appl 29(4):1129–1141
Yang Z, Li Y, Chen W, Zheng Y (2012) Dynamic hand gesture recognition using hidden markov models. In: 2012 7th international conference on computer science & education (ICCSE), IEEE, pp 360–365
Yingxin X, Jinghua L, Lichun W, Dehui K (2016) A robust hand gesture recognition method via convolutional neural network. In: 6th international conference on digital home (ICDH). IEEE 2016:64–67
Oyedotun OK, Khashman A (2017) Deep learning in vision-based static hand gesture recognition. Neural Comput Appl 28(12):3941–3951
Fang W, Ding Y, Zhang F, Sheng J (2019) Gesture recognition based on CNN and DCGAN for calculation and text output. IEEE Access 7:28230–28237
Adithya V, Rajesh R (2020) A deep convolutional neural network approach for static hand gesture recognition. Proc Comput Sci 171:2353–2361
Neethu P, Suguna R, Sathish D (2020) An efficient method for human hand gesture detection and recognition using deep learning convolutional neural networks. Soft Comput 24:15239–15248
Sen A, Mishra TK, Dash R (2022) A novel hand gesture detection and recognition system based on ensemble-based convolutional neural network. Multimed Tools Appl 81(28):40043–40066
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. (2020) An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929
Godoy RV, Lahr GJ, Dwivedi A, Reis TJ, Polegato PH, Becker M, Caurin GA, Liarokapis M (2022) Electromyography-based, robust hand motion classification employing temporal multi-channel vision transformers. IEEE Robot Autom Lett 7(4):10200–10207
Montazerin M, Zabihi S, Rahimian E, Mohammadi A, Naderkhani F (2022) Vit-hgr: Vision transformer-based hand gesture recognition from high density surface EMG signals, arXiv preprint arXiv:2201.10060
Rautaray SS, Agrawal A (2010) A novel human computer interface based on hand gesture recognition using computer vision techniques. In: Proceedings of the first international conference on intelligent interactive technologies and multimedia, pp 292–296
Kim K-S, Jang D-S, Choi H-I (2007) Real time face tracking with pyramidal lucas-kanade feature tracker. In: Computational science and its applications–ICCSA 2007: international conference, Kuala Lumpur, Malaysia, August 26-29, 2007. Proceedings, Part I 7, Springer, pp 1074–1082
Paliwal M, Sharma G, Nath D, Rathore A, Mishra H, Mondal S (2013) A dynamic hand gesture recognition system for controlling vlc media player. In: 2013 international conference on advances in technology and engineering (ICATE), IEEE, pp 1–4
Shibly KH, Dey SK, Islam MA, Showrav SI (2019) Design and development of hand gesture based virtual mouse. In: 2019 1st international conference on advances in science, engineering and robotics technology (ICASERT), IEEE, pp 1–5
Tsai T-H, Huang C-C, Zhang K-L (2020) Design of hand gesture recognition system for human-computer interaction. Multimed Tools Appl 79(9):5989–6007
Xu P (2017) A real-time hand gesture recognition and human-computer interaction system, arXiv preprint arXiv:1704.07296
Kim Y, Bang H (2018) Introduction to kalman filter and its applications. In: F. Govaers (Ed.), Introduction and Implementations of the Kalman Filter, IntechOpen, Rijeka, Ch. 2. https://doi.org/10.5772/intechopen.80600
Chen Z-h, Kim J-T, Liang J, Zhang J, Yuan Y-B (2014) Real-time hand gesture recognition using finger segmentation. Sci World J. https://doi.org/10.1155/2014/267872
Jamil N, Sembok TMT, Bakar ZA (2008) Noise removal and enhancement of binary images using morphological operations. In: International symposium on information technology, Vol. 4. IEEE 2008:1–6
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Vlc-ctrl. https://pypi.org/project/vlc-ctrl/
Audioplayer. https://pypi.org/project/audioplayer/
Kauten C (2018) Super Mario Bros for OpenAI Gym, GitHub
Asaari MSM, Suandi SA (2010) Hand gesture tracking system using adaptive Kalman filter. In: 2010 10th international conference on intelligent systems design and applications, IEEE, pp 166–171
Ren S, He K, Girshick R, Sun J. Faster (2015) r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Proc Syst 28
Tan M, Pang R, Le QV (2020) Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10781–10790
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L, Imagenet: A large-scale hierarchical image database, in, (2009) IEEE conference on computer vision and pattern recognition. Ieee 2009:248–255
Bazi Y, Bashmal L, Rahhal MMA, Dayil RA, Ajlan NA (2021) Vision transformers for remote sensing image classification. Remote Sens 13(3):516
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Appendix
1.1 A.1 VGG16
VGG16 architecture is an one type of variant of VGGNet [27] model, consisting of 13 convolutional layers with kernel sizes of \((3 \times 3)\) and Relu activation function. Each convolution layer is pursued by max-pooling layer with filter size \((2 \times 2)\). Finally the FC layer is added with softmax activation function to produce the final output class label. In this architecture, the depth of network is increased by adding more convolution and max-pooling layers. This network is trained on large-scale ImageNet [36] dataset, ImageNet dataset consists of millions of images and having more than 20,000 class labels developed for large scale visual recognition challenge. VGG16 has reported test accuracy of 92.7% in the ILSVRC-2012 challenge.
1.2 A.2 VGG19
VGG19 is also a variant of VGGNet network [27], it comprises 16 convolutional layers and three dense layers with kernel /filter size of \((3 \times 3)\) and then the max-pooling layer is used with filter size \( (2 \times 2)\). This architecture is pursued by the final FC layer with softmax function to deliver the predicted class label. This model has achieved second rank in ILSVRC-2014 challenge after being trained on ImageNet [36] dataset. This model has an input size of \((224 \times 224)\).
1.3 A.3 Inception-V1
Inception-V1 or GoogleNet [29] is a powerful CNN architecture with having 22 layers built on the inception module. In this module, the architecture is limited by three independent filters such as \((1 \times 1)\), \((3 \times 3)\) and \((5 \times 5)\). Here in this architecture, \((1 \times 1)\) filter is used before \((3 \times 3)\), \((5 \times 5)\) convolutional filters for dimension reduction purpose. This module also includes one max-pooling layer with pool size \((3 \times 3)\). In this module the outputs by using convolutional layers such as \((1 \times 1)\), \((3 \times 3)\) and \((5 \times 5)\) are concatenated and form the inputs for next layer. The last part follows the FC layer with a softmax function to produce the final predicted output. This input of this model is \((224 \times 224)\). This architecture is trained with ImageNet [36] dataset and has reported top-5 error of 6.67% in ILSVRC-2014 challenge.
1.4 A.4 ResNet50
Residual neural network [28] was developed by Microsoft research, This model consists of 50 layers, where 50 stands total number of deep layers, containing 48 convolutional layers, one max-pooling. Finally global average pool layer is connected to the top of the final residual block, which is pursued by the dense layer with softmax activation to generate the final output class. This network has input size of \((224 \times 224)\). The backbone of this architecture is based on residual block. In the case of residual block, the output of one layer is added to a deeper layer in the block, which is also called skip connections or shortcuts. This architecture also reduces the vanishing and exploding gradient problems during training. ResNet50 architecture was trained on the ImageNet dataset [36] and has achieved a good results in ILSVRC-2014 challenge with an error of 3.57%.
1.5 A.5 ResNet101
ResNet101 model consists of 101 deep layers. Like ResNet50, this architecture is also based on the residual building block. In our experiment, we have loaded the pre-trained version of this architecture, trained on ImageNet dataset [36] that comprises millions of images. This model’s default input image size is \((224 \times 224)\).
1.6 A.6 Vision Transformer
A standard transformer architecture consists of two components (1) a set of encoder and (2) a decoder. But in the case of ViT, it doesn’t require the decoder part as it contains only encoder part. In Vision Transformer, firstly image is split into fixed-sized patches, and each patch is implemented for the patch-embedding phase. In the case of patch embedding, each patch is flattened to produce a one-dimensional vector. After the patch-embedding phase, positional embedding is added with the patches to retain the positional information about the image patches in the sequence. Next, they are moved to the transformer encoder. The transformer encoder [37] comprises two components: (1) a Multi-head self-attention block (MHSA) and (2) MLP (multiple-layer perceptron). Hence MHSA block splits the inputs into several number of heads so that each head can learn different levels of self-attention. Then, the outputs of multiple attention heads are concatenated and delivered to the MLP. Next, the classification task is performed by the MLP layer.
B Appendix
1.1 B.1 Statistical Hypothesis Testing
We have also performed a statistical analysis in order to check statistical significance of our model. In Experiment-1 4.4, we have conducted one sample t-test with the help of IBM SPSS statistical analysis tool. In case of null hypothesis, we have to assume that our model is not statistically significant.
To obtain the value of t, the following formula is used:
t=\(\frac{(\overline{X}-\mu )}{\frac{SD}{\sqrt{k}} } \)
Where \(\overline{X}\) is the mean of samples. \(\mu \) the test value. SD sample standard deviation. k size of samples.
To get the value of \(\overline{X}\), firstly, we have used the ten-fold-cross-validation strategy using Dataset-1, and have calculated the fold-wise accuracy followed by computing the average (considered as sample mean) of these accuracy values with the Inception-V1 (best-selected model for Experiment-1 4.4) model.
Table 13 shows that the sample mean (\(\overline{X}\)), sample size (k), test value (\(\mu \)), and the standard deviation (SD) are 99.83, 10, 99, and 0.2907, respectively, and the entire statistical analysis of one-sample t-test has been demonstrated in Table 14.
The results in Table 14 exhibit that p-value < 0.001. Here p-value is used for hypothesis testing to determine whether there is evidence to reject the null hypothesis.
If p < \(\alpha \) where, \(\alpha \) (confidence level) = 0.05, then null hypothesis is rejected.
In Table 14, it is observed that p-value is very less than \(\alpha \), so the null hypothesis is rejected, and we can say that there is a statistically significant difference in the mean of the accuracy values.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sen, A., Mishra, T.K. & Dash, R. Deep Learning-Based Hand Gesture Recognition System and Design of a Human–Machine Interface. Neural Process Lett 55, 12569–12596 (2023). https://doi.org/10.1007/s11063-023-11433-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-023-11433-8