Introduction

In computer science and language technology, gesture recognition [1] is getting emphasized. Hand, face, and other bodily motion are engaged to originate different types of gestures. Gestures are helpful to control electronic devices without physically touching them. Using cameras and computer vision algorithms [2, 3], interpretation of sign language [4] is easily made and different signs have given different meaning to perform electronic devices functionalities.

Detection, tracking, and recognition are essential parts to achieve success in gesture recognition work. Following this, hand gesture recognition needs these three parts as well. In addition, two types of hand gestures are important which are static hand gesture and dynamic hand gesture. Static hand gestures can be created using static hand sign or fixed hand sign [5] and the dynamic hand gestures [6] can be created by recognizing the movement of the hand with sign, e.g., grabbing or swiping [7].

Recently, Microsoft introduced its depth camera and named as Kinect [8]. By the influence of Kinect depth camera, many methods that are based on depth information have evolved with directive knowledge. For example, Memo et al. [9] and Keskin et al. [10] proposed two frameworks in which they used some effective machine learning algorithms such as the random forest [11] to train up the architecture for capturing the skeletal structure of the hand. But comparing with a typical web camera, such depth cameras are expensive and the environment can affect to obtain a better result, which is limitations. In recent years, several human device interaction frameworks are developed based on sensor technology [12,13,14,15], computer vision [16,17,18,19], deep learning [20], smartphone [21, 22] and Internet of Things (IoT) [23,24,25] for different purposes.

The pipeline of the proposed system is as follows. Some preprocessing techniques are applied to detect the hand such as background subtraction, skin color segmentation, and noise processing. Cascade classifier is used to find the hand from skin segmented frame. Initially, background subtraction is done with the captured frame to remove an unnecessary region; then dilate effect or skin segmentation is applied to prepare frame to apply hand cascade classifier that obtains the desired hand region, our ROI in this work. The tracking algorithm is initialized after detecting the palm using a Haar cascade palm detector from the frame that entering on the lens of the webcam. Then the ROI is resized and passed into the CNN network to recognize the gesture. The tracking is then continuing with the ROI without detecting again the hand until the ROI exist in the frames and the CNN network continues to recognize the ROI. In the recognition phase, five different classes of the hand are recognized by 2D CNN with “ReLU” activation function and “softmax” activation function in the last layer and “adam” optimizer to compile the architecture by setting loss equal to categorical “cross_entropy.” Distinguishing between ROI and background skin color is a difficult task in a dynamic environment. Other problems are raised due to the changes of intensity of light in different environments and variations of human skin color. Because of these, the background selection of the running system is a very sensitive task by avoiding those disturbances.

The remaining parts of the paper are organized as follows. “Materials and Methods” section describes the hand detection and tracking methods and designs the CNN network architecture for the recognition of hand gestures. The experimental results are shown in “Experimental Results Analysis” section. Finally, the conclusions and future research with potential direction are discussed in “Conclusion” section.

Materials and Methods

For the better result of recognition performances, some essential preprocessings have done on the initial images, background subtraction [26], noise processing, skin segmentation [27] using the YCrCb skin [28] range (0,133,77) to (235,173,127). The ROI is detected using the cascade classifier [29]. Then hand tracking is done using KCF and median-flow algorithm. Finally, the processed images resized to 54 × 54, converted to binary, and fed into the CNN network for gesture recognition. The overall processes of hand gesture recognition are shown in Fig. 1.

Fig. 1
figure 1

Overview of the overall processes of the hand gesture recognition system

Hand Detection Method

The initial step of hand gesture recognition is hand detection that is performed on webcam captured RGB images. Some critical preprocessing techniques are used to meet this purpose. Background subtraction, noise processing, skin color segmentation, and Haar cascade classifier are used for preprocessing the images. The hand detection process of the proposed system is demonstrated in Fig. 2 which is described as follows.

Fig. 2
figure 2

The hand detection process of the proposed system

Background Subtraction

Background subtraction is the most basic technique for computer vision preprocessing. Firstly, background subtraction is applied to webcam captured image based on the previous fifth frame and then again subtracted from the initial captured basic frame. Gaussian blur function [30], the thresholding technique, is associated with performing absolute differentiate between two image frames. The followings are the steps of background subtraction.

  • Step 1: Gaussian Blur function to reduce noise

  • Step 2: Absolute difference between two image frames

  • Step 3: Convert color from RGB to gray scale

  • Step 4: Image thresholding technique for converting grayscale image to binary image

  • Step 5: Morphology function

  • Step 6: Image dilate effect

Noise Reduction

To get a decent result from CNN architecture, obviously, the CNN network requires a noise free and predictable 2D image frame. To reduce or preprocess the noises, Gaussian blur, median blur, and dilation effect are used on a different part of this work. Figure 3 shows the frames altogether.

Fig. 3
figure 3

a Indicates very basic initial background image, b indicates initial background that is used for repeated subtraction, c indicates the image with first-hand region, d indicates background subtracted image, e binary image obtaining from d, f skin segmented image, g detected hand region on binary image frame, h detected hand region on skin segmented image frame

Skin Segmentation

Then skin color segmentation is associated with that frame using YCrCb color channel. Besides, a binary frame is generated by converting the RGB image to grayscale image and applied image dilation effect. Finally, Haar cascade is used to detect hand from the preprocessed image frame.

Haar Cascade Classifier

Haar cascade classifiers are effective feature-based object detection method proposed by [7]. Haar cascade classifiers are work with face detection, hand detection, and other object detection. For training the classifier, a number of positive images (images of hands) and negative images (images without hands) are entered into the algorithm to train the classifier. Then classifier is got ready to extract features from it.

Figure 3 demonstrates all the frames of this work. Here, Fig. 3a and b is the initial image that is used to background subtraction and one of Fig. 3a is used for very basic background image and another is for repeated subtraction. Figure 3c contains first hand to detect the ROI; then, further processes are executed through this image frame. Figure 3d shows the background subtracted frame, actually subtraction performs two times: first, based on Fig. 3b frame and then based on the initial frame. Figure 3e and f is binary images generated by thresholding and skin color segmented images, respectively. After that, KCF is used to track the hand region (ROI), which describes in section below.

Hand Tracking Method

After detecting the hand region as ROI, the second phase is hand tracking to determine the movement of ROI. Tracking means locating an object in successive frames of a video. Different ideas such as dense optical flow, sparse optical flow, Kalman filtering, meanshift and camshift, single object trackers, and multiple object track finding algorithms are exist there. Here, in this proposed system, we considered a single object tracker. We used KCF and median-flow single object tracker API which is provided by OpenCV as built in functionalities. KCF is used for avoiding the interception of moving skin color objects. Median-flow is used for zooming purposes. These two single object trackers are described below in brief.

KCF Algorithm

The KCF algorithm [31] can be described into two stages shown in Fig. 4. The training stage is the first stage which is indicated in Fig. 4 (top). In the training stage, the initial frame of the ROI (which is detected by the Haar cascade hand palm detection classifier from the background subtracted frame or further skin segmented frame) is used for the positive sample to train up the algorithm for tracking the object. Firstly, multiple training samples (negative samples) are generated using this initial ROI frame. Then, each positive and negative sample is fed to train; then based on the Gaussian probability density function (PDF) model, when a sample is closer to the positive sample, it obtains higher PDF value otherwise not.

Fig. 4
figure 4

Flowchart of the training stage of KCF tracking (top) and tracking stage of KCF tracking (bottom)

The tracking phase is the second stage of the KCF algorithm as indicated in Fig. 4 (bottom). Whenever a hand comes in the captured frame, the system captures the image of the selected ROI of the previous frame; by this time, multiple samples are produced for displacement. The samples and the new frame are prepared for the KCF trained model which is indicated in Fig. 4 (top) and using those samples correction is calculated. Then, the ROI is modified based on the position of the maximum value. The targeted image will be repeatedly captured when it finds a new ROI position, then trained, and updated the model as indicated in Fig. 4 (top). When the hand is out of the camera range, the system stops its execution.

Median-Flow Algorithm

This tracker is used to zoom-in and zoom-out any document, photograph, and PowerPoint slide. Using this tracker, we measured the movement of ROI in a forward or backward direction [32]. If the movement is in a forward direction, then zoom-in operation will be performed otherwise zoom-out. Internally, this tracker tracks the object in both forward and backward directions in time and calculates the inconsistencies between these two trajectories. We realized that this tracker works best when the motion is predictable and small.

CNN Architecture for Gesture Recognition

This study proposed a convolutional neural network containing three convolutional and max pooling layers. After a tensor is passed through the convolutional layers, it is flatted into a vector and passed through the dense layers [33]. The overview of the CNN architecture is shown in Table 1 and the overall architecture is shown in Fig. 5, which contains 3 convolutional layers, 3 max-pooling layers, 2 fully connected layers, and the final output layer connected to 5 classes to recognize gestures of 5 categories. Each step of the CNN architecture is as follows.

Table 1 Overview of the CNN architecture
Fig. 5
figure 5

Architecture of the 2D CNN network

Convolutional Layer

Convolution layer is the basic building block in CNN [34, 35]. In the proposed system, we used three convolution layers with different convolution kernel 32-32-64 sequentially and kernel size is 3 × 3 for all. The first layer is given 54 × 54 sized input image; then, after going through the pooling layer, the second convolution layer gets 26 × 26 sized image and the last layer gets 12 × 12 sized input image. Also, ReLU activation function is performed in every convolution layers.

Pooling Layer

After every convolution layer, the pooling layer is merged to downsampling. Consequently, the obtained feature map sample by the convolution layer becomes one-fourth of the original sample. Max pooling method is used for pooling with 2 × 2 kernel. This method takes maximum element among the 2 × 2 sized mapped area elements. Around three times are used in the architecture, results in a fully connected layer get the input of 5 × 5 sized sample.

Fully Connected Layer

The output of the last pooling layer gets into a fully connected layer. In this architecture, two fully connected layers are used to classify five common hand gestures. Around 64 neurons are set as input parameters. To overcome the overfitting problem, we add a dropout (0.5) method before fully connected layer and “softmax” activation function is used. Finally, after network testing is finished, all features are combined.

Tuning Parameters

To initialize the network, weight parameters are randomly set in every layer of the network. The network batch size is 16, and “categorical_crossentropy” is used as the loss function. Adam optimizer is used as an accuracy metric. After passing the result using the “softmax” function, the “categorical_crossentropy” loss function is used to measure the error that occurs between the real label value and the result of the prediction. The index of the maximum value of predicted output probability array of the five categories is used for the gesture selection from the five classes which is shown in Fig. 6.

Fig. 6
figure 6

Five hand gestures with the corresponding gesture label

Experimental Results Analysis

This section analyzed the results of our proposed system. We fixed the input image size to a ratio of 1:1, resized it to 54 × 54, and converted the images to the binary channel (black and white) to fix the neurons quantity in the fully connected layer. In order to recognize the gesture, we entered the converted images into the CNN architecture. We subtracted most of the background information, detected the skin by filtering YCrCb values, and applied some noise removal filters (e.g., blurring) to remove the unnecessary information so that the complexity of the system can be reduced and can be trained the network efficiently and faster.

Prepared Data

A number of images are collected for different gestures, and a total of 3,000 images are created after preprocessing (background subtraction, skin filtering, and using other noise removal filters), 600 images for each gesture (5 × 600 = 3000 images). For training, we used 2,500 images and 500 images for each gesture (5 × 500 = 2500 images). These images were collected at various angles and different backgrounds, then preprocessed, resized to 54 × 54 (a ratio of 1:1), and converted to binary to fed into the network. And finally, 500 images were used to validate the CNN architecture. Figure 7 shows some example images after preprocessing.

Fig. 7
figure 7

Examples of images after preprocessing at different angles

Performances of the System

The settings of the network parameter for the CNN architecture are shown in Table 2. We subtract the background information and segmented the skin using YCrCb range from (0, 133, 77) to (235,173,127). After resizing the images and converting them into binary, we fed it into the network.

Table 2 Network parameters for the CNN architecture

Several performance measures are used to evaluate a system like accuracy [36], error rate [37], precision [38], sensitivity [39], specificity [40], f1-score [41, 42], MCC [43], AUC [44] etc. In this system, we just considered accuracy and loss as evaluation metrics. The recognition results of the training set and validation set are shown in Table 3. The system successfully achieved the training set accuracy of 93.25%, loss of 0.19, and validation set accuracy of 98.44%, loss of 0.04 for gesture recognition. Figures 8 and Fig. 9 show the accuracy and loss curve of the CNN architecture, respectively, in which the accuracy of validation is higher than the accuracy of the training set and the validation loss is lower than the training loss.

Table 3 Recognition results of the CNN architecture
Fig. 8
figure 8

Accuracy of the proposed system both for training and validation

Fig. 9
figure 9

Loss of the proposed system both for training and validation

We developed an interface using this CNN architecture for human–computer interaction by performing some mouse and keyboard operations (e.g., mouse movements, clicking, scrolling, drag, and dropping, left key press, right key press, etc.). The user interface including all features is depicted in Fig. 10.

Fig. 10
figure 10

User interface of the proposed human–computer interaction system including all features

Conclusions

The achievement of effective human–computer interaction using hand gesture recognition is the main focus of our study. Hand detection, tracking, and gesture recognition are the three main components of our proposed system. For the better result of recognition performance, some essential preprocessing, background subtraction, noise processing, skin segmentation using the YCrCb skin detection range have been done on the initial images. The ROI is detected using the Haar cascade classifier. Then hand tracking is done using KCF and median-flow algorithm. Finally, the processed images resized to 54 × 54, converted to binary images (black and white), and fed into the 2D convolutional neural network for gesture recognition from the five categories. Our proposed system achieved a higher performance result of recognition with validation accuracy of 98.44%

For detecting hand, background subtraction, skin segmentation, and noise processing are applied. But sometimes due to environmental complexity or in a dynamic environment, the developed system behaves unexpected, because in some conditions and cases skin segmentation finds some other objects that matches with human skin color. As a result, hand tracking is humped and the recognition step cannot occur. Hence, it can be a future research scope to mitigate the errors due to the limitations of the environment, lighting condition, or color.