Keywords

1 Introduction

Research on hand motion recognition has been a popular research topic since two decades ago (Hirota & Tagawa, 2016; Kaur & Rani, 2016). Nevertheless, this topic of research is still relevant because of the abundance of data derived from the measurement of human body motions using various equipments such as cameras and smart sensors (Chan et al., 2015; Ciotti et al., 2016; Lu et al., 2014; Zhou et al., 2014; Jeong & Cho, 2016). Researchers are focusing on developing controller that utilizes hand motions to control electronic devices. An area where hand motion recognition is much needed is sign language recognition. Similar to normal person, in order to live independently, deaf people need to communicate and interact with normal people. Therefore, an efficient gesture recognition system can significantly improve their quality of life.

To help with deafness and other communication disorders, there are different types of off-the-shelf hearing aid devices available, including behind-the-ear, in-the-ear and canal aids. Although hearing aid devices are beneficial, users may face issues such as discomfort and hearing background noise when using them.

As a result, experts have been working on a variety of ways for translating sign language motions. In general, two common ways are known: vision-based systems and wearable gadgets. To identify hand and finger movements, vision-based systems use image processing techniques such as feature extraction.

Several experiments have been conducted on sign language translation utilizing a vision-based system. Vision-based systems have the advantage of not requiring users to be tied to sensory devices, which can be messy and inconvenient. However, vision-based systems are challenging to design since establishing algorithms for feature and movement needs complicated and expensive computations.

2 Findings on Hand Gesture and Sign Language Studies

This paper compiled selected works done by fellow researchers between the years of 2016–2021 on hand gesture and sign language. It covers type or mode of operation, algorithm applied, methodology and hardware used. The review is separated based on type of study, i.e., vision-based approach, sensor-based approach and other approach as well.

2.1 Vision Based Study

A high number of visual sequences in sign language will contribute problem to accuracy and stability of hand locating in sign language recognition. According to He (2019), this is due to the effect of light and complex background. In this study, the author applied Faster R-CNN and 3D CNN as the solution to the mentioned.

FR-CNN is used in order to detect either the sign language video or part of hand in a picture. It integrates the RPN module for better performance in terms of locate gesture effectively even though facing interference of skin color and background, movement blurs, and hand occlusion. 3D CNN is used because of its ability in improving the feature extraction and classification of sign language accuracy.

Ahmed et al. (2016) has developed a system that not only would be able to function as sign-to-speech engine but as speech-to-sign engine too, a sign language to speech conversion system. It is a software-based solution that used Microsoft Kinect 2.0’s Continuous Gesture Builder to record and train the hand gesture. The input from Kinect will be compared with the pre-recorded dataset and for each matched gesture. Sentences comes from the matched gestures before being into speech. The Visual Gesture Builder from Kinect’s SDK, which uses AdaBoostTrigger and RFRProgress detection technologies, was used to train gesture detection databases (Gonçalves et al., 2015).

During gesture recording session, information such as 3D coordinate points, depth, body heat and infrared scan were gathered for tagging purposes. The gesture will be stored at gesture database and will be used as a data dictionary to compare the incoming gesture. Any incoming gesture that matches with the gesture in database, the keyword of the gesture is added in the sentence before being speak out by the computer.

Tamiru et al. (2018) proposed a technique on controlling of a vision-based wireless mobile robot hand using hand gesture. A set of 100 hand gesture images has been recorded and it will be compared with the input in real time. The noise has been removed by applying median filtering together with histogram equalization to match with the output image with the database images. Even though it has went through the segmentation process, it was found out that the result was not satisfied. A morphological filtering technique was used to get a clearer contour of the hand image before the gesture was able to be recognized. This step produces a better efficiency. The final step to recognize the hand sign is to apply the 2D cross-correlation coefficient technique. Then the robot will be able to identify the direction.

Chung et al. (2019) used a webcam to instantly track the hand region of interest (ROI) in order to identify the hand gesture. The hand detection block will do the tasks such as skin color detection method and morphology to remove the unnecessary background information from the hand image. By applying background subtraction, the author will be able to detect the ROI of the hand.

Next process is to monitor the observed ROI to avoid background influences on artifacts or noise affecting the ROI by applying Kernelized Correlation Filters (KCF) (Oh et al., 2017). The final process is to apply the deep Convolution Neural Network (CNN) (Ji et al., 2017) technique, mainly AlexNet (Alom et al., 2019) and VGGNet (ElBadawy et al., 2017) to identify the multiple hand image. The process of tracking and recognition is carried out continuously in order to achieve an instantaneous effect until the hand leaves the camera range.

In 2016, Naglot and Kulkarni (Naglot & Kulkarni, 2016) proposed a system for ASL recognition using Leap Motion Controller (LMC). The normalized training images are passed through feature extraction where only finger and palm dataset will be used for classification process (Kajan et al., 2015). The author used multi-layer perceptron (MLP) algorithm for gesture recognition and backpropagation is chosen for training. The proposed method has shown that the recognition rate is 96.15%.

Fasihuddin et al. (2018) reported on limited tools and resources on Arabic Sign Language (ArSL) assistive learning tools for hearing impairment people. The author also used LMC for capturing hand gesture. The author applied k-nearest neighbor (KNN) algorithm which provides the highest accuracy of sign recognition compared with other algorithms.

Kolivand et al. (2021) presented a faster method that improved the accuracy of SLR system and resulting faster sign language recognition even with plain background or clutter background. The author used a three-dimensional depth-based sensor camera as the input since it is capable to capture depth-based image and scan the hand pose at the same time. The depth-based image carries many information producing derivation of many methods.

The author also applied K-fold cross-validation during training phase in order to produce high accuracy of results over large amount of dataset. Furthermore, the validation and training processes can be used for this procedure.

Bantupalli and Xie (2018) outlined the usage of a smartphone to record multiple different hand gesture as the dataset. The author applied two different approaches to classification: the prediction from Softmax Layer and the output of the global Max Pooling layer (Tripathi et al., 2015) splits the dataset into segments extract features and classifies using Euclidean Distance and K-Nearest Neighbors. This is to overcome issues when applying Neural Network to segmentize the video.

The smartphone’s camera is used again for gesture detection process. Using the video Inception model is applied to extract spatial features from the video stream for SLR. The proposed method managed to clock 99% of accuracy when applying inception method. The only drawback of the design is, that the system cannot be tested with different facial features and skin tone without having train.

2.2 Sensor Based Study

O’Connor et al. (2017) presented a gesture-tracking-glove attached with strain sensors, 6 degree of freedom micro-electro-mechanical-sensor and microcontroller. The system is not only able to translate American Sign Language alphabet into text on a computer or over a smart phone, it is also capable to communicate with virtual hand. The low cost piezoresistive sensors measures the knuckle later converted into voltages. The microcontroller will process them and mapped the voltages into series of binaries. Each binary key represents a specific ALS. The translated ASL is transmitted to the smartphone via BLE.

Wu et al. (2016) implemented inertial measurement unit (IMU) and sEMG sensors in their multi-model wearable ASL recognition system. A large set of well-established features were extracted from the sensors during training phase. The associated measurements were put in one feature vector before fed into information gain to produce the optimum features subset. This study has tested Decision Tree, Support Vector Machine (LibSVM), Naïve Bayes and Nearest Neighbor classifiers to the selected subset. LibSVM scored the highest accuracy with 96.16% and has shown a consistent result (95.16%) to the prior study.

Lee et al. (2020) model achieved 99.81% on average recognition rate for dynamic hand gesture. The smart wearable system consists of 6 IMU which applied to fingers and back palm for capturing its movement during sign language gestures. The information from the IMUs were passed through the preprocessing segment for noise removal, feature extraction and normalization. Then, the author adopted the recurrent neural network (RNN) with the LSTM layer in classifying the gestures.

In 2017, Patil (2017) extended the application of biomedical signal to a hand gesture recognition and controlling system. The surface electromyography (sEMG) that attached to both hands will emit nerve value than processed by the Arm microcontroller that maps with the preset gesture value. The captured muscle movement is channeled to conditioning circuit (Haroon & Malik, 2016) to smooth the output and coupled with ARM microprocessor to recognize the input and Visual Basic to display the output, the author is able to produce desired output.

2.3 Hybrid Approaches

Hafit et al. conducted a study on developing mobile apps that will translate the Malaysian sign language into text. The apps were developed by using native camera plugin, Ionic Framework, Angular Firebase and Google Cloud Vision API. When the image is captured from the smartphone camera, it will be uploaded to Firebase storage. The API will detect the contents of the image. It will then map to the text that match with the image and display back at the smartphone. The authors have to upload samples of MSL to Firebase database that will be mapped with the image uploaded to the Firebase. This app was specifically designed for normal people to communicate with the disabilities.

Haron et al. mentioned in her paper that hearing impaired individuals should deserve the same right with normal people in using mobile e-learning apps. Thus, she and her team developed mobile apps that’s specifically designed to the deaf and dumb in learning MSL.

Paragon et al. made use of a publicly ASL dataset that consist of more than 2000 images from Barczak et al. (2011) and fed directly to machine learning algorithm for training. The images in the dataset consists of hand cut out with black background. The works continue by applying Convolutional Neural Network algorithm. The design incorporates two bunches of fully connected layers followed by a dropout layer and one final output layer, as well as four groups of two convolutional layers followed by a max-pool layer and a dropout layer.

3 Discussion

The focus of our review in this section is to further discuss a few important aspects that we strongly believe may contribute to this review.

Figure 1 illustrated how the training procedure of hand gesture recognition process is conducted. It starts from the input signals that consists of hand gesture of letter spell, number spell and vocabulary. Next step is, selecting the hardware for capturing hand gesture movement. Follows by choosing the appropriate machine learning algorithm for the study. This section requires with a large amount dataset for training purpose because, poor approximation is caused by not enough training data. In order to achieve good performance, the system requires a proper tune.

Fig. 1
figure 1

Block diagram of hand gesture and sign language process

3.1 Image Acquisition Devices

Table 1 summarizes previous research on hand gestures and sign languages recognition. This study is divided into three domain of image capturing methods, namely, vision, sensors and hybrid. (Ahmed et al., 2016; Bantupalli & Xie, 2018; Chung et al., 2019; Fasihuddin et al., 2018; He, 2019; Kolivand et al., 2021; Naglot & Kulkarni, 2016; Tamiru et al., 2018) have applied camera-based approach as the image processing technique in recognizing hand gesture and sign language. The researchers implemented varieties of image capturing devices such as still camera, video camera, web cam, Kinect, and leap motion controller in performing image requisition of hand gesture activity. A special laboratory setup needs to be considered because most of the cameras have issue with surrounding.

Table 1 Comparison of this study with the related publications

Hand gesture and sign language also can be captured through wearable sensors including flex sensors, inertial measurement unit and surface electromyography (Lee et al., 2020; O’Connor et al., 2017; Wu et al., 2016). Sensors must be connected to microcontroller. The flex sensors records every digits of fingers’ adduction and abduction data, the IMU document the palm’s position and orientation information and the sEMG reads the electrical activity of wrist’s muscles. Normally, these sensors including the microcontroller are wired together into a sensory glove or smart glove to make it portable and ease of use.

Hybrid methods differ with vision-based system and sensor-based system in the sense of capturing devices, ANN engine and image dataset. Hybrid domain deals with method that uses either smartphone (Das et al., 2020; Haron et al., 2019) or cloud-based services such as Firebase or Google Collab (Barczak et al., 2011; Das et al., 2020). This configuration is the least cost compared to vision and sensor system.

3.2 Performance Metrics

The overall performance in this review achieved an efficient performance of face recognition. Both sensor and vision-based studies achieved an efficient performance of hand validation (greater than 90%). There are a few factors that contribute to the achievement when analyzing gesture recognition algorithms. Factors including the amount of classes the algorithm can recognize, as well as its tolerance to noise and complex situations, should be taken into account. Not to mention the performance of hardware and software components like as CPUs, graphics cards and compilers also play a vital role in ML (Pisharady & Saerbeck, 2014).

Series of formula and equations applied by to ensure that they will achieve their research objectives and produces high accuracy results. Some researchers applied statistical formula during data preprocessing toward the raw input data as to filter out the unnecessary information. (Lee et al., 2020) for instance, has applied mean (µ) and standard deviation (\(\sigma\)) deviation to its raw input to filter/omit specific data pattern and finding average distribution for a specific time frame. Both equations are illustrated in Eqs. (1) and (2), where N is total number of data and x is sensor data:

$$ \sigma = \sqrt {\frac{1}{N}(\mathop \sum \limits_{i = 1}^{N} \left( {x_{i} - \mu } \right)^{2} )} $$
(1)
$$ \mu = \frac{1}{N}(\mathop \sum \limits_{i = 1}^{N} \left( {x_{i} } \right)^{2} ) . $$
(2)

A vision-based study conducted by Tamiru et al. (2018) has adapted the cross-correlation coefficient (γ) where Eq. (3) was used to verify the hand gesture recognition. The difference of two time series will be measured with possibility range between −1.0 and +1.0. A near to 1 resulting an identical gesture image between captured and trained image.

$$ \gamma \left( {x,y} \right) = \frac{{\mathop \sum \nolimits_{s} \mathop \sum \nolimits_{t} \delta_{{I\left( {x + s,y + t} \right)\delta }} \delta_{T} \left( {s,t} \right)}}{{\mathop \sum \nolimits_{s} \mathop \sum \nolimits_{t} \delta^{2}_{{I\left( {x + s,y + t} \right)\delta^{2} }} \delta_{T} \left( {s,t} \right)}} $$
(3)

where \(\delta_{{I\left( {x + s,y + t} \right) \, = \, I\left( {x + s,y + t} \right) - I^{\prime}\left( {x,y} \right),}}\)

$$ \delta T\left( {s,t} \right) \, = \, T\left( {s,t} \right) - T^{\prime},\frac{1}{N} $$

{1,2,3,…,p},

{1,2,3,…,q},

{1,2,3,…,m−p+1},

{1,2,3,…,n−q+1},

$$ I^{\prime}\left( {x,y} \right) = \frac{1}{pq}\sum\limits_{s} {\sum\limits_{t} {I\left( {x + s,y + t} \right)} } $$

3.3 Sign Language Dataset

Like spoken languages, sign languages are also unique because it developed by specific groups of people to allow the deaf to interact with each other. Even though, there are countries who share the same spoken language, it does not mean that they share the same sign language. English itself, has three different sign languages: American Sign Language, British and New Zealand Sign language and Australian Sign Language (Sign Language Alphabets from Around the World, https://www.ai-media.tv/sign-language-alphabets-from-around-the-world). Nevertheless, on the differences, the aim of having sign language is to ensure that everybody can communicate with others. When we read further and compare pattern or shape, we realized that each letter carries different finger stroke. The differences of sign language pattern and vocabularies is shown in Fig. 2. Meanwhile, Table 2 indicates in detail specific area of each researchers work.

Fig. 2
figure 2

Samples of sign languages from different countries

Table 2 List of sign languages referred by the authors in this study

Figure 3 shows American is the prominent language, followed by Malaysian and Arabic. However, in this review, only one article did not mention any specific language. ASL gained its popularity due to huge dataset amount available compared to other languages.

Fig. 3
figure 3

Number of articles published based on sign language (Ahmed et al., 2016; Bantupalli & Xie, 2018; Chung et al., 2019; Das et al., 2020; Fasihuddin et al., 2018; Hafit et al., 2019; Haron et al., 2019; He, 2019; Kolivand et al., 2021; Lee et al., 2020; Naglot & Kulkarni, 2016; O’Connor et al., 2017; Patil & Patil, 2017; Tamiru et al., 2018; Tripathi et al., 2015; Wu et al., 2016)

According to Baker (Baker, 2010), fingerspelling is referred to alphabets while sign language will reflect to vocabulary. She also mentioned that the deaf people have to learn fingerspell spell first then follows by vocabulary. For example, when teaching the word, car, ones will do: either fingerspell C-A-R; or sign CAR.

4 Hand Gesture and Sign Language Prospect in Smart Health Aspect

Industry Revolution 4.0 (Zaidi & Belal, 2019) has tremendously influenced the perspective of hand gesture recognition system. More electronics and electrical devices for instance, smart television, have implemented camera, radar, and radio frequency as their human computer interface unit (Ahmed et al., 2021; Yasen & Jusoh, 2019). Just by waving your hand, you can change the channel, change screen and control the volume.

In healthcare perspective, quite a number of medical devices make full use of the fourth industrial revolution framework. The usage of wearable sensors (Brezulianu et al., 2019; Jones et al., 2020; Le et al., 2019) shows the ability in controlling devices from a different geographical location (Fig. 4). Medical doctors can diagnose and perform surgery within his office at one part of the world while, the patient is at the other side of the world (Lee et al., 1999). This will save cost, time, energy and secure both surgeon’s and patient’s security. Innovation in health telemetry system (Care & Today, https://www.link-labs.com/blog/iot-healthcare/) allows patient’s data-gathering. By staying at home, and just wearing sensory glove, doctors can monitor one’s vital signs.

Fig. 4
figure 4

Example of telemedicine operation, the remote arm is mimicking the operator’s holding a screwdriver

5 Conclusion

The findings of this one-of-a-kind study will help to inform theories about translation, identity and well-being, as well as test a novel methodology for doing visual language research. Parents of deaf children, sign language interpreters and hearing persons who work with Deaf sign language users, as well as deaf people themselves, will benefit from the findings. The ability to communicate with normal people by using the proposed device, provides limitless opportunities for the deaf people to search for jobs to improve their economy.

The revolution of industry era has changed the overview of industry itself. It focuses on internetworking of automation, machine learning and real time big data (What is Industry 4.0—the Industrial Internet of Things (IIoT), https://www.epicor.com/en-my/resource-center/articles/what-is-industry-4-0/). Eventually, healthcare industries has also benefited a lot from this revolution. Studies have been carried out in telemedicine application (Pasquale et al., 2018; Sima et al., 2020) on the efficiency and the ability of the health system delivering such a vital task.

The results of this paper can be summarized as the following: the surface electromyography sensors were the most acquisition tools used in the work studied and Artificial Neural Network deep learning approach was the most popular classifier. Vision-based approach is still popular compared to sensor-based and hybrid-based solutions.

In the future work, we are planning to use sensory glove as the input of the gesture and would like to apply different optimization technique so that we can speed up the model run time and achieve high training and validation accuracy.