1 Introduction

Hand gestures are used as a way for people to express thoughts and feelings, it serves to reinforce information delivered in our daily conversation. Sign language is a structured form of hand gestures involving visual motions and signs, which are used as a communication system. For the deaf and speech-impaired community, sign language serves as useful tools for daily interaction. Sign language involves the use of different parts of body namely fingers, hand, arm, head, body and facial expression to deliver information. However, sign language is not common among the hearing community, and fewer are able to understand it. This poses a genuine communication barrier between the deaf community and the rest of the society, as a problem yet to be fully solved until this day.

Majority of sign language involves only upper part of the body from waist level upwards [46]. Besides, the same sign can have considerably large changes in shapes when it is in different location in the sentence [44]. Hand gestures can be categorized into several types such as conversational gestures, controlling gestures, manipulative gestures, and communicative gestures [62]. Sign language is a type of communicative gestures. Since sign language is highly structural, it is suitable to be used as a test-bed for computer vision algorithm [61].

The focus of this paper is on sign language recognition. However, research in sign language recognition is highly influenced by hand gesture recognition research, as sign language is a form of communicative gestures. Therefore, when reviewing literature in sign language recognition, it is also pertinent to study literature on gesture recognition.

Gestures and sign language recognition includes the whole process of tracking and identifying the signs performed and converting into semantically meaningful words and expression. Some early efforts on gesture recognition can be dated back to 1993, where gesture recognition techniques are adapted from speech and handwriting recognition techniques. Darrell and Pentland [52] adapted Dynamic Time Warping (DTW) that had been successfully implemented in speech recognition to recognize dynamic gestures.

Later, Starner et al. [1] proposes using Hidden Markov Models (HMMs) to classify orientation, trajectory information and resultant shape of the sign language. HMMs is adapted from speech recognition, and its intrinsic properties make it suitable to be applied in gesture recognition. In [48], a total of 262 signs were collected from two different signers, and the average accuracy using HMMs classifier reaches accuracy of 94%. It is found out that the accuracy is greatly reduced when the database trained by the signs of one person is used to test by signs of another person, dropping to accuracy as low as 47.6%. Training the database with both signers improve accuracy to 91.3% [48].

Vogler and Metaxas [68] stated that the use of HMMs alone has several limitations especially in training context-dependent models. In [71], the authors employed Ascension Technologies Flock of Birds devices to collect the three-dimensional translation and rotation data of the sign. By using a bigram and epenthesis modeling, the average accuracy achieved is 95.83%. Research [68] used similar experiment setup, and by using a context-dependent HMMs and a method of coupling three-dimensional techniques, the system classifies 53 ASL and attained highest accuracy of 89.91%.

From the literature review, the most common sign languages recognition researches are based on American Sign Language (ASL), Indian Sign Language (ISL) and Arabic Sign Language (ArSL). Several other sign languages which are reviewed in this paper includes Tamil sign language (TSL), Dutch sign language (DSL), Korean sign language (KSL), Malaysian sign language (MSL), Persian sign language (PSL), English sign language (ESL), New Zealand sign language (NZSL), Chinese sign language (CSL), Japanese sign language (JPL), Vietnamese sign language (VSL), Brazilian sign language (Libras), Bangla sign language and Indonesian sign language.

This paper intends to focus on the reviewing of the state-of-the-art methods. Facial expression is used as part of sign language, it is however not discussed in this paper. The rest of the paper are organized as follows: Sect. 1 discusses the challenges, types of approaches and application domain of gesture recognition. Section 2 discusses the state-of-the-art techniques used in vision-based gesture and sign language recognition. Techniques used for pre-processing, segmentation, feature extraction, and classification are discussed separately. Section 3 discusses the techniques and technologies used in sensor-based gesture recognition. In Sect. 4, the techniques and finding by previous works are discussed and summarized. Lastly, thoughts about future works and conclusion are stated in Sect. 5.

1.1 Challenges in gesture recognition

Gestures recognition involves complex processes such as motion modeling, motion analysis, pattern recognition and machine learning [61]. It consists of methods with manual and non-manual parameters [48]. The structure of environment such as background illumination and speed of movement affects the predictive ability. The difference in viewpoints causes the gesture to appear different in 2D space. In some research, signer wears wrist band or colored glove to aid the hand segmentation process, such as in [3, 30, 48]. The use of colored gloves reduces the complexity of segmentation process. Several anticipated problems in a dynamic gesture recognition, includes temporal variance, spatial complexity, movement epenthesis, repeatability and connectivity as well as multiple attributes such as change of orientation and region of gesture carried out [53]. There are several evaluation criteria to measure the performance of a gesture recognition system in overcoming the challenges. These criteria are scalability, robustness, real-time performance and user-independent [57].

1.2 Type of approaches

Recognition of hand gestures can be achieved by using either a vision-based or sensor-based approaches.

1.2.1 Vision-based

Vision-based approaches require the acquisition of images or video of the hand gestures through video camera.

  1. 1.

    Single camera—Webcam, video camera and smartphone camera.

  2. 2.

    Stereo-camera—Using multiple monocular cameras to provides depth information.

  3. 3.

    Active techniques—Uses the projection of structured light. Such devices include Kinect and Leap Motion Controller (LMC).

  4. 4.

    Invasive techniques—Body markers such as colored gloves, wrist bands, and LED lights.

1.2.2 Sensor-based

This approach requires the use of sensors, instruments to capture the motion, position, and velocity of the hand.

  1. 1.

    Inertial measurement unit (IMU)—Measure the acceleration, position, degree of freedom and acceleration of the fingers. This includes the use of gyroscope and accelerometer.

  2. 2.

    Electromyography (EMG)—Measures human muscle’s electrical pulses and harness the bio-signal to detect fingers movements.

  3. 3.

    WiFi and Radar—Uses radio waves, broad beam radar or spectrogram to detect in-air signal strength changes.

  4. 4.

    Others—Utilizes flex sensors, ultrasonic, mechanical, electromagnetics and haptic technologies.

1.3 Hand gesture representation

The following are the type of gesture representation namely 3D model based and appearance based.

  1. 1.

    Model-based—This method describes the shape of the hand gesture in 2D or 3D space. It can be categorized into volumetric models and skeletal models. Volumetric model represents the hand gestures with high accuracy. Skeletal model reduces the hand gestures into set of equivalent joint angle parameters with segment length.

  2. 2.

    Appearance-based—Features are directly derived from the images or video using a template database. Image sequences is used a gesture templates which can be used as hand-tracking or simple gesture classification.

1.4 Hand gesture recognition application domain

The ability of a computer or machine to understand the hand gestures is the key to unlock numerous potential application. Potential application domains of gesture recognition system are as follows:

  1. 1.

    Sign language recognition—Communication medium for the deaf. It consists of several categories namely fingerspelling, isolated words, lexicon of words, and continuous signs.

  2. 2.

    Robotics and Tele-robotic—Actuators and motions of the robotic arms, legs and other parts can be moved by simulating a human’s action.

  3. 3.

    Games and virtual reality—Virtual reality enable realistic interaction between user and the virtual environment. It simulates movement of users and translate the movement in 3D world.

  4. 4.

    Human–computer interaction (HCI)—Includes application of gesture control in military, medical field, manipulating graphics, design tools, annotating or editing documents.

2 Literature review on vision-based gesture recognition

The process of gesture recognition can be categorized into few stages in general, namely data acquisition, pre-processing, segmentation, feature extraction and classification as shown in Fig. 1. The input of static gesture recognition is single frames of images, while dynamic sign languages takes video, which is continuous frames of images as input. Vison-based approaches differs from sensor-based approaches mainly by the data-acquisition method. The focus of this section are the methodologies and techniques used by vision-based gesture recognition researches.

Fig. 1
figure 1

Vision-based gesture recognition stages

2.1 Data acquisition

In vision-based gesture recognition, the data acquired is frame of images. The input of such system is collected using images capturing devices such as standard video camera, webcam, stereo camera, thermal camera or more advanced active techniques such as Kinect and LMC. Stereo cameras, Kinect and LMC are 3D cameras which can collect depth information. In this paper, sensor-based recognition involves all techniques of data acquisition which does not uses cameras.

2.2 Image pre-processing

Image pre-processing stage are performed to modify the image or video inputs to improves the overall performance of the system. Median filter and Gaussian filter are some of the commonly used techniques to reduce noises in images or video acquired. In research [79, 112], only median filtering is applied in this stage. Next, morphological operation is also widely used to remove unwanted information. For instance, Pansare et al. [19] first threshold the input image into binary image, then median and Gaussian filters is used to remove noises followed by using morphological operations as the pre-processing stage. In some researches, the images captured are downsized into a smaller resolution prior to subsequent stages. This technique is used in [12, 18, 19, 26, 66] has shown that reducing the resolution of the input image is able to improve the computational efficiency. Research in [120] tabulated the processing time associated with different downsizing factor of image resolution. In this research, division by 64 is the optimum scale as it reduced processing time by 43.8% without affecting the overall accuracy. Histogram equalization is used in [91] to enhance the contrast of the input images taken under different environment to uniform the brightness and illumination of the images.

2.3 Segmentation

Segmentation is the process of partitioning images into multiple distinct parts. It is a stage whereby the Region of Interest (ROI), is segmented from the remaining of the image. Segmentation method can be contextual or non-contextual. Contextual segmentation takes the spatial relationship between features into account, such as edge detection techniques. Whereas a non-contextual segmentation does not consider spatial relationship but group pixels based on global attributes.

2.3.1 Skin color segmentation

Skin color segmentation are mostly performed in RGB, YCbCr, HSV and HSI color spaces [6]. Several challenges toward achieving a robust skin color segmentation is sensitivity to illumination, camera characteristic and skin color [136]. HSV color space is popular as the Hue of palm and arm differs greatly, hence palm can be segmented from the arm easily [25]. Research [15] segments the face and hand in HSV color space. Chen et al. [33] performed skin color segmentation in RGB color space, using the rule of R > G > B and matching with pre-stored sample skin color to find the skin color. Research [115] found that YCbCr is more robust for skin color segmentation compared to HSV in different illumination condition. Researches in [116, 119] found that CIE Lab color space is more robust as compared to YCbCr under different light variation. A normalized RG space was introduced in [117] to overcome the weakness of RGB which suffers from non-uniformity. Research in [118] proposed using K-means clustering method on the chrominance channels in YCbCr color space to separate the foreground which is the skin pixel from the rest of the background.

Skin color distribution and skin-color model classification can overcome the shortfall of applying constant skin- color threshold. Elmezain et al. [47] performed skin color segmentation in YCbCr color space. In [51], a single Gaussian Model based on YCbCr are used, and the classifier detects skin pixels from the background effectively [48]. Yang et al. [44] implemented the methodology in [48], however, Gaussian model is built instead of histogram model. Authors in [120] proposed a dynamic skin color modeling method by introducing weighting factors to locally trained skin model and globally trained skin model to obtain an adaptive skin color model.

2.3.2 Other segmentation method

Zhang et al. [9] introduced a segmentation based on difference background image in the presence of complex background. Otsu thresholding is first applied to the images, the proposed method of maximal between-class variance ‘3 s—principal’ is then used. Ghotkar and Kharate introduced a Hand Tracking and Segmentation (HTS) framework in [17]. The method involves applying Continuously Adaptive Mean-Shift (CAMShift) in HSV color space to create a histogram of skin pixels to find the suitable segmentation threshold value. Canny edge detection is then applied followed by dilation and erosion. Edge traversal algorithm is used lastly to segment the hand gesture from the background.

Lionnie et al. [18] compared the performance of ten variant including Sobel edge detection, low pass filtering, histogram equalization, skin color segmentation in HSI color space and desaturation, and found that desaturation provides highest accuracy. Desaturation process includes first converting into grayscale image by removing the chromatic channel while preserving only the intensity channel in HSI color space.

Entropy is measured by subtracting adjacent image frame to obtain hand motion information. Lee et al. [4] subtract one image from another successive image. The process includes measuring the entropy, separating hand region from images, tracking the hand region and recognizing hand gestures. A method of combining both entropy and skin color information named Entropy Analysis and PIM is proposed in [6] to segment hand gestures in a static and complex background.

2.3.3 Tracking

Tracking is considered as part of segmentation in this paper, as both tracking and segmentation is to extract the hand from the background. Tracking of a hands is usually difficult as the movement of hand can be very fast and their appearance can change vastly within a few frames. CAMShift method is used in several researches to track the position of the hand gestures such as application in [17, 27] to detect and track hand gestures. The CAMShift method detects the location of hand gestures by continuously adjusting the search window size.

Adaboost consist of linear combination of several weak classifiers with the aim to computes the sign of a weighted combination of weak classifiers to output a strong classifier. The authors in [29] detect hand movement using Adaboost with Histogram of gradient (HOG).

Particle filtering is normally used with other techniques for gesture tracking. In research [134, 139], combination of particle filtering and mean shift algorithm has shown to be able to recognize hand accurately. In research [133, 135, 138], tracking is performed using color features, and particle filtering has shown to be able to accurately track the movement of the gesture. Research in [137] introduced Kalman Particle Filter (KPF) as improvement to particle filtering in gesture tracking.

2.4 Feature extraction

Feature extraction is the transformation of interesting parts of input data into sets of compact feature vector [83]. In gesture recognition context, the features extracted should contain relevant information from the hand gestures input and represented in a compact version which serves as an identity of the gesture to be classified apart from other gestures.

2.4.1 Shift-invariant feature transform (SIFT)

SIFT is a scale and rotation invariant feature extraction technique introduced by Lowe [40]. SIFT describe an image by its interest points whereby detection requires multi-scale approach. At each level of the pyramid, the image is rescaled and smoothed by Gaussian function. The scale-space is defined by function, \(L\left( {x,y,\sigma } \right)~\) in Eq. 1.

$$L\left( {x,y,\sigma } \right)=G\left( {x,~y,\sigma } \right) \times I\left( {x,y} \right)$$
(1)

The key-points extracted are the maxima and minima, which are calculated using difference-of-Gaussian (DoG) function, \(D\left( {x,y,\sigma } \right).\) The Gaussian function convolved with the images, \(D\left( {x,y,\sigma } \right)\) which is computed by subtracting two subsequent scales which is separated by a constant scale factor \(k\) with \(~k=\sqrt 2\) as the optimum value as in Eq. 2.

$$\begin{aligned} D\left( {x,y,\sigma } \right) & =\left( {G\left( {x,~y,k\sigma } \right) - ~G\left( {x,~y,\sigma } \right)} \right) \times I(x,y) \\ ~ & =L\left( {x,y,k\sigma } \right) - L(x,y,\sigma ) \\ \end{aligned}$$
(2)

At each point, \(D\left( {x,y,\sigma } \right)\) is compared with eight neighbors of its scale, and nine neighbors up and down one scale. If the \(D\left( {x,y,\sigma } \right)\) value is the maximum or minimum among the points, then it is extrema. In key-point localization stage, key-points with low contrast or are poorly localized are removed. The location of extremum, \(\hat x\) is in Eq. 3.

$$\hat x= - \frac{{{\partial ^2}{D^{ - 1}}}}{{\partial {x^2}}}\frac{{\partial D}}{{\partial x}}$$
(3)

In orientation assignment, each key-points are assigned a consistent orientation based on local image properties. Finally, the SIFT descriptors is created in this stage by first lining up the key-points by offsetting the orientation. The matching of SIFT descriptors can then be performed by calculating the nearest neighbor and the ratio of closest-distance to second-closest distance. SIFT is invariant to a certain range of affine transformation, illumination variation, and changes in 3D viewpoint. In several gesture classification applications like in [15], the SIFT feature extracted from images are later quantized using K-means clustering before mapped into Bag-of-Feature (BoF). The above steps are taken to address the issue of different dimensionality of each SIFT features extracted as most classification technique requires inputs of equal dimensionality [11]. Using a similar method as [15], recognition of four gestures are performed with an average accuracy of 90% [66]. The authors claimed that although SURF has a faster processing speed, it is however is not as rotation invariant as SIFT [66]. Principal component analysis (PCA)-SIFT on the other hand has better illumination invariant, but are not scale invariant. SIFT features is extracted from ArSL in [23], and authors has shown the system to be robust against occlusion and rotation.

2.4.2 Speeded up robust feature (SURF)

SURF is developed based on SIFT. SIFT constructs scale pyramid, convolving the upper and lower scales of the image with DoG operator and searching the local extreme in scale space. Meanwhile, SURF scales filter up instead of iteratively reducing the image size. In SIFT, Laplacian of Gaussian (LoG) is approximated with DoG for finding scale-space. SURF approximates LoG with Box Filter. The convolution of box filter can be calculated easily using integral images, which is a fast and effective method in calculating the sum of pixels value.

In detection of key-points or descriptors, SURF uses an integer approximation of the determinant of Hessian blob detector. Integral image is the sum of intensity value for points in the image with location less than or equal to \((x,y)\) as shown in Eq. 4.

$$S(x,y)=~\mathop \sum \limits_{i=1}^x \mathop \sum \limits_{j=1}^y I(i,j)$$
(4)

SURF employs hessian blob detector to obtain interest points. The determinant of Hessian matrix describes the extent of the response. Hessian matrix with point \(x\) and scale \(\sigma\) is defined as in Eq. 5.

$$H\left( {x,\sigma } \right)=~\left[ {\begin{array}{*{20}{c}} {{L_{xx}}(x,\sigma )}&{{L_{xy}}(x,\sigma )} \\ {{L_{xy}}(x,\sigma )}&{{L_{yy}}(x,\sigma )} \end{array}} \right]$$
(5)

where \({L_{xx}}(x,\sigma )\) is the convolution of the image with the second-order derivative of the Gaussian as described by Bay et al. [7]. To make the system scale-invariant, the scale space is realized as an image pyramid. With the use of integral image and box filter, the scale space can be realized by up-scaling. Finally, non-maximum suppression is applied in a \(~3 \times ~3~ \times 3~\) neighborhood to localize interest point in the image. Key-points between two images are matched nearest neighbors.

In research [144], using 500 test images, Support Vector Machine (SVM) classifier is built to classify both SIFT and SURF features, achieving accuracy of 81.2 and 82.8% respectively. In [145], the authors extracted SURF features from 12 images of each 24 classes of sign language, the overall accuracy is 63%. The author stated that SURF features are invariant to rotation if rotation is within 15°. In research [34], the authors extract SURF features to obtain the dominant movement direction of matched SURF feature points in adjacent frames, accuracy of 84.6% is achieved.

2.4.3 Principal component analysis (PCA)

PCA is a mathematical operation which utilizes orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principal components [83]. Given a training set of \(M\) images with an S-dimensional vector, PCA finds a t-dimensional subspace which its basis vectors correspond to the maximum variance direction in the original image space. The dimension of the new subspace is usually lower, where \(~t \ll s\). The mean, \(\mu ~\) of all images in the training set given in Eq. 6, with \({x_i}\) as the ith image with its columns concatenated in a vector.

$$\mu =\frac{1}{M}\mathop \sum \limits_{i=1}^M {x_i}$$
(6)

PCA basis vectors are defined as eigenvectors of the Scatter matrix, \({S_T}\) is computed as in Eq. 7.

$${S_T}=\mathop \sum \limits_{i=1}^M \left( {{x_i} - \mu } \right) \cdot {\left( {{x_i} - \mu } \right)^2}$$
(7)

The eigenvectors and corresponding eigenvalues are calculated and the eigenvectors are stored by decreasing eigenvalues order. The eigenvectors with lower eigenvalues contains less information on the distribution of data, and these are filtered to reduce the dimensionality of data.

PCA has been widely used as a dimensionality reduction technique. PCA transforms possibly correlated variables into smaller number of principal components which are the uncorrelated variables [70]. PCA is used in [27] to extract features of 24 MSL. Locality Preserving Projection (LPP) is a modified PCA which utilizes known similarity between input features to adjust feature vector distances. Performance of PCA is compared LPP whereby the former achieved 92.8% and latter achieved 96.5% accuracies. PCA features are also used in [70] as measures of hand configuration and orientation. The authors however combined PCA with kurtosis position and chain code to improve the overall accuracy. PCA is used for dimensionality reduction in [77]. By calculating the eigenvalues, the authors omitted the components after the 12th and hence it reduces the computational complexity. In research [141], by classifying PCA features from 25 classes of VSL using Mahalonobis distance, it achieves accuracy of 91.5%.

2.4.4 Linear discriminant analysis (LDA)

Both LDA and PCA approaches finds the linear combination of features which best describe the data. For all samples of all classes, the between-class scatter matrix \({S_B}\) and within-class scatter matrix \({S_W}\) are given in Eq. 8.

$$\begin{gathered} {S_B}=\mathop \sum \limits_{i=1}^M {M_i}\left( {{x_i} - \mu } \right) \cdot {\left( {{x_i} - \mu } \right)^T} \hfill \\ {S_W}=\mathop \sum \limits_{i=1}^M \mathop \sum \limits_{{x_k} \in {X_i}} \left( {{x_k} - {\mu _i}} \right) \cdot {\left( {{x_k} - {\mu _i}} \right)^T} \hfill \\ \end{gathered}$$
(8)

\({M_i}\) is the number of training samples in class \(i\), \(c\) represents the number of distinct classes, \({\mu _i}\) is the mean vector of samples respective to class \(i\) with \({x_k}\) being the \(k\)th images of the class. The aim of LDA is to determine matrix \(W=\max \frac{{{S_B}}}{{{S_W}}}\) that maximizes \({S_B}\) and minimizing \(~{S_W}\). Transformation matrix, \(W\) which projects the samples into reduced dimension space is as in Eq. 9.

$$W=~{W^T}_{{LDA}}{W^T}_{{PCA}}$$
(9)

LDA maximizes class separability by finding linear combination of features which best discriminate among classes of objects [43]. PCA finds only the direction of maximal variance among features and does not consider the difference in classes [132]. LDA can be applied as a linear classifier and dimensionality reduction method. Author in [131] extracted PCA and LDA features from five classes of gesture. The accuracy of PCA is merely 26% while LDA is 100%, the poor performance of PCA could be due to overfitting. Research in [132] also compared the accuracy between PCA and LDA using five classes with 50 input images. The accuracy achieved for PCA and LDA is 60 and 62% respectively. It is stated that the noise rate can be reduced by reducing the dimensionality using both PCA and LDA. LDA is used by Tharwat et al. [23] to perform sign language recognition on Arabic sign language using a similar method as in [11, 15, 16]. SIFT features are first extracted from the images; however, LDA is applied in this research to widen the separation between classes of sign languages.

2.4.5 Convexity defects and K-curvature

Convexity defects and K-curvature method involves finding convex hull, convex and defects, center of the palm, angle between fingertips and palm center. This method was used in research [101, 102, 104, 109,110,111,112,113,114]. Several research uses global features with convexity defects to identify the gestures, research in [104] which uses solidity for identifying the fingers.

Shukla and Dwivedi utilized convexity defects and contour area as features and able to recognize five gestures with 100% accuracy [101]. Maisto et al. [102] uses Douglass Pecker method to approximate the hand gesture segmentation result with simpler contour. Research in [103] utilizes K-curvature in addition to convexity defects to improve the accuracy of recognizing fingertips. K-curvature are useful in finding the maximum and minimum points of the hand edges to identify the fingertips. Research in [105] classifies the fingers using angle between fingertips and palm center, and assumed the potential position of fingers which are unable to be detected. Research [106] uses Randomized Decision Forest (RDF) and estimation of joint position to classify gesture based on fingertips. Author in [107] further improve the accuracy of convexity defects by improving rule-based method suggested by [108] to identify fingers whether they are upright, bent, looped, joint or separated.

2.4.6 Features extraction in frequency domain

Feature extraction in frequency domain involves transformation of time domain input data into frequency domain. This includes Cosine Transform, Fourier Transform and Wavelet Transform. In research [33], the authors stated the advantage of Fourier Descriptors (FD) is its size-invariant properties. FD is also rotation invariant as rotation in hand gestures only causes a phase change. Also, noises can be reduced by removing the high frequency, as noises and quantization errors only cause local variation of high frequency.

The authors in [27] claimed that contour based features including FD, Wavelet Descriptors (WD) and B-spline prone to suffers from poor performance when the fingers are curled inward and lose its contour properties. Region-based features such as Principal Curvature Based Region detector (PCBR) utilizes semi-local structural information for instance the curvilinear shapes and edges which are robust to intensity, color and shape variation. 2-D Wavelet Packet Decomposition (WPD-2) uses Haar basis function up to level two which utilizes the high frequency channels with significant information. A hybrid feature extraction method of PCBR, WPD-2 and Convexity defect are performed in [51] to recognize 23 static ISL. The hybrid of three features are shown to outperform the hybrid of only any two of the features using k-NN classifier. Similar findings are obtained when classified using SVM. Discrete Wavelet Transform (DWT) features is extracted for classification of 23 static PSL in [69]. DWT can be realized by iteration of filters with rescaling. The resolution of the signal, is determined by the filtering operations [69].

2.4.7 Others feature extraction method

Some features have advantages over the others, yet suffers from others drawback. For instance, SURF is much computational efficient as compared with SIFT [7]. However, SURF is not as rotational and illumination invariant [26]. Hybrid features extraction has been used in several researches to overcome the limitations in single features. Hu moment invariant geometric features is extracted from hand gestures and combined with SURF in [26]. Using hybrid of SVM and k-NN as classifier, the authors compared their proposed method with SIFT, SURF, Hu-moment. It is shown that hybrid of SURF with Hu moment has the highest accuracy.

Liu et al. [25] proposed a hybrid features fusion of Hu moment invariant, finger angle count, skin color angle, and non-skin color angle. Accuracy of 90% is achieved in matching ten gestures. Local Binary Pattern (LBP) is a computational efficient texture operator which labels pixels in an image by thresholding the neighborhood of each pixels and the results is considered as binary number. LBP is used as feature extraction [142] on both Chinese and Bangladeshi numeral gesture dataset and able to achieve accuracy of 87.13 and 85.10% respectively [142]. In [154], the authors extract both HOG and Zernike invariant moment (ZIM) shape descriptors to classify 40 classes of Libras. The magnitude of ZIM are rotational invariant, and hence the magnitude is used as features. The overall accuracy achieved is 96.77%. Chakraborty et al. [8] compared four methods of gesture recognition techniques namely Subtraction, Gradient, PCA, and Rotation invariant. Rotation Invariant which is based on LBP provides the highest accuracy. Pansare [20] compared performance of different feature extraction method namely Discrete Cosine Transform (DCT), edge oriented histogram, centroid and Fourier Transform and shown that DCT has better result. In research [149], combination of SIFT, Hu Moments are FD features are extracted from input images. PCA and LDA is applied to these features to reduce the dimensionality. Using SVM, classification of 26 classes of CSL achieves accuracy of 99.8%.

2.5 Classification

Classification can be categorized into supervised and unsupervised machine learning techniques. Supervised machine learning is a technique that teaches the system to recognize certain pattern of input data, which are then used to predict future data. Supervised machine learning takes in a set of known training data and it is used to infer a function from labeled training data. An unsupervised machine learning is used to draw inferences from datasets with input data with no labeled response. Since no labeled response is fed into the classifier, there is no reward or penalty weightage to which classes the data is supposed to belong.

2.5.1 Static gesture classification

Static gestures are single images which involves no time frame. Static gestures recognition is mostly used to recognize finger-spelled signs.

2.5.1.1 Support vector machine (SVM)

SVM is a supervised machine learning technique. It finds the optimal hyperplane to separate the data points. SVM maximize the margin around the separating hyperplane. Optimization techniques are employed in finding the optimal hyper plane. Two hyperplanes are found which best represent the data. \(w\) is the weight vector for \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {w}\), for training data \(\left( {{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {x} }_1},{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {y} }_1}} \right), \ldots {\text{~}}\left( {{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {x} }_n},{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {y} }_n}} \right)\), where \({y_i}\) are either 1 or −1, indicating to which class the data \({\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {x} _i}\) belong. The weight vector decides the orientation of decision boundary, whereas bias point, \(b\) decides its location. The hyperplane can be represented by Eq. 10.

$$\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {w} \cdot {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {x} _i}+~b=0$$
(10)

The points above the hyperplane will have positive \(~{y_i}\), and points below will have negative \(~{y_i}\). The distance between the support vector and plane is \(~~distance = ~\frac{1}{{\left\| {\vec{w}} \right\|}}~\). The Margin, \(~M\) is twice the distance to support vector, hence margin is defined as \(~M = ~\frac{2}{{\left\| {\vec{w}} \right\|}}\). \(w\) need to be minimized as in Eq. 11 to maximize the margin, M.

$$\min ~L=~\frac{1}{2}{\left\| {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {w} } \right\|^2}\quad where\;~{y_i}(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {w} \cdot {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {x} _i}+~b) \geqslant 1$$
(11)

The performance of SVM has been compared with k-NN [23], Naive Bayes [16], and shown that SVM has better performance over the other methods. SVM with linear kernel perform better than non-linear Gaussian kernel [76]. The authors experimented with two size of gesture database. The accuracy of classification using linear SVM with 12 ESL dropped from 99.2 to 82.3% when the number of gestures increased to 25 ESL. The method of using SIFT to extract features from images followed by quantization using K-means clustering before mapped into BoF classification using SVM has shown promising results in [11, 15, 16, 66]. Proximal SVM (PSVM) employs an equality constraint instead of inequality constraint in SVM. PSVM is used in [21], seven features are extracted which are group into a matrix with each row representing single feature vector. PSVM handle multiple classes more efficiently and classification of 20 TSL achieved 91% accuracy. Multi-dimensional classification using non-linear SVM has higher accuracy as compared to using linear SVM [16]. In [23], SIFT features is extracted from 30 ArSL. With 7 train images each, accuracy of 99% is obtained.

2.5.1.2 Artificial neural network (ANN)

ANN is an information-processing system with several performance characteristics in common with that of biological neural networks [69]. ANN is generally defined by three parameters, namely the interconnection pattern between different layers of neurons, the weight of interconnections, and the activation function. A neuron has inputs \(~{x_1},~{x_2} \ldots {x_n}\), which each are labelled with a weight \({w_1},~{w_2} \ldots {w_n}\) that measures the permeability. The neuron function can be represented as nonlinear weighted sum in Eq. 12, where \(K\) is the activation function.

$$y=K\mathop \sum \limits_{i=1}^n {w_i}{x_i}$$
(12)

Akmeliawati et al. [27] applied ANN with 7392 gestures signals to train a system to recognize 13 gestures. Using a single ANN with 45 inputs and 14 outputs with two hidden layers, an average accuracy of 96.02% is achieved. Gesture Recognition Fuzzy Neural Network (GRFNN) was introduced in [5] to adapt fuzzy control for learning parameters. The advantage of eliminating the needs of preselecting training pattern improves the accuracy. In recognition of 36 ASL, GRFNN achieved 92.19% accuracy [5]. Time Delay NN (TDNN) focus on working with continuous data. Multi-Layered Perceptron NN (MLPNN) is a feedforward neural network with one or more layers between input and output layer. It is devised from linear perceptron to distinguish data which are not linearly separable.

Karami et al. [69] employed MLPNN to classify 32 classes of PSL. Using an MLPNN with 92 input nodes, one hidden layer with 21 neurons, and five linear output neurons, the accuracy achieved is 94.06%. A recurrent NN is when the connections between neural forms a directed cycle. Elman RNN is a partial RNN, whereby the feedforward connections are flexible while the recurrent connections are fixed. The connections have a set of feedback connection which allows the network to remember cues from recent past while the rest is feedforward network. By performing the back-propagation with Simulated Annealing training method, results are promising for dynamic sequences training in both [79, 82].

2.5.1.3 K-nearest neighbor (k-NN)

K-NN is a non-parametric statistical method whereby input data is classified by a majority vote of its neighbor. The data will be assigned to the class most common among its \(~k\) nearest neighbors. Euclidean distance as in Eq. 13 is a commonly used similarity measures.

$$distance=\mathop \sum \limits_{i=1}^N {({a_i} - {b_i})^2}.$$
(13)

The Euclidean distance between each testing data point to the training data points are calculated. The testing data are then labelled according to the majority classes in the \(k\)th nearest training data. K-NN is used in comparison with the parametric Bayes classifier in [35] and shown that the former has better performance. In research [146], k-NN is used to classify 30 test images from each 26 gestures, the highest overall accuracy achieved is 90%. However, several researches on comparing the accuracy of k-NN against SVM in equal test and train data size, has shown that the overall accuracy of k-NN is comparatively lower [23, 51, 76, 89]. Nevertheless, k-NN has the advantage of being computational efficient and easy to be implemented.

2.5.1.4 Unsupervised static classification method

Unsupervised classification is often referred to as clustering. It differs from supervised classification where the input data are not labelled. K-means clustering is one of the commonly used unsupervised classification in gesture recognition. It is a vector quantization technique which partition n observations into k clusters in which each observation belongs to the cluster of the nearest mean. In researches [16, 66], K-means clustering is used to cluster the training features vectors into classes of sign languages. The centroids are then used as inputs to BoF model to simplify the classification problems. In literature [81], the authors employed K-means clustering to calculate the code vector coordinates in four dimensions.

Self-organizing maps (SOM) is a variant of ANN which is an unsupervised learning method. SOM differs from other supervised ANN method as it uses competitive learning in contrast with error-correction learning such as backpropagation with gradient descent. Self-Growing and SOM (SGONG) proposed in [84] combines the advantages of Growing Neural Gas (GNG) while adapting a reduce parameter and more biologically plausible design. It retains the ability to insert nodes and neurons where needed in SOM without the need to introduce new nodes [85]. In literature [84], the construction of SGONG on the hand gestures input, allows the position of each fingers to be identified. Classification of 31 static gestures achieved average accuracy of 90.45%. Euclidean distance is the real distance between two points in the m-dimensional space [25]. In some researches, classification is performed through template matching by calculating Euclidean distance between feature vectors of input gestures and a template. The nearest distance is the matching result. Examples of gesture classification by calculating Euclidean distance can be found in [8, 19, 25].

2.5.2 Dynamic gesture classification

In dynamic gestures recognition, two different signs cannot be compared using Euclidean space due to the misalignment in time. DTW and HMM are widely applied due to the ability to align frames of signs and compute the likelihood of similarity [49]. Other notable classification techniques in dynamic environment include Finite State Machine (FSM), Kalman filtering, advanced particle filtering, and condensation algorithm.

2.5.2.1 Dynamic time warping (DTW)

DTW is useful in measuring the similarity between two temporal sequences which may be different in length and speed. DTW finds the best mapping with the minimum distance using ‘time warping’ which allows compress of expand in time to obtain the best match. The goal of DTW is to find the mapping path mapping path mapping path \(p=({p_1}, \ldots ,~{p_L})\) with \({p_L}=\left( {{n_l},{m_l}} \right) \in \left[ {1:N} \right] \times [1:M]\) for \(l \in [1:L]\) satisfying the following constraints:

  1. 1.

    Boundary condition:

    \({p_1}=(1,1)\) and \(~{p_L}=(N,M)\).

  2. 2.

    Step size condition:

    \({p_{l+1}} - {p_l} \in \left\{ {\left. {\left( {1,0} \right),\left( {0,1} \right),\left( {1,1} \right)} \right\}} \right.~\)

    \(for~l \in \left[ {1:L - 1} \right].\)

  3. 3.

    Monotonicity condition:

    \({n_1} \leq {n_2} \leq \cdots \leq {n_L}~\) and

    \(~{m_1} \leq {m_2} \leq \cdots \leq {m_L}~\).

Given two sequences \(~{x_{nl}}~,{y_{mi}}\), the local distance can be compared. The total cost \({c_p}\left( {X,Y} \right)~\) of a warping path \(p\) between \(X~\) and \(~Y\) with respect to the local cost measure \(c\) is given as in Eq. 14.

$${c_p}\left( {X,Y} \right)=\mathop \sum \limits_{l=1}^L c({x_{nl}}~,{y_{mi}})$$
(14)

DTW was implemented in [52], reaching an accuracy of 96% in recognizing dynamic sign ‘hello’ among a continuous sentence consisting of four other signs. In [50], the author introduced Statistical DTW which use DTW to train a statistical model, and shown to outperform HMMs in handwriting recognition. Lichtenauer et al. [49] introduced a hybrid approach by using Statistical DTW (SDTW) only for time warping and a separate classifier on the warped features. Two statistical classifiers for warped features are proposed by the authors, namely the Combined Discriminative Feature Detectors (CDFDs) and Quadratic Classification on DF Fisher Mapping (Q-DFFM). Both proposed method of SDTW with CDFDs and SDTW with Q-DFFM are shown to have better accuracy than SDTW alone and HMMs. Both methods uses a selective-based discriminative features (DFs) which is able to reduce the dimensionality and noises by removing non-DFs. DTW has also been successfully applied in the classification of dynamic gestures in [51] on features vectors of PCBR, WPD-2D, and convexity defects.

2.5.2.2 Hidden Markov models (HMMs)

HMMs is a stochastic method of analyzing time-varying data with spatio-temporal variability [63]. A first-order HMM has two assumptions, namely the probability of a state depends only on the previous state, and the probability of an output observation \(k\) depends only on the state that results in the observation \({q_i}\) and not any other observations. A HMM is defined by three fundamental problem, namely finding the likelihood of observation, decoding the best hidden state sequence, and training the HMM parameters.

The likelihood computation can be achieved using Forward algorithm. Viterbi algorithm is used to decode the sequence of state which results in the observation sequence. The parameter learning or training stage can be achieved by using Baum–Welch algorithm or Forward–Backward algorithm.

Nianjun and Lovell [81] experimented HMMs with different model structure namely the Left–right and full connection topologies, and found that it has no significant effect on the accuracy. HMMs is used in [33] to classify 20 gestures with 1200 test and train sequences respectively, and accuracy achieved is 98.5%. Another application of HMM to classify ten dynamic gestures using 200 train sequences and 98 test sequences achieved 94.29% [47]. Elmezain et al. [10] applied Gaussian Mixture Model (GMM) in segmentation and Baum–Welch algorithm with Forward algorithm in gesture classification stage.

Parametric HMMs (PHMMs) is introduced in [31] to improve the parameter-dependent nature of a standard HMMs. PHMMs is parameterizes the underlying output probabilities of the states in HMMs. There are several researches in improving scalability of HMMs. Performance of HMMs, Linked HMMs (LHMMs) and CHMMs are compared for three gestures and found out that accuracy of CHMMs is least sensitive to the initial values of the parameters [64]. Parallel HMMs (PaHMMs) is proposed in [28] as improvement to factorial HMMs (FHMMs) in [65] and coupled HMMs (CHMMs) in [64]. Both FMMs and CHMMs require the interactions of the processes to be modelled and hence every combination of actions must be trained [28]. The authors used PaHMMs to classify 22 ASL with 400 training sentences and 99 test sentences. An average accuracy of 94.23 and 84.85% for sign and sentence accuracy respectively are achieved [32]. Other application of HMMs can also be found in [2, 45].

2.5.2.3 Other dynamic classification methods

There are several other supervised classification techniques used in classification of static gestures. Wong and Cipolla [77] employed Sparse Bayesian Classifier and Relevance Vector Machine (RVM) in classification of ten gestures. The authors stated that the benefit of using Bayesian classifier with probabilistic nature enable the system to be applied in complex motion analysis that must maintain multiple hypotheses [77]. In this research, the authors used RVM classifier, which is a simple binary classifier over SVM classifier as the output of RVM is a probabilistic value instead of a binary true-or-false value. In addition, the dispersity of the model stored by the RVM classifier enables RVM to be less computational heavy.

Hong et al. [87] used FSMs for classification of dynamic gestures. The advantage of FSMs over the commonly used HMMs is that in HMMs, the states and structure must be predefined. In FSMs, the alignment of training data can be done simultaneously with the construction of gesture model [87].

2.6 Active techniques

LMC is a portable USB peripheral device with two monochromatic cameras and three infrared Light-Emitting Diode (LED). It models the 3D position of both hands and fingers and provides 28 information features including fingertips, palm center, hand orientation and so on. LMC has been used to aid the recognition of sign languages in [97]. Classification performed using Naive Bayes classifier and MLP-NN and achieve average accuracy of 98.3 and 99.1% respectively. Chuan et al. [96] utilizes seven features obtained from LMC and using SVM to classify 26 ASL with 79.83% accuracy.

Kinect is a device with a color sensor, an Infrared Emitter, and a depth camera, which collects color and depth information. Chai et al. [22] utilized Kinect to obtain color and depth information which are used to create a 3D motion trajectories database. With database of 239 Chinese Sign languages and four samples per language, recognition rate achieved is 96.32%.

Marin et al. [100] utilized LMC and Kinect to obtain position of fingertips, palm center and hand orientation features obtained from LMC together with color and depth information from Kinect and form a histogram of features. Multi-classes SVM with Gaussian Radial Basis Function (RBF) kernel are then used to classify ten different sign languages with 140 samples each and shown an average accuracy of 91.3% in real-time recognition [33].

LMC however is unable to detect fingers when they are touching with each other or when fingers are occluded [99]. LMC is also limited when the hand is not perpendicular to the camera or when signer is wearing bracelet and long sleeves [100]. The tracking ability of LMC is tested in [74] by using 1500 samples by performing the known gestures and actual outcome of tracking. The average accuracy experimented is 96.34%.

3 Literature review on sensor-based gesture recognition

This section discusses the techniques used in sensor-based gesture recognition research. Sensor-based approaches generally relies on the use of sensors which are physically attached to users to collect position, motion and trajectories of fingers and hand data. These approaches reduce the need of pre-processing and segmentation stage, which are essential to vision-based gesture recognition. Features such as flex angle of fingers, orientation and the absolute position of hand are often in 3D space, and hence it contains the depth information which is useful in telling distance of gesture away from source of sensors. Sensor-based approaches often requires users to wear a glove with sensors or with probes attached to the arm of users. These instruments are required to be set up prior to the recognition, and these often limit the approaches to a laboratory setup.

3.1 Data glove

Data gloves used in gesture and sign language recognition utilizes IMU sensors such as gyroscope and accelerometer to obtain the orientation, angular, acceleration information. Flex sensors are present in some data gloves to obtain finger bending information. VLP-Data glove is a pair of flex-sensor gloves that consist of fiber optic transducer, which measures the flex angles, position, and orientation data. Kim et al. [41] used 16 raw data generated from VPL-Data glove and categorized the motion of both hands into ten basic motion which are used as input to Fuzzy Min–max Neural Network (FMNN). With 25 KSL words, the authors achieved an accuracy of 85%. In [42] recognizes 250 Taiwanese Sign Language words. The features extracted from Data Glove include flexion of fingers, position, angles and motion trajectory data. The features are used as input to HMM to recognize 51 types of posture, six types of orientation and eight types of motion and achieved 100% accuracy for all three categories. The authors also tested isolated gestures, short sentence and long sentences with 250 vocabularies and achieve 89.5, 70.4, and 81.6% respectively. In [53], ten flex angle and 3D absolute position generated by VPL Data-glove is used, HMMs are applied to recognize ten dynamic gestures and achieve accuracy of 99%.

3.2 Electromyography (EMG)

Electromyography is the recording of the electrical activities of the muscle tissues using electrodes attached to the skin or inserted into the muscles. Zhang et al. [73] uses a fusion of 3-axis input from accelerometer and 5-channel of EMG signals attached on the hand of the user. Using Fuzzy K-means clustering as classifier, 72 dynamic CSL is recognized with 93.1% accuracy. Kim et al. [35] used EMG sensors attached on the arm of users to obtain finger movement input. Using a linear combination of both k-NN and Bayes classifier to classify 20 classes of gestures, the approach achieved 94% accuracy. Ahsan et al. [24] extracted EMG pattern signatures from the signals for each movement and then ANN utilized to classify the EMG signals based on features. Myo armband is an arm wearable with both IMU and EMG sensors. Research in [144] uses Myo armband in recognition of 20 classes of Libras. Using SVM classifier, the average accuracy is 98.6%.

A hybrid method of combining vision input from LMC and surface EMG (SEMG) is done in [74]. Using SEMG alone, an accuracy of 86% is achieved. Together with LMC depth camera input, the accuracy is increased to 95%. Research in [140] utilizes both SEMG and Cyberglove to classify the flexion and extension of all five fingers. PCA is used before Independent Component Analysis (ICA) as PCA reduce computational complexity. Classification using LDA reaches accuracy of 90%.

3.3 WiFi and Radar

Another type of technology used for gesture recognition is WiFi oriented gesture control [75]. The authors claimed that this method is much simple to be applied as compared to Kinect technology. It uses WiSee technology that consists of multiple antennas to focus on one user to detect the user’s gesture. Signals used in Wifi do not require line of sight and can traverse through walls. It utilizes the properties of Doppler shift, which is the change in frequency of a wave as its sources move relative to the observer. A similar research is done in literature [67].

Abdelnasser et al. [92] proposed a gesture recognition system using WiFi named WiGest. WiGest system leverages changes in WiFi signal strength to detect in-air hand gestures nearby the user’s mobile device. Using single access point (AP), the recognition rate is 87.5%. The accuracy increases to 96% when the three overheard APs are used. In research [93], the authors used smart radar sensor that operates in the 2.4 GHz Industrial, Scientific and medical (ISM) band. The features are extracted based on magnitude differences and Doppler shifts of the gesture performed. K-NN is used for classification of four gestures, and achieved accuracy of 98%. Unlike vision-based gesture recognition, WiFi and radar offers the flexibility of position and orientation, without having to face the source of camera.

4 Discussion

This section provides an overview of previous surveys done on gesture and sign language recognition works as well as the techniques applied in different researches.

4.1 Previous survey on gesture and sign language recognition works

Reviews and surveys had been conducted on researches on gesture and sign language recognition, these papers may provide a comprehensive overview of methods used in gesture recognition. Table 1 lists the previous works on the analysis of hand gesture recognition and their focus.

Table 1 List of gesture recognition reviews

4.2 Summary of techniques and algorithm reviewed

Information including techniques applied, database size, performance, and scope of previous work are presented and tabulated in this section. Tables 2 and 3 includes the techniques used and summary of vision-based gestures and sign language recognition researches reviewed. Table 2 listed research in static gesture recognition, whereas Table 3 listed research in dynamic gesture recognition. Table 4 highlights technologies used in vision-based active techniques and sensor-based gesture recognition. The techniques used are categorized by the classification, feature extraction, and segmentation. Pre-processing method are however not included in this section as it is found that many papers lack detailed information of this stage.

Table 2 Vision-based static gesture recognition summary
Table 3 Vision-based dynamic gesture recognition summary
Table 4 Active techniques and sensor-based gesture recognition summary

The accuracy/sample sizes column stated the highest accuracy achieved by the proposed method as well as the sample sizes of the dataset. The samples size are the total samples used, including both train and test samples. Sample size 15 × 80 for instance, translate to 15 classes of gesture with 80 sample each.

In Table 3, the information of the numbers of sentences or sign used for training and testing of dynamic gestures which are stated explicitly by the authors are included. Most literature reviewed in this paper focus on recognition of only one hand. Research which involves recognition of more than a single hand is stated explicitly in the Scope column. Most vision-based research reviewed uses a standard camera or a webcam. For research involving stereo camera or invasive techniques, it is indicated in the Scope column. Some research compared performance of different techniques used. The techniques with most prominent result is presented first, and those techniques compared are stated after “(Comp)”. Research which uses hybrid of techniques are indicated by “+”. In the event of information not explicitly stated or are found to be vague by the authors of this paper, the information is left blank.

Pre-processing method are carried out to improve accuracy and processing time. The most commonly applied pre-processing techniques includes Median and Gaussian filter to remove noises. Downsizing the input image is often used prior to segmentation and the following stage in gesture recognition research to reduce the computational load. Tracking of hand movement are often carried out using Particle filtering, CAMShift method, and Adaboost tracking algorithm.

Skin color segmentation is a popular choice of segmentation method. The most commonly use color space are HSV, YCbCr, and CIE Lab as these color space easily differentiate skin color from the background. The research shown that skin color segmentation with other features such as edge detection and threshold improves the segmentation result. Skin color modelling approaches and adaptive skin model are more robust towards dynamic changing background than explicitly selected threshold in color space.

In feature extraction stage, appearance-based and model-based recognition uses different approaches. Appearance-based method in both time and frequency domain extracts useful information from pixels of the input image. Model-based method includes both volumetric and skeletal modelling in either 2D or 3D environment, this includes convexity defects and K-curvature techniques. SURF is more computational efficient as compared to SIFT. However, the performance of SURF is not as invariant as SIFT. PCA are mostly used in hybrid with other features to improve overall accuracy. PCA and LDA are also useful in dimensionality reduction, which serves to reduce the computational load. Hybrid feature extraction method has been used widely in recent gesture recognition research.

In dynamic gestures classification, some notable methods are DTW and HMM. Several variants of HMM are proposed such as PaHMM, CHMM, and LHMM to address scalability issues. PHMM on the other hand are proposed as a solution to reduce the parameters-dependent characteristic of a standard HMM. In classification of static gestures, some of the commonly used techniques are SVM and ANN. Many researches which performed comparison of classification method has shown SVM in overall have better performance.

In the context of sign language recognition specifically, the vocabulary of a sign languages system is tremendously vast. However, the vocabulary used in most research until today is little as compared to that of a sign language system. The scalability issue is another challenges exclusive for sign language recognition. Although many researches on gesture and sign language recognition have been done, however none has deployed on a large scale to date [95]. Despite most researches done on gesture and sign language recognition shown promising results, a practical implementation of such system is still far from reality as there is several underlying assumptions in most researches. Most researches done might be suitable in a controlled lab setting but does not generalize to arbitrary setting [37]. One common assumption in most researches is to assume a high contrast and stationary backgrounds with constant ambient lighting conditions.

4.3 Benchmark databases

In sign language recognition research, benchmark databases are available as standard reference for future researches. Benchmark databases allows comparison of a model-free and person-independent approaches [122]. These includes Purdue RVL-SLLL [124], RWTH-PHOENIX-Weather [125], ATIS Sign Language Corpus [127], SIGNUM Corpus [78], RWTH-BOSTON-50, RWTH-BOSTON-104, and RWTH-BOSTON-400 [128]. Standard accuracy measurement is introduced for performance to be compared. RWTH-BOSTON-50 used error rate (ER) as accuracy measurement. RWTH-BOSTON-104 and ATIS used tracking error rate (TER), Word error rate (WER) and Independent word error rate (PER) as assessment for accuracy. There are several researches conducted using the benchmark databases and their result are shown in Table 5. Nevertheless, these databases are not widely referenced in research of sign language recognition. The recognition results presented in most papers reviewed are based on each author’s own collection of data.

Table 5 List of Research using Benchmark Database

5 Conclusion and future work

Gesture recognition has been an on-going research driven by its wide potential for applications such as sign language recognition, remote control robots and human–computer interaction in virtual reality. Nevertheless, the barriers to achieving an accurate and robust system persist, namely the occlusion of hand, presence of affine transformation, scalability of database, different background illumination and high computational cost.

There are growing numbers of emerging technology such as EMG, LMC, and Kinect which capture gesture information more readily. The common pre-processing method used are Median and Gaussian filter as well as downsizing of images prior to subsequent stages. Skin color segmentation is one of the most commonly used segmentation method. Color space which are generally more robust towards illumination condition are CIE Lab, YCbCr and HSV. More recent research utilizes combination of several others spatial features and modeling approaches to improve segmentation performance.

Common feature extraction with appearance-based approaches includes SIFT, SURF, PCA, LDA and DWT. Model-based approaches includes both volumetric and skeletal modelling and convexity defects techniques. Hybrid of feature extraction method has been widely used to provide more robust feature for recognition.

From previous works, HMMs appear as promising approaches towards dynamic gesture recognition as it has been successfully implemented in many researches. In static gesture recognition, SVM is the most popular method as it has shown to have better performance in several researches. Several variants are proposed towards existing method and hybrids of methods are becoming more widely used as it can overcome the shortfall of the single method. There are significant gaps to be filled for gesture recognition to be able to be put into actual use. The numbers of research using benchmark database are far less compared to those collected their own database. Future works using benchmarked databases are advised to allow for direct comparison between algorithms used.