1 Introduction

Sign language is a form of visual language that uses grammatically structured manual and non-manual sign gestures for communication [45]. Manual gestures include hand shape, palm’s orientation, location and movements, whereas non-manual gestures are represented by various facial expressions including head tilting, lip pattern, and mouthing [5, 38]. The language forms one of the natural means of communication among hearing impaired people. The goal of a Sign Language Recognition (SLR) system is to translate sign gestures into a meaningful text that helps persons without any speech or hearing disability can to understand sign language [24], hence, provides a natural interface for communication between humans and machines. A simple way to implement a SLR systems is based on tracking the position of hand and identifying relevant features to classify a given sign. This process of detecting and tracking the hand movements is relatively easier when compared with articulated or self occluded gestures and hand movements [18]. In literature, there exists a number of SLR systems proposed by various researchers for multiple sign languages including American [45], Australian [35], Indian [18], Spanish [15], and Greek [33], etc. However, most of the existing SLR systems require a signer to perform sign gestures in front of the capturing device, i.e., camera or sensor. These systems fail to recognize sign gestures correctly (i) when there is a change in the signer’s relative position with respect to the camera or (ii) when the signer performs the gestures in a different plane leading to some change in Y-axis orientation. Such a scenario is depicted in Fig. 1, where a signer performs a sign gesture with a rotation along Y-axis in the camera coordinate system that results into self occlusion and a distorted view of the gestures being acquired. Therefore, pose and position invariant hand gesture tracking and recognition system can be very much helpful to improve the overall performance of the SLR systems and makes them usable for real life scenario including real time gesture recognition involving multiple signers, sign word spotting, etc.

Fig. 1
figure 1

Body occlusion occurs when a signer performs (a) sign gesture with rotation along Y-axis (b) a distorted human torso of the performed sign

With the advancement in low-cost depth sensing technology and emergence of sensors such as Leap motion and Microsoft Kinect, new possibilities in Human-Computer-Interaction (HCI) are evolving. These devices are designed to provide 3D point cloud of the observed scene. Kinect provides a 3D skeleton view of the human body through its rich Software Development Kit (SDK) [39]. 3D skeleton tracking can successfully address the body part segmentation problem, therefore, it is considered highly useful in hand gesture recognition. Illumination variation related problems that are usually encountered in images captured using traditional 2D cameras, can be avoided using such systems. Kinect has been successfully used in various applications including 3D interactive gaming [4], robotics [28], rehabilitation [13] and hand gesture recognitions [18, 25, 30]. Zafrulla et al. [45] have developed an automatic SLR system using Kinect based skeleton tracking. The authors have utilized 3D points of the upper body skeletal joints and fed these features to Hidden Markov Model (HMM) for recognition purpose. However, their system suffers from tracking errors when the users remain seated. In such cases, the main challenge is to extract self occluded, articulated or noisy sign gestures that can be used to perform recognition. In this paper, we propose a new framework for SLR using 3D points of body skeleton that can be used to recognized gestures independently irrespective to the signer’s position or rotation with respect to the sensor. The main contributions to the paper are as follows:

  1. (i)

    Firstly, we present a position and rotation invariant sign gesture tracking and recognition framework that can be used for designing SLR. Our system observes all gestures independently by transforming the 3D skeleton feature points with respect to one of the coordinate axes.

  2. (ii)

    Secondly, we demonstrate the robustness of the proposed framework for recognition of self-occluded gestures using HMM. A comparative gesture recognition performance has also been presented using the HMM and SVM (Support Vector Machine) classifiers.

Rest of the paper is organized as follows. In Section 2, a chronological review of recent works in this field of study, is presented. System setup along the preprocessing and feature extraction are presented in Section 3. Experimental results are discussed in Section 4. Finally, we conclude with the future possibilities of the work in Section 5.

2 Related work

Hand gesture recognition is one of the basic steps of SLR systems. Handful of research work are being carried out in locating and extracting the hand trajectories. These work vary from vision-based skin color segmentation to depth-based analysis. To overcome the self-occlusion problem in hand gesture recognition, researchers have used multiple cameras to estimate 3D hand pose. Athitsos et al. [2] have proposed an estimation of 3D hand poses by finding the closest match between the input image and a large image database. The authors have used a database indexed with the help of Chamfer distance and probabilistic line matching algorithms by embedding binary edge images into a high dimensional Euclidean space. In [6], the authors have proposed a Relevance Vector Machine (RVM) based 3D hand pose estimation method to overcome the problem of self-occlusion using multiple cameras. The authors have extracted multiple-view descriptors for each camera image using shape contexts to form a high dimensional feature vector. Mapping between the descriptors and 3D hand pose has been done using regression analysis on RVM.

Recent development in depth sensor technology allows the users to acquire images with depth information. Depth cameras such as time-of-flight and Kinect have been successfully used by researchers for 3D hand and body estimation. Liu et al. [27] have proposed hand gesture recognition using time-of-flight camera to acquire color and depth images, simultaneously. The authors have extracted shape, location, trajectory, orientation and speed as features from the acquired 3D trajectory. Chamfer distance has been used to find the similarity between two hand shapes. In [34], the authors have proposed a gesture recognition system using Kinect. Three basic gestures have been considered in the study using skeleton joint positions as features. Recognition of gestures has been performed using multiple classifiers namely, SVM, Backpropagation Neural Network (NN), Decision Tree and Naive Bayes where an average recognition rate of 93.72% was recorded in their work. Monir et al. [32], have proposed a human posture recognition system using Kinect 3D skeleton data points. Angular features of the skeleton data are used to represent the body posture. Three different matching matrices have been applied for recognition of postures where a recognition rate of 96.9% has been observed with priority based matching. Another study of hand gesture recognition using skeleton tracking has been proposed in [31] using torso-based coordinate system. The authors have used angular representation of the skeleton joints and SVM classifier to learn key poses, whereas a decision forest classifier has been used to recognize 18 gestures. Almeida et al. [1] have developed one SLR system for Brazilian Sign Language (BSL) using Kinect sensor. The authors have extracted seven vision-based features that are related to shape, movement and position of the hands. An average accuracy of 80% has been recorded on 34 BSL signs with the help of SVM classifier. In [26], the authors have proposed a covariance matrix based serial particle filter to track the hand movements in isolated sign gesture videos. Their methodology has been applied on 50 isolated ASL gestures with an accuracy of 87.33%.

Uebersax et al. [41] have proposed the SLR system for ASL using time-of-flight (TOF) camera. The authors have utilized depth information for hand segmentation and orientation estimation. Recognition of letters is based on average neighborhood margin maximization (ANMM), depth difference (DD), and hand rotation (ROT). Confidences of the letters are then combined to compute a word score. In [37], the authors have utilized Kinect sensor to develop gesture based arithmetic computation and rock-paper-scissors game. They have utilized depth maps as well as color images to detect the hand shapes. Recognition of gestures has been carried out using a distance metric to measure the dissimilarities between different hand shapes known as Finger-Earth Mover’s Distance (FEMD). A Discriminative Exemplar Coding (DEC) based SLR system is proposed in [40] using Kinect. The authors have used background modeling to extract human body and hand segmentation. Next, multiple instance learning (MIL) has been applied to learn similarities between the frames using SVM, and AdaBoost technique has been used to select the most discriminative features. An accuracy of 85.5% was recorded on 73 sign gestures of ASL. Keskin et al. [17] have proposed a real time hand pose estimation system by creating a 3D hand model using Kinect. The authors have used Random Decision Forest (RDF) to perform per pixel classification and the results are then fed to a local mode finding algorithm to estimate the joint locations for the hand skeleton. The methodology has been applied to recognize 10 ASL digits, where an accuracy of 99.9% has been recorded using SVM. A Multi-Layered Random Forest (MLRF) has been used to recognize 24 static signs of ASL [23]. The authors have used Ensemble of Shape Function (ESF) descriptor that consist of a set of histograms to make the system translation, rotation and scale invariant. An accuracy of 85% has been recorded when tested on gestures of 4 subjects.

Chai et al. [5] have proposed the SLR and translation framework using Kinect. Recognition of gestures has been performed using a matching score computed with the help of Euclidean distance. The methodology has been tested on 239 Chinese Sign Language (CSL) words, where an accuracy of 83.51% has been recorded with top 1 choice. A hand contour model based gesture recognition system has been proposed in [44]. Their model simplifies the gesture matching process to reduce the complexity of gesture recognition. The authors have used pixel’s normal and the mean curvature to compute the feature vector for hand segmentation. The methodology has been applied to recognize 10 sign gestures with an accuracy of 96.1%. A 3D Convolutional Neural Network (CNN) has been utilized in [14] to develop a SLR system. The model extracts both spatial and temporal features by performing 3D convolutions on the raw video stream. The authors have used multi-channels of video streams, i.e., color information, depth data, and body joint positions, and these features have been fed as input to the 3D CNN. Multilayer Perceptron (MLP) classifier has been used to classify 25 sign gestures performed by 9 signers with an accuracy of 94.2%. In [43], the authors have used hierarchical Conditional Random Field (CRF) to detect candidate segments of signs using hand motions. A BoostMap embedding approach has been used to verify the hand shapes and segmented signs. However, their methodology requires a signer to wear a wrist-band during data collection. In [8], the authors have proposed one SLR system for 24 ASL alphabet recognition using Kinect. Per-pixel classification algorithm has been used to segment human hand into parts. The joint positions are obtained using a mean-shift mode-seeking algorithm and 13 angles of the hand skeleton have been used as the features to make the system invariant to the hand’s size and rotational directions. Random Forest based classifier has been used for recognition of ASL gestures with an accuracy of 92%. In this work, we have used Affine transformation based methodology on 3D human skeleton to make the proposed SLR system position and rotation invariant.

To the best of our knowledge, all of the existing gesture recognition systems require the users to perform in front of the Kinect sensor. Therefore, such systems suffer when the users perform gestures that are recorded from side views. This creates occlusion, especially in the joints of the 3D skeleton. In the proposed framework, we present a solution to the self-occluded sign gestures. Our solution is position and rotation independent of the signer within the sensor’s viewing field. A summary of the related work in comparison to the proposed methodology is presented in Table 1.

Table 1 Summary of the related work in comparison to the proposed methodology

3 System setup

In this section, we present a detailed description of the proposed framework of the SLR system. It offers position and rotation independent sign gesture tracking and recognition. Since the displacement in signer’s position with respect to the Kinect can change the origin of the coordinate system in the XZ-plane, it may cause difficulty in recognition. Similarly, when the signer performs a sign and the side view of the gesture is captured by the sensor, it may cause self-occlusion. A block diagram of the proposed framework is shown in Fig. 2, where the acquired 3D skeleton represents gesture sequences that undergoes Affine transformation. After transformation, the hands are segmented from the skeleton to extract gesture sequence and it is followed by feature extraction and recognition.

Fig. 2
figure 2

Block diagram of the proposed framework for SLR

3.1 Affine transformation

After capturing the signer’s skeleton information through Kinect, skeleton data are then processed through affine transformation. Affine transformation has been used to cancel out the effect of signer’s rotation and position while performing the gestures. Two different 3D transformations , namely rotation and translation as given in (1) and (2), have been applied,

$$ R_{y}^{\theta}=\left[\begin{array}{cccc} cos(\theta)& 0& sin(\theta) & 0 \\ 0 & 1 &0 & 0\\ -sin(\theta) & 0 & cos(\theta) & 0 \\ 0 & 0 & 0& 1 \end{array}\right] $$
(1)
$$ T(t_{x},t_{y},t_{z})=\left[\begin{array}{cccc} 1& 0& 0 & t_{x} \\ 0 & 1 &0 & t_{y}\\ 0 & 0 & 1 & t_{z}\\ 0 & 0 & 0& 1 \end{array}\right] $$
(2)

where \(R_{y}^{\theta }\) is the rotation matrix for rotating a 3D vector by an angle of 𝜃 about Y-axis, and T(t x , t y , t z ) is the translation vector for translating the points in 3D.

3.1.1 Rotation invariant

If the signer does not perform sign in parallel to the Kinect sensor, then the torso makes an angle (𝜃 z ) with the Z-axis of the sensor’s coordinate system. For calculating 𝜃 z , we have used three specific 3D points of the skeleton, i.e., left shoulder (L), right shoulder (R) and the spine center (C) that constitute the torso-plane (TP) as shown in Fig. 3.

Fig. 3
figure 3

Computation of the plane TP using three 3D points on skeleton as representative (a) with front view (b) side view-1 (45 degree approx. with Z-axis) (c) side view-2 (90 degree approx. with Z-axis)

TP has been used as the representative of the 3D skeleton. Next, two vectors \(\overrightarrow {CL}\) and \(\overrightarrow {CR}\) are computed on TP as shown in Fig. 4a.

Fig. 4
figure 4

Computation to cancel out the rotation effect (a) computation of 𝜃 z before rotation (b) After rotation of 𝜃 z about Y-axis

Finally, a normal vector (\(\hat {n}\)) is estimated from TP with the help of \(\overrightarrow {CL}\) and \(\overrightarrow {CR}\), and it can be computed using (3). In our study, we made a zero-degree (0o) angle between Z-axis and \(\hat {n}\) for all gestures. Thus, while testing, 𝜃 z is calculated using (4) by taking the projection of \(\hat {n}\) in the XZ-plane,

$$ \hat{n}= \frac{\overrightarrow{CL} \times \overrightarrow{CR}} {|\overrightarrow{CL}| |\overrightarrow{CR}|} $$
(3)
$$ cos(\theta_{z})=\frac{\hat{n}.\hat{k}}{|\hat{n}| |\hat{k}|} $$
(4)

where \(\hat {k}\) is an unit vector < 0, 0, 1 > along Z-axis. After estimating the value of 𝜃 z , torso is rotated across Y-axis using (1) to cancel the effect of rotation of the signer as depicted in Fig. 4b.

3.1.2 Position invariant

After canceling out the rotational effect, we have used another heuristic to make the gesture recognition system independent of the position while the gestures are performed in the XZ-plane. To accomplish this, coordinates of the torso have been transformed from the sensor’s frame of reference to a new frame of reference with respect to the signer. This is performed by translating the 3D point (C) of the skeleton to the center and shifting the rest of the data points with respect to this new origin using (2) as illustrated Fig. 5.

Fig. 5
figure 5

Computation of position invariant by translating the 3D spine point C to center

An alignment of all gestures improves recognition performance. Also, this makes the system position invariant. Rotation and position invariant steps have been applied on a typical gesture performed on three different angles as shown in Fig. 6, where first column shows the raw gesture captured using Kinect, second column shows the outcome of the rotation invariant step, whereas the third column shows the translation of the torso to the center for making the position invariant.

Fig. 6
figure 6

Applying rotation and position invariant steps to a gesture captured at three different viewing angles: Figures in first column show a gesture performed in front, side view-1 and side view-2; Figures in second column show the gestures after applying rotation invariant step that rotates the torso about Y-axis; Figures in third column shows the outcome of the position invariant step that translates the torso to the origin with respect to spine

3.1.3 Hand segmentation

Since gestures are performed either using single hand or both hands, both left (H L ) and right (H R ) hands are segmented from the 3D torso. For each hand, two 3D points, namely, wrist and hand-tip have been segmented from the torso that can be obtained using (5) and (6). This makes H L and H R of 6 dimensions each by concatenation of two 3D points,

$$ H_{L}= [W_{L}~|~T_{L}] $$
(5)
$$ H_{R}= [W_{R}~|~T_{R}] $$
(6)

where [W L | T L ] and [W R | T R ] are the wrist and hand-tips of left and right hands, respectively.

3.2 Feature extraction

Three different features have been extracted from the 3D segmented hands H L and H R , namely angular features (A L and A R ), velocity (V L and V R ) and curvature features (C L and C R ). The details are as follows.

3.2.1 Angular direction

Angular features have been considered by various researchers in gesture recognition problems [7, 21, 29]. Angular direction corresponding to a 3D gesture sequence point M(x, y, z) is computed with the help of two neighbor points, i.e., L(x 1, y 1, z 1) and N(x 2, y 2, z 2) as depicted in Fig. 7.

Fig. 7
figure 7

Computation of the angular features of a 3D gesture sequence

Neighboring points are selected in such a way that all points are non-collinear. In this work, N and L are the third neighboring points that lies on either side of M. By doing this, we have ensured that all three points are collinear. A gesture sequence is shown in the Fig. 7 that forms a vector \(\overrightarrow {LN}\) making α, β and γ angles with the coordinate axes. These angles can be calculated using (7) and (8). These angles are taken as the three angular direction features of the feature set. Both H L and H R consist of two 3D sequences. Therefore, 6 dimensional angular features A L and A R have been computed corresponding to H L and H R , respectively.

$$ \overrightarrow{v}=\overrightarrow{OR}=<v_{x},v_{y},v_{z}>, |\overrightarrow{v}|=\sqrt{{v_{x}^{2}}+{v_{y}^{2}}+{v_{z}^{2}}} $$
(7)
$$ cos(\alpha)=\frac{v_{x}}{|\overrightarrow{v}|},cos(\beta)=\frac{v_{y}}{|\overrightarrow{v}|},cos(\gamma)=\frac{v_{z}}{|\overrightarrow{v}|} $$
(8)

3.2.2 Velocity

Velocity features are based on the fact that, each gesture is performed at different speeds [46]. For example, certain gestures may include simple hand movements, thus, having uniform speed, whereas complex gestures may have varying speeds. Velocity (V) can be computed by measuring the distance between two successive points of the 3D gesture sequence, say (x t , y t , z t ) and (x t+1, y t+1, z t+1), and it can be computed using (9).

$$ V(x,y,z)= (x_{t+1},y_{t+1},z_{t+1}) - (x_{t},y_{t},z_{t}) $$
(9)

In this study, the velocity has been computed for both hands that results into a 6-dimensional feature vector V L and V R , each corresponds to one hand.

3.2.3 Curvature

Curvature feature represents a shape’s curve and they are used to reflect the structural feature such as concavity and convexity. This has been successfully utilized in various gesture recognition tasks [7]. Curvature of a 3D point B(x, y, z) on the gesture sequence is estimated using its two neighboring points A(x 1, y 1, z 1), C(x 2, y 2, z 2) and a circle is drawn if the points are non-collinear. This is accomplished with the help of two perpendicular bisectors \(\overrightarrow {OM}\) and \(\overrightarrow {ON}\) as depicted in Fig. 8, where the gesture sequence is marked using dashed line.

Fig. 8
figure 8

Computation of the curvature features of a 3D gesture sequence

The center of the circle is calculated using (10). We have extracted five curvature related features, namely, the center (O), radius (r) of the circle, \(\angle {AOC}\) marked as (𝜃) in the figure, and two normal vectors \(\overrightarrow {OM}\) and \(\overrightarrow {ON}\) that comprises of 11 dimensions.

$$ \overrightarrow{O}=\frac{sin 2A \ \overrightarrow{A}+sin 2B \ \overrightarrow{B} + sin 2C \ \overrightarrow{C}}{sin 2A+sin2B+sin2C} $$
(10)

In this study, curvature has been computed for both hands which results into a 22-dimensional feature vector C L and C R each hand, respectively. Thus, by applying all three features, a new multi-dimensional feature vector (F T ) of 68-dimension is constructed as given in (11).

$$ F_{T} = [H_{L}~|~H_{R}] = [A_{L}~|~A_{R}~|~V_{L}~|~V_{R}~|~C_{L}~|~C_{R}] $$
(11)

3.3 HMM guided gesture recognition

HMM is used for modeling the temporal sequences and it has been used by researchers in sign gesture and handwriting recognition systems [20, 21, 45]. HMM can be defined using { π, A, B}, with π as the initial probability distribution, A = [a i j ], i, j = 1, 2, …N as the state transition matrix that has transition probability from state i to state j, and B defines the probability of observations with b j (O k ) as a density function from state j and observing a sequence O k [19, 36]. For each state of the model, a Gaussian Mixture Model (GMM) is defined. The output probability density of the state j can be computed using (12),

$$ b_{j}(x) = \sum\limits_{k=1}^{M_{j}}c_{jk} \aleph(x, \mu_{jk}, {\Sigma}_{jk}) $$
(12)

where M j defines the number of Gaussian components assigned to j, and (x, μ, Σ) denotes the Gaussian with mean (μ) and co-variance matrix (\(\sum \)) and a weight coefficient (c j k ) of the Gaussian for component k of the state j. The observation probability of the sequence O = (O 1, O 2, …O T ) has been assumed to be generated by a state sequence Q = Q 1, Q 2, …Q T of length T. This can be computed using (13), where \(\pi _{q_{1}}\) denotes the initial probability of start state.

$$ P(O, Q |\lambda)= \sum\limits_{Q} \pi_{q_{1}}b_{q_{1}}(O_{1}) \prod\limits_{T} a_{q_{T-1}q_{T}} b_{q_{T}}(O_{T}) $$
(13)

HMM has been used for recognition of sign gestures and a set of HMMs are trained using the feature vector F T defined in (11).

3.3.1 Dynamic context-independent feature

For boosting the sign gesture recognition system, we have included contextual information from neighboring windows by adding time derivatives in every feature vector. Such type of contextual and dynamic information in the current window helps to enhance the performance of recognition process [3]. The first and second-order dynamic features are known as delta and acceleration coefficients, respectively. Computation of delta coefficients is done with the help of first order regression of the feature vector using (14),

$$ d_{t} = \frac{{\sum}_{\theta = 1}^{\Theta} \theta(c_{t+\theta}-c_{t-\theta})}{2{\sum}_{\theta = 1}^{\Theta} \theta^{2}} $$
(14)

where d t is a delta coefficient at time t that has been computed in terms of static coefficients c t−Θ to c t. Value of Θ is set according to the window size. Likewise, the acceleration coefficients can be obtained using second order regression. A temporal information has been captured by these derivative features at each frame that represents the dynamics of the features around the current window. In this study, the 68-dimensional feature vector (F T ) has been used along with the dynamic features discussed earlier to create a 204-dimensional feature vector for classification.

4 Results

We first present the dataset that has been prepared to test our proposed system. Next, we present gesture recognition results. We have carried out experiments in such a way that the training and test sets include gestures of different users.

4.1 Dataset description

A dataset of 30 isolated sign gestures of Indian Sign Language (ISL) has been prepared. The sign gestures have been performed by 10 different signers, where each sign has been performed 9 times by every signer. Hence, in total 2700 (i.e. 30 × 9 × 10) gestures have been collected. Out of these 30 sign words, 16 words have been performed using single hand (right hand only), whereas remaining 14 words have been performed by both hands. Few examples of single and double-handed gestures are shown in Fig. 9.

Fig. 9
figure 9

Pictorial representation of sign gestures: (a) single-handed (b) double-handed. Note: Two instances of each gesture have been shown where one depicts the starting pose and the other is towards ending of gesture

In order to show the robustness of the proposed framework, all sign gestures have been performed at three different rotational angles as shown in Fig. 10, where a signer has performed sign gestures in three different directions, that make approximately 0°, 45°, and 90° angles between torso plane (TP) and Z-axis, respectively.

Fig. 10
figure 10

Figure shows a gesture made by a signer in different view angles. a front view with zero-degree b side view 1 with 45o approx. c side view 2 with 90o approx

All these gestures from different view-angles were considered in our dataset. Similarly, signers have also changed their positions in the XZ-plane of the sensor’s view field when performing different gestures. The 3D visualization of the gesture shows the variations when a gesture is performed by different signers. A 3D plot of the tip points during single-handed sign gesture ‘bye’ is shown in Fig. 11 (first row) the gesture has been performed at different angles.

Fig. 11
figure 11

Figure shows the variation of the same gesture in our dataset. First row: 3D plot for the sign word ‘bye’ (1-handed) performed at different viewing angles (column 1: front view, column 2: side view 1, column 3: side view-2); Second row: 3D plot for sign word ‘dance’ (2-handed) performed at different viewing angles by different signers

Different colors distinguish amongst various signers. Similarly, a 3D plot for the sign word ‘dance’ is shown in Fig. 11 (second row), where large variations in the input sequence can be seen. We have made the dataset public for the research community.Footnote 1

4.2 Experimental protocol

Experiments have been carried out using user-independent training of the gesture sequences. Our proposed methodology does not require a signer to enroll any gestures in the system for testing the system. HMM models are trained such that they do not depend on users. The results have been recorded using Leave-One-Out Cross-Validation (LOOCV) scheme. According to this scheme, the number of folds are equal to the number of instances. The learning algorithm is applied once for each instance, using all other instances as training set. In our experiments, we have kept gestures of 9 persons in training and test the gestures of 10th person. The process is repeated for every person. Finally, an average of the results is recorded and reported. Recognition of gestures has been carried out in three modes, namely, for single hand, double hand, and using a combination of both.

4.3 HMM based gesture recognition

Gesture recognition has been performed without dynamic features as well as with dynamic features. The experiments have been carries out by varying HMM states, S t ∈ {3, 4, 5, 6} and by varying number of Gaussian mixture components per state from 1 to 256 with an incremental step of power of 2. Results using the framework by varying GMM components and HMM states are shown in Figs. 12 and 13, respectively using single, double and both hands gestures.

Fig. 12
figure 12

Gesture recognition rate by varying Gaussian mixture components

Fig. 13
figure 13

Gesture recognition rate by varying HMM states

An accuracy of 81.29% has been recorded for single handed gestures at 64 Gaussians components and 3 HMM states, whereas accuracies of 84.81% and 83.77% have been recorded on double handed gestures and combined gestures with 4 HMM and 5 HMM states and 128 Gaussians, respectively, when tested with dynamic features. Similarly, recognition of gestures have also been tested without using the dynamic features, where recognition rate of 75.29%, 78.56% and 73.54% have been recorded for single, double and combination modes, respectively. A comparison between the recognition rate of gesture with and without using dynamic features, is shown in Table 2.

Table 2 Gesture recognition rate on HMM with and without dynamic features

It can be observed that the dynamic feature based gesture recognition outperforms non-dynamic set. The confusion matrix, using dynamic features in the form of heat map is shown in Fig. 14.

Fig. 14
figure 14

Gesture recognition performance in the form of confusion matrix

4.4 Rotation-wise results

In this section, we present the results obtained using different rotations as shown in Fig. 10. HMM classifier has been trained with the gestures that have been performed in the front view of the signer, whereas the gestures from side-views are kept for testing. Recognition has been carried out jointly on single as well as double-handed gestures. Performance has been compared with raw data and the results are presented in Fig. 15.

Fig. 15
figure 15

Rotation wise gesture recognition in comparison to raw data when trained with only front view gestures

An accuracy of 86.67% has been obtained on front-view setup. We have obtained accuracies of 78.45% and 64.39% respectively on side view data. In all views, the proposed feature outperforms recognition using raw data.

In addition, the rotation-wise performance has also been computed by training the system with the complete preprocessed gestures including front and side views. Recognition results are depicted in Fig. 16, where the performance of the system has been increased in comparison to the accuracies obtained using front gestures based training.

Fig. 16
figure 16

Rotation wise gesture recognition when trained with all gestures (including front, side view 1 and side view 2)

4.5 Scalability test

A scalability test has also been performed by varying the training data, e.g. by varying number of signers (2,4,6,8) and by keeping the test data fixed during the experiments. These experiments have been carried out to test user-independence on the combined data for single as well as double-handed gestures and testing them on gestures of two signers while varying the training data. The recognition results are shown in Fig. 17, where an accuracy of 83% was recorded with 8 number of signers participated in training of the HMM classifier.

Fig. 17
figure 17

Gesture recognition performance by varying training signers

4.6 Comparative analysis

A comparative analysis of the proposed framework has been performed using SVM guided sign gesture recognition system. For this purpose, two different features, i.e., Mean and Standard Deviation have been extracted from the feature vector F T . SVM classifier directly uses a hypothesis space for estimating the decision surface instead of modeling probability distribution of the training samples [16, 22]. The basic idea is to search an optimal hyperplane such that it maximizes the margins of the decision boundaries such that the worst-case generalization errors are minimized. For a set of M labeled training samples (x i , y i ), where x i R d and y i ∈ {+1, −1}, the SVM classifier maps it into higher dimensional feature space using a non-linear operator ϕ(x). The optimal hyperplane is computed by the decision surface defined in (15),

$$ f(x)= sign\left( \sum\limits_{i} \ y_{i} \alpha_{i} K(x_{i},x)+b\right) $$
(15)

where K(x i , x) is the kernel function. In this study, Radial Basis Function (RBF) kernel has been used to train the SVM model. Performance has been evaluated using the complete setup, i.e., preprocessed gestures in training and then applying the user-independent test mode. Finally, average results are reported. The value of the γ is kept fixed at 0.0049, whereas the regularization parameter C has been varied from 1 to 99 as shown in Fig. 18.

Fig. 18
figure 18

Gesture recognition performance using SVM by varying regularization parameter

Accuracies of 71.75%, 77.77% and 70.91% have been recorded on three different values of C, i.e., 98, 87, and 91 for single-handed, double-handed and combined gestures, respectively.

Rotation-wise results have been computed by training the system with front gestures as well as with complete dataset. Recognition results are depicted in Fig. 19.

Fig. 19
figure 19

Rotation-wise results by training with front view and all gestures (including front and side views) using SVM

In addition, the accuracies of all views have also been computed in user-dependent training using 9-fold cross validation scheme. The dataset has been divided into 9 equal parts and 8 parts of them have been kept in training and test the remaining part. Similarly, all the parts have been tested and average results are computed. Recognition accuracies of all views are shown in Fig. 20, where an average performance of 90.26% is recorded.

Fig. 20
figure 20

Recognition results of all views in user-dependent training

To the best of our knowledge, no other method exists with which we can compare our method directly. However, viable comparisons of the front-view gestures are performed with two publicly available datasets, namely, GSL20 [33] and CHALEARN [10]. The dataset GSL20 consist 20 sign gestures of Greek Sign Language (GSL), whereas the CHALEARN dataset consist of 20 Italian sign gestures recorded with Kinect sensor. In CHALEARN dataset, the accuracy is reported on the validation set due to non-availability of labels in test data [12]. The authors in [11] have considered 7 joints of the 3D skeleton, namely, shoulder center (SC), elbow right (ER), elbow left (EL), hand right (HR), hand left (HL), wrist right (WR) and wrist left (WL) whereas in our methodology we have considered only 4 joints, i.e. hand right (HR), hand left (HL), wrist right (WR) and wrist left (WL). Therefore, achieved lower accuracy in comparison to [11]. The comparative performance is presented in Table 3.

Table 3 Comparative analysis of proposed SLR system with existing methodologies

4.7 Error analysis

This section presents an analysis on failure cases. We show a confusion matrix in Fig. 14 for such results. Some gestures have not been recognized because of the presence of distortions in the data even after the affine transformation. Moreover, some gestures share similar movements, shape and hand orientation that creates some confusion within the set. Hence, they have been recognized falsely. For examples, the single-handed sign gesture representing ‘name’ and ‘no’ have similar hand movements except the speed and position of hand. Similarly, two-handed sign gestures for the word ‘wind’ and ‘go’ also share similar characteristics in terms of movements and positions of the hands.

5 Conclusion and future work

In this paper, we have proposed a rotation and position invariant framework for SLR that provides an effective solutions to recognize self-occluded gestures. Our system does not require a signer to perform sign gestures in front of the sensor. Hence, provides a natural way of interaction. The proposed framework has been tested on a large dataset of 2700 sign words of ISL that have been collected with varying rotations and positions of the signer in the field of view of the sensor. Recognition has been carried out using HMM classifier in three modes using single-handed, double-handed and combined setup. Results show the robustness of the proposed framework with an overall accuracy of 83.77% using the combined setup. In future, the work can be extended to the recognition of interaction between multiple persons. Vision based approaches in combination with depth sequences and 3D skeleton could also help in boosting recognition performance. Additionally, more robust features and classifiers such as Recurrent Neural Network (RNN) could also be explored to improve the performance further.