1 Introduction

Sign language uses hand gestures and body movement to depict the words of spoken language, allowing hearing-impaired people to communicate with the world [1]. Sign language structure can be decomposed into primary and secondary components that are combined either sequentially or simultaneously. The primary components consist of different hand shapes to make a sign, the location of hands/body, and the movement of hands/fingers/body. In addition to these components, many signs consist of secondary components such as facial expressions and body movements. Among these components, the principal component of sign language is the hand, as most gestures can be represented by it. A gesture is a movement of the hands that creates alphabets, numerals, words, and sentences of the local spoken language. These gestures are categorized into static and dynamic gestures. The static gestures consist of fixed hand positions i.e., no change in position of hands or fingers w.r.t time, whereas dynamic gestures comprise of variable movement of hands[2]. Sign language syntax differs from region to region, and the language of the Indian community is known as Indian Sign Language (ISL).

The use of sign language is limited to hearing-impaired people as most non-signer people can't understand this language [3]. Sign language recognition system (SLRS) is a useful tool for communicating information between the signer and non-signer community, as it automatically identifies the gestures and translates them into speech or spoken language text [4]. In recent years, the SLRS has created a new way for an interpreter to convert the signs into text/speech. There are numerous successful applications in this field, such as language translation, sign language tutors, and providing special education. These can aid deaf persons in communicating effectively with others. However, due to the intricacy of extracting information from signing components, SLRS remains a difficult process. Learning and constructing a feature vector to represent the information of a hand gesture is challenging as it involves various tasks like tracking hand regions, segmenting hands from the background, discarding irrelevant information, and so on. The approaches presented by different researchers mainly covered static hand gesture recognition. However, the SLRS using merely static gestures, can't handle the large vocabulary and complexity of sign language. Therefore, research into recognizing dynamic gestures is also required to overcome the problem. However, detecting and tracking the complexity of the finger motion activities from the wide-scale body background creates a barrier for dynamic SLR. Another hurdle in dynamic SLR is the extraction of the most discriminative features from the multiple frames of video.

The aim of the proposed work is to recognize the dynamic words of ISL using a vision-based signer-independent system. This proposed method can remember a long–term sequence of two-hand dynamic ISL gestures. The following are the paper's significant contributions:

  • An efficient hybridized model with CNN and S2B-LSTM has been designed for dynamic gesture recognition.

  • For this, a dataset of 360 videos of dynamic words used by the Indian hearing-impaired community has been created against a uniform background using a camera. Each captured video is converted into a sequence of keyframes only to reduce the computational complexity.

  • Experimental findings show a promising result with a recognition accuracy of 97%.

The remaining paper is organized as follows: Sect. 2 contains a detailed report of the existing SLRS. The dataset description is given in Sect. 3. The detail of the proposed method is given in Sect. 4. The experimental evaluation of the work is given in Sect. 5 and finally, Sect. 6 concludes this paper.

2 Related Work

This section aims to briefly discuss the available literature work of SLRs. The available SLRS can be broadly classified into sensor-based [5] and image-based methods [6]. Sensor-based method makes use of external hardware components consisting of different sensors, to capture the signer's detail. The sensors are generally embedded in the gloves, helmet, or body-suit, which signer needs to wear before performing any action. This system is efficient in capturing signing detail, but electronic circuitry restricts the signer's movement. On the other hand, in vision base method, the signing information can be collected using 2D or 3D camera, which allows the natural movement of hands or body[7]. In the literature, both types of systems has proved helpful for different sign language.

Rekha et al. have presented a recognition system for ISL signs [8]. The authors have collected 23 static and 3 dynamic gestures in this work. Static signs are classified using SVM classifiers and dynamic time warping is used for dynamic signs. Kishore et al. has given a method for recognizing ISL sentences[9]. The collected ISL dataset consisted of a total of 580 sentences. A neural network is used to classify the signs into various words and the performance is computed using a word matching score. Another vision-based approach for ISL recognition is given in[10]. This method has achieved an accuracy of 90% for classifying 24 isolated signs. The use of a leap motion controller for ISL dataset collection has been proposed by Naglot and Kulkarni[11]. ANN is used in this work to classify 26 alphabets and 10 numeric of ISL. Kumar et al. have proposed ASL recognition method using a real-time video [12]. HSV color space model is used for the skin segmentation; the features of static and dynamic gestures are extracted using Zernike moments and curve features, respectively. This work has achieved an accuracy of 93% for static gestures and 100% for dynamic gestures using SVM classifier. Ibrahim et al. presented an automatic recognition method for an Arabic sign language [13]. Firstly, hands are segmented from the dataset of 30 isolated words using the skin-detection technique, and then features like the centre of gravity of hand, motion velocity are used to make feature vector. Then Euclidean distance is used for the classification and it has achieved a classification accuracy of 97%. A convolutional neural-based ASL recognition is presented by Kim et al.[14] The impulse radio sensor is used in this work and the CNN classifier has obtained the classifier accuracy of 90%.

In [15], a vision-based recognition method is presented for ISL static and dynamic gestures. A skin based segmentation is used to extract the signs from the real-time video and then Zernike moments has used to extract key frames from the captured signs. The classification of these signs is done using SVM classifier and it has obtained a recognition rate of 91%. A multimodal ASL recognition system is presented by Ferreira et al.[16]. The colored, depth and leap motion data of 1400 ASL static signs is collected using Kinect and leap sensor. The best recognition accuracy of 97% has been obtained using a CNN classifier. A machine-based interpreter has been designed by Darwish for ArSL [17]. A total of 6000 static samples is collected and classifier using fuzzy HMM.

Another multimodal framework for SLRS is presented by Kumar et al.[18].This method has incorporated facial expression along with the gesture dataset of two sensors. The classification of 51 dynamic gestures is done using HMM. A Vietnamese SLRS is presented in [19]. Microsoft Kinect camera is used to collect the sequence of depth images for 30 dynamic gestures. The performance of the SVM and HMM classifiers is compared in this work and has achieved an average accuracy of 95%. In [20], multimodal dynamic SLRS is presented. The feature extraction and classification of the dynamic gestures is done using 3D convnet and bidirectional LSTM network. This method has achieved a maximum of 89.8% recognition accuracy for Chinese sign language. A spotting-recognition architecture for the recognition of continuous gesture is given in [21].

From the literature review discussed in this section, it can be seen that existing work focuses mainly on static and finger-spelled gestures. Designing a SLRS for dynamic gestures is challenging as it is based on multiple frames. Another point of observation is that a standardized ISL dataset is unavailable. To address these issues, a vision-based dynamic gesture recognition method is presented in this work. The next section gives the detail of the collected ISL dataset.

3 Dataset Creation

Dataset creation is very crucial part of this work. From the literature in the previous section, it is evident that the researcher of every SLRS for ISL have created their own dataset, as there is no publicly available ISL dataset. In this work, a new dataset for dynamic gestures has been collected. The dynamic gestures used for this are from the everyday language of Indian signers. This dataset is collected using camera and it consists of 18 different dynamic gestures. The list of the ISL words used is given in Table 1. In this work, around 4–5 videos has been taken from multiple signers for each dynamic gesture, resulting in total of 360 dynamic videos. To make system versatile, different gender of signer under different lighting conditions are considered. These videos were converted into the sequence of frames and then the keyframes are extracted from the sequence of input frames by removing the frames of negligible hand movement. The pre-processing is done to extract the hand region, and the frames are resized to 256 × 256 before feeding to the proposed model. The reduced image resolution reduces the computational complexity and speeds up the convergence of classifier. To the best of author's knowledge, this is the first time where a large dynamic ISL dataset has been collected. The samples of this dataset is shown in Fig. 1

Table 1 List of ISL dynamic gestures used
Fig. 1
figure 1

Sample of extracted keyframes for gesture "Marry"

4 Proposed Framework

The methodology for the recognition of dynamic gestures and their primary components are discussed in detail in this section. The ISL recognition system is divided into two sections: the feature extraction from the keyframes of input video using 2D-CNN, and then the feature vector is fed to stacked parallel BLSTM for the temporal feature extraction. The framework of the proposed work is also shown in Fig. 2.

Fig. 2
figure 2

Framework of the proposed S2B-LSTM

4.1 Feature Extraction Using Convolutional Neural Network

CNN is an efficient deep learning method composed of an input layer, convolutional layer, pooling layer, fully connected layer, and output layer [22]. The convolutional network has a wide range of applications. The convolutional layer of this architecture is used as a feature extractor to extract the features automatically from the input feed, and its mathematical equation is given in Eq. 1. It performs the 2D convolution of the input image with the pre-defined filter.

Several filters with diverse functionalities are utilized to enable the network to extract complementary information and to learn the input characteristic. K different filtered images are produced for the K number of filters used in the convolutional operation. These filtered images are then passed to the pooling layer to reduce the size of the extracted features by using down sampling.

$${f}_{xy}= \sum_{i,j}{w}_{ij}{v}_{(x+i)(y+j)}+b$$
(1)

where \({f}_{xy}\) is a feature map of position (x, y), w is the weight of kernel, \(v\) is the input unit, and b is the bias added to the feature map.

The pooling operation can be done using max-pooling or average pooling. The max-pooling layer splits the input image into a series of the non-overlapping region of a size equivalent to the size of the filter of the pooling layer and then selects the maximum value from each region. However, the average pooling operation selects the average value of the region. By this, the most dominating features of the sub-region are extracted by reducing the spatial size. The pooling operation works independently on every filtered image and resizes them. The CNN models have proved an efficient approach in many practical applications[23], as the image with millions of pixels can be scaled down to the dozens of pixels containing significant characteristics like edges, line, and intensity. This requires the storage of fewer parameters, which further reduces the model's memory requirement and improves its statistical efficiency.

A 2D-CNN model has been built to extract the features from the input videos of dynamic gestures of ISL. A video generally consists of 30 to N frames per second with many redundant frames. In the collected dataset, each video has a total of 110–150 frames, and processing each frame is computationally expensive. Thus, only key frames (generally 13–20 frames per video) are passed to CNN model for feature extraction. It is also evident from the results that 13–20 frames per video don't affect the sign's sequence. The 2D-CNN model of this proposed model consists of three convolutional layers and two pooling layers. This combination of layers has efficiently extracted different global features from the collected dataset and it also gives an advantage of less complex architecture for spatial features extraction. After each layer, a non-linear activation function named RELU is added to introduce some non-linearity in the model. From the experimental analysis of the proposed work, it is clear that this 2D-CNN has successfully extracted all the hidden detail from each frame of gesture. The CNN model can only extract spatial information from the input frames. To learn the relation between the corresponding frames, a temporal information is required. Thus the extracted features are further fed to S2B-LSTM to learn the change in sequence of the gestures.

4.2 Stacked Bi-directional Long-Short Term Memory (SB-LSTM)

A recurrent neural network is a neural network that uses its internal memory to deal with a sequence of data. The feature handling time series problem has promoted its use in computer vision applications, but regular RNN suffers from the vanishing gradient during the backpropagation process. Hence, RNN models are not capable of learning long-term sequences. The usage of the LSTM network is the solution to this problem. LSTM are a special kind of RNN model proposed by Hochreiter & Schmidhuber [23] to analyse temporal and sequential data [24]. LSTM outperforms RNN network for time series problems as it can deal with short-term and long-term memory requirement problems. The internal structure of LSTM is shown in Fig. 3.

Fig. 3
figure 3

Internal structure of LSTM [25]

The LSTM approach preserves and saves information's contextual semantics to construct long-term data associations. Its special structure is made up of 4 blocks, which are: cell state, input gate, output gate, and forget gates. LSTM uses these blocks to learn the long-term and short-term sequence. The mathematical operation of LSTM can be given by Eq. 2 to Eq. 7 [26].

$${i}_{t}=\sigma (\left({x}_{t}+{h}_{t-1}\right){W}^{i}+{b}_{i})$$
(2)
$${f}_{t}=\sigma (\left({x}_{t}+{h}_{t-1}\right){W}^{f}+{b}_{f})$$
(3)
$${O}_{t}=\sigma \left(\left({x}_{t}+{h}_{t-1}\right){W}^{o}+{b}_{o}\right)$$
(4)
$$g=\mathrm{tanh}\left(\left({x}_{t}+{h}_{t-1}\right){W}^{g}+{b}_{g}\right)$$
(5)
$${c}_{t}={c}_{t-1}.{f}_{t}+g.{i}_{t}$$
(6)
$${h}_{t}=\mathrm{tanh}\left({c}_{t}\right).{o}_{t}$$
(7)

In these equations, \({x}_{t}\) is the input at time t, \({f}_{t}\) is the forget gate that remembers the previous frame, the information of the upcoming frame is stored by the output gate\({o}_{t}\), g is the recurrent unit, \({W}^{i}, {W}^{f}, {W}^{o}, {W}^{g}\) are the weight matrices, \(\sigma\) is the gate activation function. Specifically, the input gate controls and calculates the amount of the current information \({x}_{t}\) should be permitted to pass through. The function of forget gate is to ignore the useless information of LSTM past state. The values of forget gate and input gate are adjusted using the sigmoid activation unit during the training phase. The output gate is, also known as recurrent unit with activation function tanh, stores the data (\({o}_{t})\) for the next step. The cell state at time t (\({c}_{t}\)) is evaluated using the cell state of the previous time frame (\({c}_{t-1})\), forget gate value, and the input value of current time stamp and the recurrent unit g. \(\sigma\) refers to the sigmoid function, which gives output in [0,1], tanh is the hyperbolic tangent function that gives output in [-1,1]. At each time iteration (t), the LSTM cell has layer input \({x}_{t}\) and layer output\({h}_{t}\). The complicated networks also takes cell output (\({c}_{t}\)) state of time stamp t and cell output (\({c}_{t-1}\)) of previous time stamp while training and updating the parameters. Due to the gated structure, LSTM handles long-term dependencies, allowing important information to pass through the network. A gated structure enables LSTM to be a useful and scalable model for various sequential data learning applications. It is an effective sequence predictor, particularly for temporal sequence data.

A unidirectional LSTM only retains past information as it reads the input sequence through hidden states, solely in the forward direction. In the case of Bidirectional LSTM (BLSTM), the information is processed in two ways: one is in the forward direction, i.e., from past to future, and another is in the backward direction, i.e., from future to the past, with two separate hidden layers. The final result of the BLSTM is created by combining the outputs of two LSTMSs. For the same sequence of input, BLSTMs yield greater outcomes than LSTMs due to the power of reading in both directions in many fields like speech recognition and phoneme classification. Based on the literature of SLRS, the BLSTM has not been used in ISL recognition.

The structure of unfolded BLSTM model with three consecutive units consisting of a forward layer and backward layer is shown in Fig. 4. The output of both forward and backward layers are computed using the standard equation of LSTM. The output vector \({{\varvec{y}}}_{{\varvec{t}}}\) of time t is computed by using the Eq. 8.

Fig. 4
figure 4

Bidirectional LSTM with three consecutive units

$${{\varvec{y}}}_{{\varvec{t}}} = \sigma (\overrightarrow{{h}_{t}}, \overleftarrow{{h}_{t}})$$
(8)

where the function \(\sigma\) is used to combine the output of two layers, and this can be either of concatenating function, summation function, multiplication function, or an average function. In this proposed work, a summation function has been used to combine the two output sequences. The final output of the BLSTM (shown in Fig. 4) can be represented by an output vector,\({{\varvec{Y}}}_{{\varvec{T}}}=[{{\varvec{y}}}_{{\varvec{t}}-1},{{\varvec{y}}}_{{\varvec{t}}},{{\varvec{y}}}_{{\varvec{t}}+1}]\).

In this method, a deep architecture named stacked 2 bi-directional LSTM (S2B-LSTM) neural network, consisting of two BLSTM, has been proposed to learn the long-term dependency of ISL's gestures videos. In this, the output of time t is dependent on previous frames and as well as on upcoming frames. In each BLSTM unit, two LSTM with 24 units are placed in parallel to simultaneously process the input sequence in both forward and backward directions. For the forward pass, the input data is sequentially fed to the model, similar to unidirectional LSTM, whereas the backward component takes the input in reverse order, i.e., from time step to \({T}_{x}\) to 1. This proposed model consists of various forward and backward layers to learn the temporal relation between the keyframes of dynamic gestures of ISL. The final output is the concatenation of the output from the hidden layers of both LSTMs, which is then fed to the Softmax classifier for the classification. Due to this generative nature of BLSTM, the output layer gets the information from both forward(future) and backward(past) states simultaneously, which effectively improves the context available to the model to learn the relation between the frames. The complete architecture of the proposed model is shown in Fig. 5. It shows that the time series data is fed to the 2D-CNN model for the spatial feature extraction and then these feature vectors are passed to the S2BLSTM model to learn the gestures' time relation and predict the final class.

Fig. 5
figure 5

Architecture of 2D-CNN and SBLSTM model for ISL recognition

5 Experimental Evaluation

The detail of the experimental evaluation and the obtained results for the ISL dynamic gestures recognition are discussed in this section. The detail of the dataset used in this paper is already given in Sect. 3. The 80% of this dataset is used for training, and the remaining 20% is used for validation of the method. The training of this model has been carried on Google colab (with GPU runtime). In the training phase, data is fed into the model with small batch size of 128 with a learning rate of 0.001. The other hyper-parameters are also empirically fined-tuned.

5.1 Sign Recognition Results

The experimental analysis of the proposed model is discussed in this section. One of the most commonly used evaluation metrics is "Accuracy, " which is defined as the ratio of correctly predicted classes to the total number of classes used (as given in Eq. 9). The reason for its widespread use is due to the fact that it is simple to calculate, interpret, and summarises the model's capability in a single number.

$$Accuracy ( in \%)=\frac{Correctl\, predicted\,classes}{total\,number\,of\,classes}*100$$
(9)

For the ISL dataset used, the obtained accuracy plot is shown in Fig. 6. It can be seen that the recognition accuracy increases continuously in the starting iterations and stabilizes at the 40th iteration. This method has achieved maximum recognition accuracy of 97.6%. As this model has achieved a significant amount of accuracy, hence it can be used in real-world applications to help the deaf community with sign language translation.

Fig. 6
figure 6

Accuracy plot for dynamic gestures

The performance of the proposed model is also computed by measuring the value of the computed loss function. It is a way of determining how well a certain algorithm models the data. If the predictions are too far from the actual findings, the loss function will return a large number. On the other hand, it gives a smaller loss function value for a small difference between a predicted and actual result. The loss function learns to reduce the prediction error over time with the help of some optimization functions. A categorical cross-entropy loss function is used to measure the value of the loss while classifying multiple dynamic gestures of ISL. Mathematically this function is expressed using Eq. 10.

$$Loss=\sum_{i=1}^{n}{Y}_{i}. log\widehat{{Y}_{i}}$$
(10)

where, \(\widehat{{Y}_{i}}\) is the i-th model output value, \({Y}_{i}\) is the corresponding target value, and n is the total number of outputs. The plot of loss value obtained for this model is shown in Fig. 7. From this figure, it can be seen that the value of loss drops continuously with the increasing value of iteration. For the dataset of dynamic gestures, the average loss of S2B-LSTM converges to 0.0683.

Fig. 7
figure 7

Loss plot for dynamic gestures

5.2 Comparison with Other Methods

In this section, the proposed model of S2B-LSTM is compared against the state-of-the-art methods for sign language recognition. Table 2. shows the recognition accuracy of various classifiers on the dynamic gestures. From the literature of SLRS, it is clear that most of the work reported in the literature focused on static gesture recognition, and few research articles are available for ISL dynamic sign recognition. Athira et al.[15] had achieved an accuracy of 89% for 11 dynamic gestures. Rekha et al. [8] tested their model for three dynamic gestures only, and it had achieved an accuracy of 77.2%. Bhuyan et al. [27] tested their model with ten dynamic gestures and obtained an accuracy of 95.8. Ahmed et al. [10] used the DTW method for classifying ISL dynamic gestures and achieved an accuracy of 90% for classifying 24 gestures. Another point of observation from this comparison is that none of the authors had used the temporal sequence learning mechanism for ISL recognition and the efficiency of these mechanisms is unexplored in the field of sign language recognition. The proposed model consisting of 2D-CNN and S2B-LSTM has obtained a maximum accuracy of 97.6%.

Table 2 Comparison of S2B-LSTM with other methods for ISL dynamic dataset

6 Conclusion and Future Work

This paper proposed a vision-based approach for the recognition of dynamic gestures of ISL. For this, a hybrid method consisting of CNN and S2B-LSTM is presented. The proposed model of CNN, composed of convolutional and pooling layers, has been used as a feature extractor model to extract the spatial features from the input key frames. Then these feature vector is further passed to S2B-LSTM to extract the video frames' temporal information. The efficiency of this method is tested on the self-collected dataset consisting of 360 videos corresponding to the ISL dynamic gestures. The experimental findings have confirmed the efficiency of the proposed work by yielding a recognition accuracy of 97.6%. Thus the proposed model could efficiently distinguish diverse hand motions by extracting video spatiotemporal features. There are further area of improvement in the proposed work, like: expanding the size of dataset, testing the performance for thousands of sign gestures, to increase the proposed model’s efficiency under unfavourable environment conditions. By incorporating all such ammendments, a vision-based system can also be built to support the communication among hearing-impaired people.