1 Introduction

India has the largest deaf population in the world where “one of every five people who are deaf in the world, lives in India”. Sign language (SL) [44] is the principal medium for deaf-dumb persons to share information that is composed of different gestures of hand shapes, its movements, location, orientations, and also facial expressions. There are two types of SL symbols: single-handed and double-handed. Both types are either static or dynamic. The types of SL with corresponding abbreviations are shown in Fig. 1. The single-handed symbols are represented by the dominant hand only. The dynamic double-handed symbols are grouped into type 0 (both the hands are active) and type 1 (dominant hand is more active) signs. Type 0 signs are produced by using two hands whereas type 1 requires more participation of the dominant hand.

The SL symbols either are expressed by only hands known as manual signs or by body postures, mouth gestures, and facial expressions, also known as non-manual signs. However, normal people do not have any knowledge of such a language. As a result, there is a huge communication barrier between normal and deaf-dumb persons. A solution to this problem is to take the help of an SL interpreter, but interpreters are not easily available and this requires a very costly solution for a long period. Therefore, research works are continuing to design and implement a system that can almost automatically recognize the gestures of SL to reduce the communication gap between two groups of people in society.

Fig. 1
figure 1

Variation of different types of signs available in Indian sign language

Recently, several research works have been continued to develop recognition systems of different SLs not only for the deaf population but also for applications in robot controls, video games, and virtual reality environments [36, 38]. A lot of researches have been done to develop SL Recognizer (SLR) for other countries. Those are at an advanced stage but Indian SL (ISL) recognition methodologies are recently in a growing stage and there is no approach to recognize signs in real-time. Therefore, it is necessary to continue the research toward the development of a complete ISL recognition system.

The ISL has different types of posture variability that turned recognition of a complex problem and therefore, recordings of perfect postures are required for framing the dataset. Again, it is necessary to create the ISL database since no standard ISL dataset is available yet. The research works on sign language recognition reported in the past years were based on machine learning that offers poor accuracy for the absence of automatic extraction of features. In recent time, deep neural network-based approaches have reached demand to solve problems in many fields [11] and outperform the traditional techniques in natural language processing [6, 12], computer vision [21, 30, 46, 47], robotics [46], signal processing [34, 59], image processing [8, 62], and other various fields of artificial intelligence. Deep learning based SLRs have also been designed. However, the majority of works in deep learning based SLR are done on sign languages other than ISL where automatic feature extraction is possible.

Based on the necessity stated above, we have proposed a fingerspelling recognition system of static signs for the ISL alphabet using a convolutional neural network (CNN) that is built by applying six convolutional (Conv.) layers with stochastic pooling, batch normalization and dropout, followed by the applications of two fully connected (FC) layers and diffGrad optimization method. Data augmentation is also done to achieve better performance by populating the dataset. The training and validation accuracies and losses of the presented approach have been obtained for four distinct optimizers and three types of pooling methods. The other performance measures like precision, recall (or sensitivity), and F1-score have also been presented. The proposed approach offers better results than the remaining examined systems.

The paper is organized as follows. Section 2 describes the related works of other SL and ISL while Section 3 explains a basic CNN architecture as an SL symbol classifier. The proposed system architecture has been demonstrated in Section 4. Section 5 presents the results of our experiments. At last, the conclusion has been drawn with possible future scopes in Section 6.

2 Related works

There is a huge variety of SLs worldwide. Some of them are American SL (ASL), Arabic SL (ArSL), Chinese SL (CSL), Indian SL (ISL), Persian SL (PSL), Irish SL ((IrSL)), and so on. ASL is distributed across the states of America, part of Canada, and Africa. Sun et al. [49] introduced an approach to recognize ASL that consists of 73 signs where mi-SVM and Adaboost were used to train the model and classify data, respectively. Sun et al. [50] also proposed a Latent SVM-based SLR for ASL that yields 82.9% and 86% accuracy for 63 sentences and 73 signs, respectively, prepared using the Kinect sensor. Moreover, the accuracy became 96.67% for fusing the data glove and camera.

Kim et al. [18] presented an ASL recognizer by the impulse radio sensor. The CNN architecture was applied to classify signs and achieved more than 90% accuracy. Oyedotun and Khashman [39] introduced an ASL recognition technique for 2040 alphabet signs. The segmentation of these signs was done by applying a median filter. The system achieved 91.33% accuracy for ConvNet.

ArSL is used in Mideast and North Africa regions. Al-Rousan et al. [3] designed an SLR for 30 THD type signs of ArSL that extracts features using Discrete Cosine Transform (DCT) and zonal coding and Hidden Markov Model (HMM) for classification. The system offered 93.8% and 90.6% accuracy for signer-dependent and signer-independent mode, respectively. Shanableh and Assaleh [45] proposed an ArSL recognizer for 3450 signs that applied K-nearest neighbor (KNN) to achieve 87% accuracy. Mohandes et al. [35] proposed a signer-independent ArSL recognition method for THS type signs by using a region growing scheme. Dahmani and Larabi [7] presented a framework for recognizing ArSL based on shape descriptors and classification done using KNN with Support Vector Machine (SVM). A user-dependent ArSL recognizer was presented by Tubaiz et al. [53] for dynamic continuous signs where modified KNN was applied for classification.

CSL has spread across the counties of China, Malaysia, and Taiwan. Yang et al. [57] introduced a continuous CSL recognizer for few sentences where the model used LB-HMM (Level Building HMM) and its variant LB-Fast-HMM for classification to decrease computational complexity. Guo et al. [13] presented a CSL recognizer. The Histograms of Oriented Gradients (HOG) and Principal Component Analysis (PCA) were used for feature extraction and offered 67.34% accuracy. Jiang et al. [15] proposed a CNN-based CSL recognizer with stochastic pooling, batch normalization, and dropout that yields a maximum accuracy of 90.91%.

PSL is used by deaf people in Iran. Karami et al. [16] presented a discrete wavelet transform-based PSL recognizer for SHS type alphabets. IrSL has spread in part of Ireland. Kelly et al. [17] introduced a person-independent recognition system for IrSL. ISL is distributed across South Asia with approximately 2,700,000 users [31]. However, ISL recognition systems are recently in a growing period. Tripathi et al. [51] proposed a continuous ISL gesture recognizer for two-hand gestures using gradient-based key frame extraction method. Mehrotra et al. [33] introduced a recognition method for 37 two-hand signs of ISL captured by Microsoft Kinect using multi-class SVM that obtained approximately 86% accuracy. Tripathi et al. [52] presented an approach for recognizing ISL sentences where the HMM were applied for the classification of signs. It achieved 91% accuracy. Kishore et al. [20] introduced an approach for sentence recognition of ISL and offered approximately 90% accuracy.

Naglot et al. [37] applied an ANN to classify OHD type signs of ISL using Leap Motion Controller (LMC). Kumar et al. [22] proposed a continuous ISL recognizer where a mobile front camera was used to collect signs. The system reached 90% accuracy. Ahmed et al. [2] presented an SLR for 24 double-handed dynamic signs where 90% accuracy was achieved. Kumar et al. [23] proposed a sensor-dependent ISL recognizer using leap motion equipment for 50 sign words. The model applied both HMM (95.60% accuracy) and Bidirectional Long Short Term Memory Neural Network (BLSTM-NN) (84.57% accuracy) for classification. Kumar et al. [24] developed an ISL recognition method depending on HMM for 25 OHD type signs applying Kinect with leap motion. Rao and Kishore [43] presented an SLR for selfie video of ISL where DCT was utilized to extract features and offered 90% average accuracy.

Kumar et al. [28] developed an ISL recognition model for 2240 OHS type signs applying leap motion that had 63.57% accuracy for SVM and BLSTM-NN. Kumar et al. [25] introduced an approach for ISL recognition applying Kinect for static sign words. The model achieved approximately 83.77% accuracy. Wadhawan and Kumar [55] presented a systematic literature review of SL recognizers between the last decades. Wadhawan and Kumar [56] presented an ISL recognizer for 100 static signs using deep learning-based CNN that achieved above 99% accuracy on both colored and grayscale images. Kumar et al. [26] developed a CNN based ISL recognizer for 50,000 sign videos that achieved 92.14% accuracy. Raghuveera et al. [40] introduced an algorithm to translate ISL into English text and speech for the dataset of 4600 images captured by Microsoft Kinect. Mali et al. [32] proposed a system with a computer human interface for ISL using SVM classifier. The Soft Computing technique was also applied in ISL recognition by Sahoo et al. [4].

3 CNN architecture for ISL alphabet recognition system

As an example, a common CNN architecture is depicted in Fig. 2, where Conv. layers are applied for feature extraction and two FC layers are kept for classification. Methodologies of CNN architecture for ISL recognition system are explained in the following sub-sections.

Fig. 2
figure 2

An example of CNN architecture for ISL recognition system

3.1 Convolutional layer

The CNN model is specially designed for 2D images, although it may be applied with 1D and 3D data. Central to this net is the Conv. layer that does convolution operation on data. In CNN, this linear operation involves the multiplication of many weights with the input. For a 2D image, the multiplication is done between the input image and a 2D array of weights, termed as a filter. The same filter used in a systematic way across the entire input, left to right and top to bottom, is an effective idea to find a specific feature anywhere in the input. The output of convolution operation is a 2D array termed as ’feature map’. The convolution operation explained in the context of CNN is a ’cross-correlation’ operation. The Conv. layer output size for input image with size n and filter with size f is presented by Eq. (1) where s and p represent stride and padding, respectively.

$$\begin{aligned} {Output\_size} = \lfloor \frac{(n + 2p-f)}{s} +1 \rfloor \end{aligned}$$
(1)

Figure 3 demonstrates an example of convolution operation for stride \(s = 1\) and padding \(p = 0\), where a filter with size 3 is applied to a \(6\times 6\) 2D input image to produce a feature map. The size of the feature map is \(\lfloor \frac{(n + 2p-f)}{s} +1 \rfloor = \lfloor \frac{(6-3)}{1} +1 \rfloor = 4\).

Fig. 3
figure 3

Example of a filter applied to a 2D image to create a feature map

3.2 Pooling layer

The pooling layer is placed after the Conv. layer within a CNN that processes each feature map individually to produce the same number of pooled new feature maps. Pooling chooses a pooling operation (e.g., a filter) whose dimension is less than the dimension of the feature map. The pooling operation with \(2 \times 2\) pixels and a stride of 2 pixels reduces feature map size by half. There are mainly two common schemes of pooling: max pooling and average pooling. Former keeps patch-wise highest values and the later keeps averages of the same. For the activation set Z in a pooling region R defined in Eq. (2), the max and average pooling can be obtained by Eq. (3) and Eq. (4) respectively, where \(|Z_R|\) is the length of Z.

$$\begin{aligned} {Z} = [z_i |i \in R] \end{aligned}$$
(2)
$$\begin{aligned} {Max ~ pooling (P_M)} = Max (Z_R) \end{aligned}$$
(3)
$$\begin{aligned} {Average ~ pooling (P_A)} = \frac{\sum Z_R}{|Z_R|} \end{aligned}$$
(4)

An example of both pooling operations is illustrated for \(2 \times 2\) filter and stride of 2 in Fig. 4. For input with multiple channels, pooling reduces the dimension but keeps the number of channels unchanged.

Fig. 4
figure 4

Example of Max Pooling and Average Pooling

The pooled feature maps are a summary of the features found in the input image where minor variations in the feature position will output the same position in the pooled feature map. This capability is termed the “model’s invariance to local translation”.

3.3 Batch normalization

Training deep networks are very challenging where a layer’s inputs’ distribution is often influenced by the change of the parameters of the previous layers. This may slow down the training process. The Batch Normalization (BN) [14] is applied in training deep networks to standardize input to a layer. The technique dramatically reduces the training epochs needed for training the network. The BN layer is placed after the Conv. layer or an FC layer (see Fig. 5). The equations of BN with learning parameters \(\gamma\) and \(\beta\) for input set [\(a_i\)] and mini-batch output [\(b_i\)] are defined in Eqs. (5), (6),( 7), and (8), respectively.

Fig. 5
figure 5

Illustration of batch normalization

$$\begin{aligned} {Mini{-}batch ~ mean (\mu )} = \frac{1}{n}\sum \limits _{i=1}^{n} a_i \end{aligned}$$
(5)
$$\begin{aligned} {Mini{-}batch ~ variance ({\sigma }^2_{\beta })} = \frac{1}{n}\sum \limits _{i=1}^{n} (a_i - \mu )^2 \end{aligned}$$
(6)
$$\begin{aligned} {Normalized ~ value (\hat{a_i})} = \frac{(a_i - \mu )}{\sqrt{{\sigma }^2 + \epsilon }} \end{aligned}$$
(7)
$$\begin{aligned} {b_i} = \gamma \hat{a_i} + \beta \equiv BN_{\gamma ,\beta }(a_i) \end{aligned}$$
(8)

3.4 Dropout

Deep neural nets may overfit a training dataset quickly and is a serious problem. Dropout technique [48] can deal with this problem. The technique temporarily removes units from the neural network with its incoming and outgoing connections randomly from the neural net at the time of training that produces a thinned network containing units that survived dropout. It not only reduces overfitting but also offers significant improvements over other regularization schemes. Fig. 6(a) and (b) show an example of standard neural network and its corresponding thinned neural network, respectively. It is noticed that some neural units are discarded from each layer of a thinned neural network at a specific dropout rate.

Fig. 6
figure 6

Example of dropout Neural Network: (a) A standard Neural Network; (b) A thinned Neural Network produced after applying dropout

3.5 Optimizer

Optimizers are applied in the neural network to tune attributes like weights and learning rate to reduce the losses and to offer possible accurate results. Stochastic Gradient Descent (SGD) [5] is an efficient popular optimization method where the gradient determines the path where a function has the sharp rate of change and the model parameters are updated after calculation of loss on each training example instead of single time as in Gradient Descent. However, the major limitation of SGD is that it updates all parameters in equal-sized steps. Again, it has a high variance in parameters and may shoot even after reaching global minima. These steps can be made adaptive in size for each parameter to improve SGD further. Many optimization methods like AdaGrad [10], AdaDelta [60], and adaptive moment estimation (Adam) [19] were proposed for the same.

4 Proposed ISL alphabet recognition system

Most of the existing recognition systems use either max pooling or average pooling as discussed in Section 3.2. However, both pooling has few limitations. The max pooling retains more texture information whereas the average pooling retains more background information of the image. The proposed system applied probabilistic stochastic pooling [61] to reduce these limitations. The method picks the activation within each pooling region randomly based on multinomial distribution and it is free of hyperparameters. The stochastic pooling computes first the probabilities P for each region j via activation \(a_i\) by Eq. (9). It then picks a location l within the region randomly based on P. The pooled activation is then simply \(a_l\) as given in Eq. (10), where \(l\sim P(p_1, .....p_{|R_j|})\). An example of a stochastic pooling procedure is illustrated in Fig. 7. It is observed that the chosen element is not the largest for the pooling region.

$$\begin{aligned} {P_i} = \frac{a_i}{\sum \limits _{k\in R_j}a_k} \end{aligned}$$
(9)
$$\begin{aligned} {S_j} = a_l \end{aligned}$$
(10)
Fig. 7
figure 7

Example of Stochastic Pooling

Again, the optimizers mentioned in Section 3.5 cannot take the benefit of local change in gradients since they depend on past gradients. Therefore, the diffGrad optimization method [9] is chosen for our work that depends on the difference from the present gradient to the past gradient. The optimizer has a high step length for fast gradient updating parameters and a small step length for low gradient updating parameters.

The proposed ISL alphabet recognizer is designed to recognize 26 static signs of the ISL alphabet. It has mainly five phases: data collection, resizing and normalization, training data augmentation, classifier training, and testing ( see Fig. 8). These are explained as follows.

Fig. 8
figure 8

The schematic diagram of the proposed ISL alphabet recognition system

4.1 Data collection

The dataset is prepared under various lighting conditions and backgrounds using a camera that includes 26 static signs of ISL alphabets each having 2400 images. Therefore, there are total 24,00 \(\times\) 26= 62,400 images.

4.2 Data resizing and normalization

Image resizing and normalization are applied for preprocessing. Since the neural network model takes the same size input, all collected images require to be resized to a fixed size before feeding those to the network. The larger the image size, the less shrinking is needed that means less deformation of patterns and features of data. Again, normalizing the pixel values of images is a good practice where each pixel value has a value between 0 and 1. It helps to speed up the learning process and leads to faster convergence. Therefore, the collected images are resized into \(256 \times 256\) shapes and then the pixel values of images are normalized by dividing 255 since the highest pixel value is 255.

4.3 Training data augmentation

Data augmentation generates new variations of images from existing training images artificially. It significantly increases the amount and diversity of images for training. In this way, the dataset in a model becomes rich and sufficient, and the model offers better performance. These augmentation techniques are applied in our dataset for this reason. The model gets trained using 49,920 images (80\(\%\) of the total 62,400 images) and tested using 20\(\%\), i.e., 12,480 images. Training images are augmented using zoom (scaling images by a factor), rotation (rotating images by a degree), shear (shear angle in a counter-clockwise direction in degrees), width and height shift, and ZCA whitening (shift the color values mostly present in images like PCA but preserve the spatial arrangement of pixels important for CNN) operations. As a result, each image generates 90 new images and there are 49,920 \(\times\) 91= 4,542,720 images in augmented training data.

4.4 Model training

The system gets trained using CNN classifier using images of 26 signs since ISL has 26 English alphabets. The training process tunes the parameters of the network until the accuracy is satisfied. To add randomness to the training of CNN, the dataset is shuffled to prevent bias on parameters. The shuffled data helps to break symmetry and offers better performance where the weights are set randomly close to zero and every neuron performs no longer the similar computation.

4.5 Testing

The model is tested using 20\(\%\), i.e., 12,480 images. The trained model is used to predict the images of the test dataset and the results in terms of classification performance are measured using metrics like validation accuracy, validation loss, precision, recall, and F1-score.

5 Experimental results

The ISL dataset for our experiment has been created by different volunteers using a camera. Part of samples of the ISL alphabet dataset used for our experiment is shown in Fig. 9. There are 26 classes of the dataset each having 2400 color images of size \(256\times 256\).

Fig. 9
figure 9

Sample images of our ISL alphabet dataset

5.1 Data augmentation results

The augmentation methods like zooming, rotation, shear, width and height shift, and so on are applied to the dataset to significantly expand the dataset beneficial for deep learning. It also improves accuracy by reducing over-fitting. The sample augmented images of our ISL alphabet dataset are depicted in Fig. 10.

Fig. 10
figure 10

Sample augmented images of our ISL alphabet dataset

5.2 Structure of the proposed CNN

The CNN architecture has been fine-tuned and finalized as an eight-layer CNN with six Conv. layers followed by two FC (FC_1 and FC_2) layers as given in Table 1, where Block_2 (64 \(5\times 5\times 32\) Conv, BN, ReLU, SP) represents Conv. layer with 64 filters of size \(5\times 5\), 32 channels, and applied batch normalization (BN), ReLU activation and stochastic pooling (SP). The 30% dropout has been done between the sixth Conv. layer and the first FC layer.

Table 1 The details of the proposed CNN architecture

5.3 Classification performance

After tuning parameters of the model, the training accuracy, validation accuracy, training loss, and validation loss of the model are obtained using DiffGrad optimization for 10 runs and kept in Table  2. The highest and lowest training accuracies are 99.76% and 98.96%, respectively, and the highest and lowest validation accuracies are 99.64% and 98.98%, respectively. The overall training and validation accuracies reach up to 99.52± 0.2334% and 99.36± 0.2595%, respectively. The accuracy and loss curves for the first run are illustrated in Figs. 11(a) and 12(b), respectively. The model is trained up to the 30th epoch due to improvement stagnation in the accuracy.

The precision (Pr), recall (Rc), and F1-score (Fs) defined in Eqs. (11), (12), and (13), respectively, have also been evaluated for the first run using the numbers of true positives (TP), false positives (FP), and false negatives (FN). The Pr, Rc, and Fs of all 26 signs are given in Table  3.

$$\begin{aligned} {Pr} = \frac{TP}{(TP+FP)} \end{aligned}$$
(11)
$$\begin{aligned} {Rc} = \frac{TP}{(TP+FN)} \end{aligned}$$
(12)
$$\begin{aligned} {Fs} = \frac{2*Pr*Rc}{(Pr+Rc)} \end{aligned}$$
(13)
Table 2 The training and validation accuracies of the proposed model
Fig. 11
figure 11

Accuracy and loss curves for a training and b validation

Table 3 Classification performance

5.4 Pooling method comparison

In this study, the stochastic pooling has been compared with average and max pooling as depicted in Table 4. It is noticed that stochastic pooling achieves better average training and validation accuracies of 99.52± 0.2334\(\%\) and 99.36± 0.2595\(\%\), respectively, than the other two. However, the maximum pooling offers the maximum training and validation accuracies of 99.84\(\%\) and 99.73\(\%\), respectively, among these three pooling methods for the fourth run.

Table 4 Pooling method comparison

5.5 Dropout rate

The average training and validation accuracies with varying dropout rate (10 runs for each) are obtained and the graphical representations are depicted in Fig. 12(a) and (b), respectively. Both the accuracies increase with an increase of the dropout rate started from 0\(\%\) and maximum training and validation accuracies 99.52±0.2334\(\%\) and 99.36±0.2597\(\%\), respectively, are achieved for 30\(\%\) dropout rate. Further increase of dropout rate reduces accuracy gradually. Hence, the dropout rate is chosen as 30\(\%\).

Fig. 12
figure 12

(a) Error bar of average training accuracy against dropout rate and (b) Error bar of average validation accuracy against dropout rate

5.6 Experimental results with respect to optimizer

The presented model has been tested using four optimization algorithms for 10 runs. The average accuracy and loss of these optimizers are kept in Table 5. It is noticed that the validation accuracy of DiffGard outperformed RMSProp, Adam, and SGD optimizers. The proposed model offers the average training and validation accuracy of 99.52% and 99.36% using DiffGard optimizer, respectively.

Table 5 Comparison of different optimizers

5.7 Comparison to the state-of-the-art approaches

The comparative result using the recognition rate of the presented ISL recognition approach with others is given in Table 6. It has been noticed that the presented approach outperforms other listed models since the former uses a deep learning-based CNN with multiple technologies like data augmentation to expand the dataset, batch normalization to standardize the inputs to a layer, dropout to handle overfitting, stochastic pooling to resolve the overfitting and down-weight issue and diffGard optimizer.

Table 6 Comparison of the proposed system with other classifiers for static single/double handed signs

5.8 Performance comparison of the proposed model architecture with standard architectures

The performance of the proposed model architecture is also compared with Inception V3, ResNet18, and ResNet50 standard architectures in Table 7. It is noticed that the average training and validation accuracies of the proposed model architecture are very close to these standard architectures.

Table 7 Performance comparison of the proposed model architecture with standard architecture

5.9 Prediction time of test set

The network takes only 0.728 seconds to predict 12,480 images of the test set and therefore is very useful for real-time output of SL interpretation.

6 Conclusion and future scope

This paper has proposed an ISL alphabet recognition system using optimized CNN with data augmentation, stochastic pooling, batch normalization, dropout, and diffGrad optimizer. The model offers the maximum training accuracy and validation accuracy of 99.76\(\%\) and 99.64\(\%\), respectively. The comparison of accuracies of the model for using stochastic, average, and max pooling methods is done where the former reduces overfitting significantly. We have also tested the system using four optimizers and noticed that diffGrad outperformed Adam and SGD optimizers. The proposed system can also be eventually updated to learn all static signs of ISL.

However, still the proposed system is failed to recognize dynamic signs and real-time sign words.There could be assumed as open problems for the time being. In the future, the experiment should be continued to design a system to recognize dynamic signs and signs in real-time of ISL.