1 Introduction

According to the survey, there are more than 360 million hearing-impaired people in the world, and there are nearly 27.9 million deaf people in China, which is a huge group [15]. Sign language is the main way for hearing-impaired people to communicate directly, but most healthy people do not understand or are not familiar with sign language, which makes the huge communication gap between these two groups. Thus, the hearing-impaired people have encountered great challenges in employment, learning, even living areas such as medical treatment and legal counseling. Communication barriers also bring about a loss of resources such as social labor and special group intelligence. The real-time translation of sign language and spoken language through sign language interpreters is a solution to the communication problem, but it requires advance scheduling, which is costly and often unrealistic. Therefore, researchers consider introducing artificial intelligence and machine learning to develop and implement automatic translation of sign language recognition. The research of sign language recognition technology is profound, which can not only smooth the communication barriers of hearing-impaired people and healthy people, promote the integration of aphasias people into society, but also promote the development of more friendly and intelligent human-computer interaction interfaces.

Sign Language (SL) is a complete communication system consisting of a series of elements such as hand shape, movement, expression and posture. Chinese Sign Language (CLS) can be divided into two categories: finger sign language and gesture sign language (See Fig. 1). 30 finger letters (including 26 single letters a-z, 4 double letters zh, ch, sh, ng) and some numbers constitute the basic unit of the finger sign language, and each Chinese pinyin letter is represented by the shape of the finger. It is easy to learn and has a small number of gestures, which can easily express professional terminology and abstract concepts and occupies an important position in sign language recognition [3]. Gesture sign language simulates the meaning to be expressed through the image and movement of the gesture, which is relatively difficult to use and identify.

Fig. 1
figure 1

Chinese sign language type

Sign Language Recognition (SLR) refers to the use of computer technology to translate or convert sign language information into text, language and other information to facilitate the understanding and communication of others [1]. At the beginning of sign language recognition technology, the focus of research was on designing dedicated hardware devices to input data, and then was on the study of marker gestures and human palms. After that, the identification of natural hand recognition became a popular trend. At present, based on data input levels, sign language recognition technology can be divided into contact and non-contact. Commonly used sensors mainly include data gloves, EMG signal arm rings and depth cameras. The accuracy of data glove-based sign language recognition is high, but the data glove equipment is expensive and inconvenient to carry, which makes it difficult to promote and popularize. Including carrying problems, the sign language recognition using the EMG signal armband also has the situation that when the EMG signal is a weak electric signal or the sign language is changed more quickly, the arm muscle movement cannot be clearly captured and cannot be accurately recognized. In addition, neither of the above methods can effectively identify when the same gesture points to different locations. However, this problem was not found in the depth camera-based sign language data set. The computer vision-based solution is to obtain information from video images and complete the recognition by means of image processing technology [8]. It is free from the constraints of hardware devices, flexible and convenient to operate. It has no affection to users, and is generally welcomed by the market. Currently, this field has attracted a large number of researchers to participate in.

In order to improve recognition accuracy and enhance practical effects, some classical and effective image processing [32,33,34] and recognition algorithms [7, 10, 26, 29, 37, 39] are widely mentioned, such as hidden Markov model (H-MM), support vector machine (SVM), k-nearest neighbor (k-NN), artificial neural network (ANN), dynamic time warping (DTW), long short-term memory network (LSTM), skin color modeling, random forest, extreme learning machine (ELM), recurrent neural network (RNN), convolutional neural network (CNN), including their various variants. Kumar, et al. [11] employed hidden Markov model (HMM) and developed a position and rotation invariant framework sign language recognition model. Lee, et al. [13] combined support vector machine (SVM) and hidden Markov model (HMM) to develop Taiwanese sign-language recognition. Yang and Lee [38] proposed a new method called hierarchical conditional random fields (HCRF). In combination with dynamic time warping and secondary classification, Lichtenauer, et al. [17] obtained an average recognition rate of 92.3%. Li, et al. [16] proposed combining HMM, K-means, ant colony algorithm to Taiwan sign language recognition, and the average recognition rate reached 91.3%. Pariwat and Seresangtakul [24] presented a finger-spelling sign language system in SVM kernel with an average accuracy of 91.2%. ANN classifier was trained by P. V. V. Kishore to get an average word matching score over 90% [27].

The adoption of these technologies has achieved favorable results, but they still have different disadvantages. For example, HMM is a statistical model and needs to be based on successful detection, which is difficult to use for real-time identification. DTW also needs to create templates in advance, which brings a huge amount of work. ELM often lacks superior generalization performance and robustness in gesture recognition. Classification based on SVM and k-NN requires higher feature extraction, and consumes a large amount of system resources in the classification process. Some combination algorithms have relatively high recognition rates, but their data sets are insufficient. The emergence of neural network technology provides a new idea of solution. It has strong self-learning ability and organizational capability. The distribution characteristics are obvious. It also can effectively resist noise. Deep learning is derived from artificial neural networks, which combine the features of lower layers to form more abstract high-level representation attributes or features to discover the distribution characteristics of the data. As the most classical deep neural network, convolutional neural network (CNN) is very suitable for image classification and recognition [12, 21]. In particular, it can perform network training on multi-dimensional image samples, avoiding complex manually feature extraction operations in traditional recognition algorithms [4].

In this paper, based on image processing techniques [25, 31, 35, 36], eight-layer convolutional neural network with stochastic pooling, batch normalization and dropout for Chinese finger sign language recognition was proposed. This CNN is fully optimized on each layer. Besides, stochastic pooling and data augmentation were introduced to achieve excellent performance. In the experiments, we compared stochastic pooling against average pooling and maximum pooling method. Finally, our method is found to be superior to state-of-the-art approaches.

The contributions of this paper are listed below: (i) we utilized some advanced technologies to overcome common issues in traditional CNN, for instance, stochastic pooling and dropout were employed to avoid overfitting, batch normalization was applied to speed up learning convergence, data augmentation was adopted to enhanced train set; (ii) our study rendered an opportunity to smooth the communication barriers of hearing-impaired people and health people and elevate the integration of hearing-impaired people into society; (iii) vision-based sign language recognition was free from the constraints of hardware and with no affection to patients, which was flexible and convenient.

2 Dataset

2.1 Data collection

According to the Chinese finger sign language standard, there are 30 categories, including 26 basic monosyllabic letters and 4 double syllable letters. More than 44 volunteers are selected from different departments to help create the self-built sign language image database using a camera. Figure 2 demonstrates part samples of the main hand shapes in 30 categories intercepted from those sample images. A total of 1320 images of Chinese finger sign language are acquired and normalized to 256 × 256 background-optimized samples. Our experiment was accomplished with this pre-processed 1320 Chinese finger sign language samples.

Fig. 2
figure 2

Part of samples of Chinese finger sign language

2.2 Data augmentation

Hold-out validation method was used. 80% of the total 1320 images, i.e., 1056 images were used for training, and the rest 264 images were used for test. Data augmentation was used on the 1056 training images.

  1. (i)

    Scaling. Images were scaled with scaling factor s from 0.7 to 1.3 with increase of 0.02.

  2. (ii)

    Noise injection. The zero-mean 0.01-variance Gaussian noise was embedded to the sign language images in original dataset to generate 30 new noised images.

  3. (iii)

    Random translation. The hand gesture image was randomly shifted by 30 times. The value of the random shift at both horizontal and vertical directions t = [tx, ty] lies in the scope of [−15, 15] pixels, and obeys uniform distribution.

  4. (iv)

    Gamma correction. The gamma factor R differed in the range of [0.4, 1.6] with increase of 0.04.

  5. (v)

    Affine transform. It exerted deformation to the images, while preserved straight lines.

  6. (vi)

    PCA color augmentation. It shifted the color values which were the most present in original images.

Thus, one original image will generate 180 new images. The augmented training set now has 1056 * 181 = 191,136 images, as shown in Table 1. The experiment repeated ten times. Each time the data split was reset randomly.

Table 1 Partition of dataset

3 Methodology

3.1 Convolutional layer

The convolutional neural network is a typical feedforward neural network that contains multiple layers of deep structures and combines two functions of feature extraction and classification recognition. It generally contains input, convolutional layer, pooling layer, fully connected layer, output, and so on. The convolution layer extracts the input data by convolution operation, and the pooling layer realizes data dimensionality reduction and controls the calculation burden to prevent over-fitting. The fully connected layer mainly performs the classification function. As needed, we can also add function functions in the middle of these layers, such as batch normalization, dropout.

The convolutional layer is composed of various convolutional units with learning capabilities. The subsequent convolution layer extracts more complex features based on the previous low-level features, and finally achieves feature extraction of the target object by adding a larger number of convolution layers [28]. Figure 3 shows the entire process of convolutional layer starting from the input and finally outputting as a feature map through a series of filters (Table 2).

Fig. 3
figure 3

Illustration of convolution operation

Table 2 Size of Input, Filter and Output

The two-dimensional convolution of the convolutional layer is done between the three-dimensional input and the learned filters, with the directions of width and height [20]. Suppose that the input, filter and output size are shown in Table 2. The feature map size is calculated as follows:

$$ {\mathrm{W}}_3=1+\frac{\left({\mathrm{W}}_1-{\mathrm{W}}_2+2\mathrm{P}\right)}{\mathrm{S}} $$
(1)
$$ {H}_3=1+\frac{\left({H}_1-{H}_2+2P\right)}{S} $$
(2)

Here, the input size is W1×H1×C, output size is W3×H3×M. In the specified hyperparameter, M indicates the number of filters, S denotes the stride size, and P is padding size.

3.2 Pooling layer

Pooling is also called subsampling. To avoid overfitting and reduce computational burdens, the pooling layer is often used to achieve dimensionality reduction by using a neuron value to represent an area until all neurons are represented [22, 23, 30]. This achieves compression of the convolutional layer output size. Moreover, pooling can help to maintain translation invariance. There are two common pooling methods: max pooling and average pooling.

Max pooling is achieved by selecting the maximum value of the pooling region while the average pooling obtains a condensed feature map by calculating the average of the elements in each pooling region. An example is shown in Fig. 4, where the filter size equals 2 and stride is 2.

Fig. 4
figure 4

Example of Max Pooling and Average Pooling

Suppose the pooling region is R, we can define the activation set X included in R as

$$ X=\left[{x}_i|i\in R\right] $$
(3)

The max pooling PM is expressed as:

$$ {P}_M=\max \left({X}_R\right) $$
(4)

While the following equation gives the definition of average pooling PA

$$ {P}_A=\frac{\sum {X}_R}{\left|{X}_R\right|} $$
(5)

here |XR| is the number of the elements in the set X.

Although both methods are popular, they have their own shortcomings. In general, average pooling can only reduce the error of the estimated variance caused by the limited size of the neighborhood, and it retains more background information of the image. Max pooling can only reduce the offset of the estimated mean due to convolutional layer parameter errors, it retains more texture information. In addition, the max pooling usually overfits training data [5].

In order to bridge these gaps, the researchers turn to the probabilistic pooling method. Stochastic pooling (SP) was proposed, which is somewhere in between. By giving the probability of the pixel points according to the numerical value, and then subsampling according to the probability, in the average sense, it is similar to average pooling, and in the local sense, it obeys max pooling guidelines. This process can be expressed as the following two steps:

  1. (1)

    Calculate the probability map pi via original activation map xi.

$$ {p}_i=\frac{x_i}{\sum_X{x}_i} $$
(6)
  1. (2)

    Pick a location k within the activation region X according to the probability p. Therefore, stochastic pooling PS can be defined as follow

$$ {P}_S={x}_k, where\ k\sim P\left({p}_1,\dots, {p}_i,\dots \right) $$
(7)

Figure 5 presents a stochastic pooling example. It generates a probability map firstly and then randomly chooses the location k as 4, which has corresponding position at (2, 1) and value of 0.4. Finally, the output of PS is 4 of the original activation map.

Fig. 5
figure 5

Example of Stochastic Pooling

3.3 Batch normalization

In the process of deep network training, the change of the parameters of the previous layer often influences the distribution of the data in the latter layer, and also affects the speed of training [14]. With function of unified decentralized data and optimized neural network, Batch Normalization (BN) algorithm can settle this problem well. By inserting a normalization layer and performing a normalization operation after each layer, BN forces the input values of any neurons in each layer of the neural network to be distributed back to the standard normal distribution, that is, the mean is 0, and the variance is 1. This prevents the issue of gradient disappearance and accelerates learning convergence [6].

The formula for the forward conduction process of BN network layer is as follows:

$$ \mu =\frac{1}{n}\sum \limits_{i=1}^n{z}_i $$
(8)
$$ {\sigma}^2=\frac{1}{n}{\sum}_{i=1}^n{\left({z}_i-\mu \right)}^2 $$
(9)
$$ {z}_i^{\prime }=\frac{z_i-\mu }{\sqrt{\sigma^2+\epsilon }} $$
(10)
$$ {o}_i=\gamma {z}_i^{\prime }+\beta \equiv Z{N}_{\gamma, \beta}\left({z}_i\right) $$
(11)

where [zi] indicates input set, Z = [z1…n], and [oi] is a mini-batch output. In this study, we define the mini-batch size as 256.

As shown in Fig. 6, the output from convolutional layer or fully connected layer supplies the source for the input of batch normalization, and then the output of BN turns into the input of other layers.

Fig. 6
figure 6

Illustration of batch normalization

3.4 Dropout

Overfitting and time consuming are two major embarrassments in training deep neural networks. The dropout can validly resolve the occurrence of overfitting and achieve regularization to some extent. The realization of dropout can be divided into two steps, first training the entire neural network, and then averaging the results of the whole collection. Dropout traverses layer by layer, dropping out several neurons randomly with probability P, and keeps other neurons with probability Q = (1-P), where the value of P is commonly set as 0.5. The output of all discarded neurons is set to zero, which ultimately results in a network with fewer nodes and smaller scales, reducing the links and making the neural network easier to train [9].

An example of dropout neural network is shown in Fig. 7, where the blue solid circle represents a normal neuron and the dotted circle denotes a dropout neuron. It can be seen that each layer drops out some neural units at a certain dropout rate while preserving the remaining neural units. Taking the second layer in Fig. 7 as an example, three neural units are discarded, and the other two are retained, which refines the original network layer. Obviously, network after the application of dropout has fewer nodes and been shrunk.

Fig. 7
figure 7

An example of dropout neural network

3.5 Experiment setting

This experiment was in-house developed and run on the platform of a personal computer with 2.5 GHz Core i7 CPU, and 16 GB memory, under the operating system of Windows 10. The maximum epoch was set to 30. The mini batch size was set to 256. The global learning rate was set to 0.01, and decreased to one-tenth of its previous value every 10 epochs. In total, the setting is listed in Table 3.

Table 3 Setting of neural network model

We evaluate the experiment results using “overall accuracy”, which is defined as proportion of samples that are correctly classified in all samples. It is computed by dividing the number of correctly predicted items by the total of item to predict [2].

4 Experiment results

4.1 Data augmentation results

We use the ch image as an example, which was shown in the top-left corner of Fig. 2. The data augmentation results are shown below in Fig. 8. From Fig. 8a–f, six enhancement methods such as gamma correction, PCA color augmentation, affine transform, noise injection, scaling and random shift are listed, respectively. A total of 180 new augmented ch images were generated, which created new training sets 181 times as larger as before. As we all know, sufficient image data sets are benefit for deep learning. Data augmentation can expand the data set, and it also helps to overcome over-fitting and improve classification accuracy.

Fig. 8
figure 8figure 8

Data augmentation of ch sample image

4.2 Structure of proposed CNN

After tuning, we finally determine an eight-layer CNN with 6 convolutional layers and 2 fully-connected layers. Their details are listed in Table 4. For instance, in Block_2, the hyperparameters represent that the number of filters is 64 and its width is 3, the height is 3, the channel is 32, respectively. Here, batch normalization (BN), rectified linear units (ReLU) and stochastic pooling (SP) components are employed with the convolutional layer. Meanwhile, the value of stride is set to 2. The other blocks are similar in parameters setting. In first fully-connected layer which includes dropout, the dropout rate is decided as 0.4 by seeking in experiments.

Table 4 Details of each layer in proposed CNN

4.3 Statistical analysis

We used this proposed eight-layer CNN, in which we employed batch normalization, dropout, and stochastic pooling components. The results of 10 runs are shown in Table 5. It can be seen that the highest accuracy is 90.91% which has been marked in bold and the minimum accuracy is 87.12%. In addition, the accuracy of eight runs exceeds 89%, and the overall accuracy reaches 89.32 ± 1.07%, which is relatively efficient and stable.

Table 5 Ten runs of our method

4.4 Pooling method comparison

In this experiment, we compared the stochastic pooling (SP) with other two orthodox pooling methods: average pooling (AP) and maximum pooling (MP). The comparison is shown in Table 6. The average accuracies of SP, AP and MP are 89.32 ± 1.07%, 86.67 ± 1.01%, and 88.86 ± 1.42%, respectively. It demonstrates that SP is considerably better than AP and MP in measure of accuracy. We also can see that SP and MP both achieve the highest accuracy of 90.91% while AP doesn’t touch this line. Furthermore, the minimum accuracy of AP is 84.85%, which is much lower than 87.12% in SP and MP.

Table 6 Comparison of average pooling, maximum pooling, and stochastic pooling

4.5 Dropout rate

We varied the dropout rate from 0% to 90%, and recorded the 10-run results in Table 7. The error bar was shown in Fig. 9. As can be seen, when dropout rate is 40%, the highest overall accuracy reaches 89.32 ± 1.07%, which gives the best performance. In general, with the increase of the dropout rate, the overall accuracy is rising and reaches its peak with dropout rate of 40%. Then it begins to decline, the second highest overall accuracy 88.98 ± 1.29% appears with dropout rate of 70%. After that, the overall accuracy continuously keeps dropping. Therefore, the optimal dropout rate was sought at 40%.

Table 7 10-run results against different dropout rate
Fig. 9
figure 9

Error bar of overall accuracy against dropout rate

4.6 Comparison to state-of-the-art approaches

In this experiment, we compared our method with state-of-the-art approaches: HMM [11], SVM-HMM [13], HCRF [38]. The comparison results are listed in Table 8. We can observe that our method is better than HMM of 83.77%, SVM-HMM of 85.14% and HCRF of 78.00%. Our leading edge derives from deep learning that combines multiple technologies. Firstly, stochastic pooling can resolve the overfitting and down-weight issue. Secondly, batch normalization can help to accelerate learning convergence and prevent the issue of gradient disappearance. Thirdly, dropout can effectively solve the occurrence of overfitting and achieve regularization. Finally, data augmentation was applied to enhance the generality of deep neural network. Thus, our method obviously has superiority to other state-of-the-art approaches compared.

Table 8 Comparison with state-of-the-art approaches

5 Conclusion

This study proposed an optimized eight-layer convolutional neural network with stochastic pooling, batch normalization and dropout for fingerspelling recognition of Chinese sign language. The result demonstrated that our method was superior to three state-of-the-art approaches, even better than the second best method SVM-HMM by 4% in terms of overall accuracy. We compared stochastic pooling against average pooling and maximum pooling method. The experiment outcomes indicated the excellence of stochastic pooling, which reduced overfitting effectively. Besides, batch normalization, dropout and data augmentation were employed to achieve superior performance. All these advanced technologies could overcome common issues in traditional CNN, which offered a big opportunity to elevate the integration of hearing-impaired people into society.

Nevertheless, there are some shortcomings to deal with. To improve accuracy, current data size is insufficient and more data need to be collected. To achieve excellent experiment results, the hyperparameters obtained by experience need to be optimized.

In the future, we will try to verify and filter a deep neural network of the appropriate depth and take more advanced technology to improve accuracy. The data set also will be further enlarged. We will try to shift the profits of this study to other fields, such as biomedical imaging, clinical oncology, blind fever screening, which will greatly help those in need. Besides, transfer learning [18, 19] is an alternative way to solve our task.