1 Introduction

Sign language is a visual language, which transfers the signs of the hands using not only the movement and orientation of hands, arms, or bodies, but also facial expressions instead of sound patterns. There is no uniform sign language across the world. Each country has its own sign language, but in this study, we have considered the American Sign Language which is most popular among existing Sign Languages.

The previous studies on sign language recognition failed to supply a complete or reliable model without restriction. In particular, most of them are depending on users; in other words, they are not able to be applied for independent user systems. It can be conducted by some methods, especially in the feature extraction steps due to the image base system. Furthermore, they can involve mimic of the face or body postures for more details, for example, portray anger and emotion through the hands. Likewise, all of the studies on artificial neural network (ANN) have shown that it has a robust learning capability, and there are varieties of ANN systems used in hand posture recognition systems. On the other side, the support vector machine (SVM) approaches have very effective results on recognition systems [10].

Our novelty in this work is using a new method of geometrical feature extraction which leads to get more accurate classification in our classifier. In fact, a new integration of the extracted features, geometrical features of the hand is presented in Sign Language recognition system. Furthermore, the proposed system uses a new simple approach for segmentation in different backgrounds. The concept regarding Microsoft’s Kinect sensor returns to the attainment of 3D data for paving the way in a new solution for quite a few challenging computer vision issues, including human activity analysis, object tracking, indoor 3D mapping, supervision scenarios, and recognition especially hand gesture recognition. Changes due to different lighting conditions have a bad effect on the recognition process. Furthermore, the recognition process is more difficult in a cluttered background than a plain background. This issue has an important impact on accuracy. In order to make a system that works in both simple and cluttered backgrounds, indoor or outdoor with different lighting conditions, a new approach is necessary to solve these problems.

A realistic Sign Language Recognition with error-free recognition is an ambitious goal for many outstanding researchers in computer science, especially pattern recognition. The effects of illumination changes on hand recognition as well as occlusion by another object in the scene in the cluttered background have been attempted in this research. In addition, finding some features of the hand which are independent to the hand orientation or direction have been important issues which this research tried to address. The proposed methods cater for the weaknesses in the hand posture recognition system to develop an SLR system. These methods are applied in segmentation and feature extraction phases and can increase the overall accuracy due to the depth-based images and geometrical features of the hand.

This paper used SVM in DGSLR recognition. All algorithms in each part have been explained in detail. The model used in support vector machines, especially in the most basic cases (e.g. two-class classification), is a model with a linear structure and very similar to what is used, for example, in the multilayer perceptron neural network or MLP. In fact, along with some other differences between the two models, they actually teach a very similar structure in two different ways. In MLP neural network, the parameters of this model are adjusted by error minimization, but in SVM, the risk of incorrect classification is defined as a target function and the parameters are adjusted and optimized accordingly. For some issues, the error rate may be as low as zero, but of all the zero-error models, there is only one that has the lowest operational risk. Therefore, in some cases, the SVM output, in addition to its better performance, will also show more robustness to changes and noise in the data. Because it is basically designed and trained to withstand such uncertainties and to perform well. On the other hand, the use of the term neural network (artificial) or any other similar term to refer to such devices has been merely to create a metaphor that is appropriate and close to nature, and the essence of the theorem is the mathematical relationship behind these systems. From this perspective, many of the systems and models used in the field of machine learning use very similar (and sometimes identical) mathematical structures, and only in the way the problem is expressed, the way the models are set up and described with. They are different from each other. For further study, it is recommended that you read the second edition of Simon Haykin's famous book, Neural Networks: A Comprehensive Foundation, published in 1999. In the introduction of this book, it is well explained that SVM is a type of neural network. The third edition of this book, with the new title "Neural Networks and Learning Machines", was published in 2008. Another suitable reference for further studies in this regard is the book "Neural Networks in a Soft-computing Framework" (neural networks in the framework of soft computing), which in the introduction and chapter ten of this book, the topic of support vector machines, and the fact that they are a special form of artificial neural networks has been debated. The book "Pattern Recognition and Machine Learning", written by Christopher M. Bishop (Christopher M. Bishop), is another very important and practical reference in this field, and interested for more information, you can refer to this important and practical reference.

This paper introduces an American Sign Language alphabet recognition method to hand gesture recognition to help deaf and dumb people. It also presents some geometrical features of the hand for achieving more reliable recognition. Then it explains the literature review in depth-based on hand gestures in sign language recognition systems. In the next part, the research methodology and the procedure of the research are described. Segmenting the signer’s hand is performed and the appeared issues are discussed. The level set method is implemented and reported their results. The feature extraction method is in accordance with hand geometrical features. The support vector machine (SVM) algorithm is implemented to classify the extracted features in the previous step for recognizing the performed gestures. Then, it expounds implementation step. Finally, a comparison discussion between the proposed method by SVM and two classifiers, K-nearest neighbour (K-NN) and decision tree (DT), are employed. The evaluation and testing of the system are applied, and then, the accuracy rate of the proposed method is shown as charts and tables. Also, errors due to wrong recognition are shown. The paper ends with a conclusion and some suggestions for further research in the future, which may provide ways to easier hand gesture recognition in order to apply in the recognition systems.

The idea behind this work is: users can act on desired signs, while the proposed system detects the signs. The detected signs can be converted to sound or text for normal people. The new idea in this research is depth-based segmentation and geometrical features which distinguishes it from other methods. It can be developed by a depth-based camera embedded on a cell phone. The depth-based camera can lead to subtract the background more easily whether simple or clutter. On the other side, geometrical features are independent of the orientation, location, or position of the hand. So, the emotional signs do not make any problem in the recognition process. There is natural variability in the executed signs because of the different positions of the hand in the same signs, and the observations are error-prone, thus applying a method other than the existing exact matching of features is needed without considering the finger’s positions. Furthermore, it can be developed on a system in public places such as airports or libraries, or even educational places like universities. It can be used in conferences or other scientific assemblies.

After introducing an American Sign Language alphabet recognition system, some related works were explained in the literature review in Sect. 2. In Sect. 3, the research methodology and depth-based geometrical features procedure are described. The used dataset, segmentation method, proposed feature extraction methods, and finally classification step have been defined. Experimental results and discussion are given in Sect. 4. The paper ends with a conclusion and some suggestions for further research in the future, which may provide ways to easier hand gesture recognition in order to apply in the recognition systems.

2 Related works

There are several challenges which we will try to solve. Complex background and lighting conditions are more important than the rest factors. The distance between the user and Kinect Camera during the capturing images can be considered as a limitation of this research. However, some ordinary cameras can solve this issue, but they have no depth-based application. The process is very sensitive to hand movements due to the illumination changes. This may lead to the occlusion of some parts of the hand by other parts. Two letters ‘J’ and ‘Z’ are motional signs, and it is much better to remove them from the hand posture recognition field. These two signs are very similar to ‘I’ and ‘G’, and they have similar features together. It caused to confuse the conditions in the classifier process.

Limitations and constraints in the existing vision-based methods have been caused to obtain the unsatisfying results in the previous research. Object recognition in the cluttered scene, or with long sleeve clothes of the signer, or the necessity of motionless head or face is some of these restrictions. Likewise, steady hand movements, stable pose and location of the body, determined primary location for hands, and restricted vocabulary are other discussed limitations in this field.

Lee et al (2013) explained a computer vision-based method for posture recognition of a hand posture and its application on an iOS iPhone. The proposed algorithm used YCbCr images [19] to set skin regions. They eliminated noise caused by slanted hand posture. Then ANN was used for sign recognition and applied to another device like iPhone. The accuracy rate of recognition was 89% in the motion hand posture, and it was 94.6 percent for static hand posture, but the skin detection was affected by the illumination conditions of the environment. This issue caused a low accuracy in some states or orientations.

The feature extraction step is one of the crucial steps in every recognition system. There is a diverse huge collection of feature extraction methods that each of them has some advantages and disadvantages, such as scale-invariant feature transform (SIFT) [8], Gurjal and Kunnur, 2012), wavelet moments [7], histogram of oriented gradients (HOG) [24,25,26], and Gabor filters (GF) [1, 30]. These techniques are very robust in the recognizing process but for a small number of simple hand postures [11]. For example, Dardas and Georganas [8] obtained an accuracy rate of 96.23% for recognizing six signs using SIFT based and an SVM classifier. Pugeault and Bowden [30] implemented the recognition of 24 static ASL alphabet signs using the Gabor filter (GF) method. The mean accuracy of 75% was reported. Moreover, the proposed method had a high confusion rate of 17% between similar signs such as "r" and "u". In short, these methods are usually not able to obtain desirable accuracy in complex classifying or variations of a lot of ASL signs.

In addition, Domino et al. [10] presented multiple depth-based descriptors. The descriptors included some features of the hand such as distance and elevation, the hand’s contour curvature, and properties of the palm region to be extracted. The achieved accuracy was 93.8% by SVM classifier in an experimental set of 12 static and digit signs of ASL alphabet. Liang et al. [20] improved the per-pixel-based hand parsing method by distance-adaptive feature selection scheme and super-pixel partition-based Markov random fields (MRF). The improved algorithm was led to increase from 72 to 89% of accuracy in per-pixel classification. The above methods recognize only a small number of simple postures (less than 15) including ASL digits and custom signs which are a small portion of ASL alphabet signs.

Changes due to different lighting conditions have a negative effect on the recognition tasks due to the shadow or undesired effects on the objects [6], Kishore and Kumar [18, 34]. Furthermore, the recognition process is more difficult with a cluttered background than a plain background [29]. Compared to the body or skeleton recognizing procedures, the recognition of the hand or another specific part of the body is more sensitive tasks. In these cases, the other objects in the scene can lead to occlusion and consequently wrong detection procedure. These issues have an important impact on accuracy. In order to make a system that works in both simple and cluttered backgrounds, indoor or outdoor with different lightening conditions, a new approach is necessary to solve these problems.

Most of the previous researches are dependent to the signer [6, 32]. On the other word, the selected extracted features of the hand in these previous hand recognition systems are dependent on the position or direction of the signer’s hand [27, 33]. Then, the recognition process is performed correctly just for a specific user and it does not work properly for generic users. Using features independent of the user’s hand shape, orientation, location, position and direction is highly desirable. On the other hand, most of the previous research used fingertips as a feature [20]. The main weakness of the use of hand fingertips in the extracted features is that they can be occluded by other fingers. There is a natural variability in the executed signs because of the different positions of the hand in the same signs. Furthermore, if the observations are error-prone, then a method other than the existing exact matching of features is needed without considering the finger’s positions.

Kisel’ák et al. [16] introduced a new method as ‘scaled polynomial constant unit activation function—SPOCU’ for a medical image in some cancer detection. Such a novel activation function relates to complex patterns through the phenomenon of percolation, and thus, it can overcome already introduced activation functions, e.g. SELU and ReLU. Discrimination between mammary cancer and mastopathy tissues plays a crucial role in clinical practice. In this case, a more precise activation function in the classifier is necessary which can detect the tissue and its complexity. But in our case, using such an activation function only leads to increasing computational time.

This study focuses on the classification by SVM because of its clarity and simplicity in the classification. Furthermore, its usability to resolve the various problems is one of another reason to use it, as some approaches like decision trees are not simplicity used in the various problems. As Hinton (2008) mentioned the SVM causes to get a good generalization on a big dataset. Since a big data set requires a complicated model, the full Bayesian framework is very costly in computation. In contrast, the SVM is faster and still has a good generalization solution. Furthermore, due to a very big set of nonlinear task-independent features, SVM has a clever way to prevent over-fitting problem.

3 Depth-based geometrical features in hand recognition

3.1 Dataset

Two separate datasets are employed in this research. The first one is the chosen dataset by the research which is called DGSLR. The other one is a standard dataset. In the DGSLR dataset, three novice users of Sign Language, one man and two women, were employed in this study. They were asked to sit down in front of the Kinect camera and perform the signs. Each letter was repeated for five times.

After the preparing step and teaching the signs to the signers, the images were captured by the Kinect Explorer—WPF application at 30 frames per second. In this coloured image capturing application, the hand is detected by a distinct colour due to the depth feature.

The capturing process was performed in both plain and cluttered backgrounds in different variations of illumination. As Fig. 1 illustrates, the other objects in the cluttered background do not have any interference in the detection procedure. The farther objects are removed, and the closer objects are shown in the different depth with the user in the foreground. Thus, the hand is still shown as different colours in the RGB mode (Fig. 1 (left)) and brighter view in the depth mode (Fig. 1 (right)). The hand is also recognizable in two modes.

Fig. 1
figure 1

Cluttered background in RGB and depth mode

In order to validate the data, a huge standard dataset from the Centre for Vision, Speech and Signal Processing, University of Surrey Pugeault and Bowden [30], was used. The images have been captured from 9 people in different backgrounds similar to the research dataset. The images gathered by Kinect are only depth-based. In addition, there are more than 400 repetitions on each sign in different postures and directions. The users changed their hand direction and also the distance to the Kinect sensor.

Posture or gesture recognition methods can be divided into two types: one is to use Kinect (for example in our work), Leap motion and other depth cameras to obtain image depth information, such as position. The other one is to split the gesture from the background by traditional methods and then extract the apparent image characteristics of the posture by neural networks to perform posture recognition.

In this case, according to the type of neural network (MLP) and learning paradigms (Backpropagation), and also the desired task which is ‘Pattern Recognition’, the ‘Fermi function’ can be used. This study uses a nonlinear SVM, and since there are different kernel functions in the nonlinear SVM structure, choosing a kernel based on the prior knowledge of invariances as suggested by Cawley and Talbot [5] is an excellent idea. The Gaussian radial basis function (GRBF) kernel is one of the most common kernel which is used in this research.

3.2 Segmentation of the hand

Hand extraction is a crucial step in hand recognition systems because all of the following processing steps are performed on the segmented regions only. The proposed scheme for segmenting the hand is based on the depth data. A scenario used in this research is to have users facing the Kinect camera with their hands held in front of themselves. In this case, the hand seems brighter than the other objects because of the depth capability in the image. It caused to place the body or other objects in the scene in the deeper layer and the hand seems by different colour due to changing light conditions compared to the rest of the body. The distance between the user and the Kinect was 150 cm. In addition, the lighting conditions were changeable during the signing process.

3.3 Morphological object dilation

There are some noisy points in the obtained depth images in this study. Then, a post-processing procedure has been to improve the obtained depth images. These noisy points can be due to hand movements or shaking during the signing. Furthermore, the Kinect sensitivity to the illumination conditions can also have an effect on the images. A filtering operation can perform on the image to address this issue, but according to the review on the filtering methods, they are commonly time-consuming procedures (Chiang et al., 2013, Pal et al., 2014). On the other hand, in our depth images, no need to rectify the edge and only some morphological operations are applied for smoothing the binary depth-based image and remove the noisy points on the hand surface as the demonstrated example in Fig. 2.

Fig. 2
figure 2

The binary image before and after morphological operations

The first step, all the images were resized to a 128-by-128 pixel matrix. A unified dataset of images, all of equal size allows for modifications in later stages if needed. These points of the image should be distinguishable from the rest black points like background points. Since the number of these type of images was little in this study, the mentioned issue was resolved by a series of morphological functions in MATLAB as following definition.

The dilation of A by B is implicated \(A \oplus B\) where defined as:

$$ A \oplus B = \left\{ {z\left| {(\hat{B}} \right.)_{z} \cap A \ne \phi } \right\} $$
(1)

where \(\hat{B}\) is the reflection of the structuring element B. In fact, it is the set of pixel locations Z, where the reflected structuring element overlaps with foreground pixels in A when translated to Z. In the greyscale dilation, the structuring element has a height. The greyscale dilation of A(x,y) by B(x,y) is as:

$$ (A \oplus B)(X,Y) = \max \left\{ {A(x - x^{\prime},y - y^{\prime} + B(x^{\prime},y^{\prime})\left| {(x^{\prime},y^{\prime}) \in D_{B} } \right.} \right\} $$
(2)

where DB is the domain of the structuring element B and A(x,y) is assumed to be − ∞ outside the domain of the image. To create a structuring element with nonzero height values, the syntax strel (sdom, height) is used, where height shows the height values and sdom corresponds to the structuring element domain. The greyscale dilation is commonly performed with a flat structuring element (B(x,y) = 0). Greyscale dilation using such a structuring element is equivalent to a local-maximum operator:

$$ (A \oplus B)(X,Y) = \max \left\{ {A(x - x^{\prime},y - y^{\prime})\left| {(x^{\prime},y^{\prime}) \in D_{B} } \right.} \right\} $$
(3)

3.4 Feature extraction

After hand segmentation and post-processing based on depth hand images, selected feature vectors are expected to represent the position of fingers and palm. Consequently, fingers should be roughly characterized by a robust approach.

3.4.1 Hand geometry

The hand area (HA) and hand perimeter (HP) are the first feature descriptors which were calculated by morphological operators. In order to compute the perimeter of the hand, the distance between each adjacent pair of pixels around the hand contour is calculated. The discontinuous areas in the hand region may lead to unexpected results. All noisy points should be removed to gain better results in the hand area and perimeter here. Two mentioned parameters, HA and HP, for all the fingers are closed and when they are open. This is the minimum and maximum value, respectively, so the other signs are within this range.

3.4.2 Convex hull of the hand

The convex hull of the hand is calculated in order to gain the desired geometry information. It should be noted that the forearm or arm of the hand were removed from the initial images as it did not contain any important information. In the hand image of the research, a convex hull is a n-by-m matrix that determines the smallest convex polygon containing the hand region. The parameter n is the number of the pixels, and m represents the vertexes. Each row of the matrix demonstrates the coordinates of one vertex of the circumscribed polygon of the hand. In the next section, the concept of convex polygon will be introduced.

Consequently, for a nonempty points set in a certain plane, the convex hull is the smallest convex polygon which includes all these points in the set. For instance, in Fig. 3 the polygon around the points is a convex hull and the six points which are on the boundary are called ‘hull points’.

Fig. 3
figure 3

Convex hull of (left) a points set, (right) segmented hand

The convexity defects of the hand have some geometry properties which can be used as features of the proposed system in this study. The area of the convexity defects, CDA, was computed by a similar algorithm of the convex hull. Likewise, the number of convexity defects represents the number of open or closed fingers. The empty spaces between the opened fingers are also convexity defects, so the number of these spaces can be represented for some specific signs in the classification step. This is much more useful for designing a reliable recognition system.

3.4.3 Ratio feature

Another extracted feature is the ratio between the hand area, HA, and the area of the convex polygon, CHA, enclosing it. As mentioned above it is called convex hull. So it is named convex hull area ratio that is:

$$ \Re_{{{\text{CHA}}}} = \frac{{{\text{Handarea}}({\text{HA}})}}{{{\text{ConvexHullarea}}({\text{CHA}})}} $$
(4)

The ratio between the perimeter of the handshape (HP) and the convex hull perimeter (CHP) is another useful parameter. Those gestures with closed fingers are typically related to perimeter less than when some fingers are opened. Likewise, the rate of hand perimeter to the convex hull is close to 1. The following equation shows this relationship.

$$ \Re_{{{\text{CHP}}}} = \frac{{{\text{Handperimeter}}({\text{HP}})}}{{{\text{ConvexHullperimeter}}({\text{CHP}})}} $$
(5)

Similarly to the convex hull, the rate of hand geometry area (HA) to the convexity defect area (CDA) can be considered as an informative feature for a reliable recognition system. This rate has been calculated by:

$$ \Re_{{{\text{CDA}}}} = \frac{{{\text{Handarea}}({\text{HA}})}}{{{\text{Convexitydefectarea}}({\text{CDA}})}} $$
(6)

3.4.4 Distance feature

The height and width of the signer’s hand are other measurable features which are considered in this research. The height and width values can represent the hand postures. Although the similar signs have similar values of height and width, they can be classified into the same class for more clarity in the classifier. For example, as represented in Fig. 4, for three signs ‘A’, ‘S’, and ‘T’ the value of the height and width is close together. This similarity also occurs between ‘R’ and ‘U’.

Fig. 4
figure 4

Similar signs with close geometrical values

For computing the height and width of the hand, the edge of the hand should be detected. Then, the longest diameter of the hand in vertical and horizontal directions is computed based on the eigenvalue and the eigenvector concepts. As the last step, the calculation of the distance feature was performed by the Euclidean distance between the ending points of these diameters on the hand boundary.

The first step, the hand boundary should be calculated. There are some predefined functions which can be applied on the images for detecting the edges of the objects. MATLAB software also includes several algorithms for calculating the object’s boundary, but edges may include the adjacent number of rows which creates a ‘thick’ edge as shown in Fig. 5.

Fig. 5
figure 5

Thick edge includes several points, image edge detection algorithm, then original image, detected contour, more detailed view are extracted

In statistics, a covariance is a matrix which its element in the i, j position means the covariance between the ith and jth elements of a random vector variable. Each element of this vector is a scalar variable with a finite number of appeared experimental values or by a finite or infinite number of possible values determined by the theory of joint probability distribution of all the random variables.

The covariance between two jointly distributed real-valued random variables X and Y with finite second moments is (Statistics, 2002):

$$ \sigma (X,Y) = E[X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y] $$

where E[X] is the expected value of X. Since all probabilities pi add up to one, p1 + p2 +  + pk = 1, the expected value is shown as the weighted average:

$$ E[X] = \frac{{x_{1} p_{1} + x_{2} p_{2} + ... + x_{k} p_{k} }}{1} = \frac{{x_{1} p_{1} + x_{2} p_{2} + ... + x_{k} p_{k} }}{{p_{1} + p_{2} + ... + p_{k} }} $$
(7)

An eigenvector of a square matrix in linear algebra is a vector that does not change its direction under the linear transformation. If v is a nonzero vector, then the v is an eigenvector of the square matrix A as Av is a scalar multiple of v. There is a relationship between n by n square matrices and linear transformations. The linear transformation of n-dimensional vectors specified by an n by n matrix A is:

$$ Av = w $$
(8)

where

$$ w_{i} = A_{i,1} v_{1} + A_{i,2} v_{2} + ... + A_{i,n} v_{n} = \sum\limits_{j = 1}^{n} {A_{i,j} } v_{j} $$
(9)

If w and v be the scalar multiples then:

$$ Av = \lambda v $$
(10)

which v is an eigenvector of the linear transformation A and the factor λ is the eigenvalue of it.

The approximate longest diameter in the hand and then the perpendicular line to it should be computed as shown in Fig. 6. The coordinate of the points on the hand contour was computed in the boundary detection algorithm. So, the gravity centre point is easily obtained. Then the covariance matrix is computed. The direction and value of the longest diameter will be obtained by calculating the eigenvalue and the eigenvector.

Fig. 6
figure 6

Height and width of the hand

3.5 Feature vector structure

All computed features on both DGSLR and standard datasets were saved in two repositories in a CSV (comma separated file) which we utilized Microsoft Excel for easy usage. The first one which belongs to the DGSLR dataset includes three sheets where each sheet corresponds to each user. The rows and columns of this file represent the letters and features, respectively. The last column is considered as a label column for labelling each sign within 1 to 26. Considering the leave-one-out approach which will be explained in the next part in classification, one person is kept for testing and the rest is considered in the training phase. The second excel file corresponds to the standard data set that consists of 26 sheets which each of which belong to a specific sign. We took a regular training procedure of 70/30 split, 70% of images is used for training, while 30% is used for testing.

3.6 Classification

The last step of the proposed recognition system includes an appropriate machine learning method to classify the extracted features in the previous step in order to recognize hand gestures. In this research, a multi-class one versus one SVM classifier has been used, and in accordance with a set of n(n − 1)/2 binary SVM classifiers used to test each gesture against each other. Each output is selected as a vote for a certain gesture, and as mentioned before, the gesture with the maximum votes is the recognition process result. This study uses a nonlinear SVM, and since there are different kernel functions in the nonlinear SVM structure, choosing a kernel based on the prior knowledge of invariances as suggested by Cawley and Talbot [5] is an excellent idea.

The Gaussian radial basis function (GRBF) kernel is one of the most common kernel which is used in this research as obtained by Eq. 11.

$$ k(x_{i} ,x_{j} ) = \exp ( - \gamma \left\| {x_{i} - x_{j} } \right\|^{2} )\,for\;\gamma > 0 $$
(11)

The Gaussian radial basis function kernel supports the corresponding feature space in an infinite dimension. The maximum margin in the classifier is well regularized, and it is widely believed that the infinite dimensions do not spoil the results, Jin and Wang [15]. The GRBF kernel makes a good default kernel in a nonlinear model. It may lead to having an efficient-to-compute and high accuracy approach without having the huge and potentially infinite-dimensional feature vector. The optimized run time of the GRBF is one of the other reasons to employ it in the classifier of this research. The GRBF execution time is bounded by O(nlogn), where n is the number of training samples.

In this research, there are two datasets of the depth-based image of the Sign Language alphabet. Firstly, the classification process is applied on the DGSLR dataset, so the training set contains data from three available users. A cross-validation method as K-fold cross-validation is used by K equals to 5 and 10 in the testing step. In the K-fold validation method, the collected data are partitioned into the K subsets. In these subsets, one of them is used for validating data and K-1 subsets for the training process. This procedure is repeated K times, and all the data are used once for training and once for testing. Finally, the average of these K procedures is selected as the final estimation.

The two parameters C and φ of the RBF kernel are subdivided with a regular grid which when C is considered, equals to 1,10,100, and 1000, and parameter φ equals to 0.001, 0.01, 0.1, 1. Similar to other classifiers, for each couple of these parameters, the training collection is divided into two categories, N − 1 users in the training set and the rest for validating. We reiterate the 70/30 split between training and testing. The accuracy is assessed and the testing process is iterated frequently based on changing the iteration number. Finally, the parameter pair which gives the most accuracy is selected and applied to the SVM structure.

In order to measure the classifier accuracy, two statistical parameters called ‘Sensitivity’ and ‘specificity’ were used. The Sensitivity parameter or true-positive rate measures the proportion of actual positive samples which are correctly identified. It is also complementary to the false-negative rate. The Specificity parameter or true-negative rate measures the proportion of negative samples that are correctly identified. Similarly, it is complementary to the false-positive rate.

A perfect predictor approach describes samples as 100% sensitive and 100% specific, but in fact, there is no perfect predictor and theoretically, all of them have a minimum error bound called the Bayes error rate. As concluding the four outcomes can be formulated derived a confusion matrix as follows:

  • True positive (TP) = correctly identified

  • False positive (FP) = incorrectly identified

  • True negative (TN) = correctly rejected

  • False negative (FN) = incorrectly rejected

Two equations can be formulated and derived from a confusion matrix as follows (Fawcett, 2006, Powers, 2011):

$$ \begin{gathered} {\text{Sensitivity}} = {\text{True}}\;{\text{Positive}}\;{\text{Rate}}\;({\text{TPR}}) = \frac{{{\text{Number}}\;{\text{of}}\;{\text{True}}\;{\text{Positives}}}}{{{\text{Number}}\;{\text{of}}\;{\text{True}}\;{\text{Positives}} + {\text{Number}}\;{\text{of}}\;{\text{False}}\;{\text{Negatives}}}} \hfill \\ = \frac{{\sum {{\text{True}}\;{\text{Positive}}} }}{{\sum {{\text{Condition}}\;{\text{Positive}}} }} \hfill \\ \end{gathered} $$
(12)
$$ \begin{gathered} {\text{Specificity}} = {\text{True}}\;{\text{Negative}}\;{\text{Rate}}\;(TNR) = \frac{{{\text{Number}}\;{\text{of}}\;{\text{True}}\;{\text{Negatives}}}}{{{\text{Number}}\;{\text{of}}\;{\text{True}}\;{\text{Negatives}} + {\text{Number}}\;{\text{of}}\;{\text{False}}\;{\text{Positives}}}} \hfill \\ = \frac{{\sum {{\text{True}}\;{\text{Negative}}} }}{{\sum {{\text{Condition}}\;{\text{Negative}}} }} \hfill \\ \end{gathered} $$
(13)

These statistical parameters can be represented in the confusion matrix as shown in Table 1.

Table 1 Statistical parameters in confusion matrix to measure the classifier accuracy

4 Experimental results and discussion

The experiments were divided into two categories, our own dataset and the standard dataset. The experiments were performed on a gesture dataset in the Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, GU2 7XH, the UK, allowing comparison with state-of-the-art techniques in this study. A number of practical tests were performed to evaluate the proposed methods and computed the accuracy of the system with different parameters. Using a larger data set will definitely lead to more accurate results. The proposed method is independent of the size, angle and rotation of the hand. Therefore, the increase in the dataset size leads to better network learning and finally more accurate results.

Some geometrical features were used as parameters:

  • hand area (HA),

  • hand perimeter (HP),

  • A convex hull is an nXm matrix that determines the smallest convex polygon containing the hand region. So, the area of the convex polygon (CHA), & perimeter of the convex polygon (CHP) are considered as new parameters.

  • The area of the convexity defects is computed by a similar algorithm to the convex hull. Likewise, the number of convexity defects represents the number of open or closed fingers. The empty spaces between the opened fingers are also convexity defects, so the number of these spaces can be represented for some specific signs in the classification step. The area of the convexity defect of the hand (CDA) is another parameter that is considered in this paper.

  • The longest diameter of the hand in vertical and horizontal directions is another parameter which is computed based on the eigenvalue and the eigenvector concepts.

These parameters and the ratio between them are considered in the computational process. The parameters under study depend directly on the type of the signs. Since some of these signs are very similar, the values are very close together. A threshold has been considered for each sign to avoid interfering and overlapping. For example in two signs, i & j, the convexity defects are very close together as you can see in the figure.

A tolerance between ± 0.1 is error-prone in each repeat for the similar sign but by one specified signer because it depends on the size of the signer’s hand. Furthermore, in different signers with the same sign, it increased to ± 0.5 in each iteration.

In other parameters the tolerance was different, so for each parameter, a different tolerant was considered.

4.1 Data collection

After collecting the desired data from both DGSLR and standard datasets, the signer’s hand should be separated from the rest of the body and other objects in the scene. The proposed segmentation approach is represented based on the depth-based image property. After converting the greyscale images to the binary mode, Otsu’s thresholding algorithm Batenburg and Sijbers [3] was applied to the images as described in the previous chapter. Some samples of the experimental results are shown in Fig. 7.

Fig. 7
figure 7

Hand segmentation

It is clearly observed that no need to trace the hand or determine the bounding box around the hand region. In addition, there is no difference between the left or right hand, because the coordinate of the hand location is not important. The hand can be segmented well only based on the illumination intensity. For more clarity, the segmented hands were cropped and zoomed in as shown in Fig. 8.

Fig. 8
figure 8

Hand segmentation

In order to separate the wrist from the forearm, the hand contour was computed, and an inscribed circle with a palm centre was drawn. The longest diameter of the hand was calculated based on the eigenvector and eigenvalue of the hand image. Then, the perpendicular line to the longest diameter and also tangent to the inscribed circle was plotted as represented in the previous section in detail. The green star represents the tangent point between the inscribed circle and the perpendicular line (hand width) in the lowest point of the circle. Figure 9 shows some experimental results a none expert user.

Fig. 9
figure 9

Removed forearm

In order to recognize the hand position, the level set method (LSM) was employed due to the low computational cost and high speed, Gonzalez et al., [12]. In this case, it was applied to the signer’s image for recognizing the missed parts in hand. As described in the previous section, some parts of the hand may be missed due to illumination directions and the position of the hand. The hand could be segmented by the definition of a set of arbitrary points around the hand region. Some examples of the experimental results of the level set method are highlighted in Fig. 10.

Fig. 10
figure 10

Comparison between the Kinect and level set segmentation, a depth image, b Kinect segmentation, c LSM execution, d LSM segmentation

4.2 Feature extraction

The next step includes extracting features from the segmented hand. These features will be used in the classification step for evaluating the conducted gestures.

4.2.1 Hand geometry features

The geometrical properties of the hand are reliable features for hand gesture recognition systems because properties like area or perimeter are constant against rotation or changing the location of the hand. The signer may move a bit or the signer’s hand may be shaken and change its position or one signer might use the right hand for some signs and the left for other signs. Table 2 shows HA and HP for the three signers in DGSLR dataset.

Table 2 The area and perimeter of the hand for three different signers

4.2.2 Convex hull of the hand

The convex hull of the 2D depth-based hand shape was computed by the interpolation and computational geometry of mathematic functions. It can be one of the constructions of the existing descriptors for the hand posture, Pedersoli et al [28]. All the binary segmented hand images were resized, and then the convex hull function was applied to them. Some instances of the results are represented in Fig. 11.

Fig. 11
figure 11

Convex hull of the hand shape

As it can be observed in the above figure, the similarity between the sign may lead to the similar convex hull polygon around them like the first and fifth signs which represent the ‘A’ and ‘E’ signs. This similarity is also observed in Fig. 11 between the fourth and the last sign which are ‘D’ and ‘R’ signs. Meanwhile, a little difference in the geometrical features such as area and perimeter of the convex polygon is acceptable for this classification process.

The convex hull area and perimeter have been shown with CHA and CHP abbreviations, respectively, which the results them for DGSLR dataset are presented in Table 3

Table 3 The area and perimeter of the convex hull for three different signers

4.2.3 Convexity defects of the hand

An applicable way of estimating the shape of a specific object is to calculate its convex hull and then its convexity defects. As mentioned the convexity defects are some parts of an object which are contained in the convex hull of the object, but it does not belong to the object. There are several ways to compute the convexity defects of an object Keskin et al. [17]. Some experimental results of the applied procedure to obtain the convexity defects of the hand are illustrated in Fig. 12.

Fig. 12
figure 12

Convexity defects of the hand shape

Too many informative data can be extracted from convexity defects of the hand, as shown in Fig. 12 Some signs like ‘F’, the fourth sign in the figure from the left, can represent the number of open fingers by counting the convexity defect spaces between the fingers. Each space between two fingers consists of one point which belongs to the hand and has a maximum distance to the convex hull. The number of these points is also helpful to understand the shape of the hand in the hand posture. Then, the area computation procedure is followed similarly to the convex hull process. The results of this procedure are as shown in Table 4.

Table 4 The area of the convexity defect for three different signers

4.2.4 Hand ratio

There is another good feature which is considered in this study. It is the ratio between the hand shape area and perimeter and the convex hull enclosing it. This ratio is also computed for convexity defects areas. Equations 14, 15, and 16 show these mathematical relationships. Table 5 shows some instances results of these equations.

$$ \Re_{{{\text{CHA}}}} = \frac{{{\text{Handarea}}({\text{HA}})}}{{{\text{ConvexHullarea}}({\text{CHA}})}} $$
(14)
$$ \Re_{{{\text{CHP}}}} = \frac{{{\text{Handperimeter}}({\text{HP}})}}{{{\text{ConvexHullperimeter}}({\text{CHP}})}} $$
(15)
$$ \Re_{{{\text{CDA}}}} = \frac{{{\text{Handarea}}({\text{HA}})}}{{{\text{Convexitydefectarea}}({\text{CDA}})}} $$
(16)
Table 5 The area and perimeter ratio for three different signers

4.2.5 Distance features

The approximate longest diameter of the hand can be calculated via the eigenvalue and eigenvector concepts. In addition, the approximate width is also computable by drawing a line perpendicular to this line. Figure 13 presents some selected results of the above procedure for signer ‘A’ in both states with and without hand contour. The results have been zoomed till 200% for more clarity.

Fig. 13
figure 13

Eigenvectors of the hand

As can be observed in the results in Fig. 13, the vectors are drawn from hand contour to hand centre. If they are continued to the opposite points, the approximate length and width of the hand can be computed easily.

After trying all signs on the dataset, it was observed that this procedure cannot lead to a good result in some signs, as shown in Fig. 14, so there is a complementary idea which explained in the following subsection to solve this issue.

Fig. 14
figure 14

Bad results of the length and width calculation

4.3 Classification

4.3.1 Discussion on DGSLR dataset

In the last commands on the SVM, the average accuracy for trained and test sets was calculated. The experimental results were computed for extracted features lonely and also the combination of them. In addition, the programme was repeated in 1 and 10 iterations for fivefold and 10-fold cross-validation. Tables 6, 7 show the accuracy rate in fivefold cross-validation in the training phase and final accuracy of the testing phase for one iteration in the DGSLR dataset. The DGSLR dataset consists of three users with 390 depth-based images. As it can be seen in Tables 6, 7, the accuracy rate is increased considerably when the features are combined. For instance, when the convexity defect is a feature lonely, the trained accuracy rate is 23.88%. This rate in the validation phase equals to 21.88%, while the combination of this feature with other features affects highly the recognition rate, as it reaches 80.64% in the training phase and 80.81% in the testing phase.

Table 6 Accuracy of single extracted features from the DGSLR dataset
Table 7 Accuracy of combination of extracted features from the DGSLR dataset

The following figures represent the results as line charts for more clarity. It can be seen that the training and testing phases have very close results in both single and specially combined features. Refer to Fig. 15, the convex hull feature has the most impact on the accuracy. The accuracy rate in two points which is related to the convex hull, (CHA + CHP) and (RCHA + RCHP), is close to 80%. This value for the distance feature is approximately 50%, and it shows that the distance feature is an important feature in this case.

Fig. 15
figure 15

Accuracy rate in a single and combined feature vector

The overall accuracy rate for a single feature vector in the training phase is 53.195%. This magnitude in the testing phase is 50.48%. Likewise, the recognition accuracy in training and testing phases is 84.807 and 85.005%, respectively. Figure 15 shows the overall results in the DGSLR dataset.

Figure 16 shows the confusion matrix of 26 signs in the DGSLR dataset by three users. As it is observed those signs which are similar have some recognition error and cannot be detected 100% in all cases. For example, sign ‘M’ has been predicted correctly with an 89.5% rate and predicted as ‘A’ sign in 16.7% prediction rate.

Fig. 16
figure 16

(left) Confusion matrix, signs A-M, DGSLR dataset with three users, (right) confusion matrix, signs N-Z, DGSLR dataset with three users

In sign ‘T’, the sign has been predicted correctly in the 87.8% of cases, but it has been detected as ‘N’ and ‘S’ in 5.3% and 11.2% of tested cases, respectively. Likewise, some signs like ‘B’ and ‘V’ were predicted correctly in all cases. Consequently, the overall recognition rate equals 90.250% which is an acceptable rate considering the previous works in this field study.

Totally for one and ten iterations in fivefold and 10-fold cross-validation in this case study of multiclass RBF SVM, the average accuracy rate in the training and testing phases was according to the presented charts as Figs. 17.

Fig. 17
figure 17

(left) Training phase accuracy rate, (right) testing phase accuracy rate

4.3.2 Discussion on standard dataset

The multi-class SVM classifier was also applied on the standard dataset, and the obtained results are as follows. The employed standard dataset includes a huge set of depth-based images of nine users in approximately 400 repetitions on each sign, so includes about 10,400 images for each user. Here, just one user has been considered. As can be seen in Table 8, the most value of the recognition accuracy rate is related to the convex hull with 58.99% in the training phase and 59.65% in the testing phase. The second most value is related to the ratio between convex hull and hand. It is similar to DGSLR dataset results. Table 9 shows the extracted features combination where the highest value of accuracy rate belongs to a combination of distance, hand and the convex hull of the hand.

Table 8 Accuracy of single extracted features from the standard dataset
Table 9 Accuracy of combination of extracted features from the standard dataset

The following charts represent the results of the recognition accuracy rate in single and combined features to represent the recognition trend on the standard dataset. Similar to the DGSLR results, Fig. 17(left) compared with Fig. 17(right) has a higher accuracy rate. Furthermore, the trend of the combined features is increasingly upward.

A recognition accuracy comparison between the proposed method and previous works which used the Kinect sensor is presented in Table 10. According to the table, Random Occupancy Pattern and Eigenjoints demonstrated a high accuracy rate between the other examined classifiers in the recognition process. Some applicable classifiers based on histograms have also illustrated the positive results on recognition. Moreover, the graph-based classifiers have an accuracy rate of more than 70%. Whereas neural network-based classifiers are widely used in most of the recognition processes, but compared with the other classifiers, they have a low accuracy rate. The hidden Markov Model has shown a high accuracy recognition in Sign Language applications as also discussed in the literature. The recognition accuracy rate of this research is based on SVM and examined on DGSLR and standard datasets. As Table 10 shows, the recognition rate of the classifier is more than 90% on DGSLR dataset and 96% on the standard dataset which is a good result compared to the previous research.

Table 10 Recognition accuracy comparison

4.3.3 Discussion and comparison on benchmark

The experimental results of this research are according to previous principal research that used the mentioned standard dataset. Here there is a quick look at this research, and then some comparisons between this study and the main research are conducted in obtained practical results as tabular form.

In the principal research which this study built on it, the depth-based detection of the user’s hand has been performed using the OpenNI + NITE (Middleware, OpenNI) framework on a Kinect. This library provides functions for detecting hands in 3D space by the depth image made by the Kinect sensor. Then, the hand is segmented from the depth-based image assuming that the hand is a continuous region. For the feature extraction step, the hand shape features used were based on Gabor filtering of the depth images and intensity. The learning and classification process is well established and utilized via a multi-class random forest, discussed in detail earlier. The random forest has good accuracy in learning Daugman [9] and can handle large feature space and large datasets. It has shown some good results in fast training. The flow work of this research is presented as follows.

Figures 18 and 19 show the confusion matrix for the detection of all signs in the mentioned research and this research, using a combined feature vector.

Fig. 18
figure 18

Confusion matrix of all signs in the dataset in benchmark research

Fig. 19
figure 19

Confusion matrix of all signs in the standard dataset in the proposed research

Considering both above confusion matrixes, it can be found that some signs like ‘A’, ‘B’, ‘M’, ‘N’, ‘S’, and ‘T’ which have similar posture, the recognition rates are close together. For example, the recognition rate for ‘A’ sign equals to 0.64 (64%) and for ‘E’ is 0.63. These rates in the benchmark results are 0.75 and 0.63. Similar signs can be wrongly detected. This wrong detection occurs in ‘Y’ and ‘L’ or ‘F’ and ‘W’. In addition, the prediction error has similar results, for example, the ‘A’ sign is detected wrongly as ‘M’ with 0.05 rate of prediction in the principal study. This rate in this proposed research is 0.03. The ‘O’ sign is predicted as ‘C’ with 0.3 rates, while this rate in this research is 0.05. On the other hand, some recognition rates have been improved while some of them, vice versa. But with an overall look at both figures it can be realized that most of the rates have been improved in the proposed research. One another considerable note is about two signs ‘J’ and ‘Z’. These signs are motional and have movement while signing. Since this research is related to the study of images, so having a look at the figures, can be found these two signs have a low recognition rate. The benchmark research removed these signs from its field, and this research got an average of the different poses of the sign. It means that while the signer doing the sign, the images were captured one by one, and then calculate the average of geometrical features of them.

Consequently, in the benchmark research, the best results were obtained for two signs ‘L’ and ‘V’ with 0.87 prediction rate and the lowest rates for ‘O’ sign with 0.13 and ‘S’ and ‘m’ with 0.17. These rates in the proposed research are 0.35 and 0.39 for ‘J’ and ‘Z’, respectively. It means that two motional signs have the lowest recognition rates between all the performed signs. The overall recognition rate in the benchmark research is 52.95%, while this rate equals to 66.07% in the proposed research. As mentioned before this rate is 90.25 in the DGSLR dataset in this research with three users and 96.85 on the standard dataset.

4.3.4 Discussion and comparison based on different classifiers

The results of two common classifiers, K-nearest neighbours (K-NN) and decision tree (DT), have been represented and compared to SVM. For obtaining the better result, the signs were divided into some categories which five signs in each category. The categories are as A to E, F to J, …, U to Z, by labelling 1 to 5, 6 to 10,…, 21 to 26, respectively. Firstly, the results of SVM with fivefold and tenfold cross-validation are presented as shown in Tables 11, 12.

Table 11 SVM by fivefold cross-validation in training phase
Table 12 SVM by 10-fold cross-validation in training phase

Table 12 shows the accuracy rate of recognition methods for training and test phases in each class of SVM in fivefold cross-validation. The average of accuracy in each class is also presented. Eventually, the final accuracy in train and test is illustrated.

Table 12 shows the accuracy rate of recognition methods for training and testing phases in each class of SVM in tenfold cross-validation. The average accuracy in each class is also presented. Eventually, the final accuracy in train and test is presented.

Comparing two Tables 11, 12 shows that the accuracy rate of SVM in 10-fold cross-validation is higher than the fivefold.

Table 13 represents the accuracy rate of recognition by K-NN classifier which k equals to 10. Two last iterations in each class have been presented as a sample. For example, in the third class of 11 to 15 labels, related to ‘K’ to ‘O’ signs, the sign ‘k’ with label 11 as input and the classifier predicts it as ‘O’ with a label of 15. In the next iteration, it is predicted as ‘N’ with the label of 14, whereas, in this class, two signs ‘L’ and ‘M’ with 12 and 13 labels, are predicted correctly in accordance with input. The total accuracy rate is roughly 85% which is less than the SVM classifier.

Table 13 K-NN accuracy recognition, K = 10

Table 14 presents the accuracy rate of recognition by K-NN classifier which k equals to 20. Two last iterations in each class have been presented as a sample. For example, in the second class of 6 to 10 labels, where are related to ‘F’ to ‘J’ signs, sign ‘F’ with label 6 is the input, and the classifier predicts it as ‘H’ with the label of 8. In the next iteration, it is predicted the same. Whereas in this class, the sign ‘G’ with label 7 is predicted correctly in accordance with input in the second iteration, meanwhile it is predicted as ‘J’ in the first iteration. The total accuracy rate is more than 84% which is less than the SVM classifier.

Table 14 K-NN accuracy recognition, K = 20

Table 15 presents the DT results as the next classifier. The total accuracy rate of recognition is about 81%, which is less than the K-NN and SVM, but because of its simple structure, it is widely used in the classification goals.

Table 15 DT accuracy recognition

Figure 20 shows the results of the recognition rates for three classifiers. It is clear that the SVM classifier has the most accuracy rate compared with K-NN and DT classifiers. It is surprising that the K-NN with K = 10 has a higher accuracy rate than the K-NN with K = 20, which is an unexpected result. In the end, the DT classifier has the least recognition rate between two other classifiers.

Fig. 20
figure 20

Comparison between recognition accuracy rate of SVM, K-NN, and DT

Lastly, a comparison between the research and its benchmark is represented in Table 16. We utilized MATLAB and the LIBSVM library for the development of the algorithms. The benchmarking process used the OpenNI and NITE libraries. The segmentation, feature extraction, and classification phases have been implemented differently, but both types of research used multiclass classification due to the number of the sign language alphabets.

Table 16 Method Comparison

5 Conclusion

We aimed to examine the accuracy of the proposed hand recognition technique on both DGSLR and standard datasets which contain a number of samples of the American Sign Language alphabet. The effectiveness of the proposed techniques was first evaluated on the DGSLR dataset by three users, and acceptable recognition rates were obtained. Later, the evaluation approaches were carried out on the standard dataset and achieved considerable results which were very promising. Besides experimental results, different tabular analyses and discussions of the charts are also reported. Finally, a comparison discussion between the benchmark research and the proposed research with their final results are investigated. Furthermore, two classifiers, K-NN and DT, are employed and the obtained results are compared to as an SVM classifier.

Since there are 26 different signs in the Sign Language alphabet, a single multi-class versus a single SVM classifier with 26 classes by an RBF kernel was used to validate each class. The accuracy and accuracy of the proposed method were evaluated, and the procedure was repeated by changing each parameter (C, σ) for the validation. The selected pair gave the best average accuracy from the group. Then, the SVM was trained on the selected training set with these optimal parameters. This method is also used to perform the recognition process by utilizing multiple feature descriptors which is multiple feature descriptors which is a combination of the extracted features. Experiments were conducted on the selected and standard datasets. The combination of the extracted features reveals the superiority of the proposed method over the existing work on this subject. The selected dataset was used by three different users, which two users were novices in Sign Language. Each sign was repeated five times for getting improved accuracy. The standard dataset has more than 400 repetitions in each sign. The process was well done in 1 and 10 passes for all data in the dataset in fivefold and 10-fold cross-validation. The confusion matrix is used in the proposed machine learning process which permits visualization of the algorithm efficiency.

The significant finding of this research is the realization of the significant improvements in Sign Language recognition accuracy. Combined features give better results than a single feature. The distance feature has a major contribution on the recognition rate. Evaluations on the selected dataset report the recognition rate of 90.25%, while this magnitude on the complete standard dataset using the proposed approaches, reports an identification rate of 96.85%, the best overall identification rate reported so far on the considered dataset.

According to the confusion matrix visualization obtained from benchmarking and the proposed research, in specific cases, alternative techniques and combinations of machine learning algorithms provide higher accuracy of Sign Language recognition. This work is to create a generalized Sign Recognition process. Our research and proposed machine learning process for the creation of a generalized Sign Language Recognition system capable of being used in cluttered, varied lit environments, has given improved results from previous research utilizing the chosen dataset.

In this research, geometric features along with some new features such as hand keypoints for estimating and tracking have been employed. This has been done to detect multi-frame videos of our gestures by deep neural network. Feature learning and deep neural network are too time-consuming and overfitting; therefore, there are rooms to take them into account for future work. However, this has been done to detect multi-frame videos of our gestures by deep neural network.