An implementation of sign language alphabet hand posture recognition using geometrical features through artificial neural network (part 2)

Kolivand, Hoshang; Joudaki, Saba; Sunar, Mohd Shahrizal; Tully, David

doi:10.1007/s00521-021-06025-3

An implementation of sign language alphabet hand posture recognition using geometrical features through artificial neural network (part 2)

Original Article
Published: 22 April 2021

Volume 33, pages 13885–13907, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

An implementation of sign language alphabet hand posture recognition using geometrical features through artificial neural network (part 2)

Download PDF

Hoshang Kolivand¹,
Saba Joudaki²,
Mohd Shahrizal Sunar¹ &
…
David Tully³

283 Accesses
6 Citations
Explore all metrics

Abstract

In the sign language alphabet, several hand signs are in use. Automatic recognition of performed hand signs can facilitate the communication between hearing and none hearing people. This framework proposes hand posture recognition of the American Sign Language alphabet based on a neural network (NN) which works on geometrical feature extraction of the hand. The user’s hand is captured by a 3D depth-based sensor camera. Consequently, the hand is segmented according to the depth features. The proposed system is called ‘Depth-based Geometrical Sign Language Recognition’ (DGSLR). The DGSLR adopted an easier hand segmentation approach, which is further used in other segmentation applications. The proposed geometrical feature extraction framework improves the accuracy of recognition due to unchangeable features against hand orientation or rotation compared to Discrete Cosine Transform (DCT) and Moment Invariant. As a support vector machine (SVM) is a type of artificial neural network (ANN), it is used to drive desired outcomes. Since there are 26 different signs in the Sign Language alphabet, a multi-class SVM versus a single SVM classifier with 26 classes by an RBF kernel was used to validate each class. The proposed framework is proficient to hand posture recognition and provides an accuracy of up to 96.78%. The findings of the iterations demonstrated that the combination of the extracted features resulted in a better accuracy rate in the recognition process in the classification step.

A new framework for sign language alphabet hand posture recognition using geometrical features through artificial neural network (part 1)

Article Open access 19 August 2020

Recognition of Amharic sign language with Amharic alphabet signs using ANN and SVM

Article 22 March 2021

Hand Anatomy and Neural Network Based Recognition of Isolated and Real-Life Words of Indian Sign Language

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Sign language is a visual language, which transfers the signs of the hands using not only the movement and orientation of hands, arms, or bodies, but also facial expressions instead of sound patterns. There is no uniform sign language across the world. Each country has its own sign language, but in this study, we have considered the American Sign Language which is most popular among existing Sign Languages.

The previous studies on sign language recognition failed to supply a complete or reliable model without restriction. In particular, most of them are depending on users; in other words, they are not able to be applied for independent user systems. It can be conducted by some methods, especially in the feature extraction steps due to the image base system. Furthermore, they can involve mimic of the face or body postures for more details, for example, portray anger and emotion through the hands. Likewise, all of the studies on artificial neural network (ANN) have shown that it has a robust learning capability, and there are varieties of ANN systems used in hand posture recognition systems. On the other side, the support vector machine (SVM) approaches have very effective results on recognition systems [10].

Our novelty in this work is using a new method of geometrical feature extraction which leads to get more accurate classification in our classifier. In fact, a new integration of the extracted features, geometrical features of the hand is presented in Sign Language recognition system. Furthermore, the proposed system uses a new simple approach for segmentation in different backgrounds. The concept regarding Microsoft’s Kinect sensor returns to the attainment of 3D data for paving the way in a new solution for quite a few challenging computer vision issues, including human activity analysis, object tracking, indoor 3D mapping, supervision scenarios, and recognition especially hand gesture recognition. Changes due to different lighting conditions have a bad effect on the recognition process. Furthermore, the recognition process is more difficult in a cluttered background than a plain background. This issue has an important impact on accuracy. In order to make a system that works in both simple and cluttered backgrounds, indoor or outdoor with different lighting conditions, a new approach is necessary to solve these problems.

A realistic Sign Language Recognition with error-free recognition is an ambitious goal for many outstanding researchers in computer science, especially pattern recognition. The effects of illumination changes on hand recognition as well as occlusion by another object in the scene in the cluttered background have been attempted in this research. In addition, finding some features of the hand which are independent to the hand orientation or direction have been important issues which this research tried to address. The proposed methods cater for the weaknesses in the hand posture recognition system to develop an SLR system. These methods are applied in segmentation and feature extraction phases and can increase the overall accuracy due to the depth-based images and geometrical features of the hand.

This paper used SVM in DGSLR recognition. All algorithms in each part have been explained in detail. The model used in support vector machines, especially in the most basic cases (e.g. two-class classification), is a model with a linear structure and very similar to what is used, for example, in the multilayer perceptron neural network or MLP. In fact, along with some other differences between the two models, they actually teach a very similar structure in two different ways. In MLP neural network, the parameters of this model are adjusted by error minimization, but in SVM, the risk of incorrect classification is defined as a target function and the parameters are adjusted and optimized accordingly. For some issues, the error rate may be as low as zero, but of all the zero-error models, there is only one that has the lowest operational risk. Therefore, in some cases, the SVM output, in addition to its better performance, will also show more robustness to changes and noise in the data. Because it is basically designed and trained to withstand such uncertainties and to perform well. On the other hand, the use of the term neural network (artificial) or any other similar term to refer to such devices has been merely to create a metaphor that is appropriate and close to nature, and the essence of the theorem is the mathematical relationship behind these systems. From this perspective, many of the systems and models used in the field of machine learning use very similar (and sometimes identical) mathematical structures, and only in the way the problem is expressed, the way the models are set up and described with. They are different from each other. For further study, it is recommended that you read the second edition of Simon Haykin's famous book, Neural Networks: A Comprehensive Foundation, published in 1999. In the introduction of this book, it is well explained that SVM is a type of neural network. The third edition of this book, with the new title "Neural Networks and Learning Machines", was published in 2008. Another suitable reference for further studies in this regard is the book "Neural Networks in a Soft-computing Framework" (neural networks in the framework of soft computing), which in the introduction and chapter ten of this book, the topic of support vector machines, and the fact that they are a special form of artificial neural networks has been debated. The book "Pattern Recognition and Machine Learning", written by Christopher M. Bishop (Christopher M. Bishop), is another very important and practical reference in this field, and interested for more information, you can refer to this important and practical reference.

This paper introduces an American Sign Language alphabet recognition method to hand gesture recognition to help deaf and dumb people. It also presents some geometrical features of the hand for achieving more reliable recognition. Then it explains the literature review in depth-based on hand gestures in sign language recognition systems. In the next part, the research methodology and the procedure of the research are described. Segmenting the signer’s hand is performed and the appeared issues are discussed. The level set method is implemented and reported their results. The feature extraction method is in accordance with hand geometrical features. The support vector machine (SVM) algorithm is implemented to classify the extracted features in the previous step for recognizing the performed gestures. Then, it expounds implementation step. Finally, a comparison discussion between the proposed method by SVM and two classifiers, K-nearest neighbour (K-NN) and decision tree (DT), are employed. The evaluation and testing of the system are applied, and then, the accuracy rate of the proposed method is shown as charts and tables. Also, errors due to wrong recognition are shown. The paper ends with a conclusion and some suggestions for further research in the future, which may provide ways to easier hand gesture recognition in order to apply in the recognition systems.

The idea behind this work is: users can act on desired signs, while the proposed system detects the signs. The detected signs can be converted to sound or text for normal people. The new idea in this research is depth-based segmentation and geometrical features which distinguishes it from other methods. It can be developed by a depth-based camera embedded on a cell phone. The depth-based camera can lead to subtract the background more easily whether simple or clutter. On the other side, geometrical features are independent of the orientation, location, or position of the hand. So, the emotional signs do not make any problem in the recognition process. There is natural variability in the executed signs because of the different positions of the hand in the same signs, and the observations are error-prone, thus applying a method other than the existing exact matching of features is needed without considering the finger’s positions. Furthermore, it can be developed on a system in public places such as airports or libraries, or even educational places like universities. It can be used in conferences or other scientific assemblies.

After introducing an American Sign Language alphabet recognition system, some related works were explained in the literature review in Sect. 2. In Sect. 3, the research methodology and depth-based geometrical features procedure are described. The used dataset, segmentation method, proposed feature extraction methods, and finally classification step have been defined. Experimental results and discussion are given in Sect. 4. The paper ends with a conclusion and some suggestions for further research in the future, which may provide ways to easier hand gesture recognition in order to apply in the recognition systems.

2 Related works

There are several challenges which we will try to solve. Complex background and lighting conditions are more important than the rest factors. The distance between the user and Kinect Camera during the capturing images can be considered as a limitation of this research. However, some ordinary cameras can solve this issue, but they have no depth-based application. The process is very sensitive to hand movements due to the illumination changes. This may lead to the occlusion of some parts of the hand by other parts. Two letters ‘J’ and ‘Z’ are motional signs, and it is much better to remove them from the hand posture recognition field. These two signs are very similar to ‘I’ and ‘G’, and they have similar features together. It caused to confuse the conditions in the classifier process.

Limitations and constraints in the existing vision-based methods have been caused to obtain the unsatisfying results in the previous research. Object recognition in the cluttered scene, or with long sleeve clothes of the signer, or the necessity of motionless head or face is some of these restrictions. Likewise, steady hand movements, stable pose and location of the body, determined primary location for hands, and restricted vocabulary are other discussed limitations in this field.

Lee et al (2013) explained a computer vision-based method for posture recognition of a hand posture and its application on an iOS iPhone. The proposed algorithm used YCbCr images [19] to set skin regions. They eliminated noise caused by slanted hand posture. Then ANN was used for sign recognition and applied to another device like iPhone. The accuracy rate of recognition was 89% in the motion hand posture, and it was 94.6 percent for static hand posture, but the skin detection was affected by the illumination conditions of the environment. This issue caused a low accuracy in some states or orientations.

The feature extraction step is one of the crucial steps in every recognition system. There is a diverse huge collection of feature extraction methods that each of them has some advantages and disadvantages, such as scale-invariant feature transform (SIFT) [8], Gurjal and Kunnur, 2012), wavelet moments [7], histogram of oriented gradients (HOG) [24,25,26], and Gabor filters (GF) [1, 30]. These techniques are very robust in the recognizing process but for a small number of simple hand postures [11]. For example, Dardas and Georganas [8] obtained an accuracy rate of 96.23% for recognizing six signs using SIFT based and an SVM classifier. Pugeault and Bowden [30] implemented the recognition of 24 static ASL alphabet signs using the Gabor filter (GF) method. The mean accuracy of 75% was reported. Moreover, the proposed method had a high confusion rate of 17% between similar signs such as "r" and "u". In short, these methods are usually not able to obtain desirable accuracy in complex classifying or variations of a lot of ASL signs.

In addition, Domino et al. [10] presented multiple depth-based descriptors. The descriptors included some features of the hand such as distance and elevation, the hand’s contour curvature, and properties of the palm region to be extracted. The achieved accuracy was 93.8% by SVM classifier in an experimental set of 12 static and digit signs of ASL alphabet. Liang et al. [20] improved the per-pixel-based hand parsing method by distance-adaptive feature selection scheme and super-pixel partition-based Markov random fields (MRF). The improved algorithm was led to increase from 72 to 89% of accuracy in per-pixel classification. The above methods recognize only a small number of simple postures (less than 15) including ASL digits and custom signs which are a small portion of ASL alphabet signs.

Changes due to different lighting conditions have a negative effect on the recognition tasks due to the shadow or undesired effects on the objects [6], Kishore and Kumar [18, 34]. Furthermore, the recognition process is more difficult with a cluttered background than a plain background [29]. Compared to the body or skeleton recognizing procedures, the recognition of the hand or another specific part of the body is more sensitive tasks. In these cases, the other objects in the scene can lead to occlusion and consequently wrong detection procedure. These issues have an important impact on accuracy. In order to make a system that works in both simple and cluttered backgrounds, indoor or outdoor with different lightening conditions, a new approach is necessary to solve these problems.

Most of the previous researches are dependent to the signer [6, 32]. On the other word, the selected extracted features of the hand in these previous hand recognition systems are dependent on the position or direction of the signer’s hand [27, 33]. Then, the recognition process is performed correctly just for a specific user and it does not work properly for generic users. Using features independent of the user’s hand shape, orientation, location, position and direction is highly desirable. On the other hand, most of the previous research used fingertips as a feature [20]. The main weakness of the use of hand fingertips in the extracted features is that they can be occluded by other fingers. There is a natural variability in the executed signs because of the different positions of the hand in the same signs. Furthermore, if the observations are error-prone, then a method other than the existing exact matching of features is needed without considering the finger’s positions.

Kisel’ák et al. [16] introduced a new method as ‘scaled polynomial constant unit activation function—SPOCU’ for a medical image in some cancer detection. Such a novel activation function relates to complex patterns through the phenomenon of percolation, and thus, it can overcome already introduced activation functions, e.g. SELU and ReLU. Discrimination between mammary cancer and mastopathy tissues plays a crucial role in clinical practice. In this case, a more precise activation function in the classifier is necessary which can detect the tissue and its complexity. But in our case, using such an activation function only leads to increasing computational time.

This study focuses on the classification by SVM because of its clarity and simplicity in the classification. Furthermore, its usability to resolve the various problems is one of another reason to use it, as some approaches like decision trees are not simplicity used in the various problems. As Hinton (2008) mentioned the SVM causes to get a good generalization on a big dataset. Since a big data set requires a complicated model, the full Bayesian framework is very costly in computation. In contrast, the SVM is faster and still has a good generalization solution. Furthermore, due to a very big set of nonlinear task-independent features, SVM has a clever way to prevent over-fitting problem.

3 Depth-based geometrical features in hand recognition

3.1 Dataset

Two separate datasets are employed in this research. The first one is the chosen dataset by the research which is called DGSLR. The other one is a standard dataset. In the DGSLR dataset, three novice users of Sign Language, one man and two women, were employed in this study. They were asked to sit down in front of the Kinect camera and perform the signs. Each letter was repeated for five times.

After the preparing step and teaching the signs to the signers, the images were captured by the Kinect Explorer—WPF application at 30 frames per second. In this coloured image capturing application, the hand is detected by a distinct colour due to the depth feature.

The capturing process was performed in both plain and cluttered backgrounds in different variations of illumination. As Fig. 1 illustrates, the other objects in the cluttered background do not have any interference in the detection procedure. The farther objects are removed, and the closer objects are shown in the different depth with the user in the foreground. Thus, the hand is still shown as different colours in the RGB mode (Fig. 1 (left)) and brighter view in the depth mode (Fig. 1 (right)). The hand is also recognizable in two modes.

In order to validate the data, a huge standard dataset from the Centre for Vision, Speech and Signal Processing, University of Surrey Pugeault and Bowden [30], was used. The images have been captured from 9 people in different backgrounds similar to the research dataset. The images gathered by Kinect are only depth-based. In addition, there are more than 400 repetitions on each sign in different postures and directions. The users changed their hand direction and also the distance to the Kinect sensor.

Posture or gesture recognition methods can be divided into two types: one is to use Kinect (for example in our work), Leap motion and other depth cameras to obtain image depth information, such as position. The other one is to split the gesture from the background by traditional methods and then extract the apparent image characteristics of the posture by neural networks to perform posture recognition.

In this case, according to the type of neural network (MLP) and learning paradigms (Backpropagation), and also the desired task which is ‘Pattern Recognition’, the ‘Fermi function’ can be used. This study uses a nonlinear SVM, and since there are different kernel functions in the nonlinear SVM structure, choosing a kernel based on the prior knowledge of invariances as suggested by Cawley and Talbot [5] is an excellent idea. The Gaussian radial basis function (GRBF) kernel is one of the most common kernel which is used in this research.

3.2 Segmentation of the hand

Hand extraction is a crucial step in hand recognition systems because all of the following processing steps are performed on the segmented regions only. The proposed scheme for segmenting the hand is based on the depth data. A scenario used in this research is to have users facing the Kinect camera with their hands held in front of themselves. In this case, the hand seems brighter than the other objects because of the depth capability in the image. It caused to place the body or other objects in the scene in the deeper layer and the hand seems by different colour due to changing light conditions compared to the rest of the body. The distance between the user and the Kinect was 150 cm. In addition, the lighting conditions were changeable during the signing process.

3.3 Morphological object dilation

There are some noisy points in the obtained depth images in this study. Then, a post-processing procedure has been to improve the obtained depth images. These noisy points can be due to hand movements or shaking during the signing. Furthermore, the Kinect sensitivity to the illumination conditions can also have an effect on the images. A filtering operation can perform on the image to address this issue, but according to the review on the filtering methods, they are commonly time-consuming procedures (Chiang et al., 2013, Pal et al., 2014). On the other hand, in our depth images, no need to rectify the edge and only some morphological operations are applied for smoothing the binary depth-based image and remove the noisy points on the hand surface as the demonstrated example in Fig. 2.

The first step, all the images were resized to a 128-by-128 pixel matrix. A unified dataset of images, all of equal size allows for modifications in later stages if needed. These points of the image should be distinguishable from the rest black points like background points. Since the number of these type of images was little in this study, the mentioned issue was resolved by a series of morphological functions in MATLAB as following definition.

The dilation of A by B is implicated $A \oplus B$ where defined as:

$$ A \oplus B = \left\{ {z\left| {(\hat{B}} \right.)_{z} \cap A \ne \phi } \right\} $$

(1)

where $\hat{B}$ is the reflection of the structuring element B. In fact, it is the set of pixel locations Z, where the reflected structuring element overlaps with foreground pixels in A when translated to Z. In the greyscale dilation, the structuring element has a height. The greyscale dilation of A(x,y) by B(x,y) is as:

$$ (A \oplus B)(X,Y) = \max \left\{ {A(x - x^{\prime},y - y^{\prime} + B(x^{\prime},y^{\prime})\left| {(x^{\prime},y^{\prime}) \in D_{B} } \right.} \right\} $$

(2)

where D_B is the domain of the structuring element B and A(x,y) is assumed to be − ∞ outside the domain of the image. To create a structuring element with nonzero height values, the syntax strel (sdom, height) is used, where height shows the height values and sdom corresponds to the structuring element domain. The greyscale dilation is commonly performed with a flat structuring element (B(x,y) = 0). Greyscale dilation using such a structuring element is equivalent to a local-maximum operator:

$$ (A \oplus B)(X,Y) = \max \left\{ {A(x - x^{\prime},y - y^{\prime})\left| {(x^{\prime},y^{\prime}) \in D_{B} } \right.} \right\} $$

(3)

3.4 Feature extraction

After hand segmentation and post-processing based on depth hand images, selected feature vectors are expected to represent the position of fingers and palm. Consequently, fingers should be roughly characterized by a robust approach.

3.4.1 Hand geometry

The hand area (HA) and hand perimeter (HP) are the first feature descriptors which were calculated by morphological operators. In order to compute the perimeter of the hand, the distance between each adjacent pair of pixels around the hand contour is calculated. The discontinuous areas in the hand region may lead to unexpected results. All noisy points should be removed to gain better results in the hand area and perimeter here. Two mentioned parameters, HA and HP, for all the fingers are closed and when they are open. This is the minimum and maximum value, respectively, so the other signs are within this range.

3.4.2 Convex hull of the hand

The convex hull of the hand is calculated in order to gain the desired geometry information. It should be noted that the forearm or arm of the hand were removed from the initial images as it did not contain any important information. In the hand image of the research, a convex hull is a n-by-m matrix that determines the smallest convex polygon containing the hand region. The parameter n is the number of the pixels, and m represents the vertexes. Each row of the matrix demonstrates the coordinates of one vertex of the circumscribed polygon of the hand. In the next section, the concept of convex polygon will be introduced.

Consequently, for a nonempty points set in a certain plane, the convex hull is the smallest convex polygon which includes all these points in the set. For instance, in Fig. 3 the polygon around the points is a convex hull and the six points which are on the boundary are called ‘hull points’.

The convexity defects of the hand have some geometry properties which can be used as features of the proposed system in this study. The area of the convexity defects, CDA, was computed by a similar algorithm of the convex hull. Likewise, the number of convexity defects represents the number of open or closed fingers. The empty spaces between the opened fingers are also convexity defects, so the number of these spaces can be represented for some specific signs in the classification step. This is much more useful for designing a reliable recognition system.

3.4.3 Ratio feature

Another extracted feature is the ratio between the hand area, HA, and the area of the convex polygon, CHA, enclosing it. As mentioned above it is called convex hull. So it is named convex hull area ratio that is:

$$ \Re_{{{\text{CHA}}}} = \frac{{{\text{Handarea}}({\text{HA}})}}{{{\text{ConvexHullarea}}({\text{CHA}})}} $$

(4)

The ratio between the perimeter of the handshape (HP) and the convex hull perimeter (CHP) is another useful parameter. Those gestures with closed fingers are typically related to perimeter less than when some fingers are opened. Likewise, the rate of hand perimeter to the convex hull is close to 1. The following equation shows this relationship.

$$ \Re_{{{\text{CHP}}}} = \frac{{{\text{Handperimeter}}({\text{HP}})}}{{{\text{ConvexHullperimeter}}({\text{CHP}})}} $$

(5)

Similarly to the convex hull, the rate of hand geometry area (HA) to the convexity defect area (CDA) can be considered as an informative feature for a reliable recognition system. This rate has been calculated by:

$$ \Re_{{{\text{CDA}}}} = \frac{{{\text{Handarea}}({\text{HA}})}}{{{\text{Convexitydefectarea}}({\text{CDA}})}} $$

(6)

3.4.4 Distance feature

The height and width of the signer’s hand are other measurable features which are considered in this research. The height and width values can represent the hand postures. Although the similar signs have similar values of height and width, they can be classified into the same class for more clarity in the classifier. For example, as represented in Fig. 4, for three signs ‘A’, ‘S’, and ‘T’ the value of the height and width is close together. This similarity also occurs between ‘R’ and ‘U’.

For computing the height and width of the hand, the edge of the hand should be detected. Then, the longest diameter of the hand in vertical and horizontal directions is computed based on the eigenvalue and the eigenvector concepts. As the last step, the calculation of the distance feature was performed by the Euclidean distance between the ending points of these diameters on the hand boundary.

The first step, the hand boundary should be calculated. There are some predefined functions which can be applied on the images for detecting the edges of the objects. MATLAB software also includes several algorithms for calculating the object’s boundary, but edges may include the adjacent number of rows which creates a ‘thick’ edge as shown in Fig. 5.

In statistics, a covariance is a matrix which its element in the i, j position means the covariance between the ith and jth elements of a random vector variable. Each element of this vector is a scalar variable with a finite number of appeared experimental values or by a finite or infinite number of possible values determined by the theory of joint probability distribution of all the random variables.

The covariance between two jointly distributed real-valued random variables X and Y with finite second moments is (Statistics, 2002):

$$ \sigma (X,Y) = E[X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y] $$

where E[X] is the expected value of X. Since all probabilities p_i add up to one, p₁ + p₂ + … + p_k = 1, the expected value is shown as the weighted average:

$$ E[X] = \frac{{x_{1} p_{1} + x_{2} p_{2} + ... + x_{k} p_{k} }}{1} = \frac{{x_{1} p_{1} + x_{2} p_{2} + ... + x_{k} p_{k} }}{{p_{1} + p_{2} + ... + p_{k} }} $$

(7)

An eigenvector of a square matrix in linear algebra is a vector that does not change its direction under the linear transformation. If v is a nonzero vector, then the v is an eigenvector of the square matrix A as Av is a scalar multiple of v. There is a relationship between n by n square matrices and linear transformations. The linear transformation of n-dimensional vectors specified by an n by n matrix A is:

$$ Av = w $$

(8)

where

$$ w_{i} = A_{i,1} v_{1} + A_{i,2} v_{2} + ... + A_{i,n} v_{n} = \sum\limits_{j = 1}^{n} {A_{i,j} } v_{j} $$

(9)

If w and v be the scalar multiples then:

$$ Av = \lambda v $$

(10)

which v is an eigenvector of the linear transformation A and the factor λ is the eigenvalue of it.

The approximate longest diameter in the hand and then the perpendicular line to it should be computed as shown in Fig. 6. The coordinate of the points on the hand contour was computed in the boundary detection algorithm. So, the gravity centre point is easily obtained. Then the covariance matrix is computed. The direction and value of the longest diameter will be obtained by calculating the eigenvalue and the eigenvector.

3.5 Feature vector structure

All computed features on both DGSLR and standard datasets were saved in two repositories in a CSV (comma separated file) which we utilized Microsoft Excel for easy usage. The first one which belongs to the DGSLR dataset includes three sheets where each sheet corresponds to each user. The rows and columns of this file represent the letters and features, respectively. The last column is considered as a label column for labelling each sign within 1 to 26. Considering the leave-one-out approach which will be explained in the next part in classification, one person is kept for testing and the rest is considered in the training phase. The second excel file corresponds to the standard data set that consists of 26 sheets which each of which belong to a specific sign. We took a regular training procedure of 70/30 split, 70% of images is used for training, while 30% is used for testing.

3.6 Classification

The last step of the proposed recognition system includes an appropriate machine learning method to classify the extracted features in the previous step in order to recognize hand gestures. In this research, a multi-class one versus one SVM classifier has been used, and in accordance with a set of n(n − 1)/2 binary SVM classifiers used to test each gesture against each other. Each output is selected as a vote for a certain gesture, and as mentioned before, the gesture with the maximum votes is the recognition process result. This study uses a nonlinear SVM, and since there are different kernel functions in the nonlinear SVM structure, choosing a kernel based on the prior knowledge of invariances as suggested by Cawley and Talbot [5] is an excellent idea.

The Gaussian radial basis function (GRBF) kernel is one of the most common kernel which is used in this research as obtained by Eq. 11.

$$ k(x_{i} ,x_{j} ) = \exp ( - \gamma \left\| {x_{i} - x_{j} } \right\|^{2} )\,for\;\gamma > 0 $$

(11)

The Gaussian radial basis function kernel supports the corresponding feature space in an infinite dimension. The maximum margin in the classifier is well regularized, and it is widely believed that the infinite dimensions do not spoil the results, Jin and Wang [15]. The GRBF kernel makes a good default kernel in a nonlinear model. It may lead to having an efficient-to-compute and high accuracy approach without having the huge and potentially infinite-dimensional feature vector. The optimized run time of the GRBF is one of the other reasons to employ it in the classifier of this research. The GRBF execution time is bounded by O(nlogn), where n is the number of training samples.

In this research, there are two datasets of the depth-based image of the Sign Language alphabet. Firstly, the classification process is applied on the DGSLR dataset, so the training set contains data from three available users. A cross-validation method as K-fold cross-validation is used by K equals to 5 and 10 in the testing step. In the K-fold validation method, the collected data are partitioned into the K subsets. In these subsets, one of them is used for validating data and K-1 subsets for the training process. This procedure is repeated K times, and all the data are used once for training and once for testing. Finally, the average of these K procedures is selected as the final estimation.

The two parameters C and φ of the RBF kernel are subdivided with a regular grid which when C is considered, equals to 1,10,100, and 1000, and parameter φ equals to 0.001, 0.01, 0.1, 1. Similar to other classifiers, for each couple of these parameters, the training collection is divided into two categories, N − 1 users in the training set and the rest for validating. We reiterate the 70/30 split between training and testing. The accuracy is assessed and the testing process is iterated frequently based on changing the iteration number. Finally, the parameter pair which gives the most accuracy is selected and applied to the SVM structure.

In order to measure the classifier accuracy, two statistical parameters called ‘Sensitivity’ and ‘specificity’ were used. The Sensitivity parameter or true-positive rate measures the proportion of actual positive samples which are correctly identified. It is also complementary to the false-negative rate. The Specificity parameter or true-negative rate measures the proportion of negative samples that are correctly identified. Similarly, it is complementary to the false-positive rate.

A perfect predictor approach describes samples as 100% sensitive and 100% specific, but in fact, there is no perfect predictor and theoretically, all of them have a minimum error bound called the Bayes error rate. As concluding the four outcomes can be formulated derived a confusion matrix as follows:

True positive (TP) = correctly identified
False positive (FP) = incorrectly identified
True negative (TN) = correctly rejected
False negative (FN) = incorrectly rejected

Two equations can be formulated and derived from a confusion matrix as follows (Fawcett, 2006, Powers, 2011):

$$ \begin{gathered} {\text{Sensitivity}} = {\text{True}}\;{\text{Positive}}\;{\text{Rate}}\;({\text{TPR}}) = \frac{{{\text{Number}}\;{\text{of}}\;{\text{True}}\;{\text{Positives}}}}{{{\text{Number}}\;{\text{of}}\;{\text{True}}\;{\text{Positives}} + {\text{Number}}\;{\text{of}}\;{\text{False}}\;{\text{Negatives}}}} \hfill \\ = \frac{{\sum {{\text{True}}\;{\text{Positive}}} }}{{\sum {{\text{Condition}}\;{\text{Positive}}} }} \hfill \\ \end{gathered} $$

(12)

$$ \begin{gathered} {\text{Specificity}} = {\text{True}}\;{\text{Negative}}\;{\text{Rate}}\;(TNR) = \frac{{{\text{Number}}\;{\text{of}}\;{\text{True}}\;{\text{Negatives}}}}{{{\text{Number}}\;{\text{of}}\;{\text{True}}\;{\text{Negatives}} + {\text{Number}}\;{\text{of}}\;{\text{False}}\;{\text{Positives}}}} \hfill \\ = \frac{{\sum {{\text{True}}\;{\text{Negative}}} }}{{\sum {{\text{Condition}}\;{\text{Negative}}} }} \hfill \\ \end{gathered} $$

(13)

These statistical parameters can be represented in the confusion matrix as shown in Table 1.

Table 1 Statistical parameters in confusion matrix to measure the classifier accuracy

An implementation of sign language alphabet hand posture recognition using geometrical features through artificial neural network (part 2)

Abstract

Similar content being viewed by others

A new framework for sign language alphabet hand posture recognition using geometrical features through artificial neural network (part 1)

Recognition of Amharic sign language with Amharic alphabet signs using ANN and SVM

Hand Anatomy and Neural Network Based Recognition of Isolated and Real-Life Words of Indian Sign Language

Explore related subjects

1 Introduction

2 Related works

3 Depth-based geometrical features in hand recognition

3.1 Dataset

3.2 Segmentation of the hand

3.3 Morphological object dilation

3.4 Feature extraction

3.4.1 Hand geometry

3.4.2 Convex hull of the hand

3.4.3 Ratio feature

3.4.4 Distance feature

3.5 Feature vector structure

3.6 Classification

4 Experimental results and discussion

4.1 Data collection

4.2 Feature extraction

4.2.1 Hand geometry features

4.2.2 Convex hull of the hand

4.2.3 Convexity defects of the hand

4.2.4 Hand ratio

4.2.5 Distance features

4.3 Classification

4.3.1 Discussion on DGSLR dataset

4.3.2 Discussion on standard dataset

4.3.3 Discussion and comparison on benchmark

4.3.4 Discussion and comparison based on different classifiers

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation