Keywords

1 Introduction

Sign Language plays a pivot role in the social life of speech and hearing impaired people. To exchange the information between normal people, the gestures of SL are translated into English words text or voice outputs by Sign Language Recognition (SLR) using HMI. Fingers and hand shapes possess maximum information related to HMI. The input forms of HMI can be gesture voltages or images. The common hindrances in image-based gesture recognition were complex backgrounds, fixed Field of View (FOV), more number of cameras required, overlapping of face expressions with hand movements and occlusion. The most preferred SLR is sensor-based because of cost effectiveness, portability and less power consumption. Sign language is not universal all over the world. There exist 135 legally approved Sign languages like Ukrainian Sign Language (USL), Japanese Sign Language (JSL), Arabic Sign Language (ArSL), etc. In this paper, ISL gestures are considered. ISL is a combination of verbal and nonverbal signs, a combination of single and double handed gestures and it has its own grammatical structuring style.

According to India’s National Association of the Deaf, there are about 1.3 million populations are speech and hearing impaired people. Among them the working percentage as main, marginal and non-workers are shown in Fig.1. It depicts that due to the inconvenience of expressing themselves they could not obtain good education and occupation facilities available in the society. Hence, an ISL-based prototype has been designed to excel the obstructions of the technology and the emotional issues of the speech and hearing impaired people.

Fig. 1
A bar graph plots number of disabled workers versus states in India. It includes bars for main, marginal, and non-workers. Kerala has the highest number of non-workers.

Source 2011 census

Disabled worker population state wise in India.

The other sections of this paper are organized as follows. Section II provides the related papers dealing with SLR based on sensors and stimulation results using Lab VIEW. Section III describes the components used in this setup, workflow of the glove and the generation of two datasets. Section IV presents the results of sensor over two datasets, variation between them and the classification procedure using MMSE algorithm and it discusses the recognition rate based on sensor fusion. Finally, Section V includes the conclusion of this study.

2 Literature Survey

The studies in the field of SLR started in the early 70’s itself. Researchers have developed glove-based systems for sign languages and other gesture control applications. A word recognition in time normalization using slope constraint and symmetric form algorithms [1]. The Markov model was used to recognize a 40-word lexicon for American Sign Language (ASL) images [2]. The gray scale images of the American Sign Language (ASL) by morphological filtering to classify hand gestures [3]. The gestures were recognized for the survey of feature extraction and acquisition for transitions and adoptions [4]. German Sign Language (GSL) was analyzed by intrinsic mode entropy by three native signers and differentials processed [5]. ISL was recognized for 10 images including ‘UP’ and ‘DOWN’ positions feature point method into text [6]. For the conversion of Unified Arabic Sign Language (UASL) to the American Sign Language (ASL), a dictionary was designed on the basis of a web-based application. It was presented using a graphical interface for 1400 signs and various categories as colors. These video sequences were captured at a width and height of 1920 × 1080 [7]. A set of 18 gestures were described and recognized using a 3-axis accelerometer and determined the corresponding actions by the concept of random projection [8]. A sign language recognition system based on Hindi language was designed with Cepstral features. The images were processed for two models processed using Support Vector Machine (SVM) and Neural Network (NN) and testes for best score efficiency [9]. The converted gestures of British Sign Language to text output using sensor gloves for single handed. Adithya et al. [10] designed a neural network with the images of finger spelling of ISL gestures [11]. The detection of the finger bending using piezoresistive sensors and by the method of MMSE text converted to speech [12].

The principle of surface electromyography signals was used in Bayesian classifier to manipulate mobile phone working [13]. The multisensory techniques developed to train a classifier to learn hand positions and orientation with a higher pose of reduced error by 30% [14]. The Arabic Sign Language alphabets captured by both image processing and sensor gloves and both were compared on the basis on cost, efficiency, user preference, etc. [15]. An Urdu Sign Language system was designed which has been captured for 9 images whose features have been extracted and classified by PCA [16]. A LabVIEW–VI-based glove was designed and developed for paralyzed patients based on their hand and eye movements. Commands translated were water, food and pain [17]. A rehabilitation robot was proposed for hand for data acquisition from five hands in the VI [18]. A robotic arm was implemented for limb disabled people using servo motors, working based on commands from the flex sensor on the other hand for lifting and other purposes [19]. A gesture controlled vehicle based on Ultrasonic sensors. Any obstacle on the path detected then vehicle would go around the obstacle [20]. A numerically evaluated rehabilitation device for specific set of four exercises were designed from a set of 12 subjects. Based on the degrees of freedom for the shoulder, elbow and wrist joints, different flexions were planned accordingly, to improve an automated therapeutic treatment [21]. Levenberg Fractional Bat Neural Network (LFBNN) was proposed for voice-based recognition system. Evaluation is based on the metric such as FAR, FRR and accuracy achieved till 95 % for any voice-based operating models [22]. A sign language recognition system based on Hindi language was designed with Cepstral features. The images were processed for two models processed using Support Vector Machine (SVM) and Neural Network (NN) and testes for best score efficiency [9].

Summarizing the blocks in the recognition techniques were as follows. The gesture image-based processing, only alphabets detected, implemented for the other countries’ sign languages, words were detected concatenating each letter’s gesture. The general hand gestures were recognized by application based and mostly single handed symbols. In some papers the types of movements were limited to up and down positions only. To overcome the above mentioned problems, this glove is designed and developed.

3 System Design Flow

3.1 Hardware Design

In this SLR an Sign and Sound (SS) glove is designed for both hands as shown in Fig. 2. In the first version of SS Glove, it is equipped with flex sensors stretched over the ten fingers to measure the bending degree of the three joints in the finger phalanges (Distal, Medial and Proximal). These flex sensors can accommodate within age group varying from 18 to 50 years and distinct sizing of fingers. In the second version of SS glove it is included with accelerometer to measure the rotation, orientation and tilt positions of both the hands. The 3-axis accelerometer is located over the metacarpal bones and carpal bones to obtain the entire hand inclinations. These accelerometers are light in weight and accurate in angle outputs. Sensors are stitched to the glove in a fit manner, as it remains fixed even during hand movements. The two different types of flex sensors employed in SS glove are 2.2 in.’ flex sensor (only for baby finger) and 4 in. single axis digital flex sensors are shown in Fig. 3. The bending resistance exhibited by the 2.2 in. is 10 K ohms and 4.5 inch ranges from 60 to 110 K (Source:www.sparkfun.com). Each 4.5 in. flex sensor cost 3800 Rupees and 2.2 in costs 680 rupees.

Fig. 2
A photograph of gesture recognition gloves without accelerometer.

Glove (First version without accelerometer)

Fig. 3
2 photographs. First is the flex sensor, 2.2-inch model. It has horizontal stripes across the band. Second is the digital flex sensor, 5-inch model.

Two variations in the flex sensor

Sensor glove selected material was neoprene nylon glove due to the following reasons:

  1. o

    Provides fine grip and good tensile strength to fingers attached with sensors.

  2. o

    Light in weight and durable for long term usage and frequent finger movements.

  3. o

    Easily stretch back to the original size when fingers are bent fully.

  4. o

    Absorbs sweat produced by the fingers and reduces the hazards of electrical shocks.

  5. o

    Does not get affected by any oily substances, as the glove repeals these substances.

  6. o

    On a large scale usage, Neoprene nylon glove does not provide any allergic to the user’s hand or finger.

The SLR system for this specialized glove has been divided into three units: a sensor acquisition unit, processing unit and voice module and LCD unit as shown in Fig. 4. The flex sensor and the digital flex sensor are operated at 5 V and 3.3 V @10 Hz respectively to produce an electrical output proportional to the bending of finger. The sensor data are preprocessed prior to feature extraction and then provided to SVM classifier for the recognition of ISL gestures. This SLR is mainly designed to recognize the 18 words of ISL as presented in Table 1. Gestures of the selected words according to the standardized ISL are illustrated in Fig. 5. These words are detected as the output of SVM and they are transferred to the text and voice units using ZigBee technology. The recognized words equivalent to gestures were displayed in LCD and corresponding voice outputs played through speaker. The power supply of 5V for the entire SLR system is provided by the rechargeable Lithium batteries.

Fig. 4
A flow diagram has 3 units of the S L R systems, sensor acquisition unit, processing unit, and output unit. Each unit has 2 components. The processing unit has a graph.

Overview of SLR system divided into three unit

Table 1 Word dataset
Fig. 5
A confusion matrix plots true class versus predicted class. It includes 18 variants with 3 of them underlined. These are help, suggestion, and question, with the value of 84 in each.

Confusion matrix of ISL word dataset

3.2 Experimental System Flow

Trained 24 subjects from speech and hearing impaired educational institutions have participated in the survey. The subjects taken into consideration were in the age group of 15 to 50 (ranging from school, college, trainers). The subjects should not have Upper Limb Disorders (ULDs): Carpal Tunnel Syndrome (CTS), tendonitis and Hand-Arm Vibration Syndrome (HAVS). Four datasets were formed variation in gender and signer.

  • First set (24 male participants signer dependent with varying age groups)

  • Second set (24 male participant signer independent with varying age groups)

  • Third set (24 females’ participant’s signer dependent with varying age groups)

  • Fourth set (24 females’ participant’s signer dependent with varying age groups).

All the subjects performed each word’s gesture repeated 6 times with a duration of 5 s for each word performed. The entire datasets sensor voltages are stored in memory card for further processing.

In the second version of SS glove flex sensors are combined with 3-axis accelerometer to acquire the inclinations of the hand. It is placed on the dorsal (i.e., back side of palm) side of SS glove.

ACC monitors angle, rotation and orientation of each gestures. ACC works on the principle based on a small mass over silicon surface suspended by beams which produces a force. This force displaces according to the movements of the MEMS accelerometer placed over glove and produces X, Y and Z axes measurement. ACC inclusion improves the accuracy of the SS glove without affecting cost and weight parameters.

4 Methodology

Four dataset’s results inferred that sensor outputs vary depending upon subject’s hand size. As for small size variations among results are less and for large hand sizes vice versa. Normalization of data avoids all these problems, subject’s sensor outputs are estimated to mean, standard deviation based on time domain approaches as in Eq. (1). These normalized sensor outputs for two ISL word gesture are shown in Table 2.

Table 2 Improvement in accuracy of words by sensor
$$\sigma = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{n} (x_{i} - \mu )^{2} }}{N}}$$
(1)

where σ is the standard deviation, μ is the mean, N is the number of subjects, xi current sensor output. Similar procedure is conducted for all words from each of the four datasets both flex sensor and accelerometer. Prior to all, sensors are calibrated at divergent positions. Flex sensors

are tested at no bend, half bend and full bend. ACC are tested at hand shapes which are inclined at up, down, tilting right and left to calculate the x, y, z axes measure. The flex sensor output is converted into electrical signal by voltage divide using a 10 k ohm resistor. In the 3-axis accelerometer the chip itself produces voltage output of X, Y and Z axes. These analog voltages values would be transformed into digital values. The process of digitization is done by analog to digital convertor (ADC) within the microprocessor. Hence, the digital output can be applied to observe variations in ISL word’s gestures. Including the difference taken from samples either from a single-user or a multi-user, as the selection of features is based on these factors. This process begins at the press of calibration button on the Arduino lily pad, indicated by blueLED.

4.1 Feature Selection

Words in the dataset, were selected on the frequency of usage by the speech and hearing impaired people on the day to day life. The words are categorized prescribing emotions, public communication, workplace, services, offerings, etc. To recognize these words, a unique differentiation among sensor outputs is required. Flexion degree after calibration, is measured for each word in the dataset. The voltage ranges of no bend (), half bend () and full bend (). Similarly, the ACC axis ranges from X-axis (), Y-axis () and Z-axis (). Flex sensor bends are indicated as no bend-FB0. Halfbend-FB1, full bend-FB2. ACC measures are represented as X-axis (XA), Y-axis (YA) and Z-axis (ZA).

These regions are exclusively listed for each word in the dataset as shown in Table 2. It was noticed that flex sensor exhibits similar outputs for words ‘Question’, ‘Month’, ‘Year’, ‘Family’, ‘Gift’, ‘Help’ and ‘Teacher’. To avoid this misclassification, ACC have been integrated into the SS glove.

Problems of same finger bending are overcome by estimating the inclination of the hand shape of the gesture by ACC. ACC yields three axes measurement wrt gesture hand positioning, palm parallel to ground indicates X-axis, perpendicular to ground indicates Y-axis and at an inclination to the ground indicates Z-axis. The words ‘Month’ and ‘Year’ mostly match in the finger bending, so to distinguish among them the X-axis of ‘Month’ fed an output of 3.45V and Y-axis of ‘Year’ fed an output of 4.01 V. The same pattern is observed to differentiate words like ‘Family’ and ‘Teacher’. Finally, some words ‘Gift’, ‘Help’ and ‘Question’ are yet to be differentiated by the estimated of two other parameters Mean absolute deviation (MAD) and Root Mean square (RMS). ACC exhibit X-axis for both words ‘Gift’ and ‘Help’, so these parameters are employed for further classification. The MAD and RMS of ‘Help’ were 3.57 V and 3.11 V and for ‘Gift’ were 2.99 V and 2.05 V, respectively. Hence MAD and RMS serve as support to complete the classification with high accuracy.

4.2 Gesture Classification

In this glove, the gestures are classified into 18 classes using Support Vector Machine (SVM). An SVM is a supervised machine learning for multiple data classification, since it is theoretical and computational efficient. With the labeled support vectors (among the samples) and optimal hyperplanes, the decision of classes are taken. Raw data are converted into 1D support vectors using kernels and 2D using transforms. The decision boundary should consider all the points in the dataset as stated in Eq. (2):

$$y_{i} \left( {w^{T} x_{i} + b} \right) > 1\,{\rm{for}}\,i = \, 1,2 \, \ldots . \, N$$
(2)

where x is the sample in dataset, y is the label assigned to dataset, w is the width between hyperplanes and support vectors, b is the proportionality constant. The optimal decision boundaries are selected by satisfying the condition as in Eq. (3).

$${\rm{Minimize}}{ 1/2 }\left| {\left| w \right|} \right|^{2}$$
(3)

The smallest distance between the support vectors is calculated for the entire sample size and minimized for optimization purpose. For each correct classified word label y yields +1 and incorrectly classified word yields −1 known as neutral class.

In this SLR, n-classes of SVM classifier are required to distinguish one class from the other. These classifiers average is estimated to a parameter called confidence value. Higher the confidence value (greater than 0.5) indicates the higher probability of selection of the corresponding class.

4.3 Feature Extraction

For the ISL words classification, a feature vector X has been considered which is comprised of 16 built in features. Features were flex sensors over the ten fingers, 3 readings from each ACC over two hands. This feature set is unique for 18 words in the dataset. On collective basis, the total amount of sample size is given as per Eq. 4.

$$\begin{aligned} {\text{No}}.\,{\text{of}}\,{\text{samples}}\,{\text{size}} & = ({\text{No}}\,{\text{of}}\,{\text{subjects}} \\ & \quad \times {\text{sampling}}\,{\text{rate}} \times {\text{no}}{\text{. of}}\,{\text{signs }} \\ & \quad \times {\text{time}}\,{\text{duration}} \times {\text{iteration}}) \\ & = \left( {24 \, \times 100\,{\text{Hz}} \times 18 \times 10{\text{ sec}} \times \, 6} \right) \\ & = 25,92,000 \\ \end{aligned}$$
(4)

Hence, 25,92,000 sample dataset are trained using SVM in a similar procedure for all words and calculate the accuracy, which is illustrated using a confusion matrix as in Fig. 5. A confusion matrix is a diagrammatic representation of the performance of the classifier and computes performance metrics as mentioned.

  1. (1)

    Accuracy = TP + TN/(Total)

  2. (2)

    Misclassification Rate = (FP + FN)/total

  3. (3)

    Precision (P) = TP/(TP + FP)

  4. (4)

    Recall (R) = TP/(FN + TP)

  5. (5)

    F-Score = 2[(P × R)/(P + R).

5 Results Anddiscussion

5.1 Experimental Results

Words were classified into three groups: Home, Public and Work as in Table 1. For the purpose of observing the notable variation, for instance two words ‘Children’ and ‘Gift’ are considered. The gesture for the word ‘Children’ is that both hands are half bend and with both palms facing toward ground side as each finger is bent to form a triangular shape.

The ‘Gift’ word gesture was indicated with both hand fully bent and placed over each other in the horizontal direction. Fig. 6a depicted a typical example of the flex sensor value plotting for the words ‘Children’ (blue) and ‘Gift’(orange). The regions which are analyzed are similar to those mentioned in the Table 2. Large and unique variations could not be observed from the flex sensor output which leads to misclassification of words. To solve this, sensor fusion technique has been employed by the inclusion of ACC as shown in Fig. 6b, which yields clear distinction as increase in the Y-axis of the ACC for the word ‘Gift’. By similar ways the misclassification is reduced and the accuracy rate has been increased by 30.2% from the first version to the second version for words ‘Question’, ’Gift’, ’Help’ and ‘Year’ in Table 2.

Fig. 6
A set of 2 line graphs plot slopes for children, and gift. Graph a, is for flex analysis and has the slope for children above gift. Graph b, is for A C C analysis and has a declining slope for children.

Sensor plot for ‘Children’ and ‘Gift’ a Flex sensor variation b ACC variation over both hands

Still there were minor misclassifications present in the second version. This was due to the smaller difference in flex sensor and ACC voltage outputs for the most subjects. Thus, half bend was interpreted as no bend and X-axis was combined with Y-axis reading. For instance, while considering words ‘Answers’ and ‘Teacher’ both the similar regions for flexion vector of index and middle fingers and in the same X-axis which produce false classification. To invariant these rates two additional parameters were trained in SVM.

Sensor outputs have heterogeneous features based on time domain, frequency domain and time-frequency domain. From these time-frequency domain characteristics are more complex and time consuming than time-domain characteristics. Therefore, time-domain features have been preferred as they are more advantageous. The first parameter determined was Mean Absolute Deviation (MAD) is a method of calculating average distance from each output to mean. MAD tells how the values in the dataset have been spread which have been analyzed by the Eq. (5).

The second parameter computed based on sensor output was

$$\sigma = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{n} (x_{i} - {\text{mean}})^{2} }}{N}}$$
(5)

Root Mean Square (RMS) is the square root of the mean or average value of the squared function of instantaneous value of any AC sinusoidal and non-sinusoidal signals. RMS value can be calculated by Eq. (6).

$$r_{MS} = \widetilde{{\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} x_{i}^{2} }}$$
(6)

From the Table 2 it is observed that, sensor output modifies as per the ISL gestures. These outputs are presented along with their standard deviation values. On examining, the gestures can be judged and varied. This was done by the variation in the assessment of the Mean Absolute Deviation (MAD) and Root Mean Square (RMS). Similar approach can be done for all other word’s ISL symbol and gesture can be recognized to their corresponding words.

6 Conclusion

In this study, the specialized SS Glove has been designed and implemented successfully as a smart device to recognize the ISL words using SVM. A voice module with speaker translates the recognized gesture into voice in a time and portable efficient manner. The device was tested personally with the subjects and collected feedback in terms of accuracy, comfort and cost. Depicted in Fig. 6. This glove would serve as an aid to communication from speech and hearing impaired people of age ranging from 15 to 50 years. Kinesics of ISL is conquered by both the right and the left hands. Sensor fusion of Flex sensor and ACC are used to improve accuracy from 74.12% to 97.2 %. Some of the demerits of glove-based system would be to make the gesture movement cumbersome and weigh high due to sensors, which can be rectified in future versions. The possible extension of this proposed glove would be miniaturization of the entire device, inclusion of sentence structure of ISL and experiment gestures of other SL as ASL, JSL, etc.