1 Introduction

Nowadays robot can assist in various fields like doctors in his/her surgery during the critical operations, war field, household applications, etc. For such type of activity, interaction of human with robot is necessary. Many algorithms and many methodologies have already been evolved and many research works are currently running to make the robot/machine as intelligent as human beings. Both gesture and speech are the good ways of establishing a communication between human and robot [1, 2]. Gestures can be formed by hand and head movements or sometimes by full body movements which is mostly used by hearing impaired society. This society uses this sign language for establishing a communication with each other. In this work hand gestures have been taken into account. Previously gestures are recognized using data glove based gesture capturing techniques where sensors are used for capturing the joint angle values of hand. Using these angle values movements of hand gestures are identified. This method is not substantial for recognizing hand gestures therefore vision based gesture recognition technique have been evolved. This technique has the good capability of capturing gestures because of the advancement in technology of capturing devices like camera and processing of the high quality images and also it is easy to handle. In today’s scenario sign language recognition attracts the system which can make the communication straightforward and easy for impaired society. Every country has its own sign language like American has American Sign Language, British has British sign language, etc. in the same way Indian has its own sign language which is called as Indian sign language (ISL). ISL is a best way of establishing a communication with hearing impaired society present in India. In ISL dynamic and static hand movements are performed by single as well as both the hands. After giving much time to an end ever to understand this language, we found that the beauty of establishing communication through sign gestures depends upon the way of making it. Sometimes hand gestures contain other body parts too. ISL Library helped us a lot to understand ISL signs.

This paper is an extension of our previous work [25] where we apply MFCC technique to extract features of hand. Here data set have been taken in very structured environment like homogeneous light condition, black background and also we concentrate on the hands portion from full human body. Due to these constraints the data set are linear in nature which does not require any pre-processing. The MFCC technique provides good recognition accuracy when applied on these structured dataset. But it performance decreases as the complexity increases in real time like variation in background, light, shadow, illumination conditions and minor change in shape colour of hands which are difficult to discriminate and also it will depend on the speed of the gesture. The major challenges we have solved in this paper are:

  1. (a)

    Hand segmentation from upper half of the body image.

  2. (b)

    During gesture segmentation process some of the gestures are deformed or slide change in their shape.

  3. (c)

    Boundaries of each gesture may vary from one person to another.

To minimize all such challenges we have proposed a novel framework which works well in real time scenarios. Here we first extract hand from upper half of the body image. After solving this, next step is to find appropriate features for gesture recognition like shape, orientation, spatial temporal motion, etc. In this framework we have applied a combination of WD and MFCC feature extraction technique for recognizing an unknown ISL gesture with the help of SVM and KNN as a classifier. The combinations of these two feature extraction technique are never been applied before for ISL gesture recognition purpose and also it is very effective against translation, scaling, orientation, background variation and light variation when gestures are performed in real time environment. After WD, 12 MFCC coefficients of each frame are taken as a feature vector. All the experiments are performed in various background conditions with different illuminations like red, yellow, etc. and we get 98 percent classification results. After classification process humanoid learning is performed by HOAP-2 robot using Webots robotic simulation software.

This paper organizes into six sections in which second section describes about analysis of previous research where we describes that what are the research work has already been done and what are the flaws in existing techniques. In third section, detail of the proposed framework has been described. Experimental results and analysis are explained in fourth section. Fifth section tells about performance comparison between different gesture recognition techniques and explains how humanoid learning is performed. Section sixth shows the findings and future scope in this area. Finally end of the paper includes acknowledgement and references.

2 Analysis of previous research work

Many vision based sign language gesture recognition techniques have already been evolved. Here we will discuss few of them. A vision-based method for hand gesture data acquisition requires many important aspects for consideration like placement of camera and number of cameras. In general, one camera is used by Starner [5] and Martin [6] and they suggested that one camera can provide appropriate information. More than one camera can be used for getting depth or stereo information. Here gestures are recognized in real time environment using hidden Markov model (HMM) which is very effective tool in speech recognition field. The major drawback of this paper is the features here only mean and variance has been used as an input for HMM. These feature loses information regarding hand.

In case of feature extraction Sturman [7, 8] and Watson [9] have discussed template matching based technique which is done by creating templates for collecting data values of each frame of each gesture. Firstly the sensor reading is averaged and saved as templates. Test template is classified by closest matching criteria. It is easy to implement but not well suited for hand gestures because of overlapping templates in case of large posture set. Histograms of local orientation used as feature vectors by William T. Freeman and Michal Roth [10]. They added advantages of using orientation as robustness towards light variations and histogram as translational invariance. Nandy et al. [11] proposed an ISL gesture recognition technique where features are extracted using histogram equalization and atan2 function and classified using Euclidean distance metric. In histogram equalization cumulative distribution function (CDF) is used for normalizing image database. Here database is created with fixed background (black) and different illumination conditions. They also mentioned the flaws in this method in case of complex background conditions [12]. In addition to this, only fixed background dynamic gestures are well classified and feature vector for every degree of hand angle needed for correct classification that makes it expensive as time aspects. Jafreezal and Fatimah [13] jointly applied a MFCC technique for recognition of ASL database. They evaluate this technique with 10 gestures and calculate recognition rate and get 95 % accuracy. They tested the results with very small amount of dataset which is not sufficient for experimental purpose. This technique is computationally extensive in nature.

From above literature we found that gesture segmentation and appropriate feature extraction are the major issues in any gesture recognition technique. In segmentation, hand extraction is the biggest challenge. After hand extraction feature extraction is the next biggest challenge for gesture recognition. Which we have solved using a combination of wavelet descriptor and MFCC based ISL gesture recognition technique. The combinations of these two methodologies are mainly used in speech processing [14, 15] and it would be very prominent in the field of speech processing because of its inherent properties makes it unique. In palm print recognition [16] a similar type of methodology has been applied. Here features are extracted from DWT and features extracted from MFCC are fused. Than this combined features are processed for classification and it will give satisfactory results. Therefore we have inspired to apply this technique for ISL gesture recognition and this technique will also perform well on ISL gesture recognition.

3 Proposed framework

In this paper we have proposed a gesture recognition method where MFCC features have been extracted by processing of transformed images by wavelet descriptors. The proposed framework for ISL gesture recognition is shown in Fig. 1. In Fig. 1 we have seen that the overall framework is divided into five modules: data acquisition where data collection has been done through webcam and Sony handycam then each video is divided into sequence of frames. These frames are further go to the pre-processing phase in which hand extraction, silhouette image formation, and boundary point extraction is performed. In this module silhouette images are converted into 1D signal and then wavelet descriptors have been applied for finding boundary points which are moment invariant. These features are further processed for finding MFCC coefficients which is done in feature extraction module. MFCC feature extraction technique is widely used in speech processing here we explored MFCC in hand gesture recognition in combination with WD and see the performance on image frames. Due to spectral envelop property of MFCC, the proposed method gives high recognition rate with less processing time. Further probes are classified using SVM and KNN in the classification phase. In the last module these classified gestures are performed by HOAP-2 robot using Webots platform. All the modules of proposed framework are described in detailed in subsequent sections.

Fig. 1
figure 1

Proposed framework

  1. (A)

    Database creation: Sir William Tomkins mentioned 100 signs posture and guaranteed about them of being true Indian Signs in his book Universal Indian Sign Language [17]. We have created database of 42 ISL static as well as dynamic gestures among 100 signs by using simple Logitech HD 720p camera shown in Table 1. The database contains both dynamic gestures as well as static gestures recorded in different light conditions (yellow light, red light, white light, etc.) and different background like white paper, paper with different symbols, red background, etc. All the database is created with black full sleeves dress. We have captured full body dataset where portion of both the hands come. In our database we have considered both types of gestures containing single hand as well as both hands.

    Table 1 Database of static and dynamic gestures in different illumination condition
  2. (B)

    Pre-processing: Gesture videos are first divided into sequence of frames of size (m, n) represented as Ii(m n), where i is the number of frames in each video. Hand region are extracted from whole image frame shown in Fig. 2.

    These images are further converted into binary images using threshold T. Binarisation of an image is necessary for extracting good edges which are continuous in nature. Contour obtained are distorted at some points therefore binarisation of an image is necessary. This binarisation of an image is performed using Otsu’s thresholding method where threshold is chosen in such a manner that intra-class variance of black and white pixel is minimum. Then after WD is applied for reducing dimension of an image and extracting moment invariant features which does not contain noise.

    Fig. 2
    figure 2

    Pre-processing steps of each RGB frame. a RGB image, b Background, c Forground, d HSV, e Bounding box, f Face extraction, g hand image

    Wavelet descriptor In WD first DWT [18, 19] is applied up to the third level of decomposition of a binary image frames. The decomposition of image matrix in each level is done by row wise multiplication of images with wavelet filter followed by column wise multiplication and getting the four components of single image [LL, LH, HL and HH] where LL is the approximate coefficients which discards all high frequency coefficients and LH, HL, HH is the high frequency sub bands where LH contains horizontal components, HL contains vertical components and HH has a diagonal components of original image. The decomposition is performed as:

    $$LL = D_{{w_{\emptyset } }} (j,p,q) = \frac{1}{{\sqrt {PQ} }}\mathop \sum \limits_{x = 0}^{P - 1} \mathop \sum \limits_{y = 0}^{Q - 1} f(x,y)\emptyset_{j,p,q} (x,y)$$
    (1)
    $$i = LH,HL,HH = D_{{w_{\varOmega }^{i} }} (j,p,q) = \frac{1}{{\sqrt {PQ} }}\mathop \sum \limits_{x = 0}^{P - 1} \mathop \sum \limits_{y = 0}^{Q - 1} f(x,y)\varOmega_{j,p,q}^{i} (x,y)$$
    (2)

    where \(\emptyset\) and \(\varOmega\) is a scaling and wavelet coefficients, P and Q is the dimension of image matrix, j is the level of decomposition, f(x,y) is the image matrix obtained after binarisation and p, q is the row and column of image matrix.

    Suppose Image matrix of size (4, 4)

    $${\text{I}} = \left[ {\begin{array}{*{20}c} {c11} & {c12} & {c13} & {c14} \\ {c21} & {c22} & {c23} & {c24} \\ {c31} & {c32} & {c33} & {c34} \\ {c41} & {c42} & {c43} & {c44} \\ \end{array} } \right]$$

    Coefficients of Daubachies wavelet filter H is defined as: Low pass: \(\frac{1}{4\sqrt 2 }(3 - \sqrt {3 } ,\;3 + \sqrt 3 ,\; 1 + \sqrt 3 , \;1 - \sqrt 3 )\) High pass: \(\frac{1}{4\sqrt 2 }(1 - \sqrt {3 } ,\; - 1 - \sqrt 3 , \;3 + \sqrt 3 , \; - 3 + \sqrt 3 )\)

    $${\text{H}} = \left[ {\begin{array}{*{20}c} {W11} & {W12} & {W13} & {W14} \\ {W21} & {W22} & {W23} & {W24} \\ {W31} & {W32} & {W33} & {W34} \\ {W41} & {W42} & {W43} & {W44} \\ \end{array} } \right]$$

    where W11, W12 … are wavelet filter bank coefficients.

    $${\text{Ir}} = {\text{I}}\; \times \;{\text{H }}$$
    (3)
    $${\text{a}}_{\text{r}}^{1} |{\text{d}}_{\text{r}}^{1} \approx \frac{1}{4\sqrt 2 }\left[ {\begin{array}{*{20}c} {r11,\;r12,\;r13,\;r14} \\ {r21,\;r22,\;r23,\;r24} \\ {r31,\;r32,\;r33,\;r34} \\ {r41,\;r42,\;r43,\;r44} \\ \end{array} } \right]$$
    (4)

    where \(1/4\sqrt 2\) is a normalizing factor. It normalizes the approximate ac and detail dc coefficients such that

    $$\left| {\left| {{\text{a}}_{\text{c}} } \right|} \right| = \left| {\left| {{\text{d}}_{\text{c}} } \right|} \right| = 1.$$
    $${\text{Ic}} = {\text{H}}^{\prime }\; \times \; {\text{Ir}}.$$
    (5)
    $${\text{a}}_{\text{c}}^{1} |{\text{d}}_{\text{c}}^{1} = \frac{1}{4\sqrt 2 }\left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\left( {\begin{array}{*{20}c} {D11,\; D12} \\ {D21, \;D22} \\ \end{array} } \right)|\left( {\begin{array}{*{20}c} {D13, \;D14} \\ {D23, \;D24} \\ \end{array} } \right)} \\ { - - - - - - - - - - - } \\ {\left( {\begin{array}{*{20}c} {D31, D32} \\ {D41, D42} \\ \end{array} } \right)|\left( {\begin{array}{*{20}c} {D33,\;D34} \\ {D43, \;D44} \\ \end{array} } \right)} \\ \end{array} } \\ \end{array} } \\ \end{array} } \\ \end{array} } \right]$$
    (6)

    where \({\text{Ic }}\) is an orthogonal matrix which preserves magnitude and angle of any vector is belongs to |Rn. Prove of these properties are given below:

    These two properties of WD shows that the angle and amplitude of original image does not change when it converted into the transformed image means image will be less distorted after transformation.

    The Eq. 4 shows the final matrix obtained after 1st level decomposition which has 4 components termed as LL, LH, HL and HH. Each component of the matrix is of size (m/2, n/2). In second level decomposition LL component is further decomposed into four parts because it contains maximum information with minimum noise similar to original image. Whereas LH, HL and HH are high frequency coefficients having low signal to noise ratio. This process continuous up to third level of decomposition and finally we get four coefficients FF, FV, VF and VV. After decomposition process contour of an image is calculated by any known contour detection method. Then moment invariant features are deliberated by converting 2-D contour image [G(x, y)] into 1 dimension. For this conversion, 2-D image G(x, y) in x–y plane is converted into r–θ plane G(r, θ) described as: x = r cosθ and y = r sinθ.

    $$G_{ab} = \mathop {\iint }\nolimits G(r,\theta )g_{a } (r)e^{jb\theta } r drd\theta$$
    (7)

    where r is the radius of the circle, θ is the orientation angle, Gab is the moment of hand, ga(r) is a radial basis function and a, b are constants. In case of wavelet descriptor ga(r) has been treated as a wavelet basis function and replaced with \(\vartheta^{p,q} (r) = \frac{1}{\sqrt p }\vartheta \left( {\frac{r - q}{p}} \right)\) p and q are the dilation and shifting parameter.

    Now convert 2D image into 1D form for reducing feature extraction problem and increasing performance quality. We choose cubic B-spline (Gaussian approximation) function as a mother wavelet define as:

    $$\vartheta (r) = \frac{{4p^{n + 1} }}{{\sqrt {2\pi (n + 1)} }}\sigma_{y} \cos (2\pi g_{0} (2r - 1)) \times { \exp }\left( { - \frac{{(2r - 1)^{2} }}{{2\sigma_{y}^{2} (n + 1)}}} \right)$$

    Analyzing a moments in shape of an image the values of dilation and shifting parameter p and q are chosen to be discrete expressed as:

    $$p = p_{0}^{m} ,\quad m \;is\;an \;integer \; \text{and} \; q = nq_{0} p_{0}^{m} ,\quad n\;is\;an \;integer$$

    p0 > 1 or p0 < 1 and q0 > 0. These constraints has been considered so that \(\vartheta \left( {\frac{r - q}{p}} \right)\) covers the complete shape of gesture. Here we considered circle for representing shape of an image whereas (r ≤ 1). The values we choose is (p0 and q0 = 0.5). Then the wavelet basis function \(\vartheta^{p,q} (r)\) has been modified as:

    $$\vartheta_{m,n} (r) = 2^{{\frac{m}{2}}} \vartheta (2^{m} r - 0.5n)$$

    \(\vartheta_{m,n} (r)\) defines for any orientation along radial axis r. It is used for finding the local as well as global features of hand by varying the values of m, n. After that we define moment invariant wavelet feature vector as:

    $$\left\| {G_{m,n,b}^{wavelet} } \right\| = \left\| {\mathop \int \nolimits f_{b} (r)\; \times \;\vartheta_{m,n} (r)r dr} \right\|$$
    (8)

    Comparing Eqs. 7 and 8 we get \({\text{ga(r)}} \; = \; \vartheta_{m,n} (r)\) and \(f_{b} (r) = \int {G(r,\theta )e^{jb\theta } d\theta }\) shows the bth frequency feature of image \(G(r,\theta )\) in r–\(\theta\) plane where 0 ≤ \(\theta\) ≤ 2π.

    \(\left\| {G_{m,n,b}^{wavelet} } \right\|\) is the wavelet transform of fb(r)r. It analyses the signal in both time domain as well as frequency domain and extracts features which are locally descriptive in nature. Features shown in Eq. 8 are moment invariant for each gesture with feature vector \(\left\| { G_{m,n,b}^{wavelet} } \right\|\). Where m = 0, 1, 2, 3 and n = 0, 1 … 2m + 1.

    \(\left\| {G_{m,n,b}^{wavelet} } \right\|\) is the generalization of moment \(f_{b} (r)\) at mth scale level and nth sift position.

    In WD \(\left\| {G_{m,n,b}^{wavelet} } \right\|\) represents the moment invariant property of image I, if this image is rotated by an angle α then moment invariant property \(\left\| {G_{m,n,b}^{wavelet rotated} } \right\|\) defined as:

    $$G_{m,n,b}^{wavelet rotated} = G_{m,n,b}^{wavelet } e^{jb\alpha }$$

    Since

    $$\begin{aligned} \left\| {G_{m,n,b}^{wavelet rotated} } \right\| & = \sqrt {\left( {\left\| {G_{m,n,b}^{wavelet rotated} } \right\|} \right)\left( {\left\| {G_{m,n,b}^{wavelet rotated} } \right\|} \right)^{*} } \\ & = \sqrt {\left( {\left\| {G_{m,n,b}^{wavelet } e^{jb\alpha } } \right\|} \right)\left( {\left\| {G_{m,n,b}^{wavelet } e^{ - jb\alpha } } \right\|} \right)} \\ & = \left\| {G_{m,n,b}^{wavelet } } \right\| \\ \end{aligned}$$

    shows the moment invariant properties of WD.

  3. (C)

    MFCC feature extraction: After that MFCC coefficients of each image has been calculated. Each image behaves like a signal. On this signal, MFCC feature extraction technique is applied for calculating the spectral envelope of each frame. The spectral envelop properties of MFCC feature [20] provides robustness towards background noise as here in case of getting false edges of hands. Spectral envelope is a property of an image which gives the knowledge about image intensity over frequency by providing the smooth curve or regular curve having no discontinuity in frequency domain. It is a curve in the frequency amplitude plane, which tightly connects all the points of the magnitude spectrum, linking the peaks. These peaks represents the highest intensity value of the image, carries maximum information. It derived from a Fourier magnitude spectrum.

    With the help of spectral envelop graph shown in Fig. 5 we see the nature of the signal how tightly the curve will cover the signal. The steps of MFCC features and spectral envelop calculation are:

    1. (a)

      In noisy background condition there is a large variation in consecutive pixel values. Thus by assuming it statistically stationary in short period of time, the column vector block broken up into small sections called frames which is having N samples/frame. These frames are separated by M samples (N > M) therefore overlapping is done by N–M samples.

    2. (b)

      For removing the side ripples (spectral distortion) and maintaining continuity in the signal windowing is performed using hamming window function. It provides lower side lobe and narrower main lobe in the image frame. Which is expressed as:

      $${\text{hm(n)}} = 0.54 - 0.46\cos \left( {\frac{{2{\pi\text{n}}}}{{{\text{N}} - 1}}} \right),\quad 0 \le {\text{n}} \le {\text{N}} - 1$$
      (9)

      The resultant signal y(n) is

      $${\text{y(n)}} = {\text{x(n)}}\; \times \;{\text{hm(n)}}.$$
      (10)

      where hm(n) is the window signal, x(n) is the wavelet coefficients.

    3. (c)

      After that Fast Fourier Transform (FFT) is applied on resultant signal y(n) which converts time domain image frame into frequency domain image frame represented as \({\text{S}}_{\text{k }}\).

      $${\text{S}}_{\text{k }} = \mathop \sum \limits_{{{\text{n}} = 0}}^{{{\text{N}} - 1}} {\text{y}}({\text{n}})\; \times \;{\text{e}}^{{ - \frac{{{\text{j}}2{\pi\text{ kn}}}}{\text{N}}}}$$
      (11)

      where k = 0, 1, 2 … N − 1.

    4. (d)

      Periodogram based power spectral density (\({\text{P}}_{\text{i}} \left( {\text{k}} \right)\)) is calculated by taking absolute value of complex Fourier transform and square the result as.

      $${\text{P}}_{\text{i}} ({\text{k}}) = \frac{1}{\text{N}}\left| {{\text{S}}_{\text{i}} ({\text{k}})} \right|^{2}$$
      (12)

      where positive frequencies 0 ≤ f < \({\text{fs}}\)/2 correspond to values 0 ≤ n ≤ N/2 − 1, while negative frequencies \(- {\text{fs}}\)/2 < f < 0 correspond to values N/2 + 1 ≤ n ≤ N − 1.

    5. (e)

      The irrelevant frequencies are mixed up with very closely spaced relevant frequencies and this effect is more effective with frequency increase. Thus to get a clear idea about exact energy amplitude at various frequencies, PSD coefficients are binned and correlated with each filter from a Mel filter bank. Mel is a nonlinear scale represented as-

      $${\text{Mel Frequency}} = 2595\; \times \;\log \left( {1 + \frac{\text{fs}}{700}} \right)$$
      (13)

      where \({\text{fs}}\) is the frequency range used for generating filter bank is designed in next step.

    6. (f)

      This Mel scale is used by filter bank as their Centre frequency follows this Mel frequency and thus the filters near the lower frequencies are having the narrow bandwidth and as the frequency increases the filters width increases.

      $${\text{H(k}},{\text{b)}} = \left\{ {\begin{array}{*{20}l} {0,\quad {\text{if}}\;{\text{f(k)}} < fc({\text{b}} - 1)} \hfill \\ {{\text{f(k)}} - \frac{{{\text{fc}}({\text{b}} - 1)}}{{{\text{fc}}({\text{b}}) - {\text{fc}}({\text{b}} - 1)}},\quad {\text{if fc}}\left( {{\text{b}} - 1} \right) \le {\text{f}}\left( {\text{k}} \right) < {\text{fc}}\left( {\text{b}} \right)} \hfill \\ {{\text{f(k)}} - {\text{fc}}({\text{b}} + 1))/{\text{fc(b}}) - {\text{fc(b}} - 1),\quad {\text{if fc(b)}} \le {\text{f(k)}} < fc({\text{b}} + 1)} \hfill \\ {0,\quad {\text{if f(k)}} < fc({\text{b}} - 1)} \hfill \\ \end{array} } \right.{ }$$
      (14)

      where fc(b) is the central frequency of the filter, b is the number of filter used in filter bank and f(k) is the Mel frequency.

    7. (g)

      Find the log energy output of each of the Mel frequencies.

      $${\text{S(b)}} = \mathop \sum \limits_{{{\text{k}} = 0}}^{{{\text{N}} - 1}} {\text{H(k}},{\text{b)}}\; \times \;{\text{P }}$$
      (15)

      where \({\text{b}} = 1, 2, \ldots {\text{m }}\) and m is the number of filter and P is PSD matrix

    8. (h)

      Coefficients me1, me2… are generated by applying Discrete Cosine Transform (DCT) on theses Mel frequencies.

      $${\text{me }} = {\text{DCT (s}}({\text{b}}) )$$
      (16)

      where me is the number of Cepstral coefficients. These coefficients are saved as feature vector shown in Figs. 3, 4.

      Fig. 3
      figure 3

      MFCC with DWT plot for gesture above

      Fig. 4
      figure 4

      MFCC with DWT plot for gesture across

      In Figs. 3 and 4 X axis represents the number of MFCC coefficients and Y axis represents log magnitude value corresponding to that coefficients. The graph shows MFCC plot for each blocked frames of one image corresponding to their mentioned gesture. 12 MFCC coefficients are taken for each image frame. Here the single image of one gesture is blocked into 11 frames and graph shows MFCC plot for those 11 frames.

      The spectral envelop in MFCC features are calculated as:

      Spectral envelop of any signal has been calculated by applying FFT function on cepstral coefficients obtained in the last step of MFCC feature extraction technique.

      1. 1.

        Number of bins present in any spectral envelop of the signal has been taken by dividing the range of frequency ft up to Nyquist frequency fs/2. Where fs is the sampling rate. Here we divide frequency into equal parts up to fs/2.

        $${\text{f}}_{\text{t}} = t\frac{{f_{s} /2}}{n},\quad {\text{t}} = 1\ldots {\text{n}}$$
        (17)
      2. 2.

        Calculate the angular frequency\(w_{t} = f_{t} \frac{2\pi }{{f_{s} }}\)where wt is the angular frequency.

      3. 3.

        Finally we calculate the spectral envelop ϑt of frequency ft as:

        $$\vartheta {\text{t}} = { \exp }\left( {\mathop \sum \limits_{i = 1}^{n} {\text{me}}_{i} \; \times \;\cos iw_{t} } \right)$$
        (18)

        where mei is the cepstral coefficient.

        Spectral envelop of above gesture is shown in Fig. 5. From this graph we see that number of bins of frequency in which the spectral envelop lies is 10 and the red solid line shows the spectral envelop of the signal which is curved generated by combining the peaks of the signal.

        Fig. 5
        figure 5

        Spectral envelop of a MFCC features, b WD with MFCC features

        From Fig. 5a, b we see the nature of the curve of spectral envelop and found that in case of simple MFCC the curve is not tightly bound. When signal becomes invisible it becomes straight line but in case of WD with MFCC the spectral envelop curve is more tightly bound containing peaks of the signal which will properly discriminate the changes in the signal.

  4. (D)

    Classification: MFCC coefficients of various gestures are classified using K nearest neighbour (KNN) [21] and support vector machine (SVM) [22].

    1. (a)

      K nearest neighbour (KNN):

      It is a simple classifier used for classification of an unknown gesture. It is based on nearest neighbour method presents nearest to the unknown gesture shown in Fig. 6. In KNN initially assume the value of k then measured distance between all testing and training vectors of gestures using Euclidean distance. Then select k closest vectors whose Euclidean distance is minimum. Calculate the vote and assign the labels according to the majority vote the class gets. Here proposed algorithm is tested on k = 1, 3, 5 and 7.

      Fig. 6
      figure 6

      KNN classification example for k = 1, 3, 5 or 7

      In Fig. 6 we have seen that for k = 1, unlabeled data assigned to square class, for k = 3, unlabeled data assigned to triangle class and k = 7, unlabeled data assigned to square class.

    2. (b)

      Support vector machine (SVM): It is a classifier, handles nonlinearity present in features by transforming it into higher dimension. Here we use linear kernel function for discriminating different classes.

      $$y_{i} \left( {{\text{w}}\; \cdot tr_{i} - {\text{k}}} \right) \ge 0$$
      (19)

      \(tr_{i}\) is the training class where i represents the class number, . is the dot product, w is the normal vector ||w|| = 1, \(y_{i}\) is the class label for two class problem the value of \(y_{i}\) is 1 and −1. k is the constant. Constant \(\frac{k}{||w||}\) sets the width of the hyper plane between different classes. If k = 2 then there will be two classes. For multiclass problem SVM uses multiple binary classifiers. Let we have a t classes then generate t binary classifiers f1, f2…ft. Each classifier is trained with one class from rest of the classes. Combine all the binary classifier to get a classification for multiclass SVM expressed as:

      $$argmax_{i = 1 \ldots t} F^{i} (x)$$

      where

      $$F^{i} (x) = \mathop \sum \limits_{n = 1}^{m} y_{n} \alpha^{i}_{n} K(te,tr_{n} ) + k^{n}$$
      (20)

      \(\alpha^{i}_{n}\) is the lagrangian multiplier, K is the linear kernel function defined in Eq. 20. For classifying an unknown test data te within t classes then \(\frac{t \cdot (t - 1)}{2}\) binary classifiers have been applied and count the number of times the test data te belongs to a particular class. Those classes having maximum number of class labels, the test data te belong to that class.The binary SVM classification is shown in Fig. 7. Here we use “multisvm” function of Matlab for classifying a test vector of an unknown gesture w.r.t. known gesture (one verses all strategy).

      Fig. 7
      figure 7

      SVM classification

  5. (E)

    Humanoid learning: After classification process humanoid learning is performed by HOAP-2 robot in Webots robotic simulation software. It has various types of robots like social robots, industrial robots, etc. It is a software platform for establishing an interaction between human and robot. In this paper gestures are used for establishing interaction which a human performs and then all these gestures are performed by the humanoid robot HOAP-2. HOAP-2 is a humanoid open architecture platform [3] having 25 degrees of freedom (DOF) (6 DOF on foot × 2, 4 DOF on arm × 2, 1 DOF on waist, 1 DOF on hand × 2, 2 DOF on neck). The architecture of HOAP-2 robot is shown in Fig. 8. It is a social robot used for human robot interaction either by using speech or by using gestures (head motions, hand motions) [4]. To implement on HOAP-2 we need the joint angle data of each joint of the human hand. When the user performs the gesture, hand of a person moves horizontally as well as vertically, or ups or down. At each instant of time the joint angle value changes with respect to initial coordinate frame. These joint angle values are captured either by using data glove sensor or using atan2 function than these joint angle values are interpolated for smooth motion of HOAP joint-2 angles. Finally these values are passed to the.csv file which are uploaded into robot controller programme. Here we linked Webots software to the MATLAB where gestures are classified. The range of each joint of HOAP-2 is shown in Table 2.

    Fig. 8
    figure 8

    Architecture of HOAP-2 Robot [3]

    Table 2 Allowable range of joint angle for humanoid robot HOAP-2

From all the joints shown in Table 2 we used only 10 joints (RARM Joint 1, 2, 3, 4, 5 and LARM Joint 1, 2, 3, 4, 5) and corresponding to their column values present in.csv file for performing any gestures, otherwise all other joint values keep it constant. Position of joint of humanoid robot at a particular position is defined using joint angle values expressed as:

$$Po = A \times \varphi$$

where Po is the position value, \(\varphi\) is the joint angle in degree and A is the change of coefficients in pulses/deg. In HOAP-2 robot architecture the moment of joint angle values are performed in pulses because each joint having its own motor. Performing 1 degree of motion the motor of each joint requires 209 pulses. In csv file every column has the position value of each joint with some negative value and some positive value i.e. joints has been move sometimes in positive direction and sometimes in negative direction which is calculated by multiplying number of degree of motion to the pulse value.

4 Experimental results and analysis

Data set of 42 ISL gestures like big, bring, below etc. (23 static and 19 dynamic) are created in three light conditions and different background conditions. In this paper five videos of each gesture are taken as a training gesture and three videos are taken as a testing set where each video having 150 frames. For training we take five persons dataset and for testing seven person’s dataset where six are males and six are females (all are Indian Institute of Information Technology, Allahabad students). Matlab Image Acquisition Toolbox has been used for capturing ISL video data. The toolbox is configured as: Colour space: RGB (320*240), Triggering mode: immediate, Number of triggers: 10, Frame rate: 30 frames per second.

After data acquisition and preprocessing the proposed WD with MFCC features have been extracted and demonstrated through feature space plot [27, 28] which is shown in Figs. 9, 10 and 11.

Fig. 9
figure 9

Feature space plot between feature 2 vs feature 10 a MFCC features, b WD with MFCC features

Fig. 10
figure 10

Feature space plot between feature 1 and feature 2 a MFCC features, b WD with MFCC features

Fig. 11
figure 11

Feature space plot between feature 2 and feature 12 (WD with MFCC features)

We have compared two feature extraction techniques one is MFCC [25] and second one is WD with MFCC feature extraction technique using feature space plot [27] shown in Figs. 9 and 10 where graph is created between feature 1 vs feature 2 and feature 2 vs feature 10. Here I have shown best three feature space plots. From all five graphs we found that the proposed technique has more discriminative capability to separate all 18 classes’ then simple MFCC feature extraction technique which means that the overlapping between the MFCC features are more than the WD with MFCC features. This overlapping shows that the nature of data, if overlapping is more than the dataset having maximum nonlinearity and vice versa. Figure 11 shows the feature space plot of proposed methodology with five gestures. Figure 11 perfectly shows the discriminative ability of the proposed technique rather than the other existing technique. These MFCC features are statistical in nature which has been shown by calculating between class distance matrix and within class matrix shown in Figs. 12, 13, 14 and 15.

Fig. 12
figure 12

Distance plot for MFCC features

Fig. 13
figure 13

Distance plot for WD with MFCC features

Fig. 14
figure 14

Variance plot for MFCC features

Fig. 15
figure 15

Variance plot for WD with MFCC features

Graph shown in Figs. 12, 13, 14 and 15 represents the variance of within class matrix and distances of between class matrixes. In Figs. 14 and 15 we see that mean of the variance of each class lies below a threshold value 2db for maximum classes in case of WD with MFCC features in comparison to simple MFCC features where variance of the classes varies frequently which shows non linearity in dataset. From graph shown in Figs. 12 and 13 we see that the distance between classes is high in WD with MFCC then MFCC. In between class matric the Euclidean distance between same class is zero and other classes have some value but in Fig. 12 we see that in some cases the distance between same class is also not zero where as in WD with MFCC the distance between same class is always zero which proves that the proposed method has more discriminative power then other methods (MFCC [25]).

Performance of proposed method has been analysed by percentage of classification rate and confusion matrix where KNN and SVM are used as a classifier. Classification rate is calculated as:

$${\text{Classification rate}} = \left( { 1 - {\text{No}}.{\text{ of frames misclassified}}/{\text{total no}}.{\text{ of frames}}} \right) \times 100$$
(21)

In KNN experiments are performed at various values of k like k = 1, 3, 5, 7, etc. becomes the simple nearest neighbour classification. Here Euclidean distance is used for finding the nearest neighbour of a particular class. We get approximately similar results for k = 3, 5 and 7 may leads to the classification of ISL gesture and similarly in SVM we use linear kernel function for classifying an unknown gesture for a multi class SVM.

From Table 3 we have seen that SVM gives better classification accuracy then KNN because SVM handles non linearity present in the dataset by using kernel function. But KNN is a simple nearest neighbourhood based classifier which is not appropriate for classifying non-linear data and also it is very sensitive against noise. We have also observe that the classification accuracy decreases when shapes of two gestures are very similar.

Table 3 Frame based classification results for dynamic gestures using SVM and KNN

Performance of our proposed algorithm is also tested by calculating confusion matrix shown in Table 4. Here three videos per gesture for training and five videos for testing have been taken for classification. Each video contains 150 frames. Confusion matrix is created by calculating True positive, true negative, false positive and false negative for each gesture case.

Table 4 MFCC with DWT based classification for dynamic gestures (SVM as Classifier)

True positive = 50, True negative = 1018, False positive = 8, False negative = 7, Total positive = 58,

Total negative = 1025, Total population = 1083. Accuracy = (50 + 1018)/1083 = 1068/1083 = 98.61 %.

From Tables 3 and 4 we see that our proposed method gives satisfactory results for all the 19 dynamic gestures. We have also tested our experiments on static gestures. For static gestures, training is performed with one video and all other videos in different light condition are used for testing. 150 frames are there in each video. Accuracy based on confusion matrix shown in Table 5.

Table 5 KNN k = 3 based classification for static gestures in different illumination condition

True positive = 7, False negative = 1, False positive = 1, True negative = 55, Total positive = 8,

Total negative = 144, Total population = 64; Accuracy = 62/64 = 96.875 %.

Table 5 shows that the proposed method gives 100 percent classification result for all the static gestures because of no changes between the frames and also no change in the orientation therefore all the environmental noises and moment variation problems has been easily eliminated using WD and MFCC method. We have also observe that all the static gestures are constant w.r.t. time whereas dynamic gestures vary with respect to time.

We have also tested our algorithm on Sheffield Kinect Gesture data set (SKIG) created by L. Liu at university of Sheffield [23, 24]. These dataset are dynamic in nature containing ten types of actions like circle, triangle, up-down, etc. Here data set are captured with 3 backgrounds (white wooden board, white plain paper, white paper with symbols), two light conditions (dark light, poor light) and two poses (clockwise and anti-clock wise). Therefore each gesture has 3 × 2 × 2 = 12 sample. We have tested our method on total 360 SKIG gesture dataset shown in Table 6.

Table 6 Classification results for SKIG kinect gestures dataset

From Table 6 we see the classification results on SKIG dataset with different light conditions and different backgrounds and observe that when light conditions will change drastically (white light to red light) and also the background is very much similar to skin colour, the performance of proposed method degrades.

4.1 Comparative results and analysis

We have performed experiments on various parameters of various methods explain in Table 7.

Table 7 Performance parameters

On the basis of these parameters comparative analysis of various feature extraction methods like [MFCC [25], Orientation histogram (OH) [26] and WD with MFCC (proposed method)] have been performed on the basis of classification rate which is shown in Fig. 15. In OH [26] experiments are performed of 9, 18, 27, 36 bins (no. of features), in MFCC [25] we take 10,12,13,14, 15 and 17 cepstral coefficients for analysing the recognition accuracy and finally we tested our proposed algorithm on HAAR wavelet, Daubachies wavelet descriptor with 10, 12, 13, 14, 15 and 17 MFCC coefficients. Here number of decomposition levels in WD is 3.

Graph shown in Fig. 16 represents that WD with MFCC method provides high classification rate than other methods because of the multi-resolution and moment invariant properties of WD and spectral envelop property of MFCC. In this proposed framework WD reduces the 3/4 of the original image which is helpful for the reduction of time complexity as well as it requires very less space to store features. Contour of each image are created for finding moment invariant features of each gesture image by converting 2D image matrix into 1 D signal. MFCC extracts the spectral envelop of the features extracted from WD which minimizes the illumination variation present in the image.

Fig. 16
figure 16

Comparative results of MFCC, OH, WD with MFCC based methods

We have also compared the classification accuracy of proposed gesture recognition algorithm at various numbers of MFCC coefficients like 10, 12, 13…20 which is shown in Fig. 17. From this graph we have found that the recognition accuracy is maximum at 12 and 13 MFCC coefficients in comparison to other MFCC coefficients (10, 15, 20) when it applied with WD. We have also compare the results of proposed method with the existing methods OH and MFCC. The results are shown in Fig. 17.

Fig. 17
figure 17

Comparative analysis of % of accuracy at different no. of MFCC coefficients

In Fig. 18 we found that as the complexity (background variation, illumination condition, etc.) of dataset increases the recognition accuracy decreases in both the existing techniques. But proposed method provides approximately 97 % average accuracy with 12 WD with MFCC coefficients but this technique also fails when two of the gestures are approximately similar in shape. In this paper the time complexity of different methods is also compared shown in Table 8 where configuration of the system are core i5 processor with 4 GB RAM and MATLAB 2013 software.

Fig. 18
figure 18

Comparative analysis of % of accuracy at different no. of MFCC coefficients

Table 8 Avg. processing time of 19 dynamic gestures

Table 8 compares the average processing time (CPU time) of OH, MFCC and WD with MFCC method and found approximately similar processing time for all the methods but classification rate of WD with MFCC is much higher than other two methods.

Comparative analysis of all the three features have also been done on static gestures which is shown in Table 9. This analysis shows that all the three methods will give good recognition accuracy for static gestures because of variation between intermediate frames are none with respect to time. We have also perform the comparative analysis of all three features of SKIG dataset.

Table 9 Average classification accuracy for static gestures

From Table 10 we have seen that the performance of our proposed method on SKIG dataset is good as compare to other feature extraction technique like OH and MFCC.

Table 10 Comparative results of SKIG dataset for different feature extraction technique

Finally learning is performed on HOAP-2 robot using Webots software. After classifying an unknown gesture generate.csv file to perform a particular gesture on HOAP-2 robot shown in Figs. 19 and 20.

Fig. 19
figure 19

Webot simulation of ‘ARISE’ gesture

Fig. 20
figure 20

Hoap-2 performing gesture, ‘Below’, ‘Add’, ‘Across’, ‘Above’

In Figs. 19 and 20 the Webots simulation robotic software has been shown and hoap-2 is performing gestures ‘ARISE’, ‘ADD’, etc. Right hand side of Fig. 18 the Webots controller is shown where HOPA-2 programming is done. This controller has been interfaced with Matlab classification result where.csv file have been uploaded.

5 Conclusion and future work

In ISL dynamic gesture recognition the major problems are hand segmentation and feature extraction, which is helpful for classifying gestures in real time environment. In this work, we proposed a novel WD with MFCC based ISL gesture recognition method in which both the hands has been used for performing any gestures. WD provides a time and frequency resolution as well as a moment invariant properties about any gesture. These properties make it invariant against scale, orientation, moment, phase, etc. It also reduces the 1/4 feature space of data set in first level of decomposition which is a solution for reduction of time complexity as well as space complexity. After reduction and noise elimination of images moment invariant features are extracted by converting 2d contour image into 1d plane. From these points spectral envelop of MFCC features are calculated using MFCC feature extraction technique. This technique is helpful against different background and various illumination conditions then OH and simple MFCC technique and also it is very effective for discriminating one gesture to another gesture. MFCC has been generally used for speech/voice recognition we tried to use it for gesture image analysis and it can be said that it is effective technique for gesture recognition also. When we used it with WD it gives 97 % recognition accuracy. Gestures are also classified using different types of classifiers like (KNN, SVM) with various parameters like K = 1, 3, 5, 7 and different cepstral coefficients, etc. Analysis is also done by calculating FP, FN, TP and TN. By using these values a confusion matrix has been created. We also observed from these confusion matrix that the proposed technique gives better accuracy then other existing methods. Proposed technique has also been tested on SKIG data set which was published by University of Sheffield and found 93 percent accuracy towards this data set but other techniques like OH and MFCC provides only 84 and 85 percent accuracy which is very less in comparison to proposed method. After classifying an unknown gesture, HOAP-2 learning is performed through.csv file. Learning is done on Webots simulation software. Through this simulation software HOAP-2 performs similar gestures which has been classified by our proposed technique.

In advancement of ISL gesture recognition the full sentence recognition has to perform. The ISL full sentence based recognition with combination of voice recognition will be grate contribution in ISL gesture recognition field. The humanoid robot interaction with deaf people has to make based on real time voice with gesture recognition learning.