Keywords

1 Introduction

Currently due to involvement of ambiguous and multi-modal expressive behavior, emotion inference in daily human interaction remains a challenge. Mostly the emotional expressions are in the form of facial or bodily movements, although speech is sole medium of emotional expression during telephone conversation. For such task, detection of emotion arguably remains a challenge. The recognition system demands extraction of suitable features that can best represent speech emotions.

Among different feature extraction techniques such as prosodic, spectral, voice quality, nonlinear, MFCC uses frame-based approach and is widely accepted for speech and emotion analysis [1,2,3]. Although popular, MFCC feature extraction process does not involve the temporal dynamics among features that is essential for emotional analysis. The drawback can be minimized using derivative and acceleration coefficients of MFCC and has been approached by different researchers [4, 5]. However, MFCC-based features including deltas give information on amplitude and energy, hence unable to provide adequate information on speech emotion. To improve the recognition further, multi-resolution capability of wavelet analysis in extracting MFCC features has been proposed for speaker verification [6]. These literatures suggest for a possibility in enhancement of accuracy if the advantages of both wavelet analysis and differential MFCC are combined. Hence, a novel effort has been made to extract the wavelet-based delta MFCC (WDMFCC) features for comparison purpose.

Most classifiers used in speech analysis are statistical, NN-based, fuzzy logic, and combination of these [1,2,3,4,5,6,7]. GMM and HMM have the ability to model the pattern involving large feature sets. Smaller dataset and difficulty in modeling the emotions using conventional statistical classifiers such as GMM and HMM have opened up new avenues for NN-based classifiers [7]. As compared to multilayer perceptron (MLP), PNN is much faster, simpler, easy to implement, and more accurate [8]. Requirement of only one parameter adjustment and absence of any constraint in choosing the parameter with precision makes it superior to RBFN. This has motivated the authors to opt for probabilistic neural network (PNN) classifier in this work.

The organization of the paper is as follows: The feature extraction techniques and the classification model used are explained in Sects. 2 and 3, respectively. A detailed description on the simulation result with a comparison among the state of the art and the proposed features is described in Sect. 4. The conclusion and future research direction are provided in Sect. 5.

2 Feature Extraction Techniques

Initially, standard features based on wavelet analysis and MFCC are extracted and compared for their effectiveness in terms of recognition accuracy and mean square error (MSE). Next to it, delta MFCC features are extracted and compared with the wavelet-based MFCC and proposed wavelet-based delta MFCC features. The proposed feature extraction technique is shown in Fig. 1.

Fig. 1
figure 1

Proposed feature extraction technique

2.1 MFCC

One of the most dominant and effective cepstrum-based features is MFCC that uses a Mel-scale to wrap the original signal frequency into Mel-frequency. This way both human auditory and hearing mechanism are taken into account since human ear is logarithmic in nature. The relationship used to convert the windowed signal frequency f into Mel-frequency \(f_{m}\) is given by

$$f_{m} = 2595 \log_{10} \left( {\frac{f}{700} + 1} \right).$$
(1)

2.2 Wavelet Analysis

In wavelet analysis, the emotional speech signal \(s(n)\) is decomposed into approximated and detailed coefficients using a series of low-pass and high-pass filters, respectively [9]. For the low-pass filter having an impulse response \(h(n)\), the output can be represented as a convolution and is given by

$$X_{L} (n) = s(n)*h(n) = \sum\limits_{k = - \infty }^{\infty } {s(k)h(n - k)}$$
(2)

Similarly, the output of the high-pass filter is the convolution of the filter impulse response and the signal under consideration. The sub-sampled output of the filters by a factor of two can be represented as

$$X_{L} (n) = \sum\limits_{k = - \infty }^{\infty } {s(k)h(2n - k)}$$
(3)
$$X_{H} (n) = \sum\limits_{k = - \infty }^{\infty } {s(k)h(2n + 1 - k)} .$$
(4)

2.3 MFCC in Wavelet Domain (WMFCC)

The steps of extraction of WMFCC feature are explained below.

  1. (1)

    Signal decomposition: Initially, decompose the signal into detailed \(d_{i}\) and approximation \(a_{i}\) coefficients at ith level, respectively. A three-level decomposition with Daubechies-4 (dB-4) has been performed here. Emotional utterances of Berlin database has been used for this work having a sampling frequency of 16 kHz. Thus, for 8 kHz bandwidth, the sub-bands are distributed in the range of 0–1, 1–2, 2–4, and 4–8 kHz due to filtering.

  2. (2)

    The MFCC feature extraction technique as shown in Fig. 1 is applied individually for these sub-bands. Five WMFCCs from each sub-bands of 1–2 and 2–4 kHz and three WMFCCs from each band of 0–1 and 4–8 kHz have been extracted. More WMFCCs are extracted from the middle bands (1–4 kHz) to obtain more perceptual information as the speech signal often lies in this band. The number of WMFCC coefficients obtained this way is reduced to 16 coefficients per utterance.

2.4 Delta MFCC (DMFCC)

The delta MFCC \((M_{\Delta } )\) features of emotional utterances are extracted using the relation:

$$M_{\Delta } = \beta { \times }\sum\limits_{q = 1}^{2} {n \times [{\text{MFCC}}(r + q) - {\text{MFCC}}(r - q)],\quad r = 1, 2, \ldots N}$$
(5)

where \({\text{MFCC}}_{\Delta }\) is the delta features, and to scale the frequency, a value of \(\beta = 2\) has been used. N is the number of delta features per utterance, whereas q is the indexing parameter associated with analyzing window.

2.5 Proposed Wavelet Delta MFCC (WDMFCC)

In the proposed technique, the use of wavelet analysis has provided both time and frequency information of the signal, while traditional MFCCs provided the energy or amplitude information. Application of derivatives of MFCC has inducted the dynamic characteristics with the WMFCC features. The steps of extracting WDMFCC features are explained below.

  1. a.

    Extract the wavelet features as explained earlier using Eqs. (3) and (4) as shown in Fig. 1.

  2. b.

    Apply the MFCC feature extraction technique to the wavelet features as explained in WMFCC feature extraction technique.

  3. c.

    Apply the derivative algorithm as given in Eq. (5) to the WMFCC features to extract 16 WDMFCC features per utterance of an emotion.

3 Classification Method

PNN is a nonparametric network having input, pattern, summation, and decision layers as the main constituent sections as shown in Fig. 2.

Fig. 2
figure 2

The structure of PNN

For any input emotion pattern E with dimension of the vector x and smoothing parameter \(\delta\), the output of the pattern layer can be represented as

$$\delta_{u,v} (E) = \frac{1}{{\left( {2\pi } \right)^{{\frac{x}{2}}} \delta^{x} }}{ \exp }\left[ { - \frac{{\left( {E - E_{u,v} } \right)^{T} (E - E_{u,v} )}}{{2\delta^{2} }}} \right]$$
(6)

where \(E_{u,v}\) denotes the neuron vector and is considered to be the center of the kernel function. Here \(u = 1, 2, \ldots , U\) is the number of emotional states; \(v = 1, 2, \ldots V_{u}\) and \(V_{u}\) is the total number of feature values in class \(S_{u}\) as given in Eq. (2). Based on the probability distribution function (PDF) of all the neuron, the summation layer summarizes and averages all the neuron outputs of the designated class so as to classify the pattern E to that class \(S_{u}\).

$$p_{u} (E) = \frac{1}{{\left( {2\pi } \right)^{{\frac{x}{2}}} \delta^{x} }}\frac{1}{{V_{u} }}\sum\limits_{v = 1}^{{V_{u} }} {{ \exp }\left[ { - \frac{{\left( {E - E_{u,v} } \right)^{T} (E - E_{u,v} )}}{{2\delta^{2} }}} \right]}$$
(7)

The decision layer takes a decision on the predicted class using Bayes’ approximation of the summation layer neuron output:

$$\widehat{S}\,(E) = { \arg }\,{ \hbox{max} }\left\{ {p_{u} (E)} \right\}$$
(8)

where the computed class of the pattern E is represented by \(\widehat{S}\,(E)\).

4 Results and Discussion

Berlin emotional speech database (EMO-DB) [10] has been chosen to distinguish five emotional states such as angry, anxiety, happy, sad, and neutral. The utterances used for training of the classifier are not involved in testing or validation purpose. Approximately 70% of the data are used for training, and 20% each are used for testing and validation of every emotional state.

The recognition accuracy using the extracted features with PNN classifier is given in Table 1. Comparison of the results has gone in favor of WDMFCC as compared to others. Use of both temporal information due to involvement of differential values of MFCC and exploration of multi-resolution capability of wavelet analysis is the prime reason of the accuracy improvement.

Table 1 The recognition accuracy using the extracted features with PNN classifier

Wavelet-based features like WMFCC and WDMFCC are better than delta MFCC and MFCC features as observed in Table 1. It is found that involvement of either differential information or wavelet approach in modifying the MFCC extraction technique has enhanced the robustness of the resultant features as compared to the standard MFCC or wavelet features.

A comparison of MSE has provided similar results as shown in Fig. 3. It is found that standard wavelet-based features are more reliable as compared to MFCCs both in terms of classification accuracy and MSE.

Fig. 3
figure 3

Comparison of MSE using variants of MFCC with PNN classifier

To find the response time of the classifier with different MFCC features extracted, the time elapsed both during feature extraction and classification has been compared in Table 2.

Table 2 Real-time comparison of different features using PNN classifier for 45 utterances per emotion

5 Conclusion

There is an improvement in feature reliability involving the differential values of MFCC coefficients due to inclusion of temporal emotional information. Application of multi-resolution capability in extraction of MFCC features has resulted more reliable information than standard MFCC as the classification results suggest. Involvement of both wavelet analysis and differential algorithm used for modification of standard MFCC found to be a novel effort in this direction. Other modification techniques that can add valuable emotional information to MFCC features can open up new avenues in the field of emotion recognition.