Keywords

1 Introduction

With the advancements in Machine Learning and Artificial Intelligence, the need for effective Human-Machine interaction has gained significant importance. The impact of emotional speech in Human-Machine interaction is less significant due to the fact that, machines cannot understand human’s emotional state [13]. This has increased the need for analysis of emotions during the Human-Machine interaction. Usually, emotions of human beings are analyzed from the recorded speech signals. Rather than using recorded speech signals, Electroglottographic (EGG) signals can be used for recognizing the emotions. Other than EGG signals, another approximation representing the glottal information (excitation source) is by using linear prediction residual signals [2, 5], which can be derived from speech signals. By the informal listening to EGG signals, humans can identify the emotion it carries. With the availability of EGG data in different emotions in the classic German emotional speech database (EmoDb), the present work focuses on using EGG signals for emotion recognition.

Fig. 1.
figure 1

Smooth nature of EGG signal. Plot (a) Speech signal, (b) EGG signal, (c) Differenced EGG signal.

Figure 1 clearly depicts the difference between the speech and the EGG signals. As seen from the Fig. 1, EGG signals are smooth compared to speech signals and also when they are differentiated, a sequence of impulses are produced which represents the glottal closure instants (GCIs). This indicates that the glottal information is clearly present in it and also it carries perceptually relevant emotional information (excitation source information) in the low frequency regions. This phenomenon is shown clearly in the Fig. 2. The analysis of this phenomenon is performed by calculating wide-band spectrogram for different emotional (Anger, Happy, Boredom, and Fear) utterances apart from Neutral signals present in the EGG data. The same vowel is chosen for representing the variations in different emotions through the spectrogram. Figure 2(f)–(j) shows the spectrogram of the EGG signal produced when the vowel /a/ is elicitated with the different emotional state of the same speaker. The corresponding EGG signal is plotted in Fig. 2(a)–(e). In all the spectrograms, the low frequency regions are very dark when compared to the high frequency regions, which shows that the availability of emotional information in glottal signals are more concentrated in the low frequency regions.

Fig. 2.
figure 2

Spectrogram analysis of glottal signals for the same utterance with different emotions. Plot (a)–(e) shows the glottal signals for Neutral, Anger, Happy, Boredom and Fear emotions respectively. Plot (f)–(j) are the corresponding wide-band spectrogram of glottal signals in (a)–(e).

In this work, the state of the art perceptually motivated Mel Frequency Cepstral Coefficients (MFCC) are considered as features of the EGG signals for Gaussian Mixture Modelling [10]. Since the human perception of sound is in Mel scale, Mel filters were used for the computation of MFCC features in speech signals. MFCC’s are widely used as features for many applications like Speaker Recognition, Speaker Identification, Speech Recognition, Emotion Recognition etc. [9, 14, 15]. The works presented so far in the literature shows that, emotion recognition is performed exclusively for emotive speech signals [1, 7, 11, 16]. Pati et al. [12] used Residual MFCC (RMFCC) features from the linear prediction residual signals which is an approximation of glottal signals derived from the speech. Since the EGG signals have only glottal information, the proposed emotion recognition system uses MFCC features from it. So the work presented in this paper, attempts to experiment on the EGG signals by extracting the 13 static MFCC and the 39 dynamic MFCC features, by varying the number of filters in the Mel filter bank in order to emphasize the low frequency regions for computing MFCC. The organization of the work is as follows: Sect. 2 refers to the Development of emotion recognition system using EGG. Section 3 explains the Performance analysis of Emotion recognition using EGG signals. Summary and Conclusion of the present work are discussed in Sect. 4.

2 Development of Emotion Recognition System Using EGG

2.1 Production of EGG

EGG encompasses more emotional information which is captured at the time of elicitation. It is recorded through a device named Electroglottograph as shown in Fig. 3, which contains a pair of electrodes placed near the glottis region to capture the vocal fold vibrations during the production of speech [8]. It measures the vibration by passing a small amount of current between the contact area of the vocal folds. The impedance across the electrodes varies with respect to the vibrations of the vocal folds [6]. This variation of impedance produces quasi-periodic (non-stationary) signals known as the EGG signals.

Fig. 3.
figure 3

Electroglottograph.

MFCC Feature Extraction from EGG. A nonlinear triangular Mel scale filter bank [14] as shown in Fig. 4 (filters are linearly placed in the low frequency regions (<1000 Hz) and logarithmically placed in the high frequency regions (>1000 Hz)) has the potentiality to emphasize the lower frequency components over the higher ones. Mel filters are designed to mimic human auditory perception of sound by concentrating more on the low frequency regions. As EGG signals are low frequency in nature, Mel Frequency Cepstral Coefficients can act as good features representing the emotional information present in the low frequency regions. In order to extract features, the non-stationary signal is divided into a smaller number of stationary frames of size 20 ms using Hamming window. Hamming window is used to avoid spectral leakage. A Hamming window shift of 10 ms is used. Along with the 13 static MFCC (velocity features), \(\varDelta \) and \(\varDelta \) \(\varDelta \) (acceleration features) are extracted from each frame and they are combined with the 13 MFCC features to make the feature vector dimension as 39. By increasing the filter banks ranging from (14, 16, 18....46), 13 and 39 MFCC features are extracted individually.

3 Performance Analysis of Emotion Recognition Using EGG Signals

The performance of emotion recognition using EGG signals is analyzed in the classic German emotional speech database (EmoDb) [3] which includes a simultaneous recording of Speech and EGG signals. The database was developed for six different emotions Anger, Happy, Fear, Boredom, Sad, Disgust apart from Neutral signals with 10 professional actors (5 Male and 5 Female) using 10 neutral sentences spoken in six emotions. Out of six emotions, four emotions (Anger, Happy, Fear, Boredom) along with Neutral signals are considered for this emotion recognition analysis. Each speech sample is recorded at a sampling rate of 48 KHz with 16 bits per sample resolution. In this work Speech and EGG signals of German emotional speech data are separated, and the separated EGG signals are downsampled to 16 KHz. Training and Testing for the analysis are performed with 590 utterances. Out of 590 utterances, 474 utterances were taken for training the GMM’s and 116 utterances were used for testing the GMM’s. A series of experiments were conducted with the 13 static MFCC features and the 39 dynamic MFCC features with different filter bank coefficients. These cepstral features are trained with 512 Gaussian Mixture components as the training data is small and the trained GMM’s are tested for the classification accuracy in different emotional classes. For implementing MFCC-GMM based emotion recognition system, we have used HTK (Hidden Markov Model) toolkit [17] in our experiments. The configuration files are given in the following link https://drive.google.com/drive/folders/0BzHkgLdbz2n-OGR5dnJHTXhwVTg?usp=sharing. From the experiments conducted the observations inferred are two folds.

EGG signals show better performance in classifying the emotions with the 13 static MFCC features than the 39 dynamic MFCC features (by adding \(\varDelta \) and \(\varDelta \) \(\varDelta \)) for the conventional Mel filter bank of size 28 as seen from Table 1.

Table 1. Classification Accuracies(%) for emotion recognition in EGG from German EmoDb with the 13 and 39 MFCC features for the conventional filter bank of size 28.

The rationality in this performance is due to the fact that, while taking the 39 dynamic MFCC features the change in the dynamics of the vocal tract across different frames of an audio signal is accounted. Unlike speech signals, EGG signals contain only glottal information (excitation source information) and it is clearly captured by the 13 static MFCC features. Since EGG signals lacks the vocal tract information, accounting the dynamic features (\(\varDelta \) and \(\varDelta \) \(\varDelta \)) which represents the same is not helping in improving the recognition performance. Therefore the proposed work experiments only on the use of the 13 static MFCC features by increasing the number of filters in the low frequency regions, thereby giving more emphasis.

Tables 2 and 3 discusses the series of experiments conducted with the 13 static MFCC features by varying the number of Mel filters in the Mel filter bank.

Table 2. Classification Accuracies(%) for emotion recognition in EGG from German EmoDb with the 13 static MFCC features containing different filter bank coefficients.
Table 3. Classification Accuracies(%) for emotion recognition in EGG from German EmoDb with the 13 static MFCC features containing different filter bank coefficients.

It is evident from Tables 2 and 3 that, performance of the MFCC-GMM based emotion recognition system increases by increasing the number of Mel filters in the low frequency regions. This is because, EGG signals are low frequency in nature and therefore by keeping more filters in the low frequency regions, more emotional information is captured. The optimal performance with 80.17% is obtained while using 256 Gaussian mixtures with the 13 static MFCC features for the higher order filter bank of size 38. Also when the filters are increased beyond 38, the recognition performance seem to degrade, this is due to the fact that, when filters are more denser at the low frequency regions, the width of the traingular filters decreases, this, in turn, fails to capture the relevant emotional information. Figure 4 shows the traingular Mel filter bank of size 28 and 36. It is evident from the Fig. 4 that, when filters are denser in the lower frequency regions more emphasis is given.

Fig. 4.
figure 4

Mel Filter banks of size 28 and 36.

4 Summary and Conclusion

The work proposed in this paper focuses on using the EGG signals for emotion recognition. As EGG signals are low frequency in nature and it approximates the glottal information during the production of the emotive speech signals, the perceptually motivated Mel Frequency Cepstral Coefficients (MFCC) are extracted from the same for Gaussian Mixture Modeling. The conclusions drawn from this work is as follows,

  • The MFCC-GMM system with the 39 dynamic MFCC features with \(\varDelta \) and \(\varDelta \) \(\varDelta \) does not contribute for improved emotion recognition performance in case of the EGG signals, whereas the 13 static MFCC features give better performance for the same.

  • In order to emphasize the low frequency components, increasing the number of Mel filters in Mel filter bank to a certain level for the computation of MFCC features helps to improve the emotion recognition performance.

The future work concentrates on building an emotion recognition system using the acoustic features derived from the Munich’s openEAR toolkit [4] for the same EGG signals present in the classic German emotional speech database (EmoDb). The results obtained for the emotion recognition from the conventional state of the art MFCC-GMM can be verified or improved using other classification algorithms in Deep Networks.