Keywords

1 Introduction

The field of man-machine communication has witnessed a tremendous improvement in recent years. We still have difficulties in communicating with machines naturally. It is believed that speech emotion is particularly useful in human-computer interface, because the emotion carries the essential semantics and helps machines better understand human speech [1]. However, speech emotion recognition is technically challenging because it is not clear what kinds of speech features are salient to efficiently characterize different emotions [2, 3]. The aim of this study is to find new affect-salient features for speech emotion recognition.

Conventional speech emotion recognition approaches rely mostly on feature selection. Perceptual features have been intensively selected to estimate the emotion of speakers [2, 4, 5]. Perceptual features [6] consist of LLDs and statistical features, which are described in Table 1. LLDs are zero-crossing-rate (ZCR) from the time signal, root mean square (RMS) frame energy, fundamental frequency (F0), harmonics-to-noise ratio (HNR) by autocorrelation function, and mel-frequency cepstral coefficients (MFCC) 1–12. To each of these, the delta coefficients are additionally computed. Statistical features are statistical values of LLDs. Utilizing deep neural networks (DNN) to learn deep features from perceptual features is common in speech emotion recognition task. For example, DNN is utilized to obtain probability distribution of emotional state from perceptual features, and extreme learning machine (ELM) is used for classification [7]. Some studies proposed that the combination of bi-directional long short-term memory recurrent neural network (BLSTM-RNN) and full-connected neural network with a model of attention performs well when using perceptual features [8, 9]. These studies highlight perceptual features based on deep networks for speech emotion recognition.

Table 1. The composition of perceptual features

Perceptual features are chosen by experience and not comprehensive. Thus it is uncertain whether the features selected from our prior knowledge are adequate for good performance in all the situations. Compare to perceptual features, using spectrogram for speech recognition proved to be successful [10,11,12]. It is recognized that emotional contents of utterances influence spectral energy in the frequency domain [13]. Deep networks with different structures based on raw spectrogram have shown significantly improvements of speech emotion recognition. In [14,15,16], they extracted deep features from raw spectrogram with CNN to find good results. These studies indicated that CNN can process spectrogram more effective and help identify emotions.

Comprehensive spectrogram can be obtained from the speech directly not by experience from the prior knowledge. If applying CNN on spectrogram alone, it is difficult to sufficiently learn the prior knowledge of perceptual features for automatic speech emotion recognition. In order to overcome this problem, we propose novel features to utilize the prior knowledge and comprehensive spectrographic information simultaneously. First, frame-level LLDs are arranged in timeline to make time-series LLDs. Then, segmental spectrogram and time-series LLDs are fused as CSF based on timeline. In [19], global features are thought to be important for speech emotion recognition. However, LLDs in CSF contain a wealth of local information. To solve this problem and fully utilize perceptual features, statistical features that are statistical values of LLDs are added in CSF manually to generate RSF. Finally, CNN is employed to extract deep features from our proposed features. It is the first work to combine perceptual features and spectrogram before feature extraction and treat them as 2-D images to be fed into the CNN model so as to extract deep features for emotional classification task.

The outline of this paper is as follows. The baseline system is described in Sect. 2. Section 3 introduces the proposed features and fusion methods. Sections 4 and 5 cover the experiments and conclusions.

2 Baseline System

In this section, our baseline system is described in Fig. 1 according to previous works [15,16,17,18]. First, speech signals are split into segments with a fixed length. Secondly, short-time Fourier transform (STFT) are used to transform segmental signals into amplitude spectrogram. When doing STFT, the FFT points are 256. Then, segmental spectrogram are fed to CNN to extract deep spectrogram. Finally, BLSTM is used to identify utterance-level emotions. The baseline system does not make use of the prior knowledge (e.g. F0).

Fig. 1.
figure 1

Structure of the baseline system

The reasons why we primarily focus on CNN-BLSTM are:

(1) Since CNN models temporal and spectral local correlations [20], it is chosen first to extract deep features from the 2-D representations of our proposed features.

(2) Emotion is manifested in speech through a variable range of temporal dependencies. BLSTM is used to recognize the sequential dynamics in an utterance [21]. It is expected that BLSTM network captures long short-term dependent temporal details of the CNN-based features in a consecutive utterance.

3 CNN Based on Spectrogram and Perceptual Features

3.1 Motivation for Fusing Spectrogram and Perceptual Features

In this section, the motivation of the feature fusion is described from visual and theoretical perspective. Figures 2 and 3 show utterance-level spectrogram and time-sequence LLDs on different emotions with the same contents. Figure 2(a) describes the spectrogram of sadness emotion, and Fig. 2(b) depicts neutral emotion. The depth of reddish color implies the level of frequency energy. It is clear from Fig. 2 that the spectrogram of sadness and neutral standing for low-arousal emotions have similar patterns. In order to classify emotions with similar arousal, utilizing LLDs may be useful.

Fig. 2.
figure 2

Visualization of spectrogram

Fig. 3.
figure 3

Visualization of time-sequence LLDs

In Fig. 3, the horizontal axis represents time-domain of utterance. The vertical axis represents the 32-dimensional LLDs. Figure 3(a) and (b) are obviously different, which means the selected LLDs are easier to distinguish similar arousal emotions than spectrogram in this situation. However, it is unclear which kind of features are more effective to distinguish emotions in different cases. In order to adapt the features to various situations, our attempt is to fuse spectrogram and perceptual features for speech emotion recognition.

From a theoretical perspective, CNN is excellent at mining deep information from raw spectrogram. In the use of wide-band spectrogram, formants are emphasized but F0 is not, whereas F0 is known to compose the main vocal cue for emotion recognition [22]. Perceptual features can provide lots of prior knowledge (e.g. F0) that is useful for emotion recognition. In order to make use of spectrogram and perceptual features simultaneously, we propose novel features such as CSF and RSF for speech emotion recognition.

3.2 Fusion Strategy of CSF and RSF

In this study, the fusion Strategy of CSF and RSF consists of three steps.

The first step is to calculate segmental spectrogram. After the STFT, raw spectrographic matrix is obtained with the size of \(25 \times 129\), where 25 is the number of time points, and 129 depends on the selected region and frequency resolution.

The next step is to utilize the openSMILE toolkit to get frame-level LLDs and segment-level statistical features that have been described in Table 1. LLDs are organized in time series. Each 25 frame-level LLDs constitute a segmental time-sequence LLDs. After normalization, the matrix of time-sequence LLDs is obtained with the size of \( 25 \times 32\), where 25 represents the number of frames in a segment, and 32 is the dimension of LLDs.

The third step is the feature fusion. Based on the timeline, segmental spectrogram and time-sequence LLDs are spliced together as CSF, where the size of CSF is \( 25 \times 161\). CSF vector of the j-th segment in the i-th utterance can be formulated as:

$$\begin{aligned} CSF_{ij} = [S_{ij}, L_{ij}] , \end{aligned}$$
(1)

where the \(S_{ij}\), \(L_{ij}\) correspond to spectrogram vector and time-sequence LLDs vector of the j-th segment in the i-th utterance, respectively.

In order to splice spectrogram, time-sequence LLDs and statistical features, 384-dimensional statistical features are firstly reduced to 375 dimensions using PCA. Then, the resized statistical features are reshaped as \(25 \times 15\). Finally, segmental spectrogram, time-sequence LLDs and statistical features are spliced together as RSF, where the size of RSF is \( 25 \times 176\). RSF vector of the j-th segment in the i-th utterance can be formulated as:

$$\begin{aligned} RSF_{ij} = [S_{ij}, L_{ij}, C_{ij}] , \end{aligned}$$
(2)

where the \(C_{ij}\) represents statistical features vector of the j-th segment in the i-th utterance. Figure 4 depicts the detailed feature extraction of RSF.

Fig. 4.
figure 4

Extraction of RSF. S represents spectrogram. L represents time-sequence LLDs. C represents statistical features.

4 Experiment

4.1 Experimental Setup

We choose speech materials from the EmoDB [23], which has seven categorical emotion types including disgust, sadness, fear, happiness, neutral, boredom and angry, where the number of utterances in each category are 46, 62, 69, 71, 79, 81 and 127, respectively. There are 535 simulated emotional utterances in German. All the utterances of approximately 2–3 s are sampled at 16000 Hz. The arousal is a descriptor of the intensity of the emotion. In terms of the arousal space [24], angry, fear, disgust and happiness belong to the high-arousal emotion, while, sadness, boredom and neutral belong to the low-arousal emotion.

According to [25], a speech segment contains sufficient emotional contents longer than 250 ms. In our experiment, the utterances are split into segments with a 265-ms window size. Each segment is divided into 25 frames using a 25-ms window, shifting 10 ms each time. About 50,000 segments are collected in this way. Table 2 depicts the detail of the network. Other parameters are also tested for experiments, and the configuration of Table 2 resulted the best performance.

Table 2. Parameters of the CNN-BLSTM network

Due to the limited size of the Berlin Emotion Database, we run a 10-fold cross validation. The weighted accuracy (WA), unweighted accuracy (UA), F1 and relative error reduction are used to evaluate the results. WA is the accuracy of all the test utterances. UA is defined as average of per emotional category recall. F1 is the harmonic average of precision and recall. Relative error reduction is the ratio of error reduction to original error.

4.2 Experimental Results

This section shows the classification results of our proposed features. From Table 3, we conclude: (1) Compared with spectrogram, the proposed time-sequence LLDs improve the WA and UA by a relative error reduction of 11.23% and 10.29%, respectively. One of the reasons is that time series information of LLDs is used more adequately by BLSTM. Another reason is that selected LLDs perform better than raw spectrogram on a small amount of training data. (2) CSF outperforms spectrogram with 33.76% and 32.04% relative error reduction in terms of WA and UA, respectively. RSF outperforms spectrogram with 38.06% and 36.91% relative error reduction in terms of WA and UA, respectively. The results reveal that spectrogram and perceptual features are complementary. Moreover, our proposed features are significantly effective. (3) RSF performs better than CSF. The results indicate that it is useful to add additional statistical features into CSF. And it is effective to reshape the statistical features and treat them as an additional graph for CNN to learn.

Table 3. WA and UA of different features with CNN-BLSTM
Fig. 5.
figure 5

The F1(%) of different features on different emotions

Figure 5 shows the contribution of proposed features on classifying different kinds of emotion in comparison to spectrogram (baseline features). (1) The results of CSF and RSF are both better than spectrogram on all kinds of emotions, especially on boredom emotion. (2) RSF performs better than CSF on most kinds of emotions. However, when classifying boredom and neutral, CSF performs better than RSF. We assume that there is no noticeable changes on LLDs in both boredom and neutral utterances. Therefore, it is unnecessary to add extra statistical features in this situation. (3) On average F1 of seven emotions, CSF and RSF significantly outperform spectrogram by relative error reduction of 33.68% and 38.80%, respectively. Overall, both CSF and RSF are effective for identifying different categories of emotions.

5 Conclusions and Future Works

In this paper, we first proposed time-sequence LLDs, CSF and RSF for speech emotion recognition. Then, the proposed features were individually fed into the CNN model to extract deep features. Finally, the BLSTM was employed to do final classification. It is the first work to combine spectrogram and perceptual features simultaneously for speech emotion recognition. Our results indicated that spectrogram and perceptual features were complementary and our proposed features were effective.

For future work, we will evaluate our proposed features on other large datasets and consider integrating speaker, gender and linguistic features in our experiment.