Keywords

1 Introduction

Speech Emotion Recognition (SER) has become a hot research topic in recent years, due to its ability to identify the mood of a particular person from his or her voice. This makes it an important part of Human-Computer Interaction (HCI), as used for many important applications including e-learning, robotics, healthcare, security, entertainment and so on. In general, SER is a pattern recognition system which uses a vector of extracted speech features from an emotional speech database, in order to recognize a persons emotional state, through the use of a classifier.

Since the feature extraction stage plays an important role in the performance of any pattern recognition system, the first issue in this area involves finding the best features that can help increase SER accuracy. Literature shows that there are four categories of acoustic speech emotion features, which include voice quality, prosodic, spectral and wavelet features. According to Wang et al. [1] the most commonly-used features include prosodic and spectral features.

When working with spectral features, including the Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coefficients (LPC), the Linear Predictive Cepstral Coefficients (LPCC), Perceptual Linear Prediction (PLP), Relative Spectral Transform—Perceptual Linear Prediction (RASTA-PLP), the first and most important question is to determine how many coefficients are suitable for use. However there are no guidelines regarding how to choose the best number of coefficients. The tradeoffs in having large number of coefficients is that it may help to accommodate suitable features in the features vectors but it will also increase the feature dimensionality and possible redundancy which lead in increasing computational cost. On the other hand, small number of coefficients may lead to insufficient suitable features which may result in low recognition.

Fig. 1
figure 1

Shows some coefficient numbers used by researchers in conjunction with spectral features for developing their systems, as obtained in a survey conducted between 2000 and 2015, with 40 papers

From the literature, researchers are used several number of coefficients in developing their SER systems (Fig. 1). Pierre-Yves [2] has used 10 MFCC coefficients. Rong et al. [3], Schuller et al. [4] and Lee et al. [5] have used 12 MFCC coefficients. Lee et al. [6], Wang and Guan [7] and Lugger and Yang [8] have used 13 MFCC coefficients. Schuller et al. [9] has used 15 MFCC coefficients. Several authors also have chosen to use the same number of coefficients for different spectral features. For example, Kim et al. [10] has used 12 coefficients for both LPC and MFCC. Other researchers chose to use different numbers of coefficients for different spectral features. For example, Nwe et al. [11] chose to use 16 coefficients for LPCC, and 12 coefficients for both MFCC and LFPC. Fu et al. [12] also selected 10 coefficients for LPCC, and 12 coefficients for MFCC.

There are some researchers who also chose to test different numbers of coefficients for the same spectral features. For example, Koolagudi et al. [13] used 6, 8, 13, 21 and 29 coefficients for both LPCC and MFCC, while Murugappan et al. [14] used 13, 15 and 20 coefficients for MFCC, and Milton et al. [15] used MFCC with 10, 15, 24 and 23 coefficients. To reduce the dimensionality and computation of the SER system Hegde et al. [16] used the F-ratio technique to select a subset of 12 MFCC coefficients within the Hidden Markov Model (HMM), and concluded that the selection of 8 MFCC coefficients offers a better classification accuracy than that which could be achieved when selecting all 12 coefficients.

Based on the works mentioned above it is clear that there are no uniform patterns used to choose a suitable number of coefficients. This paper has proposed two approaches of selecting optimized numbers of coefficients, depending on the classifier, that could help to increase SER system accuracy while reducing feature vector dimensionality.

2 The Proposed System

Figure 2 shows the proposed speech emotion recognition system architecture, as based on optimized coefficients. The system process used is as follows:

Fig. 2
figure 2

The proposed system architecture

  1. 1.

    The system starts with the speech records from the emotional database, which are described in Sect. 3.

  2. 2.

    The features step involves spectral features pre-processing and extracting using the selected scope number of coefficients, then the optimization of the number of coefficients for spectral features, the main method and algorithm described in Sect. 4.

  3. 3.

    After the optimizing process the features vectors are fed to the classifier, which provides the classification result (accuracy or class label). The classification method is described in Sect. 5.

3 The Berlin Emotional Database (EMO-DB)

A significant number of emotional speech databases have been developed for use when testing SER systems. Some of these databases are publicly available, while others have been created in order to meet a researchers particular needs. Emotional speech databases can be categorized into three different categories, namely acted, spontaneous and Wizard-of-Oz databases. It is more practical to use a database that has collected samples from real-life situations, and this can serve as a good baseline for creating real-life applications within a specific industry. However the acted database has been consider the easiest one to collect, and different studies have proven that it can offer strong results. It is therefore suitable for theoretical research.

Within this study, Berlin Emotional Database (EMO-DB) was selected as one of the most well-known acted emotional speech databases [17]. It also has been used with spectral features in many studies [18, 19]. The EMO-DB is an acted German emotional speech database recorded at the Department of Acoustic Technology, at TU-Berlin, and is funded by the German research community. It was recorded using a Sennheiser microphone set at a sampling frequency of 16 kHz, with the help of ten professional actors including five males and five females. These actors were asked to simulate seven emotions which included anger, boredom, disgust, fear, happiness, sadness and a neutral emotion, for ten utterances. Following the recording, twenty judges were asked to listen to the utterances in a random order in front of a computer monitor. They were allowed to listen to each sample only once, before they had to decide on the emotional state of the speaker. After the selection process, the database contained a total of 535 speech files.

4 Features

4.1 Features Pre-Processing and Extraction

In this work, we considered five different spectral features namely, MFCC, LPC, LPCC, PLP and RASTA-PLP. MFCC considered being the most used feature of speech [2022]. It has been widely utilized within speech recognition and speech emotion recognition systems, and Poa et al. [23] reported it as the best and the most frequently acoustic features used in SER. LPC also has been considered one of the most dominant techniques for speech analysis [23]. LPCC is extension of the LPC that has the advantage of less computation, its algorithm is more efficient and it could describe the vowels in better manner [24].

PLP are also an improvement of LPC by using the perceptually based Bark filter bank. PLP analysis is computationally efficient and permits a compact representation [25]. While RASTA-PLP is improvement of the PLP method by adding a special band-pass filter was added to each frequency sub-band in traditional PLP algorithm in order to smooth out short-term noise variations and to remove any constant offset in the speech channel.

MATLAB R2012a was employed in order to compute 30 coefficients of the five features for a frame length of 25 ms every 10 ms, while ten different statistical measurements including minimum, maximum, stander deviation, median, mean, range, skewness, and kurtosis, were utilized for five spectral features from all speech samples.

4.2 Coefficients Optimization

Within this study, two approaches have been proposed for optimizing the number of coefficients for spectral features. The classifier has been used to compare a different number of coefficients, and then to select the coefficients that offer the best accuracy and the lowest number of features for speech emotion recognition. According to literature, the number of coefficients used in the past range from 2 to 29. From this the range of numbers of chosen coefficients was from 0 to 30 for MFCC, PLP and RASTA-PLP. However the first coefficients for LPC and LPCC have the same value for all records, namely 1 for LPC and −1 for LPCC, so the range of numbers of coefficients for both of them are chosen from 1 to 30. The coefficients optimization process as shown in Fig. 3 is as follows:

Fig. 3
figure 3

The coefficients optimization process

  1. 1.

    The first coefficient number in the search scope (0 for MFCC, PLP and RASTA-PLP and 1 for LPC and LPCC) has been chosen.

  2. 2.

    Then the features that corresponding to this coefficient number has been choosing from the extracted features vector.

  3. 3.

    Using SVM the accuracy of classification was calculated these steps are repeated until reaching the final number of coefficient number in the search scope (30 for the five features).

  4. 4.

    The coefficient numbers that give the highest accuracy with lowest number of features has been choosing, and the corresponding features have been choosing.

  5. 5.

    Finally the features have been combined in one vector.

The first approach was used to optimize the number of coefficients for the five features separately. The second approach was used to optimize the number of coefficients for the five features in a combination, The selection and evaluating of the coefficient number according to the classification accuracy have been done manually.

5 Classification

Several types of classifiers have been used in SER systems, including the Hidden Markov Model (HMM), the K-Nearest Neighbors (KNN), the Artificial Neural Network (ANN), the Gaussian Mixtures Model (GMM) and the Support Vector Machine (SVM). According to the literature [25] SVM and ANN are the most popular classifiers. Within this paper, SVM was adopted because it shows a strong performance when working with limited training data that has many features. SVM is a binary classifier used for classifications and regression. It can basically handle only two-class problems. SVM classifiers are mainly based on the use of kernel functions to nonlinearly map original features within a high dimensional space, in which data can be effectively classified using a linear classifier.

Classification with all speech utterances and spectral features was performed through the use of MATLAB R2012a. The radial basis kernel function (RBF) was employed with optimized g (in Gaussian function) and C (penalty parameter). The optimization of these classifier parameters was used in order to improve classifier accuracy. The scope of g is he scope of g is 2(−10:1:10) and the scope of C is 2(−5:1:5). 5-fold. Cross-validation was performed for parameters selection. The performance analysis was undertaken using accuracy, which is the percentage of correctly-classified instances over the total number of instances.

6 Experiments and Analysis of Results

6.1 Optimized Based on Discrete Spectral Features

Within the first approach, the coefficients were separately optimized for the features, and the accuracy of the individual features was calculated. The result is shown in Fig. 4, where the x-axis indicates the number of coefficients, and the y-axis indicates the corresponding accuracy value. From the figure it can be observed that LPC gives the best accuracy of 58 % with 5 coefficients, and LPCC gives the best accuracy of 74 % with 12 coefficients.

Fig. 4
figure 4

The accuracy of LPC and LPCC for numbers of coefficients from 1 to 30

For MFCC, as Fig. 5 shows, the best accuracy was 86 % with 20 coefficients. PLP gives the best results with 15 coefficients with an accuracy of 62 %, and finally RASTA-PLP gives the best accuracy of 54 % with 4 coefficients.

Fig. 5
figure 5

The accuracy of MFCC, PLP and RASTA-PLP for all numbers of coefficients from 0 to 30

The results show that the MFCC feature provides the best accuracy among all features. This good result relates to the largest number of coefficients. LPCC and PLP provide good accuracy, with a reasonable number of coefficients. LPC and RASTA-PLP give the lowest numbers of coefficients and the worst accuracy. After separately determining the best coefficient values for every feature, the five features were combined. This provided an overall accuracy of 84 %, with 437 features.

6.2 Optimized Based on Combine Spectral Features

Within the second model, the five features were combined first before coefficients optimization. Figure 6 showed that the best accuracy for the combined features was 88 % with 8 coefficients and 286 features.

Fig. 6
figure 6

The accuracy of the combined features for all numbers of coefficients, from 1 to 30

The two approaches offered remarkable results as shown in Table 1. However, the second approach offered the highest accuracy with the lowest number of features.

Table 1 Approaches accuracy

When compared this study method undertaken with the greatest number of coefficients used in the past, namely 12 and 13 coefficients, as shown in Fig. 1, the result in Table 2 has shown that the number of coefficients selected by the two proposed Approaches can offer much greater accuracy than the number of coefficients used in the past. Additionally, the greater accuracy came with fewer features.

Table 2 Comparison with the most number of coefficients used in SER

7 Conclusion

In this paper, two approaches for optimizing the coefficients numbers of spectral features, and for establishing a speech emotion model based on optimized coefficients, were proposed. Experiments have shown that the methods utilized for optimizing coefficients numbers not only increase the accuracy of the system when compared to the most commonly-used coefficients, but also reduces the numbers of features. This also shows that optimizing coefficient numbers for spectral features in combined, results in fewer features and better performance in speech emotion recognition, than when it is optimized separately before combination. Other Approaches used to optimize coefficients numbers will be studied in future works.