1 Introduction

Emotions are the key components of effective human-computer interaction. Affective computing is a field of science that deals with the understanding of emotions. Sentiment analysis aims to identify emotions that can be used in many systems like recommender systems for improving customer relationships. This paper is focusing on emotion recognition from speech. It’s a challenge to generate systems that can communicate with humans via speech [54]. It is important to recognize, respond and analyse the emotional state of humans, car driving system, call centre conversation analysis, robotics, call analysis in case of emergency services such as fire brigade, ambulance, speech to speech translation, and many more. In emotion recognition, one of the main research issues is the extraction of acoustic features from speech signals to get better accuracy. Several features have been mentioned in the literature categorized as Prosodic features, Spectral features, and a combination of both [13].

Acoustic features include intonation, energy, speaking rate, fundamental frequency, duration, intensity, and spectral characteristics. For emotion recognition, there are several machine learning algorithms (MLAs) that have been used to recognize emotions based on the spectral and prosodic features of speech signals. Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), linear discriminant classifiers, nearest neighbourhood classifiers, Support Vector Machine (SVM), and Artificial Neural Networks (ANNs), Convolutional Neural Networks (CNN) are some classifiers that are widely used to recognize emotions based on the acoustic feature of utterances [38, 50]. Recently, researchers have explored approaches for emotion recognition using deep learning with high-level features whereas some researchers have used hand-crafted low-level features to train CNN, RNN, and DNN models to improve the accuracy of emotion recognition.

The accuracy of emotion recognition of these classifiers depends on the selection of spectral and prosodic features of speech and feature extraction techniques used. Noise is also one of the major challenges, that affects the performance of the speech recognition system [37]. The features of speech signal vary across language, culture, speaker and gender. This results in a large number of “hand-crafted” features that can be tackled using deep learning automatically. In this paper, the capability of Deep CNN (DCNN) architecture is investigated in recognizing emotions from actor-based corpora with three different languages: German Emotional speech database (EMODB), British English Surrey audio-visual Expressed Emotion database (SAVEE) and Italian Emotional speech corpus (EMOVO). The proposed work aims to improve the accuracy of emotion recognition for any language and speaker.

The main contributions of this paper are as follows:

  • An algorithm for recognition of emotions independent of language and speaker is proposed using DCNN architecture.

  • Instead of audio files of actor-based speech corpora, their RGB spectrograms (images) are created and normalized before training. The common size maintained for all spectrograms is 224x224x3 for fine-tuning to DCNN.

  • Then optimal features are automatically learned by DCNN architecture with labelled samples. Seven emotions are recognized using 3 databases of different languages with an accuracy better than earlier studies.

  • The improvement in accuracy is reported using an optimal learning rate as compared to the random learning rate. The computational cost of the model is reduced using down-sampling by using stride as well as pooling layer in convolution.

The remaining section of the paper is organized as follows: The related work about speech corpora, features of speech, and classifiers used for speech emotion recognition are discussed in brief in Section 2. The details of the proposed algorithm are given in Section 3. In Section 4 experimental details and results are discussed. Finally, conclusions and future scope are summarized in Section 5.

2 Related work

A typical process of speech emotion recognition from speech is divided into two parts i) extraction of relevant and high-level features from the speech signal and ii) selection of classifiers for accurate recognition of emotions from speech signal. In this section, the existing literature about recognizing human emotions from the speech is discussed in terms of speech corpora, features of speech, and classifiers.

2.1 Speech corpora

The performance of human emotion recognition is highly dependent on the quality and type of speech database. The speech corpora can be categorized into three types namely actor-based or simulated, elicited, or induced, or natural. Simulated or actor-based speech corpora are collected from trained and experienced professional artists. Professional artists are asked to speak neutral sentences in many different emotions. Elicited or induced corpora is collected by simulating artificial emotional situation, without informing the speaker [51]. All types of emotions may not be present in the dataset and quality of speech may be low due to overlapping of utterances. Data is collected from real-world data in natural speech corpora. This may include conversation in the call centre, a dialog between doctor and patient, etc. Background noise may result in poor quality speech signals.

In the present work, an actor-based speech corpus is considered because this type of speech corpus is available in most of the languages including all the seven emotions. Many authors have worked on actor-based speech corpora in different languages to recognize the emotions. There are a large number of public and private actor-based speech corpora available. The details of some of the popular actor-based speech corpora are summarized in chronological order in Table 1.

Table 1 Details of Actor-based speech corpora

The popular public speech corpora EMODB, SAVEE, and EMOVO are considered to analyse the performance of emotion recognition using the proposed algorithm.

2.2 Features

Spectral characteristics like distribution of energy at different parts of the audible frequency range are acoustic features. For a different type of emotion, vocal tract shapes are unique. By using spectral analysis vocal tract shapes can be estimated. To recognize human emotions, spectral features are used [3, 41]. To compute spectral features speech signals are divided into frames (called segments) of length 25-50 ms. Speech signals are assumed to be stable in the specified length. To extract spectral features there are many techniques. Some of the popular techniques are Short-time Coherence Method (SMC), Linear Predictor Coefficient (LPC), One-Sided Autocorrelation Linear Predictor Coefficient (OSALPC) [4] and LP residual [6]. Epoch (Glottal Closure Instance) is very useful in estimating the features vocal tract frequency and pitch. Mel frequency speech power coefficients (MFSPC) is used by Nwe [33]. Normal frequency can be converted to the mel frequency by the Eq. (1).

$$ m=2595\mathit{\log}\left(\frac{f}{700}+1\right) $$
(1)

where m is mel frequency and f is normal frequency. Epoch can be extracted using a zero frequency filtered speech signal and LP residual approach [24]. All the linear approaches may not always work correctly for emotion recognition because pitch may not be linear for all human perceptions. Expo Log scale, Mel-frequency scale, and modified Mel-frequency approaches are used to estimate non-linear scales. For recognizing stress emotion, the non-linear scale was reported to be better as compared to linear.

Prosodic features are the pitch of the speech, length of speech, loudness of speech and quality of speech. These features are not useful to drive at the frame level. Therefore, they are extracted at the sentence and utterance level. Pitch is also known as fundamental frequency (F0). The average pitch in case of a female is observed as 210 Hz and in the case of a male, it is observed to be 120 Hz. If the speech signal is quasi-stationary, then there are many different approaches to compute the pitch value. Variation of pitch over the cycle to cycle is known as jitter, and it can be computed by Eq. (2) [14].

$$ J=\left(1/N-1\right){\sum}_{i=1}^{N-1}\mid Ti- Ti+1\mid $$
(2)

where J is jitter, N is the number of cycles and Ti is pitch period. Better recognition of human emotion can be achieved by using autocorrelation-based pitch. The average energy of a speech signal can be computed by Eq. (3) [1].

$$ En={\sum}_{m=n-N+1}^n\left[x(m)w\left(n-m\right)\right]2 $$
(3)

where En is energy value, n is the number of samples, energy distribution, and amplitude in spectrum affect the arousal emotion of a speaker [9]. Therefore, duration and energy features are useful for recognizing human emotions. In general, the male energy level was found to be higher as compared to females in anger emotions. For the same anger emotion, the male speech rate was found to be low as compared to females. Statistical values of pitch include minimum, maximum, range, mean, median, standard deviation, slop minima, slope maxima, kurtosis, skewness, jitter, relative pitch, first-order difference, and so on. Similarly, statistical features of energy and duration include shimmer, duration of voice ratio, speech rate, along with mean, maximum, minimum, standard deviation, and so on. Energy, pitch, and duration contour had been used as dynamics of prosodic features to recognize the seven emotions [20, 40]. It was found that spectral features such as Mel frequency Cepstral Coefficients (MFCC), Linear Prediction Cepstral Coefficients (LPCC), OSALPC were missing temporal information. For this problem modulation, spectral features were used [42]. While recognizing human emotion from speech, temporal information was used and found to be very useful [53]. A combination of spectral and prosodic features was also considered to recognize human emotions more accurately from speech [25]. For female and male speakers, database critical analysis has been done. Features such as low-MFCC, standard MFCC, and pitch were found to be better in recognizing the emotions from speech [32, 57]. In extracting stable pitch, low-MFCC performed better.

2.3 Classifiers

Many classifiers can be used for the analysis of speech to detect emotions. Earlier, temporal dynamics features of speech signals were captured by using the state transition matrix. At that time, the majority of the researchers used HMM as a classifier to recognize the emotions from speech. The phenomenon was used to recognize the human emotions from speech following the left-to-right sequence, and HMM also adopted a left-to-right structure for human emotion recognition [26, 43]. In an utterance, it is difficult to find emotional cues from the sequential flow. It may be present at the end, middle, or at the beginning in an utterance. To resolve this problem, the ergodic model of HMM was used. In this ergodic model of HMM, it was possible to go from any state to any other state. For human emotion recognition from speech, GMM was a more suitable classifier using global feature extraction as compared to HMM. It is best suited when the data of the dataset is not normally distributed. A boosted GMM was proposed to recognize human emotion from a speech by Tang [48]. Second-order parameters like standard deviation and mean were used to capture hyperplane distribution in GMM. The statistical learning concept introduced a new classifier and regression technique called SVM. SVM gave better results for many pattern recognition applications as compared to other classifiers [44]. There are many other approaches proposed to recognize human emotions from speech. These include the K Nearest Neighbour (K-NN) approach, a supervised learning method that is simplest of all for speech emotion recognition methods [17]. Feature vector contributes more while recognizing emotions from speech.

Artificial Neural Network (ANN) is another efficient classifier to extract the non-linear features in many pattern recognition applications. If the training sample size was low in number, then also ANN gave better results as compared to GMM and HMM. The performance of the ANN classifier totally depends on the number of hidden neurons for each hidden layer and the number of hidden layers. For better human emotion recognition more than one ANNs were also used [10, 45]. RBL was found to be the least used and MLP as the commonly used classifier for emotion recognition from speech. Generalized Feed Forward Neural Networks (GFNN) gave better results as compared to MLP in recognizing human emotions [16]. Other forms of neural networks used for human emotion recognition from the speech were Auto Associated Neural Networks (AANN) and 2D-Neural Network [21]. As compared to the SVM classifier ANN gave better results [36]. The performance of emotion recognition decreased when the number of emotions was increased [56]. Auto-encoder based unsupervised classifier was proposed by Deng [12, 52]. Motamed [31] proposed modified brain emotional learning for emotion recognition from speech. Some researchers extracted the features by using deep neural networks and then used different classifiers. Deep Neural Network (DNN) was introduced for acoustic emotion recognition by Stuhlsatz [47]. The benefit of using deep learning for emotion recognition is the capacity to extract high-level features more accurately. Deep Belief Network (DBN) was used to capture non-linear features [22]. CNN was found to more efficient in extracting high-level features of images [27]. A similar level of accuracy was achieved by using a neural network on information retrieved from spectrograms in EMODB corpus [39]. The number of classes of emotions recognized was only five. CNN, when combined with the LSTM network was able to learn the representation of speech by itself from raw time representation [49, 55]. The findings of the related work are summarized in Table 2.

Table 2 Summary of related work

From the literature, it is shown that many researchers have used CNN from deep learning approaches for emotion recognition from speech. Using deep learning approaches the accuracy is increased but the computational cost of the model is also increased because of large pre-trained architecture. A few researchers have developed approaches for emotion recognition from speech using spectrograms as input [2, 7, 15, 19].

In this work, an algorithm for emotion recognition using DCNN architecture is proposed. For down-sampling, stride as well as pooling layer is used in convolution. It reduced the computational cost of the proposed DCNN based model and increased the recognition accuracy for emotion recognition. It is evaluated on speech corpora EMODB, EMOVO, and SAVEE. The comprehensive explanation of the proposed model is discussed in the successive section.

3 Proposed algorithm

3.1 Pseudocode of the proposed algorithm

In this section, an algorithm is proposed using DCNN architecture for emotion recognition from speech. The pseudo-code to implement the proposed algorithm is described in Table 3.

Table 3 Pseudo code to implement the proposed algorithm

The .wav files are read from collected speech corpora EMODB, SAVEE, and EMOVO and converted into spectrograms. The detailed process is depicted in Fig. 1.

Fig. 1
figure 1

Steps of the proposed algorithm

3.2 Details of DCNN architecture

In this algorithm, the audio files from speech corpora are converted into the spectrograms and normalized to size 224x224x3 in the first step. As spectrograms hold more information that cannot be extracted from the speech signals, it improves the accuracy of emotion recognition. DCNN architecture learns high-level features from RGB spectrograms. DCNN architecture has convolutional layers, a max-pooling layer and fully connected layers. To calculate the probability of each emotion fully connected layers are fed to a softmax classifier. Special stride is used for down sampling the output features rather than the pooling layers. By convolution layer, optimal features (a combination of spectral and prosodic features like pitch, MFCC, and energy) are automatically learned by labelled samples. The details of the DCNN architecture used are shown in Fig. 2.

Fig. 2
figure 2

DCNN architecture details

In this DCNN architecture, first convolutional layers (Conv-1) have 16 filters of size (7 × 7) that are applied to the normalized RGB spectrograms of size (224x224x3) with stride (2 × 2) and same padding. The output feature map obtained from Conv-1 goes through max-pooling (2 × 2) with stride (1 × 1). Similarly, the second convolutional layer (Conv-2) has 32 filters of size (5 × 5) with stride (2 × 2). Conv-3 has the same number of filters as Conv-2 of size (3 × 3) with stride (2 × 2) and the same padding. Conv-4 and Conv-5 have 64 filters of size (3 × 3) with stride (2 × 2) and Conv-5 has the same padding. In the same way, Conv-6 has 128 filters and Conv-7 has 256 filters of size (3 × 3) with stride (2 × 2) and the same padding. In this DCNN architecture activation function, ReLU is used after each convolutional layers to rectify the output features map defined as:

$$ {\displaystyle \begin{array}{c}\mathrm{f}\left(\mathrm{z}\right)=0,\mathrm{if}\ \mathrm{z}<=0\\ {}\mathrm{and}\kern11.5em \mathrm{f}\left(\mathrm{z}\right)=\mathrm{z},\mathrm{if}\ \mathrm{z}>=0\kern4.9em \mathrm{i}.\mathrm{e}.\mathrm{f}\left(\mathrm{z}\right)=\max\ \left(0,\mathrm{z}\right)\end{array}} $$

To regularize the DCNN model the rectified output layers are followed by batch normalization with momentum 0.9. The last convolutional layer Conv-7 is followed by a flatten layer. The flattening layer is fed as input to the fully connected layers. The first fully connected layer (FC-1) has 1024 neurons and the last fully connected layer (FC-4) has 7 neurons as the number of classes. FC-1 is followed by a 20% dropout ratio. Finally, to calculate the probability of each emotion FC-4 layer is fed to a softmax classifier. The main parameters used for the experiment during training the model are shown in Table 4.

Table 4 Main Parameters and their value

3.3 Converting speech signal into spectrograms

Read the .wav files from collected speech corpora EMODB, SAVEE, and EMOVO and then convert all files into spectrograms. Steps to get the spectrograms from the .wav file given in detail in Table 5.

Table 5 Steps to get spectrograms

The parameters used for STFT are frame size (25 ms), overlapFac=0.5 (50% overlap), window=np.hanning (window type like Hamming, Kaiser, etc.). Mathematically it is calculated as in Eq. (4).

$$ Xm(w)={\sum}_{n=-\infty}^{\infty}\mathrm{x}\left(\mathrm{n}\right)\mathrm{w}\left(\mathrm{n}-\mathrm{mR}\right){e}^{- jwn} $$
(4)

where Xm(w) Discrete Time Fourier Transform (DTFT), x(n) is input signal at time n, w(n) is window function. Some sample spectrograms of each speech corpora of each emotion are shown in Fig. 3. where the vertical axis represents the frequency and the horizontal axis represents the time.

Fig. 3
figure 3

Spectrograms of each type of emotion for EMODB, SAVEE, and EMOVO corpus

In step 3 all the spectrograms are normalized to size 224x224x3. In step 4 spectrograms are divided into training, testing, and validation set in the ratio of 80%, 10%, and 10% respectively. Then training and validation of the DCNN model is performed using the proposed algorithm and saved (i.e. freezed) as stage-1in step 5. From the stage-1 model using LR range Test, the optimal learning rate is found in step-6. The stage-1 model is un-freezed and trained again with an optimal learning rate and saved as a stage-2 model in step 7. The details of the experiments and results are discussed in section 4.

4 Experimental details and results

In this section, the proposed algorithm is applied for emotion recognition from the speech on EMODB, EMOVO, and SAVEE datasets. All the experiments are performed on standard windows 10 laptop, Intel(R) Core™ i5-7200U CPU@2.70 GHz and 8 GB RAM, ×64-based processor. The performance of the proposed algorithm is compared with classifiers using handcrafted features and recent deep learning approaches. Detailed experimental results are discussed in the coming subsequent sections.

4.1 Data sets

In this work, experiments are conducted on three publicly available labelled, actor-based emotional data sets: EMODB, EMOVO, SAVEE. These are summarized below:

  • EMODB speech corpus consists of basic seven emotions, i.e. anger, boredom, disgust, fear, happiness, sadness, and neutral. This corpus consists of a total of 535 samples, and the language of this corpora is German. Utterances are recorded by five males and five female actors. All 10 utterances are used in the present work.

  • SAVEE is another popular public emotional speech corpus, which is recorded by four male actors. These data sets consist of seven emotions, i.e. happiness, sadness, anger, fear, disgust, surprise, and neutral. There is a total of 15 utterances and 480 samples.

  • EMOVO is a publicly available emotional speech corpus. It covers seven categories of emotions namely, happiness, sadness, anger, fear, disgust, surprise, and neutral. It consists of a total of 588 samples recorded by 6 actors- 3 males and 3 females. The language of this corpora is Italian.

The statistics of these emotional speech corpora are summarised in Table 6.

Table 6 Number of samples in speech corpora used

4.2 Experimental results

The training of the proposed model for all considered speech corpora (EMODB, SAVEE & EMOVO) are saved as stage-1. The prediction performance of the proposed model at stage-1 is evaluated on EMODB, SAVEE, and EMOVO datasets to show the efficiency of the proposed algorithm. Tables 7, 8 and 9) show the prediction performance of the proposed model at stage-1 in terms of confusion matrix on the EMODB, SAVEE and EMOVO datasets respectively.

Table 7 Confusion matrix for emotions prediction on EMODB at stage-1
Table 8 Confusion matrix for emotions prediction on SAVEE at stage-1
Table 9 Confusion matrix for emotions prediction on EMOVO at stage-1

Tables 7, 8 and 9) show the overall accuracy of the proposed model at stage-1 for EMODB, SAVEE, and EMOVO speech corpora as 73.08%, 50%, and 57.15% respectively. To improve the accuracy LR range test is used and an optimal learning rate is found. Learning rate variation concerning the number of iterations and learning rate variation concerning loss are shown in Fig. 4.

Fig. 4
figure 4

Learning rate variation for the number of iterations and variation in a loss concerning learning rate

The learning rate describes how parameters are updating. Here X-axis shows what happens when the learning rate is increased and Y-axis shows the loss. From Fig. 4a) it can be seen, that once the learning rate pass 10−5, the loss is getting worse because fine-tuning is done. Based on the learning rate finder it was decided to pass an optimal learning rate 10−5 (for EMODB). From Fig. 4b) we found that loss is minimum at 10−6 and after that loss increases. So, the optimal learning rate was passed 10−6 (for EMODB). In Fig. 4c) it was found that there was no range where either loss rapidly decreased or after that loss got worse. So here we passed optimal learning rate 10−6 at which loss is minimum. After finding the optimal learning rate, the stage-1 model was unfreezed. And the model was trained again with an optimal learning rate and saved as stage-2. The prediction performance of the proposed model at stage-2 was evaluated on EMODB, SAVEE, and EMOVO datasets to show the efficiency of the proposed algorithm. Tables 10, 11, 12) show the prediction performance of the proposed model at stage-2 in terms of confusion matrix on the EMODB, SAVEE and EMOVO datasets.

Table 10 Confusion matrix for emotions prediction on EMODB at stage-2
Table 11 Confusion matrix for emotions prediction on SAVEE at stage-2
Table 12 Confusion matrix for emotions prediction on EMOVO at stage-2

Tables 10, 11, 12) show the overall accuracy of the proposed model at stage-2 for EMODB, SAVEE, and EMOVO speech corpora as 84.62%, 75%, and 69.65% respectively. This also shows that better accuracy has been achieved in stage-2 as compared to the stage-1 model. The summary results of the proposed model at stage-1 and stage-2 (Mean + Standard deviation) are shown in Table 13.

Table 13 Accuracies at stage-1 and stage-2

The results show that an average of 16.35% better accuracy is achieved at stage-2 as compared to stage-1. For SAVEE corpus the improvement is highest (25%). From the results of the proposed algorithm, it can be concluded that selecting an optimal learning rate is very important as compared to a random learning rate.

4.3 Performance comparison of the proposed algorithm with state-of-the-art

The performance comparison of the proposed algorithm with other state-of-the-art algorithms are given in Table 14, which outperforms the existing results over EMODB, SAVEE, and EMOVO datasets using spectrograms as input.

Table 14 Performance comparison of the proposed algorithm with state-of-the-art methods

From Table 14, it is clear that for EMODB corpus, the proposed algorithm stage-2 gives better results (84.62%) accuracy as compared to SVM (71.12%), K-NN (63.74%), and MLP (81.32%) [34], Random Forest (77.18%), Alexnet (81.33%), and Decision Tree (72.82%) [2], 3D CRNN (82.82%) [7], CNN(84.50%) [19], CNN-LSTM (69.72%) [35], GoogLeNet (72.55%) [28], and pQPSO (82.82%) [11].

For SAVEE corpus, proposed algorithm stage-2 gives better results (75% accuracy) as compared to SVM (72.39%), K-NN (53.37%) and MLP (71.17%) [34], CNN (69.00%) [19], DNN (59.70%) [15], CNN-LSTM (72.66%) [35] and pQPSO (60.79%) [11].

And for EMOVO corpus, proposed algorithm stage-2 gives better results (69.65% accuracy) as compared to SVM (60.40%), k-NN (39.05%), and MLP (58.58%) [34], CNN-LSTM (53.24%) [35].

5 Conclusion and future scope

In this work, we have evaluated our proposed algorithm on three different language labelled speech corpora are considered EMODB, EMOVO, and SAVEE. All audio files (.wav) from all speech corpora are converted into the spectrograms of size 224x224x3. Training, testing, and validation ratio are considered 80%, 10%, and 10% respectively. Then the deep learning model is considered to recognize the emotions from speech corpus. Deep learning models are developed at two stages: stage-1 and stage-2. At stage-1 the model is developed with a random learning rate and at stage-2 it is developed using an optimal learning rate. The performance of the proposed algorithm is compared with other hand-crafted features using classifiers and between stage-1 and stage-2 also. The recognition rates are found to be 84.62%, 75%, and 69.65% for EMODB, SAVEE, and EMOVO datasets respectively which are better than existing studies. The proposed algorithm gave better results with any type of language and actor because it uses an optimal learning rate as compared to a random learning rate. Cross-lingual emotion recognition and cross-corpus emotion recognition may be another direction for future research.