1 Introduction

Speech signals are considered the most natural and intuitive means of social communication, an interactive episode that mostly comprise the conveying of different emotion states via conversation. Therefore, the effective transfer of emotion concepts from the speaker to the listener is of utmost importance so as to interpret and analyze the actual instinct behind an individual’s communication. This has where speech emotion recognition (SER) [5, 29] comes into play, the task being to investigate and accurately predict the emotion class a speech sample belongs to. SER has been instrumental to the success of human-computer interaction (HCI) [19, 54], detection of clinical depression [7, 45] and therapy [39, 57] through verbal interpretation. Thus, it is crucial to develop a dependable and automated SER system for high confidence emotion classification from speech audio clips.

SER has mostly been tackled using conventional machine learning (ML) approaches [14, 66] whose pipeline comprises extraction of handcrafted features followed by classification into respective emotion classes. The key to success for such techniques lies in the choice of the best suited feature descriptor (such as LPC, MFCC or RAASTA features), which requires manual feature engineering and therefore, is subject to several rounds of trial-and-error. Furthermore, situations may arise when traditional feature extractors vary in performance across different speech datasets, and thus a combination of multiple descriptors [17] is required to obtain a more optimal feature set, which demand more storage requirements and also increase the number of trial combinations.

On the contrary, deep learning methods [29] alleviate the troubles of handcrafted feature extraction by providing a self-learning paradigm that can automatically generate the most informative features describing the raw data. Further, they also provide an end-to-end pipeline [31], removing the need for explicit feature engineering. In the context of SER, deep learning has been leveraged predominantly in two directions – modelling upon sequential raw audio data [26, 49] and using vision-based models on audio mel-spectrograms [31, 44]. While the former approach is computationally expensive, conversion of raw signals to spectograms map the temporal audio sequence to a frequency-based spatial spectrum, allowing the use of state-of-the-art vision models [25, 60, 67] for robust classification. However, a limitation of deep learning models is that they require huge amounts of data for achieving desirable performance, which is a bottleneck in this case of datasets curated for SER tasks [11, 33]. Transfer learning is one of the solutions to this problem, where a model trained on a large corpus (such as ImageNet [16]) is reused on the problem at hand. In this study, we have used the mel-spectrogram transformations of the raw speech signals for emotion detection, employing a customized Wide-ResNet-50-2 [67] network pre-trained on the ImageNet database as the CNN feature extractor backbone.

Feature selection (FS) [3, 50] aims at selecting the most optimal subset from a given feature set with the objective of enhancing discriminatory performance as well as reducing storage requirements and thereby making the pipeline computationally efficient. The number of features extracted by the CNN backbone is quite large and as such, may contain redundant information that limit the performance of the model, bringing forth the need for FS. Two prominent approaches have been used for FS, one being to rank features based on intrinsic properties among them [37, 53], and the other being to select the most optimal subset for a heuristic objective [12, 23]. In this study, we have leveraged a two-tier FS approach that uses both the intrinsic property-based feature ranking as well as a heuristic-based algorithm for dimensionality reduction of the feature space and enhanced classification performance. Specifically, we use a fuzzy entropy and similarity based metric [37] for ranking features from which a top-q% subset is chosen, which is further optimized using Whale Optimization Algorithm (WOA) [47], a nature-inspired meta-heuristic inspired from the social behaviour of humpback whales. The final feature subset selected by WOA is fed into a k-nearest neighbors (KNN) [8] classifier to make the final predictions.

The main contributions of the present research are as follows:

  1. 1.

    A bi-stage wrapper-filter hybrid deep feature selection (HDFS) framework has been proposed for dimensionality reduction of feature space and robust classification of emotions from speech data.

  2. 2.

    A customized pre-trained Wide-ResNet-50-2 CNN network [67] has been fine-tuned to extract features from mel-spectrogram transformations of raw audio clips, after which a fuzzy entropy and similarity measure based FS strategy [37] has been employed to rank features based on metric scores. A top-q% subset is selected from the ranked features, the value of ‘q’ being set experimentally.

  3. 3.

    WOA [47] has been used to further refine the top-q% feature subset and select the most discriminative features for enhanced performance.

  4. 4.

    The proposed two-tier HDFS approach is evaluated on three publicly available benchmark SER datasets [11, 33, 36] and compared with several existing works in literature. The proposed approach achieves classification accuracies of 93.64%, 96.25%, and 89.72% on the respective datasets, outperforming many existing state-of-the-art techniques by significant margins.

The rest of the paper is organized as follows: Section 2 reviews some of the recent developments in the relevant areas of speech emotion detection and feature selection; Section 3 elaborately describes the proposed SER framework; Section 4 discusses the results obtained upon evaluation of the proposed pipeline on three publicly available SER datasets along with a comparative study against several state-of-the-art works on SER in literature; and Section 5 concludes the findings of the present study.

2 Related work

SER [1, 5] has been a field of active research for over two decades, primarily due to its application in healthcare, social robotics and understanding human behaviour. Mostly, researchers have leveraged traditional ML methods [14, 17, 51, 61, 66] for classification of handcrafted features extracted from audio signals. Danisman et al. [14] fused MFCC and energy-based features and trained an ensemble of support vector machine (SVM) classifiers for SER. Albornoz et al. [6] extracted accoustic and prosodic features from speech samples and employed a hierarchical classification scheme for emotion recognition. The authors of [51] proposed a handcrafted feature fusion framework followed by dimensionality reduction of the feature space, while Song et al. [61] introduced feature selection (FS) based transfer subspace for cross-corpus SER. More recent works that leverage handcrafted feature engineering include a hybrid meta-heuristic based FS framework [17]; a quantum-modified swarm-based algorithm [13] for dimensionality reduction of fused handcrafted features; and a clustering-based genetic algorithm (GA) [27] for optimization of raw audio features.

Deep learning-based methods [18, 24, 31, 43, 44], on the other hand, can learn relevant informative features automatically from the raw data, thereby alleviating the explicit need for handcrafted features to be extracted from data samples. Typically, SER has seen two prominent directions of research pertaining to deep learning-based approaches: (1) feeding raw audio samples (or features) into sequence-modelling neural networks such as LSTMs [26, 49], so as to learn temporal audio features; and (2) converting raw audio signals to mel-spectrograms and then passing them into 2D CNNs [18, 24, 44], so as to learn visual spectral features. Mao et al. [44] proposed a bi-stage pipeline comprising unsupervised feature learning from mel-spectrograms followed by disentangling affect-salient features to be fed into an SVM classifier. Mirsamadi et al. [49] used a local attention-guided deep RNN to model long-term contextual dependencies among emotionally salient parts of an audio clip for SER. Mansouri et al. [43] proposed a novel cross-modal enhancement approach using spiking neural networks for unsupervised SER, while the authors of [24] designed a complex architecture leveraging both handcrafted MFCC features and mel-spectrograms of audio signals and encapsulated them together to be fed into a 3D CNN for classification. Among recent works in SER, Ibrahim et al. [26] proposed a novel reservoir computing framework using bi-directional RNNs; Kwon et al. [31] employed a simple self-attention module in 2D CNNs for emotion classification from audio mel-spectrograms; Latif et al. [32] explored adversarial domain adaptation for cross-lingual SER. In a rather unique work, Liu et al. [35] introduced a GA-aided reinforcement learning-based approach for SER, mimicking the emotional processing mechanism of the limbic system in the brain.

The need for FS [3, 69] arises in order to alleviate potential redundancies captured by automated feature extraction pipelines, which limit the performance of a model at making accurate predictions. Broadly speaking, FS methods may be categorised as: (1) filter-based methods [30, 37, 53] which employ various scoring metrics based on intra-feature properties to rank features according to their discriminative (or regressive) importance; (2) wrapper-based approaches [4, 20, 50] which involve training a learning model with a subset of features followed by iterative inclusion/exclusion of features based on a heuristic objective; and (3) embedded methods [42, 63, 70] which are combinations of filter and wrapper methods having an intrinsic FS model implemented within itself. It is intuitive that filter methods are computationally cheap as they only explore intra-feature properties without calculating an explicit performance objective, although wrapper methods typically perform better compared to the former as they aim at optimizing a heuristic function modelled upon the task-dependent performance metric [69]. Researchers have also introduced hybrid wrapper-wrapper [4, 59] and wrapper-filter [56, 58] methods to obtain superior performance as compared to using a single algorithm.

FS has been leveraged in several tasks, especially in those where there is a possibility of redundancy among features [18, 21, 34, 66]. For SER tasks, various FS approaches have been employed primarily in association with traditional ML-based methods [13, 27, 34, 66]. Liu et al. [34] proposed a filter-based FS method followed by final classification using a decision tree-like classifier. Yildirim et al. [66] leveraged Cuckoo Search [65] and NSGA-II [15] to perform FS on extracted accoustic features for SER. FS has also been used with a deep learning-based approach by Farooq et al. [18] where the authors optimized the feature space obtained by training CNN models on mel-spectrograms obtained from raw audio signals for emotion classification.

3 Proposed method

In this section, we elaborately describe each stage of the proposed HDFS framework for emotion detection from speech data. Figure 1 shows the overall workflow of the proposed pipeline, where the sequential stages are:

  • ∗ Feature extraction using Wide-ResNet-50-2

  • ∗ Filter-based FS using fuzzy entropy and similarity based feature ranking

  • ∗ Wrapper-based FS using WOA

  • ∗ Classification using KNN classifier on the optimal feature subset.

Fig. 1
figure 1

Overall workflow of the proposed hybrid deep feature selection framework for SER from raw audio signals

The stages have been explained in detail in the subsequent sections.

3.1 Deep feature extraction using wide-ResNet-50-2

The first stage of our proposed pipeline involves feature extraction from speech mel-spectrograms representing the spatial time-frequency distribution of the audio signal, obtained upon applying fast Fourier transformation (FFT) on the raw speech samples. In this study, a customized Wide-ResNet-50-2 [67] CNN model has been employed to capture a rich feature representation of the mel-spectrogram images for further engineering. The architecture of Wide-ResNet-50-2 is shallower and wider compared to its predecessor, the ResNet [25] family, thereby reducing the training time as well as parameters without compromising on performance and increasing computational efficiency. Further, to ensure the information is extracted effectively from the speech mel-spectrograms, a fully-connected (FC) layer is added after flattening the final pooling layer. The FC layer comprises 512 neurons and is associated with the Rectified Linear Unit (ReLU) activation function, and this is the layer from which the deep features (dim = 512) have been extracted. The FC layer reduces loss of information when the feature representation is compressed from 2048 units (i.e. after flattening the output from the last pooling layer) to the classification layer (i.e. having number of neurons equal to the emotion classes in the dataset i.e. N). The final classification layer is associated with the softmax activation function, which maps the outputs to a probability distribution (i.e. values between 0 and 1). A schematic diagram representing the customized CNN architecture has been provided in Fig. 2.

Fig. 2
figure 2

Schematic representation of the customized Wide-ResNet-50-2 CNN architecture used in this study

3.2 Filter-based FS: fuzzy entropy and similarity measures

FS is used to improve classification performance in situations when the number of training examples is less than the number of measured features to choose from. Filter methods are used to filter out the undesirable features through checking the consistency of the data present and eliminating repetitive features. The primary objective is to select optimum features subset for an input for a learning algorithm. In this article, we have used a filter based feature selection technique [37] based on similarity and fuzzy entropy measures for feature selection.

Fuzziness measures, viz., measure of impreciseness and vagueness, imply how far a given fuzzy set is from a reference set. In this work, we have used the measure of probabilistic entropy (discussed later in this section). While using these fuzzy entropy measures with the similarity classifier, we first define the ideal vectors vi = (vi(f1),...vi(ft)) which represent the class i having t features. The calculation of the ideal vector is done based on the generalised mean. Then the similarity between each sample x and ideal vector v is measured by (1). The class of the sample is decided in accordance with the similarity value calculated. If a sample is from class i then the similarity value for the sample S(x, y) will be 1, else it will be 0.

$$ S = (1- |{I_{k,i}}^{p} - {x_{j,i}}^{p}|)^{(1/p)} $$
(1)

where, Ik,i is the ideal vector for the ith feature of the jth individual sample xj,i belonging to the kth class and p is a parameter from the Łukasiewicz structure [38]. We take p = 1 for a normal Łukasiewicz structure.

We have used the following equation to measure the probabilistic entropy,

$$ H_{1}(A) = - \sum\limits_{j=1}^{n}(\mu_{A}(x_{j})\log{\mu_{A}}(x_{j}) + (1 - \mu_{A}(x_{j}))\log{(1 - \mu_{A}(x_{j})})) $$
(2)

where μAxj are the fuzzy values. We have used this fuzziness measure to evaluate the global deviation from the ordinary sets to see if any crisp set A0 leads to h(A0) = 0.

Now for a sample, while calculating the similarity values with the ideal vector, we can get j similarity values for j features, which is where we have used fuzzy entropy measures in order to calculate the relevance of the feature. The underlying idea is, while calculating fuzzy entropy values (2) in which μA(xj) is the similarity value, for higher similarity values, we get lower entropy and if the similarity values are near 0.5 the entropy values are higher. Based on this, we have calculated the entropy values of the similarity values derived from the samples which we want to classify and the ideal vectors calculated initially. We have finally found t entropy values for t features of each sample. After calculating the entropy value for each feature, we use the above mentioned idea to rank the features based on their entropy scores. The primary idea behind ranking is that the feature having the highest entropy would be having the least amount of contribution in the deviation between classes and for more informative features, the entropy values are lower. In the present study, this algorithm has been used to rank features and select a top-q% subset from the entire set. Experimentally, we have set q = 50%, implying that out of the 512 deep features, the top-ranked 256 features are chosen at this stage A graphical representation and the pseudo-code of the algorithm described above has been provided in Fig. 3 and Algorithm 1 respectively.

Fig. 3
figure 3

Flowchart for the fuzzy entropy and similarity based filter method for FS used in the present work

Algorithm 1
figure a

Pseudo-code of proposed algorithm for the fuzzy entropy and similarity based filter method for FS where m is the number of samples, t is the number of features and l is the number of classes.

3.3 Wrapper-based FS: whale optimization algorithm

WOA [47] is a meta-heuristic optimization algorithm that uses a spiral to simulate the bubble-net attacking mechanism of the humpback whales who dive into the water and form a bubble-net spiral around their prey and swim up towards the surface. The three stages of the WOA algorithm are as follows: (1) encircling the prey, (2) bubble-net attacking phase (exploitation) and (3) searching for prey (exploration).

The first stage involves the identification of the best search agent using a fitness function and updating the distance of the other search agents towards the best search agent. The current prey or solution is considered to be closer to the global optimum. In the Bubble-net attacking stage, based on the values of certain constraints, there’s a 50% chance between the approaches of shrinking encircling and spiral updating position. The stage of “searching for prey” is an exploration stage where the search agent can search for the prey randomly without opting for the spiral updating positioning.

Equations (3) and (4) are used to update the position of the search agent denoted by position vector Xi, where A and C are coefficient vectors in the tth iteration and X is the position vector of the best search agent found.

$$ {D} = |{C}\cdot X^{*}_{t} - {X^{i}_{t}}| $$
(3)
$$ X^{i}_{t+1} = X^{*}_{t} - {A}\cdot {D} $$
(4)

The coefficient vectors A and C are calculated as follows:

$$ {A} = 2{a}\cdot {r} - {a} $$
(5)
$$ {C} = 2 \cdot {r} $$
(6)

where, a decreases linearly from 2 to 0 over the iterations in both exploration and exploitation phases and r is a random variable in [0,1]. After each iteration, we have updated the position of the best search agent (X) if there is a better solution available or the search agent goes beyond the search space. Declaring r as a random variable allows the search agent to achieve a position in the vicinity of the best search agent and implement encircling the prey.

The shrinking encircling approach in the exploitation phase is implemented by decreasing the value of a in (5). For the spiral updating position approach, we have first calculated the distance (7) between the ith whale and the prey (best solution obtained till current iteration). Then, a spiral is created between the position of the whale and the prey to imitate the movement by the humpback whale. The equations are as follows,

$$ {D^{\prime}} = |X^{*}_{t} - {X^{i}_{t}}| $$
(7)
$$ X^{i}_{t+1} = {D^{\prime}}\cdot e^{b\times l}\cdot \cos (2{\pi}l) + X^{\ast}_{t} $$
(8)

where, b is a constant for defining the shape of the logarithmic spiral, l is a random number ∈ [− 1,1], and (⋅) is element-by-element multiplication.

The approach is decided based on the value of p, a random variable in [0,1]. We have assumed that there is a probability of 50% to choose between the above mentioned approaches of exploitation to update the position of the whale during optimization. The mathematical model is as follows,

$$ X^{i}_{t+1} = \begin{cases} X^{\ast}_{t} - {A}\cdot {D} & \text{if} p<0.5\\ {D^{\ast}}\cdot e^{b\times l} \times \cos (2{\pi}l) + X^{\ast}_{t} & \text{if} p\geq 0.5 \end{cases} $$
(9)

where, p is a random number ∈ [0,1].

The algorithm also involves an exploration phase to allow the agent for a randomised search in the search space. It is used to emphasize on the random search according to the relative position of the agents. We have used the random values of the coefficient vector A to decide for the approach in case of encircling. Random values of A greater than -1 and less than 1 are used to force the search agent to move farther away from the reference whale. For exploration, we use values for which |A| > 1. For the randomised search, we have modified (3) and (4) and use a random agent instead of the best search agent,

$$ {D} = |{C} \cdot {X^{j}_{t}} - {X^{i}_{t}}| $$
(10)
$$ X^{i}_{t+1} = {X^{j}_{t}} - {A}\cdot {D} $$
(11)

where, \({X^{j}_{t}}\) is a random position vector chosen from the current population, C is a coefficient vector (as in (3)), and t is the current iteration.

We have modified the WOA to map the continuous space search of WOA to a binary one in accordance with our problem of feature selection. We have used the S-shaped sigmoid transfer function to do the needful as shown in the following equation:

$$ \mathcal{S}(z) = \frac{1}{1 + e^{-z}} $$
(12)

The position of destination points Xt+ 1 for the ith whale will be updated according to (13).

$$ X^{i}_{t+1} = \begin{cases} 1, \quad rand() < \mathcal{S}({X^{i}_{t}})\\ 0, \quad otherwise \end{cases} $$
(13)

where, rand() yields a random number ∈ [0,1].

FS is a multi-objective paradigm with its objectives being: (1) maximization of classification accuracy and (2) minimization of number of features. Thus, it is evident that the two objectives are opposing to each other. To alleviate this contradiction, we combine them to formulate a heuristic fitness function \(\mathcal {F(\cdot )}\) using a weight factor α for the ensemble, thereby reducing the problem to a single-objective optimization task. The expression for the fitness function is shown in (14).

$$ \uparrow \mathcal{F} = \alpha \times \eta + (1 - \alpha ) \times {{{\varDelta}}} $$
(14)

where η is the classification accuracy of the feature subset (obtained by KNN classifier), Δ is the feature reduction given by (15), and α ∈ [0,1] signifies the relative weight of the classification accuracy and feature reduction. For the present work, we have considered α = 0.99 for all experimentation.

$$ {{{\varDelta}}} = \left( \frac{|F| - |f|}{|F|} \right) $$
(15)

where, |F| is the original feature dimension and |f| is the number of features selected. In our work, |F| is the cardinality of the top-q% feature subset (= 256) obtained by filter-guided FS in the previous step.

The main advantages of WOA include wide range of exploration over the search space because of randomised parameters and constraints which improve random agents of the population and search for prey guided by both the best search agent and randomised agents of the population because of randomised parameters which help in the task of exploration. The flowchart for WOA is shown in Fig. 4.

Fig. 4
figure 4

Schematic diagram of WOA used in the present work. Here, Xi represents the ith member of the population, X is the best search agent found, and a,p,l,A and C are random variables and coefficient vectors respectively

3.4 Classification using KNN classifier

KNN [8] is a simple non-parametric classification algorithm that rely on distance computation as the sole classification criterion. In this algorithm, the training samples of the dataset (i.e. the feature subset selected by WOA) are treated as data points in the embedding space and divided into several distinct classes. Among n data points {pi : 1 ≤ in} in the proposed embedding space, to predict the class of a new instance point pj, the distances between pj and all its k-nearest neighbors are computed. Finally, a majority vote of the k points considered decide the class allotment to pj. Following the recommendations by [40, 41], in this study we have set the value of k = 5 for all experimentation.

3.5 Analysis of computational complexity

Although fuzzy entropy and similarity measure based FS algorithm is known to be computationally cost reducing, the general notion in literature regarding wrapper-based FS algorithms is that they are computationally more expensive in comparison with other feature selection algorithms. So it is necessary to be aware of the computational complexities of the algorithms implemented in the proposed method. In this section, we shall discuss the computational complexity of both the implemented algorithms for feature selection and feature optimisation.

3.5.1 Pasi-Luukka’s algorithm:

In Pasi-Luukka’s algorithm, the parameters which are to be considered for the calculation of computational complexity are: number of classes(l), number of samples(m) and number of features(t). In a single iteration of the algorithm, the similarity values of the features are calculated, using (1), by iterating over the entire feature set and the classes for each feature. Therefore, complexity of one iteration(iter) is,

$$ O(iter) = O(t).O(l) $$
(16)

After calculation of similarity values for each sample, the entropy values are calculated using (2), by iterating through the feature set.Thus, the computational complexity of the algorithm is,

$$ O(P) = O(m).O(iter) + O(t) $$
(17)

that is,

$$ O(P) = {{{\varOmega}}}(m).O(t).O(l) + O(t) $$
(18)

3.5.2 WOA:

In WOA, there are three parameters to be considered for the calculation of computational complexity of the algorithm. They are: number of iterations(t), population size(i), and number of features(j). Now, in one iteration, algorithm iterates over the entire population and each agent is updated through an iteration over the feature set. Therefore, the complexity of one iteration(iter) shall be,

$$ O(iter) = O(i).O(j) $$
(19)

Therefore, for a total t iterations, the computational complexity of the WOA algorithm would be,

$$ O(W) = O(t).O(iter) $$
(20)

or,

$$ O(W) = {{{\varOmega}}}(t).O(i).O(j) $$
(21)

4 Results and discussion

In this section, we describe the SER datasets used to evaluate the proposed framework, as well as report the results obtained on each dataset, using a 5-fold cross-validation scheme. We also compare the proposed approach with some existing methods in literature, to justify the superiority and reliability of the proposed method.

4.1 Datasets used

The proposed framework has been robustly evaluated on three publicly available SER datasets using a five-fold cross-validation scheme:

  1. 1.

    RAVDESS database by Livingstone et al. [36]

  2. 2.

    URDU speech database by Latif et al. [33]

  3. 3.

    EmoDB database by Burkhardt et al. [11]

For each of the aforementioned datasets, a train and validation split of 80% and 20% respectively have been taken to evaluate the proposed pipeline. A brief description of the datasets is provided in the following subsections.

4.1.1 RAVDESS database

The RAVDESS [36] database was originally a multi-modal emotion recognition dataset comprising facial expressions as well as audio samples for speech and music. The dataset was recorded with a North American accent by 24 professional actors (12 females and 12 males) with eight emotions: calm, happy, sadness, angry, fearful, surprise, neutral, and disgust expressions. Overall, RAVDESS contains 1440 speech files for SER, which have been used in the present study. The class-wise distribution of the dataset is given in Table 1.

Table 1 Class-wise distribution of samples in each of the publicly available SER datasets used in this study

4.1.2 URDU speech database

The Urdu-language speech emotion database (URDU) was originally proposed in the context of cross-lingual SER [33] which comprises 400 audio samples covering four basic emotions: angry, happy, neutral and sad. The corpus was created using video clips collected from YouTube based on the discussion and situations going on in the talk shows. It is a class-balanced dataset, the distribution given in Table 1.

4.1.3 EmoDB database

The Berlin database of emotional speech (EmoDB) [11] is a German SER database produced by the Technical University of Berlin. It was recorded by 10 actors (5 females and 5 males, between the age of 20 and 35) and covers seven emotion classes: anger, boredom, neutral, disgust, fear, happiness, and sadness, with a total of 535 audio samples. The class-wise distribution of the dataset is given in Table 1.

4.2 Implementation details

The proposed framework has been implemented in Python3 using the PyTorch Toolbox [52] on a 12GB K80 Nvidia GPU. The CNN feature extractor was trained for 100 epochs using the stochastic gradient descent (SGD) [62] optimizer with a learning rate of 0.0005 and momentum = 0.9. All mel-spectrogram images were resized to 224 × 224 before being passed into the CNN backbone, the training batch size being set to 4.

4.3 Evaluation metrics

Four commonly used evaluation measures have been considered in this study to evaluate the proposed framework on the aforementioned multi-class SER datasets, namely, Accuracy, Precision, Recall and F1-Score. The formulae of these metrics are given in (22), (23), (24) and (25), all of which having derived from a confusion matrix C.

$$ Accuracy = \frac{{\sum}_{i=1}^{N}C_{ii}}{{\sum}_{i=1}^{N}{\sum}_{j=1}^{N}C_{ij}} $$
(22)
$$ Precision_{i} = \frac{C_{ii}}{{\sum}_{j=1}^{N}C_{ji}} $$
(23)
$$ Recall_{i} = \frac{C_{t}{ii}}{{\sum}_{j=1}^{N}C_{ij}} $$
(24)
$$ F1-Score_{i} = \frac{2}{\frac{1}{Precision_{i}}+\frac{1}{Recall_{i}}} $$
(25)

Here, N denotes the number of emotion classes in a given dataset.

4.4 Results

A five-fold cross-validation scheme has been employed for robust and consistent evaluation of the proposed framework on each of the publicly available SER datasets described in Section 4.1. The results obtained on each dataset are discussed in the following sections.

4.4.1 Results on RAVDESS dataset

Table 2 tabulates the results of each evaluation metric, along with their mean and standard deviation (SD) values, obtained by the proposed framework across each fold of the 5-fold cross-validation scheme. Further, the accuracy scores and number of features selected at each stage of our multi-stage pipeline have been shown in Table 3. It can be seen that the method obtains consistent results across the 5-folds of cross-validation, thereby depicting robustness in the approach.

Table 2 Resultsobtained by the proposed method on each fold of 5-fold cross-validation scheme on RAVDESS dataset
Table 3 Accuracies and number of features obtained at each stage of the proposed framework on five folds of cross-validation on the RAVDESS dataset

For the feature extraction phase, the Wide-ResNet-50-2 [67] CNN backbone training curve has been shown in Fig. 5, which shows a moderately satisfactory convergence behaviour.

Fig. 5
figure 5

Learning curves obtained during CNN training for feature extraction on the RAVDESS dataset

Further, the confusion matrices obtained by the proposed method on each fold of the cross-validation scheme on RAVDESS have been shown in Fig. 6, which essentially describes the model’s performance on each class of the dataset. It can be observed that the proposed pipeline achieves high true positive values on most of the emotion classes consistently across each of the folds of cross-validation, which justifies the robust performance of the model. For a more concise depiction of class-wise performances, Fig. 7 provides the class-wise metric scores averaged over five folds of cross-validation, where the model is found to achieve a perfect accuracy for the emotion “Surprised”, in addition to showing a cent per cent precision scores for two emotion classes (i.e. “Fearful” and “Surprised”). The performance of the proposed framework on such a challenging dataset highlight its potential in robust detection of emotion from human speeches.

Fig. 6
figure 6

Confusion matrices obtained by the proposed method using 5-fold cross-validation procedure on the RAVDESS dataset

Fig. 7
figure 7

Class-wise results obtained by the proposed method on the RAVDESS dataset

4.4.2 Results on URDU speech dataset

The performance of the proposed method on the URDU dataset across each fold has been tabulated in Table 4, while the accuracy values obtained at each stage of the pipeline are shown in Table 5. Furthermore, the learning curves obtained during CNN backbone training in Fig. 8 shows a commendable convergence behaviour without any signs of overfitting, something small datasets are highly prone to. A high classification accuracy of 96.25% is obtained by the proposed approach, justifying its effectiveness and suitability for SER.

Table 4 Results obtained by the proposed method on each fold of 5-fold cross-validation scheme on URDU speech dataset
Table 5 Accuracies and number of features obtained at each stage of the proposed framework on five folds of cross-validation on the URDU speech dataset
Fig. 8
figure 8

Learning curves obtained during CNN training for feature extraction on the URDU speech dataset

The class-wise performance of the proposed approach on the URDU speech corpus have been illustrated by the confusion matrices in Fig. 9 and the class-wise metric scores in Fig. 10. Exemplary performance of the proposed framework across the emotion classes can be inferred from the aforementioned figures, including high true positive values and perfect accuracy scores for two emotion classes.

Fig. 9
figure 9

Confusion matrices obtained by the proposed method using 5-fold cross-validation procedure on the URDU speech dataset

Fig. 10
figure 10

Class-wise results obtained by the proposed method on the URDU speech dataset

4.4.3 Results on EmoDB dataset

The evaluation metric scores, along with their mean and SD values over five folds of cross-validation obtained by the proposed study on EmoDB [11] have been tabulated in Table 6. It is observed that our approach achieves a promising mean classification accuracy of 93.64% along with a precision of 94.45%. The accuracies as well as number of features obtained after each stage of the pipeline have been listed in Table 7. The learning curves obtained during training of the Wide-ResNet-50-2 [67] CNN feature extractor is shown in Fig. 11, the convergence behaviour being quite stable, showing very little tendency to overfit the dataset. The results are a testimony of the faithful performance of our method.

Table 6 Results obtained by the proposed method on each fold of 5-fold cross-validation scheme on EmoDB dataset
Table 7 Accuracies and number of features obtained at each stage of the proposed framework on five folds of cross-validation on EmoDB dataset
Fig. 11
figure 11

Learning curves obtained during CNN training for feature extraction on the EmoDB dataset

Finally, the confusion matrices depicting class-wise performance of the proposed pipeline obtained on each fold of the cross-validation procedure on EmoDB are shown in Fig. 12, while Fig. 13 depicts the average metric values of each emotion class. It is observed that our approach achieves a perfect classification accuracy on two emotion classes (“Angry” and “Sadness”), as well as obtaining a perfect precision score on two other emotions (“Disgust” and “Happiness”). The results further validate the robustness of the proposed approach for SER tasks.

Fig. 12
figure 12

Confusion matrices obtained by the proposed method using 5-fold cross-validation procedure on the EmoDB dataset

Fig. 13
figure 13

Class-wise results obtained by the proposed method on the EmoDB dataset

4.5 Comparison with state-of-the-art SER methods

Table 8 compares the proposed method against several works in literature pertaining to SER on the publicly available datasets used in this study based on the evaluation measures described in Section 4.3. It can be observed that the proposed framework outperforms all of the existing works on RAVDESS [36] and URDU [33] datasets by significant margins. On the EmoDB [11] dataset, the proposed pipeline achieves a performance equivalent to state-of-the-art, outperforming several existing works in literature. It may also be noted that several previous works have reported accuracy as the sole metric, which does not give any insights regarding the false positives (or true negatives) and hence is insufficient as well as unreliable on a multi-class classification task such as SER. On the other hand, our results justify that the proposed study is a highly effective approach for detecting emotions from speech signals.

Table 8 Comparison of the proposed method with existing works in literature

The FS algorithm used in this study, WOA [47], has been compared to the following state-of-the-art meta-heuristics in literature:

  1. 1.

    Particle Swarm Optimization (PSO) by Kennedy et al. [28]

  2. 2.

    Arithmetic Optimization Algorithm (AOA) by Abualigah et al. [2]

  3. 3.

    Grey Wolf Optimizer (GWO) by Mirjalili et al. [48]

  4. 4.

    Gravitational Search Algorithm (GSA) by Rashedi et al. [55]

  5. 5.

    Cuckoo Search Algorithm (CSA) by Yang et al. [65]

  6. 6.

    Sine Cosine Algorithm (SCA) by Mirjalili et al. [46]

For each optimization algorithm, a population size of 40 is chosen and the maximum number of iterations is set to 100. The average values over 10 independent runs on a given fold of a dataset, aggregated by running over five folds of cross-validation, have been reported in Table 9. In each of the datasets, it can be observed that WOA achieves the highest classification accuracy and also shows competitive performance in terms of feature space reduction. On RAVDESS [36], SCA and GWO rank second in terms of the classification metric, while SCA is found to select the minimal number of features. On the URDU dataset [33], GWO, GSA and SCA perform equally as WOA in terms of accuracy, although the latter selects the minimal feature subset. On the EmoDB dataset [11], GSA and SCA rank second in terms of accuracy, with the former showing the greatest feature reduction. Note that all of these experiments have been conducted on the top-q% feature subset (q = 50%) obtained by the filter-based FS method [37] as described in Section 3.

Table 9 Comparison of accuracies and number of features selected (FS) among state-of-the-art optimization algorithms on each of the SER datasets

A graphical view of the aforementioned comparison in terms of classification accuracy and number of features selected have been depicted in Figs. 14 and 15 respectively. The plots show that WOA has shown robust performance in optimising both of the aforesaid objectives for each of the SER datasets, justifying the use of the same in our proposed study.

Fig. 14
figure 14

Comparison of accuracies obtained by state-of-the-art optimization algorithms on each of the SER datasets. The results reported are aggregated over 10 independent runs of each algorithm, averaged over five folds of cross-validation

Fig. 15
figure 15

Comparison of number of features selected by state-of-the-art optimization algorithms on each of the SER datasets. The results reported are aggregated over 10 independent runs of each algorithm, averaged over five folds of cross-validation

5 Conclusion and future work

The present study proposes a computationally efficient two-tier hybrid wrapper-filter FS pipeline for dimensionality reduction of the feature representation extracted by a CNN backbone from mel-spectrograms of speech audio clips, as well as robust classification of speech signals into respective emotion classes. Our approach alleviates the cumbersome process of handcrafted feature extraction, providing an end-to-end framework for SER. The proposed method has been evaluated on three publicly available standard speech datasets, where it has been found to outperform several existing works in literature, justifying the reliability of the framework. The hybrid dimensionality reduction approach used in this study is a new addition to FS literature and thus, can be used as a stand-alone algorithm for traditional ML-based approaches requiring feature engineering. Further, the proposed pipeline is domain-independent and hence may be applied off-the-shelf to different facets of image classification, such as disease detection [12] or human action recognition [23], to name a few.

In order to contribute to the research on SER, we intend to explore other speech datasets available in the public domain for greater generalization and reliability so as to be used in real-world applications. We may also try various other approaches to meta-heuristic algorithm-based FS, such as initialization using clustering-guided population [22], hybrid of wrapper-based approaches [17] and local search-embedded optimization algorithms [12]. Last but not the least, we also intend to explore temporal features of raw audio signals using deep learning-based architectures to investigate deeper into emotion classification and further the community.