1 Introduction

Beyond conveying textual information, the speaker’s speech inherently carries emotional nuances such as joy and sadness. Even if the speaker articulates the same text, varying emotional cues can significantly alter the intended meaning. Hence, recognizing the emotional aspects of a speaker’s speech is paramount (Liu et al., 2022). One of the differences between humans and computers is that humans have a good perception ability for emotions. The purpose of emotion recognition is to enable computers to simulate the process of human emotional perception, and speech, as a direct way of expressing emotions, plays an essential role in achieving human-computer interaction. Speech emotion identification has significant application value in many scenarios, for example, detecting the emotional states of drivers to issue timely reminders in situations of hyperactivity or fatigue (Requardt et al., 2020). It also applies in education (Tanko et al., 2022), aiding teachers in assessing students’ emotional states through their speech. Moreover, in the realm of medicine and healthcare, speech-emotion recognition can be employed to discern if a patient is experiencing depression or anxiety (Hansen et al., 2021). For achieving intelligent and natural human-computer interaction (Pandey et al., 2022), extensive research has been conducted on emotion recognition in speech across different languages (Hu et al., 2021). Chattopadhyay et al. (2023) employed linear prediction coding and linear predictive cepstral coefficient extracted from speech signals as features and utilized clustering-based equilibrium optimizer and atom search optimization method for emotion recognition. They found that the method exhibited high classification accuracy. Guo et al. (2022) introduced a dynamic relative phase method for feature extraction. They employed a single-channel model and an attention-combined multi-channel model to learn acoustic features, yielding favorable results in emotion recognition experiments. Qiao et al. (2022) designed a Trumpet-6 method for the identification of emotions in Chinese speech, achieving a 95.7% accuracy in experiments on CASIA. Ocquaye et al. (2021) utilized a triple attentive CNN with asymmetric architecture for identifying emotion in cross-language speech. Experiments on English, German, and Italian datasets demonstrated the method’s higher prediction accuracy. Given the widespread use of English (Hyder, 2021), research on emotion recognition in English speech holds significant practical value across various domains. The combination of CNN and long short-term memory (LSTM) or gated recurrent unit (GRU) has found extensive application in speech emotion recognition, such as CNN + LSTM (Ahmed et al., 2023), CNN-bidirectional gated recurrent unit (BiGRU) (Hu et al., 2022), and CNN-n-GRU (Nfissi et al., 2022), but there is still potential for further improvement in its performance. Building upon the combination of CNN and GRU, this paper proposes a recognition method to enhance its performance in English speech emotion recognition through feature fusion and structural improvements. By fusing features like energy and Mel-frequency cepstral coefficient (MFCC), richer emotional information was obtained. Furthermore, by incorporating skip connections, a Skip-BiGRU model was designed to combine with CNN, resulting in the CNN-Skip-BiGRU method for English speech emotion recognition. Its effectiveness was validated through experiments on IEMOCAP, providing a novel approach to differentiate emotions in English speech. This article provides some directions for further research on the integration of CNN with LSTM or GRU in speech emotion recognition and demonstrates the importance of feature fusion, providing some references for extracting speech emotion features.

2 Feature fusion in english speech

2.1 Preprocessing of speech signals

English speech signals must first be preprocessed to provide higher-quality speech for subsequent recognition. First, pre-emphasis is performed on original speech \( x\left(n\right)\) by a first-order digital filter to make the spectrum flatter. The formula is written as:

$$ y\left(n\right)=x\left(n\right)-\mu x\left(n-1\right)$$
(1)

where \( \mu \) is the pre-emphasis factor, generally 0.97.

Based on the short-time smoothness characteristic of the signal, it is also necessary to intercept the original signal into shorter signals by framing, generally using the method of overlapping framing. After the framing, the key waveforms are highlighted by adding windows frame by frame, and in this paper, the Hamming window is used (Tan et al., 2020):

$$ w\left(n\right)=\left\{\begin{array}{c}0.54-0.46\text{cos}\left[2\pi n/\left(N-1\right)\right],0\le n\le N-1\\ 0,other\end{array}\right.$$
(2)

The signal after adding the window is:

$$ y\left(n\right)=\sum _{n=-N/2+1}^{N/2}x\left(m\right)w\left(n-m\right)$$
(3)

where \( n\) is the moment and \( N\) is the frame length.

2.2 Emotion feature extraction

Since the emotional information contained in a single feature tends to be one-sided, the following features are used for fusion to characterize the emotional information contained in the signal more comprehensively.

2.2.1 Energy

In general, the speaker’s voice is louder when he is happy and angry and lower when he is sad and calm. The level of energy can reflect this difference. By utilizing the short-term average amplitude, it is possible to derive the energy features of the signal. The corresponding formula is:

$$ {E}_{n}=\sum _{m=0}^{N-1}\left|{x}_{n}\left(m\right)\right|$$
(4)

where \( N\) is the frame length and \( {\text{x}}_{\text{n}}\left(\text{m}\right)\) means the \( \text{n}\)-th frame signal.

2.2.2 Short-time zero-crossing rate

It denotes the frequency at which the signal waveform crosses the zero level (Zhu et al., 2021). The number of times the signal passes the zero level varies depending on the emotional information contained in the signal. The formula is:

$$ {Z}_{n}=\frac{1}{2}\sum _{m=0}^{N-1}\left|sgn\left[{x}_{n}\left(m\right)\right]-sgn\left[{x}_{n}\left(m-1\right)\right]\right|$$
(5)
$$ sgn\left[x\right]=\left\{\begin{array}{c}1,x\ge 0\\ -1,x<0\end{array}\right.$$
(6)

2.2.3 Mel-frequency cepstral coefficient

MFCC is widely employed as a prevalent acoustic characteristic for examining auditory attributes of the human ear (Wibawa & Darmawan, 2021). It can help distinguish different emotional information. The correlation between Mel frequency and the true frequency is:

$$ \text{M}\text{e}\text{l}\left(\text{f}\right)=2595\text{l}\text{g}\left(1+\text{f}/700\right)$$
(7)

The extraction process of MFCC is as follows.

① Fast Fourier transform (FFT) is performed on the signal:

$$ {X}_{j}\left(k\right)=\sum _{i=0}^{N-1}{x}_{j}\left(n\right){e}^{\frac{j2\pi nk}{N}}, 0\le k\le K$$

② The signal passes through a set of Mel filters:

$$ {h}_{i}\left(k\right)=\left\{\begin{array}{c}0,k<f\left(i-1\right)\\ \frac{k-f\left(i-1\right)}{f\left(i\right)-f\left(i-1\right)},f\left(i-1\right)\le k\le f\left(i\right)\\ \frac{f\left(i+1\right)-k}{f\left(i+1\right)-f\left(i\right)},f\left(i\right)<k<f\left(i+1\right)\\ 0,k>f\left(i+1\right)\end{array}\right.$$

③ The logarithmic energy of the output of the Mel filter is calculated:

$$ m\left(i\right)=\sum _{k=0}^{n-1}{\left|X\left(k\right)\right|}^{2}{h}_{i}\left(k\right), 0\le i\le M$$

④ The logarithmic is taken from the outputs of all filters, and a discrete cosine transform (DCT) is also conducted to obtain the MFCC:

$$ MFCC\left(i\right)=\sqrt{\frac{2}{M}}\sum _{i=0}^{M-1}lgm\left(i\right)\text{cos}\left[\left(i-1/2\right)\frac{i\pi }{M}\right]$$

In the above equations, \( {x}_{j}\left(n\right)\) refers to the \( \text{j}\)-th frame of the English speech signal, \( K\) is the length of the FFT, which is 512, \( M\) is the number of filters, 24 filters in this paper, and \( f\left(i\right)\) is the center frequency of the \( i\)-th filter.

2.2.4 Statistical characteristic

To obtain the emotional characteristics of the signal globally, this paper calculates the statistical features, including:

① mean value: \( {f}_{mean}=\frac{1}{n}\sum _{i=1}^{n}{f}_{i}\);

② variance: \( {f}_{var}=\frac{1}{n}\sum _{i=1}^{n}{\left({f}_{i}-{f}_{mean}\right)}^{2}\);

③ maximum value: \( {f}_{max}=max\left({f}_{1},{f}_{2},\cdots,{f}_{n}\right)\);

④ minimum value: \( {f}_{min}=min\left({f}_{1},{f}_{2},\cdots,{f}_{n}\right)\);

⑤ median: \( {f}_{median}=\frac{{f}_{max}+{f}_{min}}{2}\)

In the subsequent emotion recognition process, this paper selects the following features: energy, short-time zero-crossing rate, 24-dimensional MFCC, and 24-dimensional first-order difference dynamic feature ∆MFCC. These features are fused, resulting in a total of 50 dimensions. Subsequently, five statistical features are computed for the 50-dimensional feature, ultimately yielding a 250-dimensional feature.

3 Emotion recognition methods based on feature fusion

3.1 Convolutional neural network

CNNs are widely used to recognize images, text, speech, etc. (Ponmalar & Dhanakoti, 2022). In this paper, CNN is used to realize the learning of the 50-dimensional fused feature obtained in the previous section to get more advanced features. CNN has three main layer structures. Its structure is shown in Fig. 1.

Fig. 1
figure 1

The structure of CNN

  1. (1)

    Convolutional layer: it is capable of autonomous learning of input English speech features. For an input feature matrix called \( I\), if there is a \( m\times n\) convolution kernel \( K\), the convolution operation can be written as:

$$ {O}_{i,j}=f\left(\sum _{m}\sum _{n}{I}_{i+m,j+n}{K}_{m,n}+{w}_{b}\right)$$
(8)

where \( {I}_{i+m,j+n}\) means the element at the \( \left(\text{i}+\text{m},\text{j}+\text{n}\right)\) of \( \text{I}\), \( {K}_{m,n}\) means the element at the \( \left(m,n\right)\) of \( K\), and \( {w}_{b}\) is the bias.

  1. (2)

    Pooling layer: The convolutional layer’s output can be downsampled by this layer to capture the most salient feature (Li et al., 2019). Pooling operations can be divided into two types.

① Maximum pooling: Select the highest value from the local area as the output to obtain the most significant features.

② Mean pooling: Take the highest value in the local area as the output to obtain an average representation of the overall features.

  1. (3)

    Fully connected layer: it synthesizes the features extracted from the first two layers to achieve recognition and classification.

3.2 Bidirectional gated recurrent unit (BiGRU)

CNN can obtain more emotional features from the fused feature, but it is insufficient in the extraction of temporal context information; therefore, in this paper, the BiGRU model is used to learn temporal context information in English speech signals based on CNN. The BiGRU model utilizes both forward and backward GRU, enabling concurrent processing of past and future information (Niu et al., 2022). GRU exhibits a more streamlined architecture and superior training efficacy when compared to the long short-term memory network (LSTM) (Chen et al., 2021), and its structure is presented in Fig. 2.

Fig. 2
figure 2

The structure of GRU

According to Fig. 2, the update process of the reset gate can be written as:

$$ {r}_{t}=\sigma \left({W}_{r}{x}_{t}+{U}_{r}{h}_{t-1}+{b}_{r}\right)$$
(9)

The update process of the update gate can be written as:

$$ {z}_{t}=\sigma \left({W}_{z}{x}_{t}+{U}_{z}{h}_{t-1}+{b}_{z}\right)$$
(10)

The output of GRU can be written as:

$${\mathop h\limits^ \sim _t} = {\rm{tanh}}\left[ {{W_h}{x_t} + {U_h}\left( {{r_t} \odot {h_{t - 1}}} \right) + {b_h}} \right]$$
(11)
$${h_t} = \left( {1 - {z_t}} \right) \odot {h_{t - 1}} + {z_t} \odot {h_t}$$
(12)

where \( {x}_{t}\) is the input, \( {h}_{t-1}\) is the previously hidden state, \( W\) and \( U\) are the weight matrices, \( b\) is the bias, \( \sigma \) is the sigmoi function, \( {\stackrel{\sim}{h}}_{t}\) is the candidate output state, and \( {h}_{t}\) is the final GRU output state.

The hidden layer outputs of the forward GRU and the backward GRU can be obtained at the \( t\) moment:

$$ \overrightarrow{{h}_{t}}=GRU\left({x}_{t},\overrightarrow{{h}_{t-1}}\right)$$
(13)
$$ \overleftarrow{{h}_{t}}=GRU\left({x}_{t},\overleftarrow{{h}_{t-1}}\right)$$
(14)

They are combined to obtain the output of BiGRU at the \( t\) moment:

$$ {h}_{t}=\left[\overrightarrow{{h}_{t}},\overleftarrow{{h}_{t}}\right]$$
(15)

3.3 Emotion recognition method based on CNN-Skip-BiGRU

To improve the effectiveness of the BiGRU model on long-term word sequence learning, this paper improves the structure of the BiGRU model by combining skip connections. Finally, it obtains a CNN-Skip-BiGRU model as an English speech emotion recognition method. Its structure is as follows.

  1. (1)

    Input layer: the fused 250-dimensional English speech feature.

  2. (2)

    CNN layer: it contains two convolutional layers and two pooling layers, all with a specification of 1 × 2 and a step length of 1.

  3. (3)

    Skip-BiGRU layer (Fig. 3): it contains three BiGRU layers and uses skip connections, and the output of each layer is:

$$ {O}_{1}={GRU}_{1}\left({x}_{t}\right)$$
(16)
$$ {O}_{2}={GRU}_{2}\left({O}_{1}\right)$$
(17)
$$ {O}_{3}={GRU}_{3}\left({O}_{1}+{O}_{2}\right)$$
(18)
  1. (1)

    Dense layer: the features obtained from the above learning undergo dimensional variation to achieve a size of 256 × 64.

  2. (2)

    Flatten layer: it flattens the multi-dimensional feature vector into one dimension.

  3. (3)

    Softmax layer: it realizes the recognition of different English speech emotions, and the final output can be written as:

$$ Y=softmax\left(flatten\left(Dense\left({O}_{1}+{O}_{2}+{O}_{3}\right)\right)\right)$$
(19)
Fig. 3
figure 3

Skip-BiGRU layer structure

4 Results and analysis

4.1 Experimental setup

The experiment was conducted on the Ubuntu 16.04 operating system. Python 3.6 was used as the programming language. The Keras platform was utilized to implement the emotion recognition approach, and the experiment employed the five-fold cross-validation method. The Adam optimizer was used. The batch size was 64, the epoch number was 150, and the initial learning rate was established as 10− 4. The IEMOCAP dataset was used (Ayadi & Lachiri, 2022), an English corpus recorded by ten professional performers. The samples were collected at a frequency of 16 kHz. The dataset was approximately 12 h and encompassed various emotion types such as happiness and anger. Due to category imbalance in the dataset, four emotions were selected for the experiments, and their distributions are shown in Table 1.

Table 1 IEMOCAP dataset

The effectiveness of the emotion recognition method was assessed using the following pair of indicators.

  1. (1)

    Unweighted accuracy rate (UAR): it refers to the accuracy of the entire test set:

$$ UAR=\frac{{N}_{acc}}{N}$$
(20)

where \( N\) is the total quantity of samples and \( {N}_{acc}\) is the count of accurately recognized specimens.

  1. (2)

    Weighted accuracy rate (WAR): it represents the mean recognition accuracy for each emotion:

$$ WAR=\frac{1}{{n}_{class}}\sum _{i=1}^{{n}_{class}}\frac{{N}_{i}^{acc}}{{N}_{i}}$$
(21)

where \( {n}_{class}\) refers to the number of emotion categories, \( {N}_{i}^{acc}\) refers to the recognition accuracy of the \( i\)-th kind of emotion, and \( {N}_{i}\) refers to the total number of samples for the \( i\)-th kind of emotion.

4.2 Results analysis

First, the effects of different features on the effect of English speech emotion recognition were analyzed, and the findings can be observed in Table 2.

From Table 2, the CNN-Skip-BiGRU model achieved a UAR of 63.24% and a WAR of 63.36% on the IEMOCAP dataset when using only MFCC-related features as inputs. This indicated that the model was less effective in recognizing different emotion types under these conditions. When fusing energy and short-time zero-crossing rate with MFCC-related features to obtain the 50-dimensional fused feature as input, the CNN-Skip-BiGRU model showed a UAR of 67.45% and a WAR of 66.97%, marking an improvement of 4.21% and 3.61% compared to using only MFCC, respectively. Finally, when using the obtained 250-dimensional feature as input, the UAR was 70.31%, and the WAR was 70.88%, showing improvements of 7.07% and 7.52%, respectively, compared to using the MFCC features alone. These results demonstrated the effectiveness of the fused features selected for English speech emotion identification.

Table 2 The impact of different features on the effect of English speech emotion recognition

The performance of the Skip-BiGRU structure was evaluated with the 250-dimensional feature as input (Fig. 4).

Fig. 4
figure 4

Emotion recognition effect of different structural models

From Fig. 4, the combination of CNN and LSTM only achieved a UAR of 59.87% and a WAR of 59.64% on the IEMOCAP dataset, indicating that the model was weak in distinguishing between different emotion types. The UAR of CNN-GRU was 61.26%, and the WAR was 61.37%, which were improved by 1.39% and 1.73% respectively compared to the CNN-LSTM model, and this demonstrated the superiority of GRU over LSTM. Subsequently, the UAR of the CNN-BiGRU model was 65.33%, and the WAR was 65.26%, marking a further improvement of 4.07% and 3.89% compared to the CNN-GRU model. This result demonstrated the effectiveness of using BiGRU in feature learning. Finally, the CNN-Skip-BiGRU model attained a UAR of 70.31% and a WAR of 70.88%, surpassing the CNN-BiGRU model by 4.98% and 5.62%, respectively. This result indicated that using skip connections to optimize BiGRU significantly improved the effectiveness of the model in recognizing emotions in English speech.

The CNN-Skip-BiGRU model was compared with other emotion recognition methods:

  1. (1)

    3D-CRNNs (Peng et al., 2018): a 3D convolutional recurrent neural network-based method;

  2. (2)

    attention-BLSTM-FCNs + DNN (Zhao et al., 2018): a method that combines the attention mechanism and bidirectional LSTM with a fully connected CNN to learn speech features and then utilizes a DNN to achieve sentiment prediction;

  3. (3)

    ABLSTM-AFCN (Zhao et al., 2019): an approach that integrates an attention-combined bidirectional LSTM with an attention-combined fully convolutional network.

Refer to Table 3 for the comparative results.

From Table 3, it is observed that most current emotion recognition methods were based on deep learning, and they introduced more networks or the attention mechanism to the CNN-RN model to enhance the efficacy of emotion recognition. However, these attempts did not yield satisfactory results. The attention-BLSTM-FCNs + DNN model was the least effective among the compared methods, achieving 60.10% UAR and 59.70% WAR, respectively. The ABLSTM-AFCN model performed relatively well with 67.00% UAR and 68.10% WAR, and the CNN-Skip-BiGRU model attained 70.31% UAR and 70.88% WAR, outperforming the other methods. This result indicated the reliability of the proposed method in English speech emotion identification, demonstrating its ability to distinguish between various emotions effectively.

Table 3 Comparisons with other methods

5 Conclusion

This study proposes a CNN-Skip-BiGRU model for identifying emotions in English speech, which incorporates various features as inputs. Experiments on the IEMOCAP dataset revealed that the fused feature offered an effective characterization of emotion information in diverse English languages, leading to improved emotion recognition performance. Compared to LSTM and similar models, the Skip-BiGRU structure effectively enhanced the model’s capability to distinguish between different emotions, outperforming other emotion recognition methods. These findings suggest that the designed CNN-Skip-BiGRU method holds promise for practical applications in real-world English speech emotion recognition.

However, this study also has some limitations. For instance, it only focuses on the recognition of four emotions in the IEMOCAP dataset and fails to further validate the practicality of the method on a wider range of languages and more diverse datasets. Therefore, future work should take into account the issue of dataset balance, verify the reliability of the proposed approach in recognizing a broader range of emotions, and conduct experiments on more extensive datasets.