Long-Time Speech Emotion Recognition Using Feature Compensation and Accentuation-Based Fusion

Sun, Jiu; Zhu, Jinxin; Shao, Jun

doi:10.1007/s00034-023-02480-6

Long-Time Speech Emotion Recognition Using Feature Compensation and Accentuation-Based Fusion

Published: 11 September 2023

Volume 43, pages 916–940, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Long-Time Speech Emotion Recognition Using Feature Compensation and Accentuation-Based Fusion

Download PDF

Jiu Sun¹,
Jinxin Zhu¹ &
Jun Shao¹

185 Accesses
1 Altmetric
Explore all metrics

Abstract

In this paper, we study the speech emotion feature optimization using stochastic optimization algorithms, and feature compensation using deep neural networks. We also proposed to use accentuation-based fusion for long-time speech emotion recognition. Firstly, the extraction method of emotional features is studied, and a series of speech features are constructed for the recognition of emotion. Secondly, we propose a method of sample adaptation through denoising autoencoder to enhance the versatility of features through the mapping of sample features to improve adaptive ability. Thirdly, GA and SFLA are used to optimize the combination of features to improve the emotion recognition results at the utterance level. Finally, we use transformer model to implement accentuation-based emotion fusion in long-time speech. The continuous long-time speech corpus, as well as the public available EMO-DB, are used for experiments. Results show that the proposed method can effectively improve the performance of long-time speech emotion recognition.

Speaker-Aware Speech Emotion Recognition by Fusing Amplitude and Phase Information

Feature compensation based on the normalization of vocal tract length for the improvement of emotion-affected speech recognition

Article Open access 04 August 2021

Speech emotion recognition for human–computer interaction

Article 31 August 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Speech emotion recognition represents a critical area of research. Emotions serve as essential elements in human communication and expression, exerting considerable influence on an individual’s behavior and psychological well-being [18]. Understanding and accurately identifying emotions conveyed through speech are therefore of great significance. Emotion recognition has a wide range of applications in many fields, such as human–computer interaction, diagnosis and treatment of mental illness, social media analysis, and more.

Long-time speech emotion recognition studies the temporal changes of emotions over a paragraph period of time. This is a highly challenging problem, as traditional algorithms mainly focus on the emotional state within a single sentence, losing the emotional information over time in the context [10].

The extraction, compensation, and optimization of speech emotion features constitute pivotal challenges in the field of speech emotion recognition. Addressing these challenges necessitates a holistic approach integrating signal processing and machine learning techniques. Speech emotion feature extraction involves the derivation of features from speech signals that effectively capture emotional characteristics. Key speech emotion features commonly utilized include pitch, formant, speaking rate, energy, and intonation [6, 17].

Feature extraction based on Mel frequency cepstral coefficients (MFCC) is a method that converts speech signals into Mel frequency spectrograms and extracts spectral coefficients as features. The method based on intonation analysis is a feature extraction method that extracts emotion features by analyzing the changes in pitch of speech signals. Short-term energy analysis is a feature extraction method that extracts emotion features by analyzing the changes in acoustic energy of speech signals. Using speech duration, we can construct feature extraction method that extracts emotion features by analyzing the duration of different syllables in speech signals.

Apart from feature extraction, compensation and optimization of speech emotion features are also important. Especially in practical applications, due to differences in speech features among different individuals, it is necessary to compensate and optimize these differences. There has been relatively less research on the compensation of emotional features, and previous research mainly focused on normalization of features to compensate for the impact of individual differences [23].

The optimization of features can play a crucial role in enhancing the quality of speech emotion features and significantly improving the accuracy of speech emotion recognition systems. Common feature selection methods include correlation coefficient, mutual information, and so on. In addition, random optimization methods such as genetic algorithm and swarm intelligence have important value for selecting emotional features [19]. Such algorithms randomly select feature combinations from a feature subset to obtain a new feature subset. If the performance of the new feature subset is better than the current solution, the new solution is accepted. This step is repeated until a specified stopping criterion is reached. Random optimization methods can search for the optimal solution by repeatedly sampling, evaluating, and updating the solution. It can effectively solve the feature selection problem, especially for searching and optimizing cases with a large number of emotional features.

Researchers have investigated the practical applications of emotion recognition and have conducted comparative analyses of various modeling algorithms in this domain. Albu et al. [3] explore various neural network approaches for children’s emotion recognition, specifically focusing on speech signals and facial images. It highlights the influence of the number of centers chosen by the k-means algorithm on the recognition performance of radial basis function (RBF) networks and extreme learning machines (ELM). The findings emphasize the importance of child affective modeling alongside cognitive modeling for intelligent software applications and technology-enhanced learning, indicating the implications for personalized tutoring.

In the context of long-time speech, the challenge in emotion recognition addressed by the paper lies in optimizing speech emotion feature extraction and compensation. This involves dealing with complex emotions expressed in long-time speech and capturing the dynamic relationships between emotions over time.

In this paper, we study the extraction, compensation, and optimization of emotional features in speech and their application in long-time speech emotion recognition. In order to improve emotion recognition in continuous long-time speech utterances, we studied a novel framework involving four main steps, as shown in Fig. 1: emotional feature extraction, sample adaptation using neural network, feature optimization using GA and SFLA, and a novel accentuation-based fusion method for long-time speech emotion recognition.

In the first step, a series of speech features are extracted for emotion recognition at the frame level. These features include prosodic and spectral features, such as pitch, intensity, formant, and MFCCs.

In the second step, a neural network is used for sample adaptation to enhance the versatility of features. The neural network maps the sample features to a set of latent variables that capture the underlying emotional content of the speech. The features of a new speech sample are adapted to improve adaptive ability.

In the third step, GA and SFLA are used to optimize the combination of features to improve the accuracy of emotion recognition. GA and SFLA are metaheuristic optimization algorithms that can search for the optimal solution in a large solution space.

In the fourth step, we propose to use a novel accentuation-based fusion algorithm to combine context information and accentuation information in long-time speech emotion. Each utterance is modeled independently and jointly recognized for the final emotion category.

We introduce a unique method that represents emotions as nodes and transitions as edges in a graph, utilizing the transformer model’s self-attention mechanism to predict future states based on previous states and encoder outputs. The incorporation of accentuation weights enhances emotion recognition accuracy, offering a promising solution for understanding and recognizing emotions in extended speech contexts.

The proposed method was tested on continuous long-time speech utterances, and the results showed that it effectively improved the accuracy of emotion recognition. By optimizing the combination of features and enhancing the versatility of features through sample adaptation, the proposed method was able to improve the performance of emotion recognition in continuous long-time speech utterances.

Overall, we studied a comprehensive approach to improving emotion recognition in continuous long-time speech utterances by addressing the challenges of feature compensation, optimization, and accentuation-based results fusion. The method can be applied in a variety of contexts, such as emotion recognition in therapy, education, and customer service.

1.1 Related Work

Many existing emotion recognition studies have considered the problem of feature selection and feature analysis [2, 14, 15, 21, 25]. Alex et al. [4] studied feature selection in utterance level and syllable level for emotion recognition. Abdelhamid et al. [1] studied stochastic optimization for speech emotion models. Zhang et al. [28] proposed to study the practical speech emotion using stochastic optimization algorithms. In their study, basic feature was analyzed and feature combination was used to model speech emotions. They further studied emotion types that had practical values. Xu et al. [26] studied a large set of speech emotional features, and the results were promising. Although the novelty in the graph learning-based classifier was high, the generalization ability of the algorithms needed to be further discussed. Gat et al. [11] studied speaker feature normalization for emotion models. Huang et al. [13] studied feature normalization using speaker-sensitive features. A general framework was proposed to improve the emotion recognition performance. Saad et al. [24] studied emotion recognition across different languages and databases. The transfer of models is a very interesting topic. The optimal set of emotional features need to be further studied. Cowen et al. [8] studied a large number of emotions with feature analysis. Hajarolasvadi et al. [12] studied convolution neural network and its application in spectrograms. They propose to model emotions using visual features of the spectrogram. Fahad et al. [9] studied speaker-adaptive SER system. They presented a promising solution to the issue of speaker variability, enhancing accuracy in emotion recognition tasks. Feature space maximum likelihood linear regression was used, and emotion-specific epoch-based features are explored.

Other researchers have focused on the emotion models. Zou et al. [29] studied the cognitive-related emotions and the detection from speech. They propose to record the oral report during math exercises and analyzed the emotional features. Although the results were promising, the relation between acoustic features and the cognitive states still needs investigation. Anvarjon et al. [5] studied a lightweight detector based on a novel CNN architecture. Although the results were promising, more variety of backbone networks could be discussed with novel emotional features. Jin et al. [16] studied support vector machines from a semi-supervised framework. They propose to apply self-training SVM to speech emotion recognition and tested on public available databases. Although the results were promising, more recent algorithms needed to be further discussed. Choudhary et al. [7] studied emotion recognition using deep neural networks. The representative learning requires a large number of training samples. Although the results were promising, the generalization of the model is dependent on the dataset. Oaten et al. [20] studied a special type of emotion, disgust and its practical values in health.

2 Methodology

2.1 Feature Compensation

The problem of uneven sample distribution is an important challenge in feature engineering. In the process of emotion modeling, we often need a large number of samples, so that the statistical distribution we learn is very consistent with the real situation. However, our sample data sets are often limited, and the distribution of samples is uneven from different angles, such as age, accent, and personality. This will directly lead to the model we learned, which is not highly versatile, and the effect on the new sample is difficult to guarantee. Therefore, it is necessary to study the compensation method of the features to compensate for the equilibrium problem of the sample.

Deep neural networks are used to normalize and compensate for features, so that the imbalance distribution in the sample is alleviated, as shown in Fig. 2. Feature-compensated samples can be better counted and modeled. The input of the network is a one-dimensional emotional feature vector of each sample, and the output of the network is expressed by supervised information, which is also the emotional feature vector after one-dimensional compensation.

Feature compensation algorithms play a particularly important role. In the noise scenario, we add various types of noise to the test sample, which destroys the original emotional features to a certain extent. The training samples were collected in a relatively quiet environment, and the signal was relatively pure. The problem of sample mismatch caused by noise is a great application bottleneck in the practical application of speech emotion recognition. The method of deep network feature compensation can solve the feature mapping from pure speech to noisy speech, or vice versa. To a certain extent, the influence of noise is reduced. It should be pointed out that the current emotional database rarely exists in an absolutely quiet environment, inevitably, the training data center brings all kinds of noise, so in practical applications, the test environment can be of higher quality than the training data, such as on a smartphone, close collection of voice can obtain high quality. Therefore, the mismatch of noise is not necessarily unidirectional, the sound quality of the test environment can also be better than the training corpus, and simple noise reduction methods cannot replace the feature compensation algorithm.

2.2 Feature Optimization

When it comes to the selection and combination of features, it is difficult to exhaust the full range of possibilities. Therefore, in this paper, stochastic optimization algorithms are used to optimize the sample features. We compare the GA and SFLA algorithms and optimize the combination of features to improve the emotion recognition results at the frame level.

GA-Based Feature Selection

The fitness function is defined in Eq. 1:

$$\begin{aligned} fitness_i = f(\textbf{features}_i) \end{aligned}$$

(1)

where $fitness_i$ is the fitness of the i-th individual, $\textbf{features}_i$ is the binary vector representing the presence or absence of each feature for the i-th individual, and f() is the performance metric of the model trained on the selected features.

The selection of individuals for crossover and mutation using Roulette wheel selection: $p_i = \frac{fitness_i}{\sum _{j=1}^{pop_size} fitness_j}$, where $p_i$ is the probability of selecting the i-th individual, $fitness_i$ is the fitness of the i-th individual, and $\sum _{j=1}^{pop_size} fitness_j$ is the sum of fitness values in the population.

The crossover of two individuals using a one-point crossover operator, as shown in Eq. 2, is:

$$\begin{aligned} \textbf{c}_1, \textbf{c}_2 = \text {crossover}(\textbf{p}_1, \textbf{p}_2, p_c) \end{aligned}$$

(2)

where $\textbf{p}_1$ and $\textbf{p}_2$ are the parent binary vectors, $\textbf{c}_1$ and $\textbf{c}_2$ are the offspring binary vectors, and $p_c$ is the crossover probability.

The mutation of an individual using a bitwise mutation operator is shown in Eq. 3.

$$\begin{aligned} \textbf{m} = \text {mutation}(\textbf{p}, p_m) \end{aligned}$$

(3)

where $\textbf{p}$ is the parent binary vector, $\textbf{m}$ is the mutated binary vector, and $p_m$ is the mutation probability.

The replacement of the worst individuals in the population with the best individuals from the new population, Eq. 4, is:

$$\begin{aligned} \begin{aligned} \text {if}~~~&fitness_{\text {new}} > fitness_{\text {worst}} \\&\text {then}~~~\text {replace}(worst_{individual}, new_{individual}) \end{aligned} \end{aligned}$$

(4)

where $fitness_{\text {new}}$ is the fitness of the new individual, $fitness_{\text {worst}}$ is the fitness of the worst individual in the population, and replace() is a function that replaces the worst individual with the new individual.

SFLA-Based Feature Selection

The fitness function is defined the same as the one used for GA (Eq. 1), in which $fitness_i$ is now the fitness of the i-th frog, and $\textbf{features}_i$ is the binary vector representing the presence or absence of each feature for the i-th frog.

The creation of a new frog population by combining the best two frogs from each memeplex is shown in Eq. 5.

$$ \begin{aligned} \textbf{x}_{\text {new}} = (\textbf{x}_{\text {best1}} \& \textbf{x}_{\text {best2}}) | \sim (\textbf{x}_{\text {best1}} | \textbf{x}_{\text {best2}}) \end{aligned}$$

(5)

where $\textbf{x}_{\text {best1}}$ and $\textbf{x}_{\text {best2}}$ are the binary vectors of the best two frogs in the memeplex, & is the bitwise AND operator, | is the bitwise OR operator, and $\sim $ is the bitwise NOT operator.

The replacement of the worst frogs in each memeplex with the best frogs from the new population is shown in Eq. 6:

$$\begin{aligned} \begin{aligned} \text {if}~~~&fitness_{\text {new}} > fitness_{\text {worst}}\\&\text {then}~~~\text {replace}(worst_{frog}, new_{frog}) \end{aligned} \end{aligned}$$

(6)

where $fitness_{\text {new}}$ is the fitness of the new frog, $fitness_{\text {worst}}$ is the fitness of the worst frog in the memeplex, and replace() is a function that replaces the worst frog with the new frog.

2.3 Accentuation-Based Fusion of Emotions

In order to better analyze the utterance-level characteristic of emotions, we use a graphical chain model to represent emotional behavior in long-time speech. We can capture the complex and dynamic relationships between different aspects of emotions and build more accurate models for long-time emotion recognition.

In continuous speech, emotion states also change continuously. At the conventional frame-level recognition, emotion labels are assigned to short-time periods. However, in long-time periods of utterances, these emotion recognition results should be merged. Previous studies on continuous speech have been focused on the linguistic meaning [22]; the parallel-linguistic information, such as emotion, is not well studied. Since emotions in speech typically last around one to several seconds, fusion of neighboring emotion labels is a direct implementation of long-time speech emotion recognition, as shown in Fig. 3.

For example, each node in the graph could represent a specific emotion label such as happy, sad, angry, and neutral, which is assigned to a specific utterance.

$$\begin{aligned} \text {EmotionSequence}= {{\textbf {e}}_1, {\textbf {e}}_2, {\textbf {e}}_3, \dots , {\textbf {e}}_m} \end{aligned}$$

(7)

where ${\textbf {e}}_i$ represents an emotional label of an utterance.

The edges between these nodes could represent the transitions between these emotions. The weights on the edges could represent the likelihood of transitioning from one emotion to another, based on the emotional behavior characteristics of the speech.

Using a graphical chain model, we can estimate the transition probability. By estimating probability of the next emotion labels, conditioned on the previous observation of label sequence, we can build a predictor to identify the change of emotions in long-time speech. Errors in segment-level emotion recognition can be corrected, when abnormal edges with low posterior probability are detected.

Transformer model is adopted to build the predictor. The transformer is a type of neural network architecture that can be used for various sequential data processing. It is a feedforward neural network that uses self-attention to process inputs (emotion label sequence in long-time speech emotion).

The attention mechanism works by assigning a weight to each element of the input sequence based on its relevance to the output at each step of the processing. The weights are calculated using a function that takes into account the similarity between the current processing step and each element of the input sequence.

The prediction algorithm steps are as follows:

Input:

Bi-direction Emotion label sequence;

Accentuation Weights Sequence.

The accentuation weights are related to each emotion label, and they can be calculated by utterance-level acoustic features.

Output:

Predict the transition probability of the next node;

If it is lower than an empirical threshold, replace it with the predicted emotion label.

To predict the next state of emotion sequence in long-time speech, we can first input the sequence into the transformer encoder to obtain a sequence of encoder outputs. Then, we can use the decoder to generate the next state based on the previous states and the encoder outputs. Specifically, at each time step, the decoder generates an output representation based on the previous output and the encoder outputs and then generates a probability distribution over the possible next states using a softmax function.

To estimate the probability of state transitions, we can compute the probability of transitioning from the current state to each possible next state using the output distribution generated by the decoder. The transition probabilities can then be used to construct a state transition matrix that describes the probability of transitioning between any two states in the sequence.

Furthermore, we consider the accentuation, which is a cue of important utterance in a paragraph.

To identify long-period emotion type over a paragraph, we consider the accentuation weights. A sliding window is used to generate the samples. The weights are estimated by a regression model, which reflects the accentuation features. The accentuation-related features used for regression include pitch frequency, formant frequency, duration time, and intensity.

By using a graph to represent emotions in this way, we can capture the dynamic changes in emotional expression over time and build models that can improve the recognition of emotional state of the speaker for long period of speech.

3 Experimental Results

3.1 The Databases

In our experiment, we use EMO-DB for feature compensation experiment and emotion recognition test.

EMO-DB is a widely used emotional speech database that contains recordings of emotional speech in German. It was created by the Institute of Communication Science and Phonetics at the University of Bonn in Germany.

The EMO-DB database comprises seven emotions: (1) anger; (2) boredom; (3) disgust; (4) fear; (5) happiness; (6) sadness; and (7) neutral. The data were recorded at a 16-kHz sampling rate.

EMO-DB has been used in various studies to analyze emotional speech and develop algorithms for speech emotion recognition. It is freely available for academic research purposes and has been used in numerous studies worldwide.

We adopt another local database from Southeast University, whose long-time speech corpus is an idea to verify our emotion recognition method.

The database involves three males and three females with performance or broadcasting experience, who have not had a cold recently and speak Mandarin accurately, to record their voices. The recording is conducted in a quiet room with no echo, and the performers are in separate booths, while the recording staff are outside the booths and cannot see the performers’ facial expressions and movements, only their voices. Before recording, the performers are told to speak in their own emotional expression. They can make whatever facial expressions and movements they want, as long as they do not make any noise that will interfere with the recording.

The recording was in mono, with 16-bit quantization and a sampling rate of 11025Hz. Each word and short phrase should be able to express six types of emotions, as shown in Table 1. Each long paragraph is composed of 4–5 or 5–6 short sentences. Long paragraph always contains some emotions or neutral emotions to some extent. All selected long paragraphs have certain emotions.

Table 1 The original sample distribution of SEU database

Long-Time Speech Emotion Recognition Using Feature Compensation and Accentuation-Based Fusion

Abstract

Similar content being viewed by others

Speaker-Aware Speech Emotion Recognition by Fusing Amplitude and Phase Information

Feature compensation based on the normalization of vocal tract length for the improvement of emotion-affected speech recognition

Speech emotion recognition for human–computer interaction

Explore related subjects

1 Introduction

1.1 Related Work

2 Methodology

2.1 Feature Compensation

2.2 Feature Optimization

2.3 Accentuation-Based Fusion of Emotions

3 Experimental Results

3.1 The Databases

3.2 Feature Compensation Results

3.3 Feature Optimization Results

3.4 Long-Time Emotion Recognition Results

4 Discussion

5 Conclusion

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation