1 Introduction

Recently, increasing attention has been drawn to identifying emotions by using speech signals. There are many reasons for the popularity of using speech signals for emotion recognition. One main reason is that speech is the most natural and important way of human communication. However, despite the tremendous research on speech recognition done since the late 1950s, emotion is still one of the huge differences between humans and machines [1]. Recognizing human emotion from speech introduces promising applications such as healthcare services, commercial conversations, virtual humans, emotion-based indexing, and information retrieval.

An utterance (phrase, short sentence, etc.) is often considered to be a fundamental unit and is recognized on the basis of the global utterance-wise statistics of derived segments, so the segment features are transformed into a single feature vector for each emotional utterance [26]. However, in recent research, an increasing number of scientists and psychologists have been arguing that changes in emotional activity occur within a very short period of time. Several studies have emphasized the importance of the temporal dynamics of emotions [7, 8]. Furthermore, one study illustrates that emotions are inherently dynamic [9]. The paper contains an illustration showing that, within 2.6 s, a person went through several emotional activities, such as surprise, fear, aggressive stance, and relaxation. In addition, another study demonstrates that the emotion effect occurs within hundreds of milliseconds [10].

Motivated by these findings, we focused on a novel scheme to improve speech emotion recognition by using segment-level features instead of utterance-wise features [1113]. Many researchers have recently been focusing on whether the utterance-level approach is the right choice for modeling emotions [14]. They are concerned with this because of the difficulties with utterance-wise statistics in avoiding influence from spoken content. Moreover, valuable but neglected information could be utilized in the segment-level feature extraction approach rather than calculating only the utterance-wise statistics. This hypothesis is also supported by many pieces of research [15, 16] on the basis of the fact that improvements can be made by adding segment-level features to the common utterance-level features.

We took into consideration a purely segment-level strategy for recognizing speech emotion and abandoned utterance-wise features in order to reduce noises such as spoken content and utilize neglected information when calculating the utterance-wise statistics in this study. An issue when using segment-level speech emotion recognition is that it increases the difficulty for training to a large extent because a single utterance is divided into a number of segments. The aim of this paper is to properly design an approach for recognizing utterance-level emotion that is based on aggregating the segment-level labels and to extract more information such as emotion strength. The concept is illustrated in Fig. 1.

Fig. 1
figure 1

Concept of short time analysis of utterance f(t)

2 Experimental Design for Emotion Database

A well-annotated database is needed to construct a robust method for recognizing emotions by using speech signals [17]. Our experiment emphasizes “natural speech.” The participants were prevented from becoming aware that they were in an experimental environment during the experiments, which is much more realistic than experiments that were conducted with scripted speech. Natural speech is difficult to analyze but more suitable than scripted speech for validating the robustness of an emotion analysis method.

2.1 Experimental Procedure

The experimental setup is composed of one instructor, one coordinator, and two subjects. The coordinator cooperates with the subjects in order to help better stimulate their emotions. However, the coordinator pretends to be one of the participants in the experiment to avoid being an extra obstruction for the subjects. The stimulation process unfolds through conversations with the aid of videos. The steps are demonstrated as follows, and the experiment is illustrated in Fig. 2.

Fig. 2
figure 2

Illustration of experiment

  • The instructor sets up the experimental environment, such as a projector for the videos and microphones for collecting the speech signals, and gives instructions to the participants.

  • The instructor also explains the steps to the participants, including the coordinator, for freely providing their impressions related to the videos.

  • Self-introductions are made to create an easy speaking atmosphere.

  • After watching each emotion evoking video, which lasts several minutes, the speech signals are recorded from the impressions.

The emotion corresponding to each utterance from the recorded speech signals is not only self-assessed by the subjects (self-assessment) but also by ten other people (others-assessment) after the experiment. Therefore, it is possible to evaluate the degree of the reliability of labeling utterance emotions.

2.2 Data Information

Ninety-six people participated in the experiments, which included 53 males and 43 females ranging from their early 10s to 40s. We provided sample selections to obtain reliable data in two steps. First, only the samples with the same label (pleasure or displeasure) based on the self-assessment and others-assessment were taken into consideration. Second, to maintain a balance in the sample numbers for each label, we selected 300 utterances with higher rankings by using the others-assessment, which consisted of 150 utterances as pleasure data and 150 utterances as displeasure data obtained from 50 participants. Ten specialists put a label for each utterance in the others-assessment, and the rank for each utterance was calculated on the basis of the ratio of the specialists who gave the same labels that were consistent with the label set from the self-assessment.

3 Emotion Recognition Method Based on Purely Segment-Level Features

The proposed methodology for emotion recognition is based on purely segment-level speech frames, and the important issues for consideration here are the increased number of samples that raise the computational burden in terms of both memory capacity and execution speed and the decline in the generalization ability of the classifier. In this work, we address the quantitative analysis of various analytical schemes related to segment-level emotion recognition, and we propose an automatic approach for decreasing the number of samples in order to reduce the computational complexity and improve the classifier generalization ability. The algorithm is illustrated in Fig. 3.

Fig. 3
figure 3

Flowchart of emotion recognition method based on purely segment-level features

3.1 Segmentation Approach

We propose novel segmentation strategies based on analysis of short time speech segments. The proposed approach is illustrated in Fig. 4.

Fig. 4
figure 4

Illustration of proposed segmentation approach

A classifier is trained by using the information contained in the input feature vectors. In the real world, the final uncertainty will not be ideally zero after training because of insufficient input information. In addition, the classifier might be “confused” due to ambiguities in the input information. The most likely solution is to increase the number of training samples, but it is not desirable in our case since the large increase in the number of training samples caused by splitting an utterance into segments is already a great computational burden. However, a more efficient way is to find more informative segments by minimizing the amount of mutual information between the two feature vectors. In this study, fixed length segments are constructed at selected positions on the basis of designed indexes. More precise labels of segments can be defined when taking into consideration a much smaller number of selected segments. Thus, not all parts of the utterance are used in the analysis. A sliding window with no overlap is adopted to process the utterance signal for calculating the ranking of the fixed length segment. A correlation coefficient is adopted for getting a smaller fixed number of segments from an utterance.

The correlation coefficient [18], which is also known as the Pearson product-moment correlation coefficient, is a measure of the linear dependence between two feature vectors. It is defined as

$$ \gamma = \frac{{\sum\nolimits_{x \in X,y \in Y} {\left( {x - \bar{X}} \right)} \left( {y - \bar{Y}} \right)}}{{\sqrt {\sum\nolimits_{x \in X} {\left( {x - \bar{X}} \right)} } \sqrt {\sum\nolimits_{y \in Y} {\left( {y - \bar{Y}} \right)} } }} $$
(1)
$$ \bar{X} = \frac{1}{n}\sum\limits_{x \in X} x $$
(2)
$$ \bar{Y} = \frac{1}{n}\sum\limits_{y \in Y} y , $$
(3)

where n is the number of features.

The concept of the proposed segmentation methods is illustrated in Fig. 5.

Fig. 5
figure 5

Fixed length segment positions using proposed segmentation approaches (20-segment selecting situation shown; positions represented with gray lines)

3.2 Feature Extraction

We focused on a set of 162 acoustic features obtained from speech segments, including 50 mel-frequency spectral coefficients (MFCC), 50 linear predictive coefficients (LPC), and 10 statistical features (mode, median, mean, range, interquartile range, standard deviation, variation, absolute deviation, skewness, and Kurtosis) calculated from each of the five levels of detailed wavelet coefficients by using the discrete wavelet decomposition (DWT), pitch, energy, zero-crossing rate (ZCR), the first seven formants, centroid, and 95 %-roll-off-point from FFT-spectrum.

3.3 Decision Model

The label of an utterance is decided on the basis of the labels of its segments predicted from a classifier. We simply use the majority vote, which determines the label of the utterance from the label in the majority, in order to pay more attention to examine the effectiveness of the proposed segment-level approaches for speech emotion recognition. The decision model is shown in Fig. 6. Our decision model is based on a classifier called the “probabilistic neural network” (PNN) [19]. PNN operations are designed into a multilayered feed-forward network with four layers. The network has many advantages compared with other kinds of artificial neural networks and nonlinear learning algorithms, including a very fast learning speed and a small number of parameters.

Fig. 6
figure 6

Decision model of purely segment-level approach for speech emotion recognition

4 Results

A tenfold cross validation was used to evaluate and test our proposed approaches as well as make comparisons with previous pieces of research because it is used in many other pieces of emotion recognition research to validate general models [15]. We reviewed all of the most recent research on the aspect of classifiers and found that the support vector machine (SVM) is one of the most robust and popular classifiers in the field of affective research, and it beats out many other kinds of classifiers in terms of recognition accuracy [20]. Thus, our evaluation results based on PNN are compared with those based on SVM (Fig. 7). Twenty segments with a length is 50 ms were used for voting in our proposal.

Fig. 7
figure 7

Accuracy (%) of conventional methods using global features and our proposal

5 Discussion on Segment-Level Features

Previous research has reported on strategies for improving the accuracy of speech emotion recognition by utilizing segment-level features together with global features extracted from utterances. The effectiveness of these strategies was proved in many reports [14, 15]. This research further develops a new approach that totally abandons the global features from utterances. The analytical results indicate the robustness of this advancement, which leads to a higher level of recognition accuracy by only using segment-level features in the proposed decision model. We proposed a segmentation method adopting a correlation coefficient in order to select the appropriate number of segments within an utterance. Therefore, the generated segments have less redundant information for the decision model, which contributes to a better understanding of the utterance label.

We used a 162-dimension feature set for a complete analysis, but a remaining point is that we did not include a feature selection procedure before the segmentation. The full set of extracted features is chosen instead, and we let the segmentation algorithm decide on the more appropriate segments for representing the utterance labels. This is appropriate with respect to feature dimensions with a large number of samples. However, the interaction between the feature selection and segmentation approaches and its meaning will be discussed as a future issue.

6 Application for Emotion Strength Analysis

A very interesting potential application is emotion strength analysis by segment-level speech emotion recognition. We use majority voting to decide the utterance label (pleasure or displeasure) with the assumption that the segment label in the majority represents the utterance label. To better understand segment labels, we looked further into the ratio of the predicted segment labels that can represent the strength of an utterance emotion. All speech frames are used for examining all segments in terms of emotions.

6.1 Experimental Data

The International Affective Picture System (IAPS) [21] is used for evoking emotions with different strengths. The IAPS is an emotion stimulation system built from the results of many emotion experiments and is composed of about 1,000 pictures labeled with a standard scale of valence (pleasure-displeasure) and arousal (exciting-sleepy). Therefore, it meets our requirement for stimulating emotions with different strengths. Figure 8 shows the four kinds of emotion strengths we defined by using the IAPS.

Fig. 8
figure 8

Defined emotion strengths based on IAPS

The experimental approach was made up of four parts in accordance with the pleasure and displeasure emotion stimulation, which includes the defined emotion strength (weak and strong). The detailed experimental procedure is shown in Fig. 9. The pictures selected from the IAPS during the stimulation period were projected on a screen to evoke emotions. Then, speech signals were collected when the participants were reading designed scripts with their evoked emotion. The participants were requested to close their eyes to relax during the control time. Seven Japanese males took part in the experiment. Data were collected by using the previously described procedures for estimating four emotional strengths and contained 312 samples including 156 pleasure (78 strong, 78 weak) and 156 displeasure (78 strong, 78 weak) data.

Fig. 9
figure 9

Experimental procedure for evoking emotions with different strengths

6.2 Results of Emotion Strength Analysis

We statistically analyzed the components of all of the samples. We then visualized the results with a bar chart with a standard deviation to illustrate the correlations between stimulations and emotion components, which were represented by segment-level predictions within an utterance by using the proposed segment-level speech emotion recognition method (Fig. 10).

Recognizing the emotion of utterances is one of the more attractive topics in speech analysis for human–computer interaction (HCI), healthcare, etc. However, emotion strength analysis has been a very essential but difficult research area. We discussed the potential of using segment-level frames for such analysis. As shown in Fig. 10, the proposed method can indeed reflect the strengths of emotions in utterance clusters over a short period of time. However, difficulties exist in applying it to a single utterance because of the variances in the emotional components regarding the utterances. Although additional research is necessary for collecting more solid findings in terms of the emotion strength analysis of utterances, segment-level speech emotion analysis will be a new method for better recognizing human emotion strength with machines.

Fig. 10
figure 10

Statistical analysis for emotion components with segment-level speech emotion analysis for all speech samples

7 Conclusion

An emotion recognition method using short time speech analysis was proposed. To make the proposed method more efficient and accurate, an advanced relative segmentation method was introduced that uses correlation coefficients for fixed length segment selection, which is essential for realizing the purely segment-level approach. The proposed method can greatly increase the accuracy of emotion recognition by more than 20 % compared with the conventional method of using the global features of utterances, which was validated by using a database with speech signals from 50 participates. The proposed method also showed the effectiveness of determining the emotion strength of utterances over a period of time. It can provide hints about emotion strength information according to our validation results with the IAPS database.