1 Introduction

The goal of this study is to develop an automated emotional speech recognition system for inclusion in the Korean emotional database, under the category of “real emergency situation.” Emotional speech recognition systems are an important topic in the Human-Computer Interaction research field and have reached a level of accuracy which has allowed widespread application in media services such as call centers and voice mail, etc. [14, 27, 39, 55]. Emotional speech recognition systems continue to be researched in order to develop more accurate systems. Because emotional speech recognition systems are developed based on human communication, this technology continues to require a deeper understanding of the emotion expression in human communication for further development.

In human communication, emotion is one of the most important factors for conveying intent in the connoted way. Thus, emotional information, defined as emotional intelligence, is an important part of natural human communication [52]. Studies have shown that during communication emotion can be expressed both deliberately and subconsciously through different channels including the utterance itself or paralinguistic features such as pitch, energy, or facial expressions, or a combination of both. With an increasing interest in emotional intelligence in order to provide better interaction between humans and computers, emotional intelligence is now required for accurate emotion perception [36]. For this reason, emotional speech recognition systems which aim to automatically identify a person’s emotional state in the Human-Computer Interaction(HCI) field have become a significant research topic [40]. Emotion speech recognition systems are especially beneficial for the management of service quality in media services such as call centers [43]. Automated telephone systems are already used in many areas such as consumer service center, emergency operations center, etc. because its applicability has substantial potential for dealing with users’ requirements by detecting their emotions through the spoken language [34]. For example, if the emotion speech recognition system detects a caller’s emotion to be too negative for the automated system to deal with, the call can be switched to a human service provider who is better equipped to provide adequate reactions [4, 33]. Moreover, if an emotional speech recognition system can respond to the user naturally, the system can make the user feel less daunted when they interact with computers. Existing emotion-based speech recognition systems in call centers and medical centers are currently used to monitor callers’ or patients’ emotions and take appropriate measures [47].

Emotional speech recognition systems are developed by extracting features from emotional acoustic data. To this end, there have been great deal of studies relating to emotional speech recognition systems using existing corpora [5, 59] (i.e., Berlin Database of Emotional Speech (BDES; from the German Corpus published by the Department of Acoustic Technology at Berlin Technology University), Danish Emotional Speech (DES; from the Danish Corpus published by Aalborg University, Denmark), and telephone messages [5, 59] which have been widely used in emotional speech recognition system development. However, there is no emotional database available in Korea. Thus, a prerequisite to the development of a Korean the emotional speech recognition system is the building of an emotional speech database in Korean, which requires the collection of emotional speech data. Recorded calls from emergency centers were considered as part of these efforts. Emergency call centers aim to offer precise and suitable instructions to callers in emergency situations to minimalize potential risks. Because calls are made by people in emergency situations requiring proper instruction, the caller typically exhibits panic or fear. Thus those emotions are naturally represented in the callers’ voices. It is legislated that in Korean emergency centers, each interaction between caller and call recipient shall be recorded. Hence the acoustic data from emergency call centers can be considered an optimal database for negative emotional speech.

One reason that building an emotional database using acoustic data from emergency call centers is important is that it provides real-life acoustic data. Although there are considerable emotional corpora, much of these corpora use artificially recorded data which is recorded either by professional actors making use of Wizard of Oz, or non-professionals reading sentences under instructions to read with specific emotions. Artificially produced emotional data are very clear and discernable, Thus the emotional speech recognition systems using those corpora lack the finesse of those which use acoustic data produced in real-life contexts. This is because unlike corpora recorded by an actor, emotions are not always so obviously expressed, and can be represented in much more complicated ways in the real-world [60]. Hence, studies on emotional speech recognition systems using acoustic data from real-life situations are becoming increasingly important. There have been few studies using conversations which occur in real life, especially telephone conversations. Existing real-world acoustic data is mostly amassed at telecommunications service centers [43]. However, this data does not include urgent situations in which the caller’s normal cognitive function may be effected by stress. Emotional speech recognition systems which use acoustic data from the real world, including urgent situations, need to be further studied [57]. Therefore, we use considerable real-life speech which was recorded under real emergency situations at an emergency call center in this study. Also, we anticipate that this data will be valuable for finding an emotional and behavioral feature to recognize emotional speech patterns in real-life situations with higher accuracy, specifically in Koreans.

To verify whether the acoustic data from the emergency call center was valid for the negative emotional database, our priority was to identify which category the callers’ emotions fell into. To this end, we looked into emotion studies in the field of psychology, because emotions have been an important psychology research topic for a long time. Of the many emotion theories, two classic emotional distinction theories stand out: the discrete emotion theory and the dimensional theory. The discrete emotion theory devised initially by Tolkins (1962) and further developed by many researchers, argues that individual emotions (e.g., surprise, interest, joy, etc.) have biological and neurological inherent profiles [18, 41, 65] and basically, this theory identifies eight basic discreet emotions (surprise, interest, joy, rage, fear, disgust, shame, and anguish). On the other hand, the dimensional theory [46] suggests that emotions are distinguished by only two emotional dimensional spaces: ‘valence’ which is defined as judging how positive or negative an emotion is, and ‘arousal’ which is defined as the intensity of an emotion. According to this theory, it is by these emotional dimensional spaces that the basic emotions are defined [46, 49]. Therefore, emotions can be distinguished as either falling under the eight basic discrete emotions, or as three simple emotions: positive, neutral, or negative. In emotional speech recognition system studies, there has been much focus on the discrete emotion theory for emotion distinction [14, 28, 37, 43, 47, 59, 61, 64]. However, in real life, emotional utterances are far more complicated, and it is therefore important to consider these two theories together as a mutual supplementary view [1, 12, 16, 31]. Yik et al. (1999) demonstrated the eight basic discrete emotion clusters based on a dimensional theory called the circumplex model of affect. This theory was advanced in 2011 to encompass a 12-point affect circumplex model (Fig. 1) which demonstrates discrete emotion-clusters based on the dimensional theory [31, 65]. According to this model, a person who calls the emergency services’ emotional state would fall into the negative (i.e., ‘displeasure’), high-activation (i.e., ‘activation’), and tense or jittery (boxed in orange). The current study is based on Yik et al.’s affect circumplex model for machine-based emotional speech recognition systems. We concluded that, under this model, callers to emergency call centers’ emotional states were in real-life situations corresponding to high arousal and negative valence, falling into the emotional state of the activated displeasure clusters (e.g., frenzied, jittery).

Fig. 1
figure 1

Circumplex model of affect [65]. This model demonstrated the discrete emotions clustered based on the dimensional theory. According to this model, the caller’s emotional state falls into the negative (i.e., displeasure in this figure) and the high-activation (i.e., activation in this figure); Tense or Jittery (Boxed in orange)

There are many acoustic features which are widely used in emotional speech recognition [48]. The relationship between acoustic features and their representation in the specific discrete-dimensional emotional accounts have been well reviewed (Table 1) [19, 21]. Considering both the circumplex model of affect (i.e., negative valence and high arousal emotional states) and vocal expression of emotions (Table 1), ‘the voice under an emergency’ is represented with significant variations in voice intensity as well as Formant frequency of speech (i.e., F0, pitch). Among the acoustic features, we analyzed variability in F0 (i.e., pitch), variability in voice intensity (i.e., energy), and Mel-Frequency Cepstral Coefficients (MFCC).

Table 1 Acoustic features and their representation in the specific discrete-dimensional emotional accounts summarized and adapted from Juslin & Laukka (2003) and Lukka & Juslin (2005)

In this study, we adopted ‘disfluency’ as a prosodic parameter (e.g., pitch, energy, speaking rate), although this particular behavioral feature is seldom used by most emotional speech recognition systems [34, 43]. Even though disfluency is a key feature which refers to both an emotional state and cognitive dysfunction, there has been little research on the inclusion of this feature as a factor for speech emotion pattern recognition. Disfluency is defined as “the hesitation phenomena in utterance” [6] or “all kinds of speech features which contrast with fluency” [63]. There are various forms of disfluency phenomena which are mainly categorized into either stuttered or non-stuttered forms. Stuttered disfluencies include sound, syllables, or one syllable/word repetition (e.g., “r-r-really”, “I-I-I”), and sound prolongations (e.g., “mmmmeans”) [15], where as non-stuttered disfluencies are interjections between utterances (e.g., “uh”, “umm”), word or phrase repetition (e.g., “turn, turn left” whereby the word turn is repeated), deletion (also known as false starts; which occur when a speaker starts saying something they mean to say but then starts a new sentence; e.g., “No it’s* do you see the sign over there”; whereby the speaker started with “no it’s” and then restarted with “do you -”), word or sequence substitution (e.g., “Do you have a caravan?”* whereby caravan is substituted for summer cottage), insertion (when a speaker restarts utterances with more than one word after the initial utterance; e.g., “I want to* I really want to” whereby the speaker restarts the utterance with really inserted), and articulation errors (e.g., “turn lift”* there whereby the word left displayed an articulation error to lift) [26].

These disfluencies appear when the speaker searches for the next word or phrase or to occur at the beginning of an utterance, due to either a greater demanding in cognitive load or low cognitive control [6]. In Lindström et al. (2008) in which participants were asked to speak while they were handling a steering wheel (i.e., vehicle spoken dialogue task), disfluency such as interjections (e.g., “uh”, “umm”), and filler words (e.g., “like”, “you know”) were observed more when the speaker was performing tasks that required higher cognitive load than tasks that required less cognitive load [26]. In addition, Lindsey et al. (1995) and Dibble et al. (2013) both argue that people also tend to show disfluency in their utterances when they convey information that may arouse negative emotions [10, 25]. In order to test the emotional impact on cognitive processing, Coffman and Davison (1997) asked subjects to imagine specific situations from audio-recordings and express their thoughts and emotional reactions [7]. They found that subjects showed a higher proportion of interjection or filler words in emotionally provocative scenarios than in neutral ones. Thus, disfluency can be thought of as a reasonable and theoretically grounded paralinguistic criterion which determines if a voice is highly emotional or not. Although there are some interjections in normal situations, it can be concluded that the appearance and pattern of disfluency are distinctive and more frequent in emergency situations than in normal situations. Therefore, interjections in the non-stuttered disfluency category may be taken into consideration as the feature for emotional speech recognition systems because they are quantifiable features that can be analyzed.

Thus, we aimed to develop an automatic classification of negative emotional states in a speech from emergency calls using the classification method, the Support Vector Machine (SVM) with several acoustic features, and the behavioral feature ‘interjection.’ We believe this may initiate the collection and building of a Korean emotional speech database for Korean emotional recognition speech classification, in order to encourage and enable many emotional-based automated service fields.

This paper is divided as follows; in Section 2, we present related studies that investigated emotional speech recognition systems. In Section 3, we explain the research method composing of the survey procedure, defining speech emotional voice, data acquisition and annotation, and pattern classification. In Section 4, we describe the experiment procedure. In Section 5, we present the results and discussion. Our conclusion forms the final section.

2 Related work

2.1 Emotion recognition in real-life spontaneous speech

Previous emotional recognition classification studies used induced emotional corpora which were mostly recorded in unnatural surroundings at high quality [11]. The characteristics of databases used for emotional recognition in speech are summarized in Table 2.

Table 2 Database used in the speech emotional recognition system revised from M. El Ayadi et al. (2011)

Because there are many differences between acted corpora and natural acoustic data, there are some dissimilarities between them in terms of classification results. The induced emotional corpora recorded in at high-quality unnatural surroundings tends to have a high level of accuracy. Sato and Obunchi (2007) reported that recognition results for acted emotions are much better than those in voice portal data [53]. On the other hand, Morrison et al. (2007) concluded that emotionally induced corpora fail to model subtle real-world emotions. This supports the argument that studies on emotional recognition in speech should be conducted using databases acquired in a natural context [34]. With recent databases containing natural emotional states occurring in real-life (mainly call-center interactions), more sophisticated classification studies on emotional recognition systems have become possible [9]. Several initial studies on emotion recognition using real-life utterances have been conducted, focusing on the detection of negative emotion (i.e., anger). This is because of the significant importance of negative emotions in call center interactions [14, 37]. Neiberg and Elebius (2008) studied the automatic detection of negative emotion (i.e., anger) which occurred in real-life using the corpora of the Swedish Telephone Company. They demonstrated 83% accuracy using SVM. Galanis et al. (2013) also used a corpus consisting of 135 customer service call center real-life conversations from a Greek telecom company for emotion classification using SVM [37]. Devillers et al. (2005) further established several classification methods (SVM, decision trees) for the recognition of non-basic emotion and blended emotion (e.g., anger or fear/ anxiety or annoyance) using two sets of corpora, which were all real-life dialogues (i.e., from financial call centers and medical emergency centers) [9]. Considering these studies concerning real-life acoustic data, we concluded that using acoustic data from real-world situations leads to more accurate classification in emotion recognition systems, so we used real-world data consisting of both urgent and non-not-urgent situations.

2.2 Behavioral features in emotional voice recognition

While previous studies related to emotion recognition have been limited to acoustic information [9, 43], many behavioral features have been widely used for emotional recognition in speech. This may be due to the increasing database of real-life situations. Unlike induced or acted emotional corpora, emotional utterances in real-life have high a probability of accompanying behavioral features such as laughter, sighs, disfluency, etc. The more natural utterances there are in a database, the more consideration has to be taken into account of these behavioral features. A study by Lee and Narayana (2005) used three sources of information (i.e., acoustic, lexical, and discourse) for emotion recognition, focusing on negative and negative emotions. Results showed that using both acoustic and language information improved emotion classification [22]. Polzehl et al. (2011) elaborated emotion recognition modeling anger classification in recorded emotional corpora using both behavioral and acoustics features; the fusion of both features showed a slight improvement in overall scores (with a baseline of 60% accuracy to a maximum of 79% accuracy) [47]. However, the features used in those studies were phrase or word usage frequencies (e.g., Emotional Salience, Bag-of-Words, Term Frequency), not paralinguistic features derived from behavioral features [22, 24, 47]. Devillers et al. (2004) suggest that disfluency, such as the use of fillers or pauses, can be used as a cue for emotion detection [8]. In successive research, Devillers et al. (2005) introduced this feature to detect the emotion anger/fear and showed the detection rate improving to 60%. Thus, Devillers insists that behavioral features such as lexical, prosodic, disfluency, and contextual cues should be taken into consideration for accurate emotion detection in a real-life speech [9]. Because of disfluency’s late adoption as a feature for emotional recognition classification, related studies are rare. We hypothesized that the inclusion of the disfluency paralinguistic feature will improve the detection rate of voices under urgent situations. Thus, interjection, a phenomenon represented in disfluency, was identified and selected for classification in this study.

3 Method

3.1 Survey of emotional voices at an emergency call center

To identify which psychologically defined negative emotion (according to the 12-point affect circumplex model) in a caller’s voice corresponding to the categorization of a voice under either urgent or emergency situations [65], we conducted a survey of emergency call recipients at the Seoul Emergency Operation Center. Under emergency situations, a caller’s cognitive ability to answer the call recipient’s questions and describe the situation is extremely important. However, when a caller becomes too emotional, their cognitive capacity often becomes limited, and they have trouble conveying the information required. This results in the call recipient struggling to take appropriate measures. Thus, the purpose of the survey was to identify and categorize which particular emotional states in which callers under urgent circumstances struggle to convey specific or detailed information. The main question in the survey was, “in which emotional situations have you had difficulties understanding the callers’ utterances in order to obtain information, and how difficult was it?” In total, 23 different emotional states were identified (e.g., I found it difficult to understand the situation and to give the right instructions when the caller was angry). Situations were precisely depicted using words that implied callers’ emotional states. These words were adjectives selected from both the Positive Affect and Negative Affect Schedule (PANAS) [62] and Pleasure-Arousal-Domination (PAD) [32]; the former was initially devised by Watson (1988; translated into Korean by Lee et al., 2003) to measure the positive and negative affective scale [23]; the latter is a systematic framework of emotions which was introduced by Mehrabian and Russell (1974; translated into Korean for this research). Respondents were asked to rate their difficulty from 1 to 5 on a Likert scale (i.e., 1 = not difficult and 5 = very difficult). Questions about specific work experience, the number of calls a day they received, and which gender/age group they struggled with most followed. There were a total of 68 respondents (male = 60) (Table 3).

Table 3 Detailed information of respondents in the survey

3.2 Defining emotional voice

The inquiries used in the survey first needed to be justified. To test the survey’s credibility, the internal consistency reliability (Cronbach alpha) was calculated. To identify which emotional states represented in voices could be grouped as ‘fearful voices’, a factor analysis was conducted (principal component analysis with OBLIMAX rotation) defining only two factors as valid (fear or non-fear). For more specific definitions and to narrow down the candidates, a regression analysis defining “fear” as the independent variable was conducted, allowing us to verify the emotional states that could correlate with the “fear” state. All statistics were calculated using SPSS (ver. 18) software.

Internal consistency reliability analysis showed that the questions used in the survey were highly reliable (Cronbach alpha coefficient = 0.8), implying that the adjectives used in the survey correspond well to the emotional states. The survey results revealed that emergency call recipients had difficulty when communicating with people who were ‘excited’ most, then ‘temperamental’, followed by ‘angry’ in a regular sequence (i.e., a difficulty rating of 4.25, 4.16, and 4.01 respectively), followed by other emotional states (Fig. 2).

Fig. 2
figure 2

Rating of survey results. X axis indicates callers’ emotional state (e.g., embarrassed, excited etc.) and y axis indicates degree of difficulty in obtaining information from the caller in order to take appropriate action or provide instructions to the caller to alleviate the situation. Mean and standard error are displayed

The factor analysis showed that among the 23 sentences describing the callers’ emotional states, 11 adjectives including ‘fear’ (i.e., terrified, distressed, sad, furious, disgust, misery, tense, scared, unpleasant, angry) were grouped together as the factor defined as ‘fear’ in this study, with 7.068 eigenvalue and 30.729% of explanations for variance; and the other eight adjectives were grouped under the factor defined as ‘non-fear’ in this study, with 3.196 eigenvalues and 13.894% of explanations for all variance. The Kaiser-Meyer-Olkin measure of sampling adequacy was 0.753, which means above average (0.6). Bartlett’s test of sphericity was also significant (χ2 (253) = 779.555, p < 0.0005), (Table 4 and Fig. 3).

Table 4 Structural matrix from factor analysis result
Fig. 3
figure 3

Scree plot graph in eigenvalue against factor number. Values were grouped into two components according to pre-defined factor number (two); Component 1 (Fear) - fear, distressed, embarrassed, sad, furious, disgust, miserable, nervous, frightened, excited, and angry; and component 2 (Non-fear) - temperamental, irritated, humiliated, flat, guilty, depressed, gloomy, tired, calm, ashamed, restless, and anxious

In succession, the regression analysis for the cross-validation test indicated that the words ‘terrified’ and ‘unpleasant’ explained statistically significantly ‘fear’ (F (1, 63) = 2.286, R2 = 0.414, p = 0.004 for terrified, p = 0.0001 for excited). Therefore, we drew the conclusion that fear emotion were similar to ‘terrified’ or ‘excited’ emotions. Considering the statistical results and the psychological, emotional categorical system together, the ‘fear’ represented in callers’ voices fell into the category of ‘high arousal’ as well as ‘negative unpleasant emotional state.’ Accordingly, we were able to set sort of norm in phonological features expressed in a caller’s voice under the emergency situations based on the ‘fear emotion.

3.3 Acoustic data acquisition and emotion annotation

We obtained phone calls, defined here as ‘acoustic data’ from an emergency center at a fire station (i.e., a total of 10,319 items of acoustic data). The data was recorded at an 8000 Hz sampling rate on two channels, while the call recipient communicated with the caller during the emergency. Initially, 10,319 recordings were selected from among these with a specific focus on the emotional state. The total duration of the data is alternatively 187.7 h, with each single call’s duration ranging from 3 s to 20 min. Recorded conversations occurred between the call recipient and caller; each conversation was from different individuals while the operator was the same person for several calls. The distribution of recording duration and the number of recordings per call are shown in Fig. 4.

Fig. 4
figure 4

Distribution and detailed information of recording duration per call. Most frequent call duration was the 30–60-s range

The data was annotated into the categories of fear and non-fear, by majority vote (3 out of 5 researchers). Researchers heard each recording and excluded those with too much background noise to identify the voice accurately, and annotated them as either fear or non-fear. Of the total data, 860 sets of data fell into the fear voice category and 9452 into the non-fear voice category during initial annotation; the remaining data was tagged as neither of the two categories due to mistaken calls. After initial tagging focusing on only the caller’s voice itself and not the background noise (e.g., other people’s voices), the emotional state obtained from factor analysis results in the survey were matched to the initially annotated acoustic data. As a result, of the 860 sets of data initially tagged as fear, only 162 of these were labelled as valid fearful voice.

3.4 Classification method

There are several well-known pattern classification methods; the Artificial Neural Network (ANN), the Gaussian Mixture Model (GMM), the Hidden Markov Model (HMM), the Maximum Likelihood Bayesian classifier (MLC), the Kernel Regression, K-nearest Neighbors (KNN), the Support Vector Machines (SVM), and many others [30, 54]. Because each classifier has its own benefits and limitations, therPfister, Tomas. “Emotion Detection from Speech.” 2010 (2010).e is no specific agreement on what the best classifier in emotion recognition is. Petrushin (1999) developed an emotional recognition system with an ensemble of neural networks [43]. Xiang et al. (2012) devised a speech emotion recognition system with SVM classification, suggesting that SVM is a simple and efficient computation of a machine learning algorithm with the limited training of data [40]. Tin Lay et al., (2003) proposed an HMM-based system for sic archetypal emotion recognition systems. A study by Neiberg and Elenius (2008) indicated that the GMM method has a reasonable accuracy of 83%. Recently, to increase accuracy, there has been a tendency to combine two or more classifiers [34]. Of these various methods, SVM is known to a simple and efficient computation of machine learning algorithms, and it is widely used for pattern recognition and problem classification. Especially in cases of limited training data, SVM provides very good classification performance compared to other classifiers [42], particularly for binary classification (i.e., fear or non-fear for this study). Thus, we selected SVM for our classification method.

Taking the total acoustic data as both training set (Strain) and testing set (Stest), Strain and Stest are represented as iS, where i is a tuple (li, fi), li ∈ [63] is a class label and fin. Optimizing the hyperplane separating between two categories (e.g., fear/non-fear) is as

$$ \mathbf{w}\bullet \mathrm{x}-\mathrm{b}=0 $$
(1)

where w is the normal vector in hyperplane (see Eq. 1). To maximize the distance between two parallel hyperplanes which separate the data, the hyperplanes can be defined as (Eqs. 2 and 3)

$$ \mathbf{w}\bullet \mathrm{x}-\mathrm{b}=1 $$
(2)
$$ \mathbf{w}\bullet \mathrm{x}-\mathrm{b}=-1 $$
(3)

with the distance as \( \frac{1}{\left\Vert \mathrm{w}\right\Vert } \). Thus, maximizing the distances between two hyperplane means minimizing ‖w‖, satisfying the constraints (Eq. 4).

$$ {l}_i\left(\mathbf{w}\cdot \mathrm{x}\hbox{--} \mathrm{b}\right)\ge 1 $$
(4)

since there should be no data points between hyperplanes and li ∈{1,-1} [44].

An example view of the method is shown in Fig. 5, which is typical two dimensional case. The support vectors are circled.

Fig. 5
figure 5

Linear separating hyperplanes for the separable case [3]

Figure 6 is the classifier architecture for the real-time classier. Its main components are described below. The classifier architecture describes the following process. It consists of the ensemble of linear SVM. First, load the model Ti, j and the class labels are assigned. And then, we segmented the audio only over the threshold which was empirically preset. From each segmented audio sample, we extracted the feature C(i,j). After that, n(n-1)/2 pairwise SVM for n classes classify the features into 1 or 0 in parallel. Finally, the average of probability ψi of the binary SVMs is determined.

Fig. 6
figure 6

The classier architecture. SVMi;j computes the probabilities pi and pj for labels i; j, using features f(ci;j) [44]

4 Experiments

4.1 Acoustic feature selection

There have been many studies demonstrating the correlation between emotional voices and acoustic features [17, 20, 28, 29, 48, 56, 66, 67]. Widely known characteristics of acoustic features in several high-arousal and negative discrete emotions are summarized in Table 5.

Table 5 Selected acoustic features and their representation in the specific discrete (negative) and dimensional (high-arousal) emotional accounts from Juslin & Laukka (2003) and Laukka & Juslin (2005)

Among these acoustic features, for fear/non-fear voice recognition we used F0 variability which is the variability in the fundamental frequency of speech, corresponding to pitch variability (i.e., the energy or loudness in voice) and voice intensity, [19]. Because the acoustic data was phone calls from real-life situations, the callers ranged in age from children to the elderly and included both male and female. Thus, the data inevitably had significant individual variance, making normalization more difficult. For this reason, we measured the intra-speaker variability in both F0 and voice intensity for each set of acoustic data (one set = one call). Each acoustic feature extraction was based on the smaller partitioning of speech signals into small intervals of 20 msec frames. In addition to these acoustic features, MFCCs were used in this study. We empirically set 19 filters in the Mel filter bank and used the first 16 coefficients as feature vectors [51].

4.2 Behavioral feature selection

Disfluencies are one of the main behavioral features enabling us to distinguish fear from non-fear voices. There have been many disfluency studies that show correlations with emotional speech recognition [22]. Among many disfluency phenomena (e.g., repetition, hesitation, pause, laughter, inspiration, expiration), we used the interjection phenomena which provides quantifiable parameters and reveals significant results enabling the distinguishing fear and non-fear voices. Hence, the specific interjection features represented in fearful voices were extracted for emotional voice pattern recognition. To this end, the interjection occurrence pattern in both fear-defined data and non-fear defined phonological data in the sample data were observed and compared. There were three commonly observed components of the interjection features in both fear and non-fear voices: pre-pause (a), quick phonological segment (b), and post-pause (c). We empirically set the average duration of two components (the pre-pause and the quick phonological segment) as 20–160 msec and 60–220 msec, respectively (Fig. 7).

Fig. 7
figure 7

Interjection spectrum during utterances and specific phase represented in fear situations (Left; specific utterance section in which interjection occurred, Right; magnified interjection from the left). An interjection is divided into three phases; pre-pause, quick phonological segment, and post-pause. The average duration of each phase were empirically set at 20~1760 msec for pre-pause, 60~220 msec for the quick phonological segment, and no specific average duration for post-pause

There was a general difference in the duration of the three components between the interjection in fear and non-fear voices. Also, interjection occurred less often in non-fear voices than in fear voices. The average duration times for the three components in fear and non-fear voices were different (in fear voices and non-fear voices, 96.8 msec vs. 231.6 msec for pre-pause, 160.8 msec vs. 249.1 msec for the quick phonological segment, and 224.7 msec vs 288.3 msec for post-pause, respectively). To discover significant differences in duration between fear and non-fear voices, the duration of the pre-pause, quick phonological segment, and post-pause were statistically compared. Paired t-test results showed that the duration of the interjection of the two components (pre-pause and quick phonological segment) were statistically different between fear and non-fear voices (t(9) = 1.86, p < 0.05 and t(21) = 2.99 p < 0.005 respectively; Fig. 8). Therefore, we set these duration times as standards for distinguishing fear and non-fear voices in an interjection.

Fig. 8
figure 8

Duration of interjection for each phase for fear (in blue) VS non-fear (in orange) voices (msec). Mean and standard error are displayed. Paired t-tests showed statistically significant differences in duration of pre-pause (left) and quick phonological segments (middle) between fear and non-fear voices. (*p < 0.05, **p < 0.005)

Disfluencies such as interjection and word repetition, etc., occur even during normal situations, but these phenomena appear more frequently in situations where the cognitive demand is greater [6, 63]. The greater cognitive load can be caused by negative arousal emotions [10]. Thus, we expected that interjection would be more frequently observed in fear voices than non-fear voices. This was confirmed by behavioral feature selection. In addition, we also found a distinctive feature of interjections for fear compared with non-fear voices: the duration of the pre-pause and the quick phonological segments were shorter in fear voices than in non-fear voices. We speculated these differences might be due to the contradiction between the greater cognitive load elicited by both the callers’ negative arousal emotion and the callers’ distressed feelings of responsibility for the urgent situation. Under emergency circumstances, callers’ cognitive function temporarily become limited; thereby it takes longer for them to utter the appropriate words or sentences needed to convey the required information. At the same time, the caller may feel pressure to speak as quickly and accurately as possible. This contradiction may give rise to the shorter average duration of the pre-pause and the quick phonological segment typical in fear voices.

4.3 Classification procedure

​ Six kinds of features (e.g., F0 variability only, Voice intensity variability, MFCC, interjection, the combination of Voice intensity variability with MFCC, and the combination of Voice intensity variability, MFCC, and interjection) were produced, and then SVM was used to classify the emotion categories. The results are shown in a confusion matrix in Table 6. “The condition positive in true condition” indicates a fear state; “the condition negative in true condition” indicates a non-fear state. The accuracy counts only true positive and true negative for fear state and non-fear state, respectively (Table 6).

Table 6 Confusion matrix for four-fold cross-validation support vector machines and calculation for accuracy with 162 data sets for each state (i.e., fear/non-fear)

5 Results and discussion

Table 7 shows the result of the accuracy rate for each model for the 162 sets of test acoustic data used in the current study. The MFCC model displayed the best accuracy rate (80.2%). The voice intensity variability model achieved 68% accuracy, the interjection model achieved 58.9% accuracy, and the F0 variability achieved 48% accuracy. The combination models (which employed a combination of features), accomplished 73% accuracy for voice intensity variability. The best model was the MFCC with the interjection features model and the voice intensity variability which reached an accuracy rate of 66.5% (Table 7).

Table 7 Accuracy rate results for each model in the test set data

Our results showed that the MFCC feature model displayed the best accuracy of any feature or any combinations of feature models. The MFCCs feature is widely used for speech recognition and emotional speech recognition system. Also, contradicting our expectation that the interjection behavioral feature would provide a reasonable criterion for greater improvement in emotional speech recognition system, it only reached a moderate accuracy stated above. We attribute the unexpected results to the following reasons: The accuracy rates for all models were quite low compared to results of other studies which classified the emotion of speech, all of which achieved 50 to 95% accuracy [5, 11, 14, 22, 24, 28, 34, 37, 38, 40, 42, 43, 45, 47, 48, 50, 54, 55, 57,58,59, 61, 66]. Unlike previous emotional speech recognition system studies using databases consisting mostly of individual sentences recorded in unnatural, non-real-life surroundings, using high-technique mikes [43], we used real-acoustic data (i.e., recordings from an emergency call center) from natural surroundings including the background noise. These real emergency situations included naturally induced emotions. In addition, the data length varied from 3 s to 1156 s (Fig. 4) which led to data entailing a significant portion of contextual information and vast fluctuations in callers’ emotional state. These distinctive features may have made accurate pattern recognition difficult which may have led to the overall low accuracy rate. This limitation needs to be surmounted.

The simplified classification system (i.e., fear/non-fear) could be another reason for the overall low accuracy rate. Only two categories for efficient and effective operations in an emergency call center were needed (‘neutral’ was considered as part of the non-fear category). This simple distinction, associated with many acoustic features, led to broader modeling than we expected and may have caused difficulty distinguishing emotions in speech. Nevertheless, the MFCC feature model showed the highest accuracy rate at 80.2%. In fact, among the few studies employing real acoustic data, the MFCC is the most important acoustic feature for emotional speech recognition systems [58]. Some research only used the MFCC feature for emotional speech recognition but yielded high precision [53]. Considering the nature of the data used here, and the simplified categories, we verified the MFCC feature as a vital acoustic feature for emotional speech recognition systems, even for acoustic data from real-life situations in which emotions were naturally induced. Regarding the interjection feature model; although this model only reached slightly above the level of chance for accuracy (i.e., 58.2%), it is very encouraging researchers that a behavioral feature like interjection was introduced for emotional speech recognition. In fact, there has been some research that considered disfluencies as a feature for emotional speech recognition [2, 13, 35]. Inspired by those studies, a few researchers thought about behavioral features such as laughter, stutters, and fillers in emotional speech classification, but the studies are very few in number [9]. This study applied a behavioral feature to emotional speech recognition for real acoustic data, not acted recordings, where real-life callers became emotional. Moreover, even though the result did not show as high accuracy rates as the all models, we showed a better accuracy rate (73%) for the combination of features (i.e., voice intensity variability, MFCCs, and interjection), over the other models (66.5%; i.e., voice intensity variability and MFCCs). This implies that behavioral features are necessary for the improvement of speech recognition systems, especially when using real-world acoustic data from natural situations.

6 Conclusion

In this study, we aimed to determine if the naturally induced emotional states demonstrated in acoustic data obtained from an emergency call center, which entails naturally induced emotions, corresponding to fearful emotions. To do so, we first identified the emotional states of callers under urgent situations and sorted audio files into two emotional categories: fear or non-fear. Then, using SVM, we attempted to automatically classify the speech data into fear/non-fear voices. Widely used acoustic features for emotional speech recognition, including F0 variability, voice intensity variability, and MFCCs, were used for classification. Importantly, we presented interjection, one of the disfluencies in utterances which are often observed when a speaker becomes negatively emotional, as the behavioral feature. We showed an 80.2% accuracy rate for classification results using only MFCCs, verifying that the MFCC is an important feature for emotional speech recognition system. Also, we learned that, even though its rate of accuracy was lower than the MFCC model, the combination of features, especially with the interjection behavioral feature included, improved the rate of accuracy of the classification, compared with models which only used acoustic feature combinations. Because emotion recognition in speech occurring in natural contexts is increasingly required, classification using behavioral features like interjection should be considered for further study.

In this study, we aimed to determine if the naturally induced emotional states demonstrated in acoustic data obtained from an emergency call center, corresponded to fear emotion. Importantly, we presented interjection, one of the disfluencies in utterances which are often observed when a speaker becomes negatively emotional, as the behavioral feature by which to classify the fear/non-fear state. We showed an 80.2% accuracy rate for classification results using only MFCCs, verifying that the MFCC is an important feature for ESRS. Because we confined the classification to either fear or non-fear, this may have led to the lower accuracy rate of our classification system. Even so, dealing with real emotionally aroused acoustic data in the fear state is meaningful. Also, we learned that the combination of features, especially with the interjection behavioral feature included, improved the rate of accuracy of the classification, compared with models which only used acoustic feature combinations. Because emotion recognition in speech occurring in natural contexts is increasingly required, classification using behavioral features like interjection should be considered for further study. In order to further develop emotion recognition system, our work needs to be extend to other real emotional acoustic data.