Keywords

1 Introduction

Several nuances are embedded in human speech. A spoken utterance can be analyzed for what was spoken (speech recognition), how was it spoken (emotion recognition) and who spoke (speech biometric) it. Most often these three aspects form the basis of most of the work being carried out actively by speech researchers. In this paper, we concentrate on the how aspect of spoken utterance, namely emotion recognition.

Perceiving emotions from different real-life signals is a natural and an inherent characteristic of a human being. For this reason emotion plays a very important role in intelligent human–computer interactions. Machine perception of human emotion not only helps machine to communicate more humanely, but it also helps in improving the performance of other associated technologies like Automatic Speech Recognition (ASR) and Speaker Identification (SI).

With the mushrooming of services industry there has been a significant growth in the voice-based call centers (VbCC) where identifying emotion in spoken speech has gained importance. The primary goal of a VbCC is to maintain a high level of customer satisfaction which means understanding the customer just in time (in real time and automatically) and making a decision on how to communicate (what to say, how to say) with the customer. While several things related to the customer are a priori available, thanks to the advances in data mining, the one thing that is crucial is the emotion of the customer at that point of time, so that the agent can plan what and how to converse to keep the customer happy and also allow him to know when to make a pitch to up-sell.

Much of the initial emotion recognition research has been successfully validated on acted speech (for example [8,9,10,11]). Emotions expressed by trained actors are easy to recognize, primarily because they are explicitly and dramatically expressed by them with significant intensity. This, on purpose magnified, emotions can be easily distinguished from one another. However, when the expression of the emotion is not explicit or loud, it is very difficult to distinguish one emotion of the speaker from another. This mild and not explicitly demonstrated emotion is most likely to occur in spontaneous natural day to day conversational speech. The rest of the paper is organized as follows. In Sect. 2 we dwell on the different challenges facing emotion recognition in spontaneous speech. We propose a framework in Sect. 3 that has provision to use prior knowledge to address emotion recognition in spontaneous speech. And we conclude in Sect. 4.

2 Challenges

Historically emotion has been represented using two affective dimensions, namely, arousal (also referred to as activation) and valence. Note that any point in this 2D space (Fig. 1) can be looked upon as a vector and represents an emotion. Table 1 gives the mapping of a few emotions in terms of the affective states. For example +ve valence and +ve arousal (first quadrant) would represent happy while −ve valence and +ve arousal could represent anger. Now we enumerate the challenges in machine recognizing emotion in spontaneous speech.

Fig. 1
figure 1

Emotions expressed in the (arousal, valence) space. Emotion in spontaneous speech is subtle compared to acted speech

Table 1 Emotions expressed in the (arousal, valence) space. Map of the affective states to known human emotion

 

Intensity of Emotion in Spontaneous Speech :

Usually acted speech exhibits higher degree of intensity, both in arousal and valence dimensions resulting in a larger radii emotion vector compared to the spontaneous (non-acted) speech. For this reason, it is easy to mis-recognize one emotion for another in spontaneous speech. Subsequently, if the first quadrant (Fig. 1) represents emotion \(E_1\) and the fourth quadrant represents emotion \(E_2\), then the misrecognition error is small (\(\varDelta {r}\)) for spontaneous speech but requires higher degree of error in judgment (\(\varDelta {t}\)) to mis-recognize emotion \(E_1\) as emotion \(E_2\) and vice-versa for acted speech. For this reason, recognizing emotion in spontaneous speech becomes challenging and is more prone to misrecognition.

What works for acted speech does not work for Spontaneous Speech :

Though recognizing emotion in speech has a rich literature, however, most of the work has been done on acted speech are typically machine learning-based systems. Namely, one trains a system (example, Support Vector Machine, Artificial Neural Networks, Deep Neural Networks) with a set of annotated speech data and then classifies the test dataset using the trained system. Speech emotion recognition systems that perform with high accuracies on acted speech datasets do not perform as well on realistic natural speech [12] because of the mismatch between the train (acted) and test (spontaneous) datasets. This is another challenge in addressing spontaneous speech emotion recognition.

Clearly this challenge can be addressed if there exists an emotion annotated spontaneous speech dataset which can be used for training a spontaneous speech emotion recognition system.

Spontaneous Speech Corpus :

For any given spoken utterance, there are two point of views in terms of associating emotion to the utterance namely, (a) encoded emotion and (b) decoded emotion. The emotional state of the speaker who uttered the audio is called the encoded emotion while the interpreted emotion of the same audio by a listener, who is different from the speaker, is called decoded emotion. For example, the audio s(t) in Fig. 2 can have two emotion labels associated with it. When the speaker annotates and assigns a emotion label it is called the encoded emotion of s(t) and when a listener (different from the speaker) listens to s(t) and assigns an emotion label it is called the decoded emotion.

For acted speech both decoded and encoded emotion are more likely to be the same, however, for spontaneous speech there is bound to be a wide gap between the encoded and decoded emotion. Building a realistic spontaneous speech corpus would need a person to speak in a certain emotional state and/or annotate what he or she spoke; generating such realistic spontaneous data corpus is extremely difficult and is a huge challenge.

Annotating Spontaneous Speech Corpus :

The next best thing is to have a decoded spontaneous speech corpus. However, one of the problems associated with emotion recognition of spontaneous speech, is the availability of a reliably emotion annotated spontaneous speech database suitable for emotion recognition. The inability to annotate spontaneous speech corpus is basically because of the lower degree of emotion expression (as seen in Fig. 1).

In [1] we showed that there is a fair amount of disagreement among the evaluators when they are asked to annotate spontaneous spoken utterances. The disagreement, however, decreases when the evaluators are provided with the context knowledge associated with the utterance. Fleiss’ Kappa score [13, 14] was used to determine the agreement between evaluators. When the evaluators were asked to annotate (decoded emotion) spontaneous speech the agreement was 0.12 while the same set of evaluators when provided with the context associated with the spontaneous speech, the agreement between the evaluators increased to 0.65. This suggests that there is a higher degree of agreement between the evaluators when they are provided associated contextual knowledge while annotating spontaneous speech.

 

Fig. 2
figure 2

Decoded versus encoded emotions

As illustrated above, there are several known challenges that exist in spontaneous speech emotion recognition. Clearly the literature that deals with emotion recognition of acted speech does not help in spontaneous speech emotion recognition, however, as observed, the use of prior knowledge can help address recognizing emotions in spontaneous speech. In the next section we propose a framework for recognizing emotions in spontaneous speech based on this observation.

3 A Framework for Spontaneous Speech Emotion Recognition

Let \({s}(t)\) be a speech signal, say of duration T seconds and let

$$\begin{aligned} \mathcal{E} = (E_1= anger, E_2= happy, \ldots , E_n) \end{aligned}$$

be the set of n emotion labels. In literature the emotion of the speech signal \({s}(t)\) is computed as

$$\begin{aligned} {\mu }_{k,{s}(t)}^{p} = P(E_k | {s}(t)) = \frac{P({s}(t)| E_k) P(E_k)}{P({s}(t))} \end{aligned}$$
(1)

where \({\mu }_{k,{s}(t)}^{p} = P(E_k | {s}(t))\) is the posterior probability or score associated with \({s}(t)\) being labeled as emotion \(E_k \in \mathcal{E}\). Generally, these posteriors are calculated by learning the likelihood probablities from a reliable training dataset using some machine learning algorithm. Note that in practice the features extracted from the speech signal \(\mathcal{F}({s}(t))\) are used instead of the actual raw speech signal \({s}(t)\) in (1). Conventionally, the emotion of the speech signal \({s}(t)\) is given by

$$\begin{aligned} E_{k*} = \arg \max _{1 \le k \le n} \{ {\mu }_{k,{s}(t)}^{p} \}. \end{aligned}$$
(2)

Note that \(E_{k*} \in \mathcal{E}\) is the estimated emotion of the signal \({s}(t)\).

While this process of emotion extraction works well for acted speech, because the entire speech utterance carries one emotion. However this, namely the complete speech signal carrying a single emotion is seldom true for spontaneous conversational speech (for example, a call center audio recording between the agent and a customer). As mentioned in an earlier section, additional challenges exists in terms of the fact that emotions in spontaneous speech are not explicitly demonstrated and hence can not be robustly identified even by human annotators in the absence of sufficient context surrounding the spoken utterance.

These observations lead us to look for a novel framework for recognizing emotions in spontaneous speech [1]. The framework tries to take care of the fact that (a) the emotion within the same speech signal is not the same and (b) human annotators are better able to recognize emotions when they are provided with a context associated with the speech signal.

The essential idea of the framework is to compute emotion for smaller duration (\(2 \varDelta \tau \)) segments of the speech signal (\({{s}_{{\tau }}}(t)\)), namely,

$$ {{s}_{{\tau }}}(t) = {s}(t) \times \left\{ U({\tau }- \varDelta {\tau }) - U({\tau }+ \varDelta {\tau }) \right\} $$

where U(t) is a unit step function defined as

$$ \begin{array}{ccc} U(t) &{}=&{} 1 \;\;\; \text{ for }\;\;\; t\ge 0 \\ &{}=&{} 0 \;\;\; \text{ for }\;\;\; t<0, \end{array} $$

instead of computing the emotion for the complete signal (\({s}(t)\)). Note that (a) \({{s}_{{\tau }}}(t) \subset {s}(t)\) and is of length \(2\varDelta {\tau }\) and (b) \({\tau }\in [0,T]\). As done conventionally, the emotion of \({{s}_{{\tau }}}(t)\) is computed as earlier, namely,

$$\begin{aligned} {\mu }^p_{k,{{s}_{{\tau }}}(t)} = P({E}_k | {{s}_{{\tau }}}(t)) \end{aligned}$$
(3)

for \(k=1, 2, \ldots n\). However, in addition we also make use of the emotions computed from the previous \({\eta }\) speech segments, namely \( {\mu }_{k,{s}_{{\tau }-v}(t)}\) for \(v=1, 2, \ldots {\eta }\). So we have, the posterior score associated with the speech utterance \({{s}_{{\tau }}}(t)\) being labeled \({E}_k\) as

$$\begin{aligned} '{\mu }^p_{k,{{s}_{{\tau }}}(t)} = {\mu }^p_{k,{{s}_{{\tau }}}(t)} + \sum _{{v}=1}^{{\eta }} {\omega }_{v}{\mu }^p_{k,{s}_{{\tau }-{v}}(t)} \end{aligned}$$
(4)

where \({\omega }_1, {\omega }_2 \ldots , {\omega }_{\eta }\) are monotonically decreasing weights, which are all less than 1. Equation (4) makes sure that the posterior score of the speech segment \({{s}_{{\tau }}}(t)\) is influenced by the weighted sum of the posterior score of the previous speech segments. This is generally true of spontaneous conversational speech where the emotion of the speaker is based on the past emotion experienced during the conversation.

Further the output posterior scores from emotion recognizer, namely, \({\mu }^p_{k,{{s}_{{\tau }}}(t)}\) (3) is given as input to a knowledge-based system, that modifies the scores depending upon the time lapse (how far is \(\tau \) from the beginning of the spoken conversation) of the speech segment (utterance) in the audio signal. This can be represented as,

$$\begin{aligned} {\mu }_{k,{{s}_{{\tau }}}(t)}^{{\kappa }}=w_{{\tau }}{\mu }_{k,{{s}_{{\tau }}}(t)}^{p} \end{aligned}$$
(5)

where \({\mu }_{k,{{s}_{{\tau }}}(t)}^{p}\) (3) and \(w_{{\tau }}\) (see Fig. 3) are the posterior probability score and weight vector at time instant \({\tau }\) respectively. And \({\mu }_{k,{{s}_{{\tau }}}(t)}^{{\kappa }}\) is the emotion computed based on knowledge.

The motivation for (5) is based on the observation that the duration of the audio calls plays an important role in the induction (or change) in the user’s emotion. As mentioned in [1] weight \(w_{{\tau }}\) is expected to increase or decay exponentially as \({\tau }\) increases, depending upon the type of the emotion. As an example (see Fig. 3) it is expected that \(w_{{\tau }}\) for anger and sad close to the end of the conversation is likely to be more compared to the same emotion of the customer at the beginning of the call. As seen in Fig. 3 the weight components are expected to increase exponentially as time index increases for anger and sad at the same time \(w_{{\tau }}\) is expected to decrease exponentially as time index increases for happy emotion.

Fig. 3
figure 3

Knowledge regarding the time lapse of the utterances in the call. The weights \(w_{{\tau }}\) of emotions like happy decreases with \({\tau }\) while the weights increase for emotions like anger, sad

We can combine \({\mu }_k^{p}\) and \({\mu }_k^{{\kappa }}\) to better estimate the emotion of the spontaneous utterance \({{s}_{{\tau }}}({\tau })\) as

$$\begin{aligned} e^{k} = \lambda _p ('{\mu }_k^{p}) + \lambda _{\kappa }({\mu }_k^{{\kappa }}) \end{aligned}$$
(6)

where \(\lambda _{\kappa }= 1 - \lambda _p\). The framework makes use of knowledge when \(\lambda _{\kappa }\ne 0\). Emotion of the spontaneous speech utterance \({s}({\tau })\) with the incorporation of knowledge (\({\mu }_k^{{\kappa }}\)) is represented as

$$\begin{aligned} E_{k*} = \arg \max _{1 \le k \le n} \left\{ e^{k} \right\} . \end{aligned}$$
(7)

Knowledge regarding the time lapse of the utterance in an audio call, especially in conversational system, provides useful information to recognize the emotion of the speaker. Therefore, incorporation of this knowledge is useful in extracting the actual emotion of an user. The proposed framework for spontaneous speech recognition is shown in Fig. 4.

Fig. 4
figure 4

Proposed framework for spontaneous speech emotion recognition

As shown in [1] there is performance improvement in recognition of emotion of spontaneous speech when this framework is used. They show for different classifiers that there is almost \(11\%\) absolute improvement in emotion recognition for interactive voice response type of call and the performance further improved to \(14\%\) absolute for real call center conversation.

4 Conclusion

Emotion recognition has rich literature for acted speech and this leads to the belief that the techniques that work well for acted speech can be directly used for spontaneous speech. However, there are several dissimilarities between acted and spontaneous speech which does not allow one to use techniques and algorithms that work well for acted speech to recognize emotion in spontaneous speech. Emotion recognition techniques are generally machine learning based algorithms which (a) requires sufficient amount of training data and (b) requires the test and the train data to match. The main challenge in using trained models that work for acted speech on spontaneous speech is the mismatched condition. Additionally, in case of spontaneous speech it is very challenging to (a) generate spontaneous speech data and (b) to obtain robust annotation of the speech data. For this reason techniques and algorithms that work best for spontaneous speech cannot be built afresh. In this paper, we first established the importance of spontaneous speech emotion recognition and then enumerated several challenge and hurdles faced during emotion recognition in spontaneous speech. Based on our previous work, we proposed a framework that exploits apriori knowledge to enable reliable spontaneous speech emotion recognition. The main idea behind the proposed framework is to assist the machine learning algorithm with prior knowledge associated with the spontaneous speech. It has been shown [1] that this framework can actually improve the emotion recognition accuracies of spontaneous speech by as much as 11–14% in absolute terms.