Keywords

1 Introduction

Current state-of-the-art synthesizers support the simulation of specific speaking styles in one way or the other. A specific form of speaking style is emotional speech. Since decades articles in the literature can be found on strategies on how to simulate a single emotional expression described by a categorical designation or by single point in an emotion-dimensional space, see [1] or [12] for some more recent examples. The expression of only one emotional state in speech is a first step towards more naturalness. Nonetheless, it is an over-simplification to only model one emotional state at every given time. In the real world, there are many situations conceivable where at least two emotion-related states influence the speaking style. Especially when the term “emotion” gets broadened to “emotion-related state”, i.e. includes mood, alertness or personality.

Psychologists have been very interested in the topic of mixed or blended emotions, emphatically debating the degree to which conflicting emotions can be simultaneously experienced. One perspective suggests that the ability to experience conflicting emotions simultaneously is limited, as positive and negative emotions represent opposite dimensions on a bipolar scale. A second perspective argues the opposite, namely, that emotional valence is represented by two independent dimensions. Thus, not only can one simultaneously experience conflicting emotions, such a joint experience may be natural and frequently occurring [2, 18]. For the case of facial expressions, mixed emotions have been successfully acted by providing situational descriptions and prototypical pictures [8], and even models to blend basic emotions exist [13].

The research on the simulation of affective speaking styles with speech synthesis has a long history [3, 14, 15] and started with the simulation of one single speaking style or emotional expression. Mixing two speaking styles has later also been studied, for example [17] interpolated the HMM models of two different emotional speaking styles to generate a mixed expression. They did not report on the success of the method with respect to an expression that is perceived by listeners as a mixture between two emotions.

In a similar fashion, [11] learned parameter clusters for HMM speech synthesis to model speaker identity and emotional expression. This method was used to model expression even for speakers whose model was not trained on emotional data by using prosodic models trained on speakers that included expressive samples, while the spectral features are meant to encode the speaker identity. So the foremost aim of this research was to transplant expressive speaking styles from one source speaker to another.

To our knowledge until now no one reported on the attempt to find a strategy to display more than one affective state at the same time not using interpolation between speaker expression models.

We describe an approach to simulate more than one emotion utilizing the open source program “Emofilt” which in itself is based on the diphone synthesizer “Mbrola” [9] as well as a text-to-phoneme converter, for example the text-to-speech framework “Mary” [16]. The approach is based on the idea of mixing configurations for several feature categories during the synthesis process. Feature categories are for example: articulation, phonation, pitch or duration parameters. We evaluated this approach with a perception experiment. In a systematic confusion, each of Darwins four “basic emotions” (joy, sadness, fear and anger) was combined with all other emotions and used as an emotional model to synthesize four target phrases taken from the Berlin emotional database EmoDB. The two German target phrases were generated with a male and female Mbrola voice (de6 and de7).

This article is structured as follows. Firstly we describe the speech synthesizer in Sect. 2. We then report on the way we approached the simultaneous simulation of two affective states in Sect. 3. Section 4 describes the perception experiment that was used to verify our approach. Lastly, Sect. 5 discusses the results and insights that could be gained from the experiment. We conclude the paper with an overview and some ideas for improvements in Sect. 6.

2 Emofilt

Emofilt [4] is a software program intended to simulate emotional arousal with speech synthesis based on the free-for-non-commercial-use MBROLA synthesis engine [9]. It acts as a transformer between the phonetisation and the speech-generation component. Originally developed at the Technical University of Berlin in 1998 it was revived in 2002 as an open-source project and completely rewritten in the Java programming language.

Fig. 1.
figure 1

Emofilt Developer Graphical User Interface.

The input format for Emofilt is MBROLA’s PHO-format. Each phoneme is represented by one line, consisting of the phoneme’s name and its duration (in milliseconds). Optionally following is a set of \(\text{ F }_0\) description tuples consisting of a \(\text{ F }_0\)-value (in Hertz) and a time value denoting a percentage of the duration. Here is an example of such a file:

figure a

Emofilt’s language-dependent modules are controlled by external XML-files and it is as multilingual as MBROLA which currently supports 35 languages.

Emofilt consists of three main interfaces:

  • Emofilt-Developer: a graphical editor for emotion-description XML-files with visual and acoustic feedback (see Fig. 1).

  • Emofilt itself, taking the emotion-description files as input to act as a transformer in the MBROLA framework.

  • A storyteller interface that can be used to mark phrases in a dialog with colors that correspond to emotional expression [6].

The input format for Emofilt is MBROLA’s PHO-format. Each phoneme is represented by one line, consisting of the phoneme’s name and its duration (in ms). The valid phoneme-names are declared in the MBROLA-database for a specific voice and must be known by Emofilt.

In a first step each syllable gets assigned a stress-type. Emofilt differentiates three stress-types:

  • unstressed

  • word-stressed

  • (phrase) focus-stressed

As the analysis of stress involves an elaborate syntactic and semantic analysis and this information is not part of the MBROLA PHO-format, Emofilt assigns only focus-stress to the syllables that carry local pitch maxima. However, for research scenarios it is possible to annotate the PHO-files manually with syllable and stress markers.

The emotional simulation is achieved by a set of parameterized rules that describe manipulation of the following aspects of a speech signal:

  • Pitch changes, for example: “Model a rising contour for the whole utterance by ordering each syllable pitch contour in a rising manner”.

  • Duration changes, for example: “Shorten each voiceless fricative by 20%”.

  • Voice Quality, for example the simulation of jitter by alternating F0 values and support of a multiple-voice-quality database.

  • Articulation precision changes by a substitution of centralized and decentralized vowels.

The rules were motivated by descriptions of emotional speech found in the literature [3]. As we naturally can not foresee all modifications that a future researcher might want to apply, we extended Emofilt by an extensible plugin-mechanism that enables users to integrate customized modifications more easily.

3 Data Generation

As stated in Sect. 2, Emofilt’s modification rules are categorized into four modification categories: pitch, duration, voice-quality and articulation.

The first naive idea on how to simulate two different states at the same time would perhaps be to simply fuse the modification parameters for each desired expression by using the average value. For example if anger leads to an increase of stressed syllables by 20% and sadness leads to a decrease of 20%, use 0% modification because it’s the average value. But, as can be seen directly from the example, this may easily lead to an equalization between the two expressions and thus neither expression would be detectable.

So instead we used the distinction between prosodic features (i.e. pitch and duration) to express the more “foreground” emotion and the other feature categories, namely voice-quality and articulation, to express the secondary emotional state. This distinction lacks a basis in psychological models, but was motivated purely by pragmatic motivation.

The following example displays the configuration for happy as a primary and sadness as a secondary emotion.

figure b

As modifications to display happiness, the pitch-contour gets assigned the so-called “wave model” (which means a fluent up-and-down contour between stressed syllables, see [4] for details) and the duration of the voiceless fricatives gets lengthened by 40%. At the same time, the phonation and articulation parameters get altered according to the emotion model defined for sadness, i.e. jitter is added, the vocal effort is set to “soft” and the articulation target values are set to “undershoot”.

To generate test samples for evaluation in a systematic confusion, each of Darwins four “basic emotions” (joy, sadness, fear and anger) was combined with all other emotions and used as primary as well as secondary emotional state. As a reference we added neutral versions, but did not combine neutral with the emotional states. This resulted in 17 samples (4 emotions by 4 + neutral). The target phrases were taken from the Berlin emotional database EmoDB [5]. We used two short and two longer ones.

All target phrases were synthesized with a male and female Mbrola German voice (de6 and de7). The resulting number of samples was thus 134 (17*4*2).

4 Perception Experiment

In a forced-choice listening experiment, 32 listeners (16 males, 16 females, 20–39 years old, mean = 27.26, standard deviation = 3.75) assigned all stimuli to one of the four emotions or “neutral”. A second rating was asked for as “alternative” categorization. The “neutral” emotion was introduced as default in case of uncertainty. The evaluation was done with the Speechalyzer Toolkit [7]. For playback of the stimuli in randomized order, AKG K-601 headphones were used. One single session took about 40 min.

A validation of the full emotions (256 ratings per category) confirmed the synthesis quality for basic emotions, as all five synthesized categories are labeled on average with 52,4% as intended (see Table 1).

Table 1. Confusion matrix for the single basic emotions only. Primary rating in % divided by 100. Highest values bold.

The intended complex emotions were categorized with a primary label 3072 times. Excluding all full single emotions, and thus also all primary ratings for “neutral”, resulted in 2244 answers. The complex emotions as intended with set 1 (prosody) are recognized most frequently. However, anger is equally often confused with fear (Table 2).

A similar confusion matrix for the second intended emotion (voice quality, articulation) however, shows no identification by the listeners except for anger (Table 3).

The alternative ratings are dominantly “neutral”, indicating difficulties to assign two separate emotions to the stimuli (Tables 4 and 5). The remaining data without any “neutral” responses, i.e. actually assigned to the four emotions in question, account only for 38% of the 3072 responses. Still, there are systematic results visible (Table 6): Within the limits of those actually rating a secondary emotion, combinations of anger and fear as well as fear and sadness are dominantly classified irregardless of the assignment of emotions to the features. Joy combined with fear is most often correctly rated for joy synthesized with prosodic information. In sum, fear was the best performing emotion to be combined with others. Interestingly, all confusions had one emotion in common, whereas another was dominantly replaced with fear.

5 Discussion

The pure emotions were all recognized above chance. Results for the complex emotions indicate that the prosodic parameters significantly elicit the intended emotion, whereas the second bundle (voice-quality and articulation precision) reveals mixed results, even for the primary rating. In particular, the secondary rating was dominantly “neutral”. Nevertheless, when analyzing the pairs of non-neutral ratings, the intended complex emotions including fear work especially well. Even the confusion pattern for the other targets show systematic effects in favor of fear, always retaining one of the intended emotions that is not dependent on the features bundle. Therefore, these results are most likely originated in the quality of the material and evaluation method at the current state of synthesizing complex emotions, and can not be taken to indicate invalidity of the concept of complex emotions.

Table 2. Confusion matrix for the emotions synthesized with prosody. Primary rating in % divided by 100. Highest values bold.
Table 3. Confusion matrix for the emotions synthesized with voice quality and articulation. Primary rating in % divided by 100. Highest values bold.
Table 4. Confusion matrix for the emotions synthesized with prosody. Secondary rating in % divided by 100. Highest values bold.
Table 5. Confusion matrix for the emotions synthesized with voice quality and articulation. Secondary rating in % divided by 100. Highest values bold.
Table 6. Confusion matrix for the complex emotions separated for prosodic and non-prosodic feature order. Primary and Secondary ratings pooled (in % divided by 100). Highest values bold, intended categories in italics.

Whereas the results are promising, the ultimate aim to validly synthesize two emotions simultaneously was not fully reached. Apparently, some emotions dominate the perception (fear), and the salience or quality of synthesis does not seem to be equally distributed over the two feature bundles.

From a methodological point of view, hiding the true aim while assessing two emotions per stimulus seemed to be difficult. However, asking for only one emotion and analyzing the frequencies of replies would require comparable perceptual salience of each emotion involved. Fortunately, judging from conversations with the participants and the high amount of neutral second ratings, the cover story of asking for a first and an alternative impression worked.

As alternative, openly asking for the mixture of emotions risks to induce effects of social desirability, which might still allow for testing the quality of synthesizing stereotypical emotion combinations, but not for testing validity of the complex emotions. Therefore, a more sophisticated evaluation paradigm applying social situations, in which complex emotions do occur, might be more meaningful.

6 Conclusions and Outlook

We described an approach to simulate first and secondary emotional expression in synthesized speech simultaneously. The approach is based on the combination of different parameter sets with the open-source system “Emofilt” which utilizes the diphone-synthesizer “Mbrola”. An evaluation of the technique was done in a perception experiment which showed only partial results.

The ultimate aim to validly synthesize two emotions simultaneously was not fully reached, but, as the results are promising, the synthesis quality, especially for voice quality and articulation, needs to be optimized in order to establish comparable strength and naturalness of the emotions over both feature bundles. Especially the simulation of articulation precision, which is done by replacing centralized phonemes with decentralized ones and vice versa [4], could be enhanced when using a different synthesis technique. Data-based synthesis (like diphone synthesis or non-uniform unit-selection synthesis) is not well suited for manipulations of the articulation precision or voice quality. In this respect the simulation rules that were based on prosodic manipulation (set 1) were of course more effective.

As unrestricted text-to-speech synthesis is not of importance while this is still predominantly a research topic, one possibility would be to use articulatory synthesis where the parameter sets can be modeled more elaborately by rules.

After quality testing such optimizations, an improved evaluation methodology should be applied to study validity of complex emotions synthesized with “Emofilt”.

The approach did result in success with emotions that are neighbors with respect to the emotional dimensional space that’s spanned by the PAD dimensions pleasure, arousal and dominance. For example the combination of sadness and anger as well as fear and sadness share two of the three dimensions and were recognized by the majority of the judges.

For future work it would be a possibility to try combinations of emotions that can be envisaged by the listeners more easy than systematic variation, for example by embedding the test sentences into situations that are appropriate for the targeted emotion mix.

It would also be an interesting research to investigate the acoustic manifestation of mixed emotions by analysis of natural data, for example the Vera am Mittag corpus [10]. As this corpus consists of real-life emotional expression happening in a TV-show, mixed emotions are very likely to occur. A set of clear representations would have to be identified by a new label process and then analysed for acoustic properties. The outcomes could then be synthesized to validate the findings in a more controlled environment.