1 Introduction

While researchers in speech technology have presented various efficient techniques and applications that can recognize and/or synthesize the human voice, there are still only a few proposals on how to cope with feelings in the human voice. It is a challenging task to characterize and to recognize emotional states from human speech due to difficulties in defining emotions which are caused by both cultural and individual differences in perception and the expression of emotion. In order to advance in the recognition of emotion in a particular language, it is necessary to collect and construct an emotional speech corpus of that language with annotation of moods in a systematic and consistent way. In the past, speech corpora for recognizing emotion were constructed in several languages, such as English, Spanish, Chinese, and Japanese and an intensive survey on emotional speech corpora construction was presented by Ververidis and Kotropoulos (2006). Collection of natural emotional speech is difficult since we cannot know where or when speeches expressing a particular emotion will occur. Therefore, most researchers have constructed emotion corpora by asking a professional actor to perform a mood state when speaking (simulated emotional speech) or by creating a situation to elicit a response from a person expressing the target emotion in his/her speech (elicited emotional speech) (Busso et al. 2008).

Currently, to the best of our knowledge, there is no extant Thai emotional speech corpus. In this paper, to represent emotion in speech, two emotional state models are applied: (1) a numerical state model, namely Pleasure-Arousal-Dominance (PAD) and (2) a categorical state model, including four basic emotional types with twelve subtypes. For the sake of the ease of emotional speech collection with clear emotional state, we decided to construct a corpus with simulated emotional speech where speech is uttered by professional actors and actresses. Here, the three steps of corpus construction are transcription (subtitling), metadata preparation (formatting) and emotion annotation (labeling). In this work, the selected series of Thai dramas contain approximately 1520 min of video clips. These are segmented into 8987 turns of conversation, each turn is transcribed and enriched with metadata for facilitating the tagging process, and finally tagged by using the two emotional state models. To characterize our corpus, we have extracted a number of statistics based on speakers, annotators and emotion tags.

In the rest of our study, Sect. 2 provides the background on human emotion theories as well as a literature review on emotional speech corpus construction. Our corpus design and construction are described in Sect. 3. The corpus design includes two main concerns: the tag format based on Document Type Definition (DTD) in the form of Extensible Markup Language (XML) and the tagging guidelines for corpus annotators. As for the corpus construction, the three-step process is depicted in the order of subtitling, formatting and labeling. Section 4 presents a number of statistics extracted from our constructed corpus, namely EMOLA (i.e. Thai emotional speech corpus from Lakorn) in several aspects, mainly for analyzing the tagging results and the annotators’ style of tagging. Finally, a conclusion and further works are summarized in Sect. 5.

2 Literature review

This section presents the background on human emotion theories as well as a literature review on emotional speech corpus construction. As for the former, definitions of psychological emotional states and definition issues are addressed, followed by how emotional states are expressed in speech signals. For the latter, a number of previous works related to construction of emotional speech corpora are described.

2.1 Theories of human emotion

While research on human emotion has increased significantly over the past two decades in several fields, including psychology, neuroscience, endocrinology, medicine, history, sociology, and computer science, numerous theories have been developed to explain the origin, neurobiology, experience, and function of emotions, such as primary emotions (Plutchik 1980, 1984), basic emotions (Stein and Oatley 1992), normal emotions (Kaiser and Scherer 1998), and emotional responses (Scherer 1986).

Among these, several studies have explored relations between emotions and speech signals with various emotion types, such as anger, happiness, sadness, fear, neutral (Scherer and Tannenbaum 1986). While human emotion is of two modes: speaker mode versus listener mode, human emotion can be interpreted by these two categories: vocal expression and perception of emotion. In this work, we focus on perception since it is difficult to establish the emotional intention of speech, compared with directly interpreting the vocal emotion that we perceive, thus our task is to create a system that recognizes emotions. Thus, the most challenging part of our task is how to cope with different interpretations of individuals on emotions when they perceive speech. The definition of emotion and the classification of emotion are described in the next section.

2.1.1 Definition and classification of emotion

So far there have been several approaches proposed by psychologists to clarify human emotion. These approaches are based on different theories and their definitions of emotion are inconsistent, uncertain, and arguable. In this subsection, some interesting theories are briefly reviewed. As an example of an early modern theory, Plutchik proposed the so-called Plutchik’s wheel of emotions to illustrate different emotions in a complete and comprehensive way (Plutchik 1980). The wheel is composed of eight primary bipolar emotions: joy versus sadness; anger versus fear; trust versus disgust; and surprise versus anticipation. Additionally, each emotion is expressed by a color and it is possible to express emotion intensity by color intensity and to mix emotions like mixing colors to form different emotions. As parts of Plutchik’s ten postulates, it is possible to analyze and interpret emotion by using basic emotions, emotion combination, emotion opposites, emotion similarity, and emotion intensity. So far there have been two fundamental approaches in research on emotion classification: (1) emotions as discrete and fundamentally different constructs and (2) emotions as points characterized by a dimensional basis in a coordinate system (Cowie and Cornelius 2003).

  1. (1)

    Emotions as discrete categories

A naive and straightforward way to explain emotions is to use emotional categories, each of which is expressed by an emotional word, referring to a human’s state of emotion. In this discrete emotion theory, an innate set of basic emotions that are cross-culturally recognizable is defined. These basic emotions can be distinguished by an individual’s facial expression and biological processes (Colombetti 2009). However, two issues are (1) how to define the set of basic emotions, and (2) how a listener perceives an emotion. In research on the first issue, Paul Ekman and his colleagues (Ekman 1992) conducted an intensive cross-cultural study on basic emotions and reached the conclusion that the six basic emotions are anger, disgust, fear, happiness, sadness, and surprise. The work reported that each emotion has particular characteristics attached, allowing them to be expressed in varying degrees, and that they act as discrete categories rather than an individual emotional state.

In an independent work (Bann and Bryson 2012), Bann and Bryson have proposed a theory that people convey their understanding of emotions through the language they use that surrounds emotion keywords. They suggest that the more distinct language is used to express a certain emotion, the more distinct the perception of that emotion is, and thus more basic. By experiments, Bann and Bryson’s most semantically distinct emotion set is coincidentally the same as the basic emotion set proposed in Ekman et al. (1972).

As for the second issue, how a listener perceives an emotion, it is common that different listeners may have different opinions and then provide different emotional words to describe a certain emotional state. Matching between emotional states and emotional words is somehow subjective according to individual perception due to language usage, experience, circumstance, and the personality of the listener.

  1. (2)

    Emotions in a dimensional model

As the second approach, an alternative is to design a set of primitive properties, each of which refers to a dimension in a systematic space. Several researchers prefer this approach for both theoretical and practical reasons. As the pioneer of modern psychology, Wilhelm Max Wundt proposed in 1897 that emotions can be described by three dimensions: “pleasurable versus unpleasant”, “arousing versus subduing” and “strain versus relaxation” (Wundt 1897). Around half a century after this first proposal, in 1954, Harold Schlosberg named three dimensions of emotion: “pleasantness–unpleasantness”, “attention–rejection” and “level of activation” in 1954 (Schlosberg 1954).

A so-called Positive Activation-Negative Activation (PANA) model, originally created by Watson and Tellegan in 1985, suggests that a positive effect and a negative effect are two separate systems (Watson and Tellegan 1985). In the PANA model, the vertical axis represents a low to high positive effect and the horizontal axis represents a low to high negative effect. The dimensions of valence and arousal lie at a 45-degree rotation over these axes. Recently, the circumplex emotion model developed by Posner and his colleagues has suggested that emotions are distributed in a two-dimensional circular space, containing arousal, and valence dimensions (Posner et al. 2005). This two-dimensional model of emotion attempts to characterize human emotions by incorporating valence and arousal and intensity dimensions.

Another model developed by Cowie and Cornelius suggested two dimensions, namely, activation and evaluation spaces (Cowie and Cornelius 2003). A three-dimensional model, the PAD emotional state model, was developed by Albert Mehrabian and James A. Russell, to describe and measure emotional states in the dimensions of Pleasure, Arousal and Dominance (Mehrabian and Russell 1974; Russell and Mehrabian 1977; Mehrabian 1996). Originally, some researchers used sixteen scales for a pleasure dimension, nine scales for an arousal dimension, and nine scales for a dominance dimension (Mehrabian 1995). They strongly believe these three dimensions characterize all emotions and any other concepts of emotional states. More specifically, Zhang et al. (2008) found anger to have PAD values of [− 0.90, + 0.79, + 0.95] while happiness tended to be expressed by [+ 0.68, + 0.68, + 0.43] on average. In Havlena and Holbrook (1986), anger is [− 0.85, + 0.23, − 0.32] and happiness is [+ 0.93, − 0.12, + 0.45].

2.1.2 Emotional states in speech

Compared with non-verbal communication, such as facial expression and gesture, speaking is a straightforward way to exchange information and emotions. To study emotions in speech is one of the important aspects towards understanding people’s emotions. Even speech conveys emotions in two parts: content semantics and acoustic properties, so understanding emotion via acoustic properties extracted from speech signals is important. In many cases, speech with the same content may be interpreted differently in emotional terms when it is uttered with different acoustic patterns. Some psychologists have summarized emotional states relating to acoustic properties rather than content, for example, Scherer (1995). It is well known that amplitude, energy of speech signal or pitch variation can express differences between anger and happiness while pitch contours show differences between states of sadness and pleasantness.

Cowie and Cornelius (2003) have studied relations between speech and emotions and have proposed a method to describe emotions. This study reported that 84% of clips are given a neutral label and only a very few clips present some other emotions (Cowie and Cornelius 2003). In general, most utterances produced by speakers are emotionally neutral. It was reported that acoustic stress, physical parameters, and speaker attitude affect emotions in speech. Some works state that acoustic stress (i.e., pitch movement) and other physiological/acoustic properties in many situations can be used as a clue to indicate emotions in speech (Johnstone and Scherer 1999). Moreover, there has been much evidence in the literature that a speaker’s attitude also affects his/her emotional states in speech (Schubiger 1958; Crystal 1975, 1976; O’Connor and Arnold 1973).

2.2 Previous works on emotional speech corpus construction

As material for research on emotional speech, we need to construct an emotional speech corpus with tagging information. The constructed speech corpus can be used for emotional speech analysis and for evaluating and comparing the performance of emotional speech recognition systems. So far there have been many emotional speech corpora constructed under different characteristics such as corpus language, number of speakers (subjects), corpus purpose, number of emotion states, and type of collected speech. Ververidis and Kotropoulos (2006) have listed sixty-four emotional speech corpora and provided significant details of each corpus. We additionally enumerate thirty-seven emotional speech corpora that have been recently reported during 1995–2016 as shown in Table 1. Among these corpora, 34 of them are monolingual corpora while the remaining three corpora include two, two and four in various languages i.e., 10 English, 8 Japanese, 5 Chinese, 4 Italian, 4 French, 2 German, 1 Basque, 1 Dutch, 1 Greek, 1 Indian, 1 Indonesian, 1 Persian, 1 Polish, 1 Slovenian, and 1 Spanish. The corpora were constructed under different circumstances to express particular emotions and each different mother tongue may not express emotions in the same way or same style due to various cultural differences and the individual nature of the languages. In these works, the researchers strongly believe that emotion is related to vocabulary since some words can express emotions. Furthermore some language use a descriptive emotion word (verb) for both action and feeling expressions, such as, “love” (Kövecses 2003).

Table 1 A list of available emotional speech corpora (EN, English; JA, Japanese; FR, French; SL, Slovenian; ES, Spanish; IT, Italian; PL, Polish; EU, Basque; ZH, Chinese; NL, Dutch; FA, Persian; EL, Greek; HI, Hindi; ID, Indonesian)

As shown in the fifth column of Table 1, the corpora were constructed with different numbers of speakers (subjects), ranging from one speaker to a few hundred speakers. The corpus with the most speakers was constructed by Cole in 2005 with 780 children participating in the project (Cole 2005; Ververidis and Kotropoulos 2006). Also, they vary according to the type of speaker, which includes children versus adults, male versus female, native versus non-native, actor versus non-actor, and general purpose versus specific purpose, but some of them are of mixed types. For example, there are four corpora which collected emotions from children (Ververidis and Kotropoulos 2006; Dadkhah et al. 2008). Besides speaker characteristics, the corpora also vary according to the number of annotators who participated in labeling each utterance with emotion states. Most of the corpora required at least three annotators for tagging and the majority rule is usually applied i.e., this label is assigned when it obtains more than half of the votes.

In terms of computing aim or purpose (the sixth column in Table 1), an emotion corpus may be designed for either emotion recognition, emotion synthesis, or analysis (such as psychological study and/or market research) but some of them may have an unspecified purpose. From the viewpoint of emotion definition, while there are various emotional states in human beings, researchers designed emotional states (categories) in their corpus depending on their application tasks and the objectives of the corpus. Among 38 corpora listed in Table 1, thirty-three corpora were tagged with categorical emotions, twelve of which include emotions expressed by a dimensional model and seven of them have both categorical emotion and dimensional emotion as specified in the seventh and eighth columns in Table 1. The number of categorical emotions (the seventh column) varies from a single emotion (it exists or does not exist) to over 10 emotion states, depending on the definitions of emotions that the researchers are interested in for their application. The most popular set of emotion states are {‘anger’, ‘happiness’, ‘neutral’, and ‘sadness’}, {‘anger’, ‘fear’, ‘happiness’, ‘sadness’, and ‘surprise’}, and {‘anger’, ‘fear’, ‘happiness’, ‘sadness’, ‘surprise’, and ‘disgust’}. While the four most common emotions are ‘anger’, ‘happiness’, ‘neutral’, and ‘sadness’, some popular complementary emotions are ‘disgust’, ‘fear’ and ‘surprise’. However, some works did not include ‘neutral’ as a human’s emotion state since it was treated as ‘no emotion’. One major factor to develop an emotional speech corpus for emotion recognition is the need to define emotional states clearly in annotation. Matching between emotional states and emotional words is somehow subject to individual perceptions due to language usage, experience, circumstance, and the personality of the listener. Another issue in tagging emotion states in speech is that sometimes speech may include more than one categorical emotion. With regard to dimensional emotion (the eighth column), some corpora were developed to keep a number of numeric values or polarity, in place of categories, as in emotion dimensions, such as levels of Activation, Evaluation, Intensity, Valence, Positive–Negative, Stress, Arousal, Pleasantness, Dominance, Credibility, or Interest (Mori et al. 2011).

The last characteristic (the ninth column) indicates the circumstances in which emotional speech is acquired. Three typical ways to acquire emotional speech are (1) natural method, (2) elicited method, and (3) simulated or acted method. In the first type, natural speech is collected from spontaneous conversation in unrestricted circumstances when emotions are expressed naturally. Even though genuine emotion is the most important type of speech for research, it is hard to collect due to its sparsity in nature, i.e., we do not know when and where people will speak with genuine emotion. Most utterances naturally produced by speakers are emotionally neutral and speech with emotions rarely occur. Due to this difficulty, some researchers recorded videos and audios of people’s reactions to customer service or some specific scenes, where people usually come to give information or complain, for the purpose of obtaining speech with emotions.

With regard to the second type, i.e. elicited speech, another way to obtain speech close to natural speech is to create a situation that will evoke emotions (Ververidis and Kotropoulos 2006). Researchers create a situation to make participants express the target emotion. After the participants were stimulated to target an emotion, then they may express that emotion naturally. To make this method succeed, researchers must design a task or a game with scenarios to involve the participants. They usually make some difficult conditions for the participants to undergo in a particular task or game, for example, a difficult spelling test (Bachorowski 1999) or interacting with a malfunctioning system in a Wizard-of-Oz scenario (Batliner et al. 2003).

In the last type, a simulated method is widely used in place of natural or elicitation methods since it is the easiest way to collect emotion speech data (Moriyama et al. 2009; Li 2015). The elicitation process requires one to ask professional actors to express or simulate emotional speech. Speech collected by this method is known as acted speech or simulated speech since speakers have been instructed to produce the target emotion by using self-induced simulation.

3 EMOLA: a Thai emotional speech corpus from Lakorn

This section presents two types of designs: a corpus design and a corpus construction design. In the former, the source of materials and emotion types for our corpus are explained. In the latter, the corpus construction process and supporting tools are described in detail. The output from these designs is a Thai emotional speech corpus, namely EMOLA, using a Thai TV drama series, so-called “Lakorn” (drama series in Thai).

3.1 Corpus design

As a convenient way to collect data for our corpus, we use spoken speech with simulated emotions where professional actors and actresses express or simulate emotional speech according to performance scripts, normally including daily emotions close to a real situation. The most common emotions observed in the Thai drama series, are anger, happiness, sadness, fear, joy, jealousy, envy and rage. In this work, the selected Thai TV drama series includes around 10–20 leading roles, acted by popular professional actors and actresses, producing several dialogues (conversation turns) in a scene where emotions are expressed.

Even though the actors and actresses sometimes overact, they are skillful in expressing their emotions. By using this series, we were able to use several forms of data such as video, audio, or even drama scripts online. However, in reality, we found that the online drama scripts do not match the speech in the actual performance, causing us to prepare the scripts or subtitles ourselves. In the corpus design, we segment the drama series into a number of short video/audio clips, each of which corresponds to a conversation turn, attached to the script is a subtitle and its timestamp (start and ending times).

As for our final outcome, each conversation turn is annotated with two levels of categorical emotions and one three-dimensional emotional scale. In assigning categorical emotions to the conversation turns, the first level is for a basic emotion, selecting one of the four emotions, i.e., ‘anger’, ‘happiness’, ‘neutral’, or ‘sadness’, while the second level includes optional emotions, choosing in order of relevance, a few from twelve emotional labels: ‘anger’, ‘doubt’, ‘excitement’, ‘fear’, ‘fun’, ‘jealousy’, ‘happiness’, ‘hate’, ‘rage’, ‘sadness’, ‘satisfaction’, and ‘surprise’. We chose one of the most popular models, the Pleasure-Arousal-Dominance (PAD) emotional model, for describing dimensional emotion (feeling) in this work. In PAD, each emotion is described by three measurements, each of which is generally scaled from − 1.0 to 1.0 and they are theoretically independent from each other. The PAD is somehow subjective and varies according to culture, language usage, circumstances, annotators, and the settings for the experiment.

3.2 Corpus construction design and tools

This section presents the design of our corpus construction process, which is composed of three sub processes: data enrichment (transcribing, time-stamping, actor ID & environment tagging), data preparation (video segmentation & XML formatter), and emotion annotation as shown in Fig. 1. In the design of our emotional speech corpus, a Thai TV drama series video (Lakorn) will be transcribed and each conversation turn in the video will be marked with its start/end timestamps, corresponding actor ID(s) and environments, and kept as a subtitle file. After that, the timestamps are used to segment the video into a set of video clips. These video clips, together with their corresponding transcriptions, actor IDs and noise types will be formatted and kept as XML-based transcription/metadata files. Finally, the transcription/metadata files together with the video clips are presented to an annotator for emotion labeling, and the result will be kept as XML-based corpus metadata. The corpus metadata and video clips constitute our emotion-tagged speech corpus.

Fig. 1
figure 1

Three steps in EMOLA corpus construction

3.2.1 Data enrichment: transcribing, time-stamping, actor ID & environment tagging

We collected emotional speech from TV drama series as our raw material, which consisted of video files extracted from four DVD discs. These video files are arranged into a number of episodes in order. As no subtitle or transcription is provided, each video file is manually transcribed into texts with their timestamps. By setting conversation turns (approximately one speaker’s utterance) as a unit of transcription, it is possible to focus on the real emotions of a particular scene and usually we were able to assign one main emotion for each utterance. Besides the transcription and timestamps of the speech, we also tag the ID of the speaker who makes the speech, as well as the environment type of the speech. The three environment types are none (clean speech), melody (background music) and noise (environmental sound).

Figure 2 displays the process of data enrichment where a data editor analyzes a video (audio and visual components) and then provides start/end timestamps, transcription (subtitle), actor ID and environment tagging in the format of a subtitle file (.srt). In practice, the original video files in the VOB format (.vob) are translated to the AVI-formatted files (.avi) in order to be compatible with a subtitling-support tool, namely Subtitle Edit. As free software, the Subtitle Edit tool enables a data editor to associate transcriptions, timestamps, actor ID and environment tag into a video file by outputting a subtitle file as a supplement file. As shown in Fig. 3, we can play a part of the video in the top-right corner of the program window, and then the program will output the speech (or conversation) waveform corresponding to the part of the video playing in the bottom-right corner of the window. In the figure, we blurred the faces of speakers (actors) due to copyright reasons, but during the process of emotion labelling, the annotator watched the original video without any blurring.

Fig. 2
figure 2

Data Enrichment: Timestamping, Transcript writing, and Actor ID & Environment tagging

Fig. 3
figure 3

A sample screen using Subtitle Edit

At this point, the data editor can determine the period of the waveform that he would like to provide a transcription for and replay the waveform several times to confirm a suitable waveform period for that transcription. In the format of [Environment] (Transcription [Actor ID]) +  , the input of the data editor is inserted in a text box, located in the middle-left part of the window. Here, [Environment] is used to specify one of three environment types: [CLEAN] , [BG-MUSIC] and [EX-MUSIC] , standing for clean speech, speech with background music and speech with exciting music/sound, respectively. While Transcription is the transcription usually presented in the Thai language, [Actor ID] is the bracketed identifier of the actor who utters that speech, corresponding to the Transcription . In the situations that multiple actors speak simultaneously, multiple pairs of Transcription and [Actor ID] can be inserted. The list of transcriptions and timestamps tagged for each time interval (waveform period) are shown in the top-left corner of the window.

3.2.2 Data preparation: video segmentation and XML formatter

To prepare the video data for emotion annotation, video segmentation and XML formatting are performed as described below.

  1. (1)

    Video Segmentation

After obtaining the subtitle files ( .srt ), the video file ( .avi ) is segmented into a number of video clips ( .avi ) according to the time-stamping in the subtitle files using the video and audio converter (tool) named ffmpeg . Since the ffmpeg can be used as a script, it is very convenient to divide into several video clips by executing a series of ffmpeg commands but with different parameters. With ffmpeg commands, the original video will be divided into a set of video clips, indexed by three digits at the end of the output filename. These indices are used for mapping the clips with their transcriptions.

  1. (2)

    XML Formatter

To keep transcription and metadata in a standard format, the XML is applied and the resulting file has the XML extension ( .xml ). We have designed the original XML DTD to be simple and to match our purpose. The top-left part of Fig. 4 shows the Document Type Definition (DTD) ( .dtd ) designed for describing transcriptions (subtitles) and metadata in the form of a filename of the video source as well as a set of video-clip filenames with their corresponding subtitles, start timestamps, duration intervals, actor ID and speaking environment. On the other hand, the right hand side of Fig. 4 illustrates an example of the data description based on the XML DTD on the left. For clarity, the translation of each transcription in the XML-tagged data is given in the bottom-left hand side of Fig. 4.

Fig. 4
figure 4

Document Type Definition (DTD) for transcription (subtitle) and metadata (top left), an example of XML-tagged data (right), and the translation of each transcription in the XML-tagged data (bottom left)

3.2.3 Emotion annotation

The last process (the third process in Fig. 1), emotion annotation, is performed to manually tag emotion for each video clip (utterance). As part of our corpus design, we decided to keep both the emotional category and the three-dimensional emotion of each utterance, for two purposes: emotion recognition and analysis of the relations between category and dimension. For consistency, we had held a training course and several meetings for discussion among the six emotion annotators, who were trained to annotate all utterances in the whole corpus in the same manner. The annotators were requested to use the self-developed software namely “EmoAnnotator” to label emotions for each transcription. By using EmoAnnotator, an annotator can watch the video clip with both audio and visual functions, and its corresponding transcription for emotion labeling. The guidelines (instructions) for the annotators are as follow.

  1. I.

    Provision of Annotator Information: The annotator fills in the details of the annotator with annotator ID, name, and age.

  2. II.

    Error Checking: The annotator has to check for any errors before giving an emotion label to a transcription. If any unacceptable errors are found in the data, no label is assigned to the transcription. The following three types of unacceptable errors are considered.

    1. (1)

      Incorrect Segmentation: Ideally one video clip includes one (or more) complete utterance(s) from only one speaker with only one emotion under only one environment (clean, background music, noise). However, sometimes two types of incorrect segmentation are found.

      1. a.

        Over-segmentation: a video clip that includes multiple utterances from multiple speakers with more than one emotion.

      2. b.

        Under-segmentation: a video clip that includes incomplete (unfinished) utterances.

    2. (2)

      Multiple Speakers or Two speakers: Sometimes it is impossible to make a video clip which includes utterance(s) from only one speaker since two or more speakers are speaking at the same time.

    3. (3)

      Wrong Transcription: For some reason, the transcription (subtitle) does not match with the video clip it corresponds to. Sometimes the transcription is completely different from the video clips and therefore we judge it to be an error. However, in some cases, only some words are missing or are added in the transcription and then it is not judged to be a wrong subtitle but just an incomplete subtitle. Such incomplete subtitles are not considered as errors since the transcription is partially useful.

    If the annotators mark one of the above errors, they have no need to perform any of the following steps.

  3. III.

    Annotation of Primary Emotion: According to our design, the four possible primary emotions are anger, happiness, sadness, and neutral. These primary emotions are easy to distinguish. Therefore, all annotators were trained to give the most suitable primary emotion to a video clip.

  4. IV.

    Annotation of Secondary Emotion: In some situations, if the video clip does not match perfectly with any of the primary emotions, the annotators are instructed to select any (one or more) of twelve secondary emotions, which are anger, confusion, disgust, excitement, fear, happiness, jealousy, pleasure, rage, sadness, satisfaction, and surprise. Moreover, in the case that none of these twelve emotions matches with the emotion in the video clip, the annotators are allowed to manually add a new emotion that is appropriate into the text box provided.

  5. V.

    Determination of Three-Dimensional Emotion: Besides the primary and secondary emotional categories, annotators are told to provide the values of Pleasure-Arousal-Dominance (PAD) space, which are the three-dimensional emotions illustrated in Sect. 2.1.1. Any of the three PAD dimensions is rated in the scale of − 1.0 to 1.0, with an interval scale of 0.25. Thus, the nine possible values for each dimension are − 1.00, − 0.75, − 0.50, − 0.25, 0.00, + 0.25, + 0.50, + 0.75, and + 1.00. Here, positive feeling (+ value) refers to pleasure, non-arousal, and submissive. On the other hand, negative feeling (− value) means displeasure, arousal, and dominance.

  6. VI.

    Provision of Supplementary Quality Information: Moreover, the annotators are asked to provide more information particularly related to the audio/visual quality of the video clip, including incomplete subtitles, non-synchronization, too-long clips, and low amplitude. Even if these quality problems are not very serious, it is useful to keep such information.

Similar to the format for the transcription and metadata, we also designed the Document Type Definition (DTD) ( .dtd ), as shown in Fig. 5, for describing the personal information of the annotator (including his/her identification, name, gender, and age), the input filename of the transcription and metadata, the annotation settings (that is the usage of video, sound, and transcription during annotation), the annotation result (primary emotion, secondary emotion and three-dimensional emotion), and the supplementary quality information (incomplete subtitle, non-synchronization, too-long a clip, and low amplitude).

Fig. 5
figure 5

Document Type Definition (DTD) for describing personal information of the annotator (identification, name, gender, and age), the input filename of transcription and metadata (dataset name), the annotation settings (the usage of video, sound, and transcription during annotation), the primary emotion (Label 1), secondary emotion (Label 2), optional emotion (Label 3), three-dimensional emotion (Pleasure, Arousal, and Dominance), and supplementary quality information (incomplete subtitle, non-synchronization, too-long a clip and low amplitude)

As an example, the EmoAnnotator tool starts from requesting an annotator to input his/her information as shown in Fig. 6a and then the annotation window is displayed as shown in Fig. 6b. At the window for inputting annotator information, the annotator has to input his/her identification, name/surname, gender, and age, as well as the name of the data set and video-clip folder/filename he/she would like to annotate. He/she can select to show or hide the video, audio (sound) and subtitle during annotation for further exploration of the effects of annotation environment on tagging quality. However, in our corpus construction, the annotators are advised to turn on all of these three options. In the window for emotion annotation, there are eleven components which are video display, transcription (subtitle) display, replay button, error-marking checkboxes, help button, primary-emotion radio button, secondary-emotion ranked checkboxes, new-emotion text boxes, PAD sliders, checkboxes for supplementary quality information and next-video-clip button.

Fig. 6
figure 6

a Annotator information input window, and b emotion annotation window

By watching a video clip and reading the transcription or replaying the video clip with the replay button on the left side of the window, an annotator can mark one or more error types that may occur in the bottom left of the window. If any error is marked for the video clip, the annotator cannot perform any further annotation. He can use the ‘Help’ button on the top right hand corner to access the annotation instructions or guidelines. Otherwise, he/she can tag one primary emotion, as well as optionally a few secondary ranked emotions or a new emotion, for the video clip on the top right part of the window and provide three values for P, A and D dimensions in the middle right of the windows. The sliders will be green if an annotator feels positive (+ 1.00) and red if he has a negative feeling (− 1.00). If any additional or supplementary quality on the speech is available, the annotator can check one or more options in the bottom right of the windows. The ‘Next’ button should be pressed to watch the next video clip.

During labeling, annotators can take a break by terminating the program and later resuming their work by selecting the data set, the video folder and the video file as shown in Fig. 6a. The annotation results are stored in the form of XML as shown in Fig. 5. The annotators work independently on the process; they are not allowed to take any advice from the others. Because of this, all labels are subjective and based on an individual’s judgment.

4 Statistics of annotated data and evaluation

A Thai emotional speech corpus can be constructed using the process described in the previous section. To characterize the constructed corpus, we have provided three levels of statistics: collection-level, annotator-oriented, and actor-emotion-oriented statistics.

4.1 Collection-level statistics related to transcription/metadata

A popular Thai TV drama series was selected for construction of an emotion annotated corpus, namely EMOLA. Table 2 illustrates a number of the major statistics of the material for corpus construction, related to the transcription and metadata. This corpus material included a video with a length of approximately 23 h and 17 min of which only 14 h and 28 min of the video are subtitled since the others include silence, music, background noise, or non-speech portions. Therefore, sixty-two percent of the video was transcribed. Of a total of 8987 transcriptions, 208 of them (2.31%) are transcriptions that have multiple actors speaking simultaneously while most transcriptions (97.69% or 8779 transcriptions) contain a single actor’s speech. The 208 transcriptions with multiple actors speaking occupy 453 speaker scenes. In this study, a transcription with n speaking actors will be counted as n speaker scenes. The corpus has a total of 9232 speaker scenes (5269 speaker scenes by 20 actors and 3963 speaker scenes by 31 actresses). The average, minimum and maximum transcription lengths are 5.80, 1.18, and 72.40 s, respectively, with a standard deviation of 3.80 s. Some transcriptions are from clear speech, some from scenes with background music and the rest are speech in a noise environment.

Table 2 Major statistics of the material for the EMOLA corpus

In general, the interpretation of emotion in speech is subjective in nature. In this work, we recruited six annotators (four females and two males) to annotate emotion in all the transcriptions in this corpus. Later, the results of the annotated data were analyzed and evaluated in order to compare individual perceptions as can be seen in the next section.

4.2 Annotator-oriented statistics of emotion annotation

After finishing the preparation of the transcriptions and metadata, the emotion annotation was performed. It should be noted that all the video clips were individually labeled by six annotators (four females and two males) aged between 23 to 39. Note that ‘F’ refers to female annotators and ‘M’ refers to male annotators in the annotator ID. For example, AnnF01 is a female annotator and AnnM01 is a male annotator. This section describes the statistics of the emotion annotations. All the annotators were asked to select one label per transcription (utterance), however, some of them may not have been able to label them due to some errors which occurred in the data. Table 3 shows the details of the number of emotional labels in two levels of categories assigned by each annotator (Level 1: anger, happiness, sadness, and neutral, as well as Level 2: anger, happiness, sadness, confusion, disgust, excitement, fear, jealousy, joy, rage, satisfaction, and surprise). We observed some variations on the annotated labels among the annotators as follows.

Table 3 Two most assigned labels

Discussion on the level-1 labels

  1. (1)

    AnnF01 and AnnF02 assigned ‘Happiness’ and ‘Anger’ labels to many utterances.

  2. (2)

    AnnF01, AnnF02, AnnF03, and AnnM01 rarely assigned a ‘neutral’ label while AnnF04 and AnnM02 assigned a ‘Neutral’ label to many utterances.

  3. (3)

    AnnF03, AnnF04 and AnnM01 assigned an ‘Error’ label to many utterances.

  4. (4)

    The two most assigned labels for each annotator are summarized in Table 3.

Discussion on the level-2 labels

  1. (1)

    The results of the second level seem to match well with the results of the first level. ‘Anger’ in the first level seems to match well with ‘Anger’ in the second level. Utterances that are assigned with a ‘Happiness’ label in the first level usually have ‘Happiness’ or ‘Satisfaction’ labels in the second levels while those with ‘Sadness’ in the first level also receive ‘Sadness’ in the second level.

  2. (2)

    AnnF01, AnnF02, and AnnM01 sometimes tag ‘Anger’ utterances with ‘Confusion as their second-level labels. AnnF03, AnnF04, and AnnM02 sometimes tag ‘Anger’ utterances with ‘Rage’ as their second-level labels.

  3. (3)

    AnnF01 usually assigns ‘Sadness’ utterances with ‘Fear’ in the second level while AnnM01 often assigns ‘Sadness’ utterances with ‘Confusion in the second level.

From these observations, it can be seen that some annotators have similar tagging concepts while others have different concepts.

In addition to a comparison of the individual annotators in Table 4, we also investigated annotator agreement on annotated labels as shown in Table 5. From this table, we found that 7792 transcriptions have majority emotion labels and 1195 transcriptions have no majority labels. The most popular emotion label was ‘Anger’ while the least popular label was ‘Sadness’. Moreover, there were ten possible scenarios of annotator agreement. Each scenario was assigned a point (1.0) to the majority label. Moreover, in the case that the majority votes were equally shared with two or more labels, the score was divided into the number of labels shared. In Table 4, a-b-c-d-e indicates the number of annotators for each emotion, who provided the same labels, in descending order. For example, the column 6-0-0-0-0 of the first row (‘Anger’) has a value of 399 which means there are 399 transcriptions assigned with ‘Anger’ by all six annotators. The column 3-2-1-0-0 of the first row (Anger) had a value of 414 which means there were 414 transcriptions assigned with ‘Anger’ by three annotators, another emotion with two votes, and another emotion with one vote. The column 2-2-2-0-0 of the first row (‘Anger’) had a value of 44.67. This means 134 transcriptions were assigned with Anger by two annotators and another two emotions, each with two votes. Since the 2-2-2-0-0 had three equal emotions, each emotion obtained 0.33 (= 1/3) points and with 134 transcriptions, 0.33 × 134 is equal to 44.67. This means that the points for the 3-3-0-0-0 scenario were 0.5 (= 1/2) points for each annotated label and 0.33 (= 1/3) points for the 2-2-2-0-0 scenario.

Table 4 The number of emotional labels and their associated labels in the second level
Table 5 Number of transcriptions by label agreement patterns

The details of the ten possible scenarios were grouped into five groups according to the largest number of votes which are described as follows.

  1. (1)

    The first case is idealistic labeling, in which all six annotators share the same opinion (6-0-0-0-0) in labeling an utterance. There are only 752 out of 8987 utterances. In other words, all annotators agreed in giving the same label by approximately 8%. Sadness is the most complex emotional state with 5 utterances, while anger causes the clearest emotional state with 399 utterances.

  2. (2)

    In the case of 5-1-0-0-0 one annotator did not label the emotion states in the same way as the others. In this case the number of unlabeled data is highest with 742 utterances, but sadness has only 33 utterances. The total number of matched utterances is highest among all scenarios with 2095.

  3. (3)

    The majority label was voted by four annotators. There are two sub-scenarios: (1) 4-1-1-0-0 where the winning label was voted by four annotators and the other two annotators voted for two other labels, and (2) 4-2-0-0-0 where the winning label was voted by four annotators and the other two annotators voted for one other label. The total number of utterances in this scenario was 2724, which outnumbered all the other majority cases.

  4. (4)

    For the cases where the majority is three, there are three scenarios: (1) 3-1-1-1-0, (2) 3-2-1-0-0, and (3) 3-3-0-0-0. The first two scenarios (2180 transcriptions) have the majority vote. The last scenario (1070 transcriptions) had two emotions as majority votes and each emotion obtained 0.5 points. Therefore, there were 535 weighted transcriptions for 3-3-0-0-0.

  5. (5)

    For the cases where the majority was two, there were three scenarios; (1) 2-1-1-1-1, (2) 2-2-2-0-0, and (3) 2-2-1-1-0. The first scenario (41 transcriptions) had the majority vote. The last two scenarios had no majority emotion. There were 702 transcriptions for 2-2-2-0-0, which are equivalent to 702 × 1/3 = 234 weighted transcriptions, and 852 transcriptions for 2-2-1-1-0, which are equivalent to 852 × 1/3 = 426 weighted transcriptions.

  6. (6)

    In summary, 7792 out of 8987 transcriptions had majority votes while only 1195 transcriptions had no majority votes for three scenarios, i.e., 3-3-0-0-0, 2-2-2-0-0, and 2-2-1-1-0.

Moreover, we analyzed annotator clustering based on inter-annotator agreement (Cohen’s kappa). In Fig. 7, for each actor of ActF01-ActF05 and ActM01-ActMo6, the agreement among annotators was calculated based on Cohen’s kappa. Then the average was calculated as the measure of agreement between two annotators as shown in the last column of Fig. 7a. The averages are summarized in Fig. 7b to present the measure of agreement between any pair of annotators. Based on the definition in Fig. 7c, the clustering results of single linkage and complete linkage are shown in Fig. 7d, e, respectively. In conclusion, the three annotators who had the most similar opinions in emotion labeling, are AnnF02, AnnF03, and AnnM01 for both single and complete linkage when the threshold was set to 0.44. We further investigated annotator agreement on labels for only these three annotators (AnnF02, AnnF03 and AnnM01) and the results are shown in Table 6. From this table, it can be seen that 8211 transcriptions had a majority of emotion labels and 776 transcriptions had no majority labels. The most popular agreed emotion label was ‘Anger’ (2332) while the least popular label was ‘Neutral’ (284). Two possible scenarios with majorities were 3-0-0 and 2-1-0 (1.0 point for the majority), whereas one scenario with “No majority” was 1-1-1 (1/3 points each).

Fig. 7
figure 7

Annotator clustering based on emotion label agreement (Cohen’s kappa): single linkage and complete linkage. a Emotion label agreement (Cohen’s kappa) with respect to the total number of speaker scenes, by actor. b Similarities between annotators based on average emotion label agreement. c Clustering by single linkage. d Clustering by complete linkage

Table 6 Number of transcriptions by label agreement patterns for three annotators (AnnF02, AnnF03 and AnnM01)

4.3 Actor-emotion-oriented statistics and analysis of emotion annotation data

Emotion annotation results can be applied to investigate the emotion similarities among actors. In this analysis, the results of the three most similar annotators: AnnF02, AnnF03, and AnnM01 were used. In Table 2, there were 51 actors (20 males and 31 females) in this corpus, so for the sake of simplicity, we selected the TOP-5 dominant (most seen) actresses (ActF01, ActF02, ActF03, ActF04 and ActF05), who were 18, 69, 19, 48, and 22 years old and those of actors (ActM01, ActM02, ActM03, ActM04, and ActM05) aged 24, 26, 62, 53, and 18 respectively, to represent all the actors since the total number of those actors’ utterances were 7486, which is almost 83.30% of all utterances in the corpus. The emotion distributions of ten major actors/actresses tagged by AnnF02, AnnF03 and AnnM01 are shown in Table 7. Here, the leading actors were ActF01 and ActM01and their conversations covered 36% (3249 utterances) of the drama. On average, the order of emotion expressions was ‘Anger’ > ‘Happiness’ > ‘Sadness’ > ‘Neutral’. According to Table 7, it can be seen that most actors usually expressed ‘Anger’, except ActF02, ActM03, and ActM04 who expressed more ‘Sadness’ and ActM02 and ActM05 who expressed more ‘Happiness’. On average, ‘Neutral’ was the least expressed emotion. Moreover, on average the female actors generated more ‘Anger’ and ‘Sadness’ than the male actors. The ‘Error’ row indicates the number of transcriptions (conversation turns) for which the annotator could not assign an emotion. The ‘Sum’ and ‘No Error’ rows were identical in number but different in proportions. The ‘Total’ row presents the summation of ‘Error’ and ‘No Error’, equal to the number of total transcriptions (conversation turns) of each actor. The ActM03’s emotions seemed vague, resulting in different emotion labels being assigned by the three annotators. The ActM04 usually expressed ‘Anger’ and ‘Sadness’ (more than 80% of total conversion turns). In a nutshell, the male actors generally expressed more ‘Happiness’ while the female actors expressed more ‘Anger’ and ‘Sadness’.

Table 7 Emotion distribution annotated by AnnF02, AnnF03 and AnnM01 for five major actors and actresses

To analyze the actor characteristics, we introduced Kullback–Leibler (KL) divergence, which is a measure for expressing the difference between two probability distributions (the proportions of two actors’ emotions). We calculated p i j (x), the probability of an emotion (x) labelled by an annotator (i) assigned to an individual actor (j), by Eq. (1). The sum of the probabilities of all four emotions (anger, happiness, neutral, and sadness) for each actor is 1.

$$ p_{j}^{i} \left( x \right) = \frac{{N_{j}^{i} ( x )}}{{\mathop \sum \nolimits_{x \in E} N_{j}^{i} ( x )}} $$
(1)

For example, p AnnF01 ActF02 (AG) refers to the probability of the anger (AG) emotion made by ActF02, labelled by AnnF01. The KL divergence of two actors (here A and B) judged by an annotator (i) is defined by the difference between two probability distributions of all the target emotions (x) of those two actors (A and B) assigned by an annotator (i). The formal description is given in Eq. (2).

$$ D_{KL}^{i} \left( {A,B} \right) = D_{KL} (p_{A}^{i} \left( x \right)||p_{B}^{i} \left( x \right)) = \mathop \sum \limits_{x \in E} p_{A}^{i} \left( x \right)\log_{2} \frac{{p_{A}^{i} \left( x \right)}}{{p_{B}^{i} \left( x \right)}} $$
(2)

Since the KL divergence was not symmetric (D i KL (AB) ≠ D i KL (BA)), we applied Jensen-Shannon divergence (JSD), a symmetric KL-divergence-based measure instead, as shown in Eq. (3).

$$ D_{JSD}^{i} \left( {A,B} \right) = \frac{1}{2}D_{KL}^{i} \left( {A,B} \right) + \frac{1}{2}D_{KL}^{i} \left( {B,A} \right) $$
(3)

To investigate further similarities among the actors, we used JSD to express the distance between an actor pair, and then applied the complete linkage method to the TOP-10 group of actors (5 females, 5 males). The result is shown in Fig. 8 (note that node 1 = ActF01, 2 = ActF02, 3 = ActF03, 4 = ActF04, 5 = ActF05, 6 = ActM01, 7 = ActM02, 8 = ActM03, 9 = ActM04 and 10 = ActM05, and using black circles refer to actors while white circles refer to actresses). At first glance, the female actors seemed to share a similar proportion of emotions while the male actors also expressed a similar proportion of emotions, except for ActM01 (No. 6 in the figure) and ActF02 (No. 2 in the figure). Figure 8a shows that ActF03 (No.1) and ActF04 (No.4) share the most similar emotions according to AnnF02. Moreover, most male actors, ActM02, ActM03, ActM04, and ActM05 (Nos. 7-10), are considered to have similar emotions in the drama. As shown in Fig. 8b, AnnF03 decided that ActF01 (No. 1) and ActF04 (No. 4) were the most similar pair with 0.018 of complete linkage. AnnF02 (No. 2) was the only female actor who had similar characteristics to two of the male actors, ActM01 and ActM04 (Nos. 6 and 9). Figure 8c showed similarities in the results based on AnnM01’s annotations. Although the result was similar to that of the previous annotators (AnnF02 and AnnF03), the decisions show that ActF03 (No. 3), ActF05 (No. 5) and ActM01 (No. 6) are the most similar with 0.002 and 0.015. On average, the three annotators (Fig. 8d), which is most of the female actors (Nos. 1, 3, 4, and 5) expressed their emotions in a similar way to the male actors (Nos. 7-10) even though they had a high JSD divergence. ActM01 (No. 6) is a male actor whose emotion distribution was close to that of the female actors. On the other hand, ActF02 (No. 2) is a female actor whose emotion distribution was similar that of the male actors.

Fig. 8
figure 8

Complete linkage clustering of actors’ emotions labeled by three annotators (a AnnF02, b AnnF03, and c AnnM01) and d sum of three annotators

5 Conclusion and future work

This paper presents the construction of a Thai emotional speech corpus, namely EMOLA, using a Thai drama series (Lakorn). The design, construction, and annotation process are discussed. In the corpus design, four basic emotion types of anger, happiness, sadness and neutral were selected and twelve subtypes of emotions: anger, confusion, disgust, excitement, fear, happiness, jealousy, pleasure, rage, sadness, satisfaction, and surprise, were used. In addition to the categories of emotion, there are also emotions based on the Pleasure-Arousal-Dominance (PAD) emotional state model, where each emotion is represented by three values: P, A, and D. An XML DTD was designed to store transcriptions, metadata and emotion tags. The mapping between the categories of emotion and the PAD representatives was analyzed. In the process of corpus construction, we transcribed and gave metadata to 8987 video clips with approximately 868 min (from a video of 1397 min in total), while we also assigned one basic type and a few subtypes to each video clip. This corpus was developed from a Thai drama series in which there were 20 male actors and 31 female actors. However, 208 utterances were produced by multiple actors. All utterances have been annotated by six annotators (AnnF01, AnnF02, AnnF03, AnnM01, AnnM02 and AnnM03). The characteristics of the corpus were investigated in three aspects: the video material, the annotators, and the actors. The relationship between basic emotions (level-1) and subtypes of emotions (level-2) and labeling of agreement patterns among the annotators were analyzed. According to an analysis of the work of the annotators, AnnF02, AnnF03 and AnnM01 were the top-3 annotators who mostly shared the same opinions in emotion labeling. Moreover, we applied Kullback–Leibler divergence and Jensen-Shannon divergence to measure the similarities between two actors’ emotions and we also used the complete linkage clustering to group the actors’ characteristics. The results for all three annotators show that ActF01 and ActF03 were the most similar in this corpus. This database is expected to be a resource to help us understand the use of certain emotions in the Thai language and this should then be useful for modeling the relation between video material, actors, and annotators. In the future, we plan to use this corpus for emotion recognition in Thai speech. The annotation differences among annotators will be investigated with regard to speech mood recognition. Since emotion is not usually expressed throughout the whole of a speech utterance, the location of moods in the utterance should also be studied.