Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

This contribution deals with the requirements on representation languages employed in planning and displaying communicative multimodal behaviour of embodied conversational agents (ECAs) . The term ECA has been coined in Cassell et al. (2000) and refers to human-like virtual characters that typically engage in a face-to-face communication with the human user, employing various synchronised channels of communication such as facial expression, hand–arm gestures and body posture, as well as tone of voice and text. The embodiment of ECAs ranges from talking heads to full-bodied 2D and 3D characters. Underlying are complex AI systems that model the character’s capabilities to meaningfully engage in communicative situations with the human user or other ECAs. This includes modelling of the character’s self, its tasks, goals, related believes and intentions, its personality, emotions and interpersonal stances, but also modelling of the character’s (perceived) environment, including social scenarios, communication situations and partners, in order to generate situationally adequate and believable behaviours.

ECA systems may implement a face-to-face dialogue with the user, model scenarios where humans and artificial agents interact with each other in a virtual or mixed environment or generate communicative interactions between different artificial characters in which the user actively takes part or which are displayed to the user like in a product presentation, a TV spot or a stage play. See for instance André and Rist (2000) and Nijholt (2006) where one or more virtual agents present information to the user, Krenn et al. (2002) where a virtual car seller and buyer engage in a conversation about various features of cars and Mateas and Stern (2003) where in an interactive drama the interaction of a user with two characters representing a couple on the verge of divorce influences the outcome of the couple’s story. Rehm and André (2005) present GAMBL, an interactive test-bed for human–ECA interaction. Other examples of ECA implementations are the Real Estate Agent REA (Cassell et al., 1999) which implements a full perception–action loop of communication by interpreting multimodal user input and generating multimodal agent behaviour; the pedagogical agent Steve (Rickel and Johnson, 1998) which functions as a tutor in training situations; MAX (Kopp and Wachsmuth, 2004), a virtual character geared towards simulating multimodal behaviour; and Carmen (Marsella et al., 2003), a system that supports humans in emotionally critical situations such as advising parents of infant cancer patients. ECAs can adopt several roles, such as being a teacher (Johnson et al., 2005; Moreno, 2007), a doctor (De Rosis et al., 2003), a museum guide (Kopp and Wachsmuth, 2004), a real-estate agent REA (Cassell et al., 1999) or a companion (Bickmore and Cassell, 2005; Hall et al., 2006). Even though the above examples represent only a small fraction of the vast and constantly growing work on ECAs, they are well suited for illustrating the broad range of applications.

Human communicative behaviour covers a broad range of skills, including natural language generation and production, co-verbal gesture, eye gaze and facial expression. People produce such behaviours in real time with ease and in a broad range of circumstances. In order to simulate human-like multimodal communicative behaviour, advanced ECA systems need to incorporate a whole range of complex processing steps, from intent to behaviour planning to behaviour realisation including some sort of scene or story generation, multimodal natural language generation, speech synthesis, the temporal alignment of verbal and non-verbal behaviours and behaviour realisation employing particular animation libraries and engines. In the current contribution, we focus on the role of behaviour representation frameworks as part of the processing chain from intent planning to the planning and generation of multimodal communicative behaviours.

2 Background

Imagine a situation, where we want to model the following encounter between a character C and a user U: C is in a good mood, encountering the appearance of U in the system makes C particularly happy and leads C to greet U effusively. Finally, we want to see an animation including the following behaviours: C displays a neutral but friendly face, directs its attention to U, broadly smiles at U and says Hello my friend! Good to see you after such a long time! As regards the spoken utterance, we want to put emphasis on hello, good and such.

At the intentional level we thus have something like the following ingredients which we present in a pseudo-notation (in reality XMLFootnote 1 formats are widely used):

Example 1.

Intentional Level pseudo-notation

mood(C) = happy;

event = encounter(C,U) -->

emotion(C) = happy & communication_act(C) = greet(U).

At the behaviour planning level, we need to further specify the non-verbal and verbal behaviours, bring them into a temporal order and specify the relative dependencies between the communication channels involved. Employing our pseudo-notation, this might be represented as follows:

Example 2.

Behaviour planning level pseudo-notation

c1: face(C) = neutral_friendly;

c2: gaze(target(C))= user;

c3: face(C) = broad_smile;

c4: gesture(C) = wave;

c5: utterance(C) = emph {hello} my friend! m1 emph{good} to

see you after such a emph{long} time!;

start(c1) = t0;

start(c2) = encounter(C,U);

start(c3) = encounter(C,U);

end (c3) = start(c5);

start(c4) = encounter(C,U);

end(c4) = m1.

The above notation defines the partial behaviours (c1 to c5) including face, gaze, gesture and utterance, puts them in a relative temporal order and specifies the following dependencies: At time t0, the beginning of the animation, C looks neutral_friendly and starts to look at the user when encountering them. At the same time, C starts a broad smile and waves. When beginning to speak C stops smiling. The wave ends after friend has been spoken out and before the onset of good.

At the behaviour realisation level , we need to specify the actual behaviours at an even greater level of detail such that the realisation engine is able to generate a sequence of integrated multimodal behaviours for playing. In particular, the concrete animations are selected, the utterance is synthesised and the timing of the partial behaviours is transformed from relative to absolute. For lip-synchronised speech, phonemes (transcripts of consonants and vowels) are aligned with visemes (visualisations of mouth shapes which may also include the tongue). Depending on the speech synthesis component employed, also a markup of prosodic information including syllables, intonation phrases and related accents may be available. This information is necessary to synchronise eyebrow raises and accents.

As our small example has shown, there is a variety of information that needs to be modelled in order to represent multimodal communicative behaviour. Accordingly, considerable effort has been put into the development and documentation of representation formats in the last years. The overall complexity of ECA systems motivated different strands of development. On the one hand it gave rise to a number of XML-based markup languages which are aimed at providing means for human authors to easily annotate text with multimodal behaviours. On the other hand attention was geared towards the specification of representation formats for the exchange of information between sub-components of an ECA system.

The goal of this contribution is to give an overview on representation formats proposed so far and will discuss specific representational needs raised by selected sub-tasks such as modelling at different levels of processing emotional display and spoken dialogue accompanying bodily behaviours. An evaluation in terms of the general acceptance and dissemination of different representation formats will be provided. Moreover, we will present an initiative for a common architecture for ECA systems and the prospects for the future development of representation languages. The community is still investigating ways to come up with strategies of unifying the existing variety of representation formats. We expect benefits of such an endeavour only if representation formats and source codes of related processing components are made available for free to the research community.

3 Different Views on Representation Languages/Formats for Behaviour Generation of ECAs

Numerous representation languages or formats have been proposed in the literature. They include markup languages for annotating text with behaviour directives, representation languages that declaratively and to different degrees of detail model various aspects of information required at different stages of behaviour generation, and languages that incorporate procedural knowledge in their annotations. Some languages attempt to cover a broad range of information relevant for behaviour generation. Most of the representations, however, have been designed for specific applications. In order to structure the wealth of proposals, we will, in the following, offer two views on existing representations: First we provide examples for different representation formats ranging from text markup to representations that contain aspects of high-level programming languages. Second we will present examples for representation languages that attempt to cover a broad range of information versus representations that have been designed with a specific application purpose in mind.

3.1 Formats – Markup– Versus Representation– Versus Scripting Languages

Markup languages typically define sets of markups that give the non-expert user (usually a web designer) the possibility to annotate text with high-level behavioural information in order to produce pre-scripted presentations for ECAs. VHML (Beard and Reid, 2002)Footnote 2 is an example for this type of languages. It has been designed for creating interactive multimodal applications with talking heads or full-bodied ECAs. Other examples of ECA markup languages where text is annotated with high-level concepts are APML (De Carolis et al., 2004) and MPML (Zong et al., 2000).

Representation languages in contrast aim for the technically detailed annotation of theory-specific information. In this respect, the Emotion Annotation and Representation Language (EARL) addressed in Schröder et al. (2010) is more a representation than a markup language. This holds in general for all languages that become more and more detailed in modelling and describing multimodal behaviours. Thus representation languages are well suited to function as data representation formats inside a system, especially as representations at the interfaces between the individual sub-components. RRL (Rich Representation LanguageFootnote 3, Piwek et al., 2002) is an example for such a language that defines an XML format for representing the input and output of all the components used for realising the processing steps from intent to behaviour planning to behaviour realisation.

Scripting languages in addition also incorporate means for encoding procedural knowledge, e.g. conditional execution of behaviours such as “if event X occurs, then execute behaviour Y”. Thus scripting languages are comparable to high-level programming languages. Examples in the field of ECAs are STEP and its XML variant XSTEP (Huang et al., 2003), and ABL (Mateas and Stern, 2004). The expressive power of scripting languages comes with a price though, e.g. the complexity of writing specifications in ABL comes close to programming in Java. The choice of the appropriate level of representation thus has to take different constraints into account. On the one hand, markup languages are indispensable in application development, because the application designer need not necessarily be an expert in all the fields underlying the development of ECA systems. On the other hand, representation languages are crucial in research contexts, because of the necessity to represent highly specific, low-level information. With the increased demand for truly interactive systems, the need for including at least some procedural capabilities typical for scripting languages into the representations becomes more and more of an issue, resulting in new hybrid formats. This trend is, e.g., exemplified in the evolution from MPML to MPML3D (Nischt et al., 2006). While MPML was designed as a markup language that allows the non-expert user to create pre-synchronised presentations, MPML3D has evolved to an authoring system that allows for the embedding of scripts and for the design of reactive scenarios. Summing up, while multimodal markup languages are designed to allow non-experts create multimodal presentations easily, representation languages in the above definition are designed to ease the integration and exchange of components in multimodal systems. Rist (2004) argues though that with the advent of more and more sophisticated authoring tools in the future, simplicity eventually will become less of a design criterion for markup languages and the distinction between the different types of representation languages will become more vague.

3.2 Scope – General Purpose Versus Application-Specific

To give an assessment of different strands of endeavours, we present, in the following, attempts to develop representation formats of broad scope and contrast them with languages that have been developed to serve much more restricted purposes, either being developed in the contexts of and thus particularly geared towards certain ECA implementations or aiming at the representation of certain aspects in multimodal behaviour generation.

3.2.1 General-Purpose Initiatives

HumanML (Brooks, 2000) was an initiative hosted by the “Organization for the Advancement of Structured Information Standards” (OASIS) to come up with a mark-up language for describing virtually all properties not only of artificial characters but of human beings. It set off with the goal to provide information for human-to-human and human-to-machine communications in a machine-readable form. The language aimed to encode information related to human communicative behaviour from high level (culture, emotion) to low level (signal, kinesics) and aimed to be of relevance for such diverse areas as anthropology, medicine, business communication and virtual reality. It was planned to specify tags related to physiology, proxemics, kinesics, haptics, beliefs, intentions, emotions, etc. and aimed to provide attributes related to community, culture, context/location of the conversation, personality, thoughts and signals. The initiative came to a halt soon after proposing a rough XML scheme with place-holders for the high-level concepts which never were specified in more detail.

Virtual Human Markup Language (VHML) presents a more down-to-earth initiative for a language that should facilitate the interaction of a talking head or a virtual human (Beard and Reid, 2002). It was designed as a confederation of various relatively simple sub-languages, each of them concerned with a sub-task: dialogue management, emotion, facial animation, body animation, hypertext and speech. VHML has a hierarchical structure, i.e. elements of a lower level inherit information from the higher level. The one typical example for the hierarchical encoding is emotion tags which are inherited by all the sub-components for speech, face and gesture. The specification of VHML did not leave the draft level and was mainly geared towards the control of a talking head, for which also sample implementations were implemented. That is, the most detailed specification was available for the head and the face, while the gesture markup language only comprised a small set of six atomic behaviours (shrug, agree, disagree, concentrate, emphasise, smile).

3.2.2 Special-Purpose Applications

Languages that have been developed for specific purposes are, e.g., SiGML for sign language (Elliott et al., 2004) and MURML for the reproduction of gesture kinematics (Kranstedt and Kopp, 2002). Other examples are MPML for a presentation agent (Mori et al., 2003); RRL to represent information relevant at the interfaces of system components in a pipeline for generation of animated presentation dialogues (Piwek et al., 2002); APML (De Carolis et al., 2004), AML and CML (Arafa et al., 2002) for agent communicative behaviour; and BEAT for verbal and non-verbal synchronisation (Cassell et al., 2001). We will describe MPML, RRL and APML in greater detail directly in the following. MURML will be addressed when it comes to gesture coding (Sect. 4.3.2) and BEAT when we talk about multimodal natural language generation (Sect. 4.2.1).

Multimodal Presentation Markup Language (MPML) aims at developing a language to easily create animated agents within interactive presentations. Agents may be set up on the web and the user can interact directly with the agents. The general goal of MPML is that, unlike most other web agents for presentation applications, the presentation of information is no longer presented sequentially, but its content is generated dynamically as the conversation between the agent and the user evolves (Mori et al., 2003). Furthermore, MPML has been designed for mouse control, voice control, text-to-speech and agent’s action description (Tsutsui et al., 2000). A specialised scripting language, SCRipting Emotion-based Agent Minds (SCREAM), may be interfaced with MPML (Prendinger et al., 2004). SCREAM has been designed to create emotionally and socially appropriate responses of animated agents placed in an interactive environment. SCREAM is specialised in scripting the agent’s mind. SCREAM may be used within applications that compute the verbal content of the interaction between the user and the agent. The role of SCREAM is to compute the emotion that may arise during the conversation. The instantiation of the signals and their intensity for a given emotion is then computed taking into account many factors, such as the social setting of the conversation as well as the agent’s mental state.

The Rich Representation Language, RRL, has been developed to manage interactive dialogue scenes between two or more virtual agents (Piwek et al., 2002). RRL is used as a link between a scene generator, a multimodal natural language generator, a speech synthesis component, a gesture assignment component and finally a media player. A scene description contains information related to the set of communicative acts and the temporal ordering of these acts. An affective reasoner is embedded in the scene description to compute the corresponding emotion that may be triggered by given acts. The emotion is defined by its type, intensity and optionally by the object that causes the emotion. A scene description is input to the multimodal natural language generator that computes the corresponding linguistic and non-linguistic forms of the communicative acts. The role of the speech synthesis and gesture assignment components is to instantiate the acoustic and visual data for a given emotion and dialogue act. This results in an XML script representing the multimodal dialogue where the verbal and non-verbal behaviour is fully specified and the integrated temporal specification of the various communication channels (speech, facial, expression, gesture) is absolute. The script forms the common basis from which the specific animation directives required to drive individual players are derived via syntactic transformation.

The Affective Presentation Markup Language (APML) is based on a taxonomy of communicative functions proposed by Isabella Poggi (Poggi et al., 2000). A communicative function is defined as a pair (meaning, signal) where the meaning corresponds to the communicative value the agent wants to communicate and the signal is the behaviour used to convey this meaning. Communicative values are differentiated into four categories namely information about the speaker’s beliefs, intentions, affective state and mental state. APML tags correspond to the meaning of a given communicative function. The conversion from meaning to signals is done by looking up a library of meaning–signal pairs.

Noot et al. (2004) developed GESTYLE, a complex representation language which is based on several dictionaries. Each dictionary reflects a certain aspect of the style of a character, e.g. cultural characteristics, profession or personality, and defines the association of meaning to signals. In addition physical information such as manner of gesturing (smooth, jerky, etc.) or tiredness can be specified. To create an agent with style one needs to specify this set of parameters (e.g. an Italian extrovert professor), and the proper set of mappings between meanings and signals is then instantiated.

To summarise, as we can see from the examples in this section, all proposed languages somehow model the relation between the ECA’s mental states (goals, beliefs, intentions, emotions) and their display via concrete verbal and non-verbal behaviours. The details addressed in the various languages, however, differ widely, as they depend on the particular application the ECA system is built for, and thus on the system components realised.

4 Inventory of Information Relevant in the Generation Process

As we have seen in the previous section, all the languages or representation formats presented so far model some specific mix of information required in the complex process of multimodal behaviour generation. Obviously there is a considerable overlap between different representations, but at the same time they are not directly compatible or cannot be easily adapted for the needs of individual ECA projects. Therefore in the past we have seen (re-)building of representations over and over again. In this section, we will take a step back and concentrate not so much on existing languages, but look at the kinds of information relevant at different steps of the generation process.

Figure 1 depicts one way to conceptually organise the multitude of different processing steps and associated modules into fairly high-level blocks and helps to divide the overall task into separate sub-components. This grouping into three top-level modules namely intent planning, behaviour planning and behaviour realisation has been proposed by the SAIBA framework Situation, Agent, Intention, Behaviour and Animation (Kopp et al., 2006; Vilhjálmsson et al., 2007). All three boxes have to be understood as complex systems with a variety of sub-components. One of the main guiding principles in the design of SAIBA architecture was to aim for a clear distinction between function and behaviour. FML thus stands for Functional Markup Language and BML for Behaviour Markup Language. Functional markup is to include all information regarding agent’s mental, communicative and affective state that is necessary to create a link between intent and behaviour planning. It needs to provide a large spectrum of information including semantic, communicative, discursive, pragmatic and epistemic information. Behaviour markup comprises all those representations that are necessary for the realisation of behaviour. This includes textual and prosodic information, facial display, gestures and postures, eye gaze, etc. and, very importantly, it includes directives for the temporal synchronisation of behaviours. This format nevertheless aims for specifying behaviours independent of specific behaviour realisation systems, most specifically independent of concrete animation engines. In the following we are organising the discussion on information requirements and existing representation format along the lines of the SAIBA architecture.

Fig. 1
figure 20_1_213094_1_En

Outlay of the processing pipeline for the generation of ECA behaviour as proposed by the SAIBA framework (Kopp et al., 2006)

4.1 Intent Planning

In order to be able to specify an ECA’s communicative behaviours, first of all the underlying intent needs to be determined. At this stage, the basic semantic units related to the communicative event are computed. There is no reference to any physical or verbal behaviour yet. All in all, intent planning requires the computation of the mental, affective and communicative state of the agent. One possibility to implement the mental state of the agent is a BDI approach (Belief, Desire, Intention; Louis and Martinez, 2007). FML has not yet been defined in such a detail as BML has been. However, several proposals have been made for relevant concepts to be modelled during intent planning; see the collection of papers in Heylen et al. (2008). Contributions cover amongst others the specification of communicative actions (Kopp and Pfeiffer-Leßmann, 2008), different cognitive functions such as remembering or recalling (Mancini and Pelachaud, 2008), planning and regulating conversation (Lee et al., 2008), as well as emotional states (Krenn and Sieber, 2008).

4.1.1 Personality and Emotion

Personality and emotion are important aspects guiding the display of human behaviour. Emotions and emotion related states influence the way communication proceeds, its wording, voice quality, facial expression and other aspects of bodily behaviours such as posture and the dynamics of gesture. While emotion is responsible for the temporary changes in the quality of expression, personality determines the global tendencies of an individual’s expression and thus may function as a means to establish coherency and consistency in the behaviour of an individual so that it becomes more predictable for the observer (Ortony, 2003). The expression of joy, for instance, in an extrovert person is overall much more pronounced than when expressed by an introvert person.

In the ECA community two approaches to personality and emotion are widely employed, the Five-Factor (OCEAN) model (Wiggins, 1996) and the OCC model (Ortony et al., 2003), respectively. This is reflected in the inventory of several representation languages for ECAs, e.g. PAR (Allbeck and Badler, 2002) makes provisions for OCC and OCEAN ; MPML incorporates labels for the 22 emotions defined in OCC (Zong et al., 2000); RRL encodes the OCC labels plus their extension by (Elliott, 1992), a subset of OCEAN labels and a politeness attribute; APML defines its own set of emotion labels geared towards the communication situation (medical counselling) and the type of ECA (a talking head) used.

OCC is an example of an appraisal model. It defines emotions as positive and negative reactions to events, to other characters’ or people’s actions and to objects. Events are evaluated with respect to their desirability, actions according to their praiseworthiness, and objects in terms of their appeal to the agent. The subjective appraisal of a situation depends on the goals, standards and attitudes of the agent. Whereas attitudes are long-term affective states, emotions have a strong onset, but diminish over time. The latter is typically modelled via a decay function (Gebhard et al., 2003). Gebhard (2005) has presented a computational model that integrates emotion, mood and personality, standing for short-, mid- and long-term aspects of affect, respectively. As the terminology to describe human emotional life is manifold, we would like to redirect you to the opening chapter of the handbook (Cowie et al., 2010) for a discussion of concepts and definitions.

Modelling of emotion comes into play in intent and behaviour planning as well as in behaviour realisation. Whereas in intent planning, appraisal models have shown to be well suited (Ortony et al., 2003; Egges et al., 2003; Gebhard et al., 2003; Gebhard, 2005; Rank and Petta, 2005; Ochs et al., 2008; Marsella and Gratch, 2009), basic emotion categories (Ekman, 1993) are still the predominant representation when it comes to behaviour realisation. As Ekman’s original research on basic emotions has focused on facial expressions, it unsurprisingly is still very influential in the field of facial animation for ECAs. Alternatively dimensional models (Scherer, 2000) have been successfully employed for modelling emotional speech (Schröder et al., 2001; Schröder, 2004).

Personality models have been integrated in agents to model behaviour tendencies as well as intent planning (Moffatt, 1995; André et al., 2000; Johns and Silverman, 2001; Ball and Breese, 2000; Egges et al., 2004; Kshirsagar and Magnenat-Thalmann, 2002). The Five-Factor model of personality (McCrae and Costa, 1996) is used in most of the cited works. The interplay between personality and emotion has been studied. Moffatt (1995) views personality and emotion as similar states that differ in time and duration. Moreover personality ensures coherency of reactions to similar events, i.e. the emotional answers of an individual to these events are coherent through time. See also Ortony (2003).

4.1.2 Dialogue

Modelling of affect and personality not only is required for generating believable non-verbal behaviour, but also influences the agents dialogue; see Piwek (2003) for a survey. Even more importantly, automatic dialogue generation requires a representation of the domain, as it determines to a large extent what the virtual actors can talk about. Moreover behaviour, including dialogues, adheres to social conventions. Depending on the social relationship between the communication partners and the formality or informality of the situation, things are said differently and different display rules for the body behaviour apply (Walker et al., 1996; Rehm and André, 2005; De Carolis et al., 2001; Niewiadomski and Pelachaud, 2007; Ball and Breese, 2000), e.g. people would normally avoid crying in a business meeting whereas such a behaviour is much more likely in a private, intimate setting. What kinds of behaviours are socially acceptable and which ones are not strongly depends on cultural conventions. Just think about what is considered as good table manners in Europe as opposed to China. For instance, smacking and burping while eating will be considered as rude in Europe, but may be expected in China as an indicator for the positive appreciation of the food.

For modelling the communicative state, some dialogue planner is required that generates the initial version of a dialogue as a sequence of dialogue acts. A dialogue act represents an abstract communicative function, such as requesting for information, answering a question and giving feedback. Such communicative functions can be realised in many different ways depending, for example, on the personality and affective state of the actor. The structures produced by the dialogue planner represent communicative strategies that can be observed in a particular genre or domain and (partial) plans of how the communication should proceed. These plans include choice points according to which the communication differently proceeds based on the input from the outside world. This can be utterances from the communication partner(s), but also events occurring in the environment. van Deemter et al. (2008), e.g., describe the plan generation process for whole scenes of car sales dialogues enacted by two virtual characters, a customer and a seller.

This kind of presenter agents (see also André et al., 2000) realises very specific cases of dialogue where the whole dialogue is planned in one go depending on the initial settings given by the user. While the dialogue proceeds, no interference from the user is possible. Carmen’s bright ideas (Marsella, 2003), FearNot! (Hall et al., 2006) and Faade (Mateas and Stern, 2003), are examples where the way how the story proceeds depends on the user’s contribution to the dialogue.

4.2 Behaviour Planning

Given a particular intention and/or emotional state the agent aims to communicate, the system needs to decide which non-verbal signals will be used. The behaviour planner takes as input a given intention and/or emotional state, for instance to greet the communication partner happily, and outputs representations for the visual and acoustic signals to be generated by the behaviour realisation modules, such as a waving hand gesture, a broad smiling face and a greeting utterance for instance “Hello my friend! It is great to see you!”. The behaviour planner has to instantiate the communicative acts, in our case greeting with the emotional colouring happy, which were generated during the intent planning phase and which are defined in terms of their meaning, into signals and then decide which modalities (facial expression, gaze, gesture, posture, voice, etc.) will be used to convey the particular meaning. Apart from selecting the modalities to convey meaning, proper synchronisation between the modalities is crucial. Examples of ECA systems that use such an approach are, for instance, the Greta behaviour engine (Pelachaud, 2005), BEAT (Cassell et al., 2001), SmartBody (Thiébaux et al., 2008) and MAX (Kopp and Wachsmuth, 2004). The task of behaviour planning thus at least comprises the sub-tasks multimodal planning and multimodal alignment.

If you go back to our example 2 on page 3, multimodal planning accounts for modelling the communication channels face, gaze, gesture and utterance expressed in c1 to c5, whereas multimodal alignment takes care of the timing of the expressions in the different channels relative to each other, which we have modelled with the start–end mechanism. Absolute timing is only available at the stage of behaviour realisation, when the speech is synthesised and concrete animations have been selected.

4.2.1 Multimodal Behaviour Planning

Non-verbal and verbal behaviour needs to be tightly integrated. Thus, in the most sophisticated ECA implementations, planning of non-verbal behaviour is coupled with natural language generation (NLG) leading to multimodal natural language generation (MNLG). In NLG the overall generation task is traditionally divided into two separate phases. In strategic generation it is decided “What to say”, i.e. which propositions are to be expressed in a still language-independent representation. Tactical generation then deals with the “How to say”, i.e. it is responsible to come up with the concrete wording.

As in NLG, the multimodal generation process is divided into a planning phase where the communicative acts are semantically outlined and a realisation phase where the behaviours that are going to be actually displayed are specified. Gestures are planned on the basis of the semantic and pragmatic content of the natural language utterances and are aligned with the respective representations of the utterance. At this stage, information on the concrete realisation of the body behaviour as well as the surface realisation of the utterances, i.e. the concrete wording, is still under-specified. The idea of intertwining gestural and syntactic structure has been proposed in different works. Cassell et al. (2000) describe a mechanism for applying the SPUD natural language generator (Stone et al., 2000) to multimodal generation. SPUD makes use of the “Lexicalized Tree Adjoining Grammar” (LTAG) formalism (Joshi et al., 1975; Schabes, 1990) and integrates the natural language grammar with motion events. Integration of gestures and syntax is particularly suitable for gestures that can express semantic content and therefore present an alternative to linguistic expression of the same content. For instance, if one wants to express that some X has a square shape one could say “X is squared”, “X is of square shape”, etc. However, one could as well say “X looks like this” and produce a gesture depicting a square. Gestures can also be used to express discourse functions. For instance, a question can be accompanied by an eye brow raise or a head tilt, assertions by a head nod, etc. van Deemter et al. (2008) describe an approach to MNLG which is based on typed feature structures representing deep syntactic, semantic and pragmatic content of dialogue acts and referring expressions. For the semantic representations, Discourse Representation Theory is employed (Kamp and Reyle, 1993). An extra module associates gestures and body postures with specific dialogue acts.

4.2.2 Alignment of Multimodal Behaviours

At the stage of behaviour planning signals across modalities must be aligned to each other. The most prevalent task is the temporal synchronisation between verbal and non-verbal signals, i.e. typically between eye gaze, facial expression, gestures and speech. Emphasis, e.g., is usually encoded via the synchronisation of an eye brow movement and a beat gesture with the stressed syllable of the word to be emphasised. Also speech, gaze and deictic gestures are typically synchronised. Take for example an utterance such as “give me this cake”, which might be accompanied by an eye gaze and a deictic gesture towards the cake, with gaze and pointing being aligned with the phrase “this cake”. This example shows that apart from the temporal alignment across channels, also the spatial alignment of movements and objects in the world needs to be handled. Getting back to the example, there is the cake as a target of pointing and gaze, and there is the addressee of the utterance who also needs to be looked at in order to establish and maintain the communication channel. In order to enable the behaviour realisation modules to take care of the actual synchronisations, the representations on the level of behaviour planning should make provisions for appropriate synchronisation points. Knowledge about the location of stressed syllables and sentence accents, e.g., is most crucial for proper synchronisation in the speech modality. For gestures usually the exact placement of the stroke phase, which is defined as the most meaningful part, is essential. In Fig. 2 the especially fine-grained inventory of synchronisation points offered by BML is depicted. The purpose of the different points is explained as follows.

The preparation for or visible anticipation of the behaviour occurs between start and ready, and the retraction back to neutral or previous state occurs between relax and end. The actual behaviour takes place between the ready and relax, with the most significant or semantically rich motion during the stroke phase, between stroke-start and stroke-end, with the greatest effort coinciding with the stroke point. (…) If no preparation or relaxation is needed, then start and ready refer to the same point in time, and relax and end refer to the same point in time. Quoted from Kopp et al. (2006)

Fig. 2
figure 20_2_213094_1_En

The synchronisation points of a communicative behaviour (Kopp et al., 2006)

4.3 Behaviour Realisation

Behaviour realisation is concerned with the generation of the concrete realisation of the behaviours. This component deals with tasks such as selecting the one most appropriate deictic gestures from a repository of candidate gestures that shall be finally realised by the media player, realising a very specific facial expression, generating speech on the basis of natural language text and replacing the specification of relative timing of the synchronised behaviours from the planning phase with absolute time values. Until today, it is common in ECA implementations to use speech as the guiding medium for temporal alignment. In the best case the granularity of temporal information in the speech channel goes down to the level of phoneme durations (in milliseconds), though in many applications only information on the location of word boundaries is employed. With the availability of fine-grained prosodic information, facial and gesture animation can be time-aligned to individual phonemes, accented syllables or boundaries of intonation phrases.

What is still missing is an integrated approach where not only speech defines the timing of its accompanying facial expressions and gestures, but also motor activation constrains voice quality, e.g. to lengthen the duration of a prosodic phrase and postpone the onset of its successor in order to wait for an accompanying gesture to finish.

4.3.1 Speech

Text-to-speech (TTS) systems take as input text possibly annotated with additional information. The TTS determines pronunciation and prosodic properties, such as location and type of pitch accents, prosodic phrase boundaries and duration of phonemes. Based on this information sound files are generated. For generating multimodal behaviour the TTS should provide fine-grained temporal information such as the list of phonemes, the location of phoneme boundaries to allow for the synchronisation of visemes (mouth shapes related to sounds/phonemes) or the location of accented syllables to allow for the exact temporal alignment of beat gestures. Though this information at some point is available within virtually every TTS system, there is no standard way to gain access to this information, given it is accessible at all. Unfortunately many of the commercial products do not provide interfaces to this sort of information, and only a few of the research-related systems provide access as easily as, e.g., the Mary system (Schröder and Trouvain, 2004) or the Festival TTS (Black and Taylor, 1997).

The W3C Speech Synthesis Markup Language SSMLFootnote 4 has been developed to assist the generation of synthetic speech. It provides a standard way to mark up text in order to control aspects of speech such as pronunciation, volume, pitch and rate and is supported by a variety of speech synthesis systems. SSML can thus be seen as an example of a success story when it comes to the specification of a unimodal markup language. This standard, however, only provides a very rough interface both to and from the synthesis engine. For instance the only feedback mechanism is the insertion of event throwing tags in the text, which limits the temporal granularity of the feedback to the level of orthographic words. For the purpose of multimodal generation, in addition to SSML, some standardised speech synthesis output format would be highly desirable in order to make TTS’s internal decisions on pronunciation, timing, pitch, accenting, phrasing, etc. accessible to other components.

As previously mentioned, speech is usually the leading modality providing the timeline for the other modalities (face, gaze, body, gesture) to align with. This, however, requires to be changed in favour of models where the timing of speech should be sensitive to restrictions from other modalities posing additional challenges for the specification of proper interfaces from and to the speech synthesis component. One example is the generation of explanations while manipulating objects, where the manipulations become the leading modality and pauses between intonation phrases are adapted accordingly. In this case, control of sub-sentential chunks has to be guaranteed (Kopp and Jung, 2000). Another possible application where a demand of increased temporal control is to be foreseen is the implementation of immediate reactivity, e.g. an ECA that is interrupting an utterance as an instant reaction to a barge-in or to some other observed user behaviour.

4.3.2 Gestures

Gestures are complex, being composed by one or a sequence of basic gesture elements, each of which describe a basic hand–arm movement trajectory. A trajectory is defined by a sequence of key points where each point corresponds to a position of the wrist in 3D space and a hand configuration. Depending on the granularity of representation, a gesture spans one or more phases, e.g. preparation, stroke, hold, retract. See also Fig. 2. Methods for encoding of gestures can be classified on a continuum ranging from purely semantic representations (related only to the meaning of the gesture) to formats which encode the form of the gesture exclusively. Most existing representation languages for computational systems are founded upon annotation systems in psychology and sign language research. In the following, we will briefly review some of the foundational works and then give an outline of scripting languages used for ECA systems. A more detailed review of existing gesture coding schemes can be found in Serenari et al. (2002). McNeill (1992) provides a semantic classification into iconic, metaphoric, deictic and beat gestures. To localise gestures, a grid-like gesture space is introduced in front of the actor. The descriptions of gesture form are holistic-imagistic though and not readily adaptable to automatic processing, because the shape of a gesture, especially in the case of iconic and some metaphoric gestures, refers to the meaning of the gesture. For example, a cup-shaped hand in certain contexts (for instance when visitors come to the house in France) carries the meaning “to offer something to the guests” as it is interpreted as representing a bowl of food. To make use of this representation for behaviour realisation, it requires the instantiation of the rough semantic classification into concrete representations, such as shape descriptions. Calbris (1990), thus, describes gestures by the meaningful form of their components morphology (segment, configuration, orientation, localisation and movement). Components may be linked to physical properties, for example, the flat vertical hand held between the speaker and the listener to mean stop symbolises the erection of a wall between the speaker and the listener as to show refusal. This act of refusal can also be done by throwing the head backward or even turning the head away. Calbris (1990) describes how gesture variants can be gathered as a class of gestures carrying one meaning. In contrast a gesture may be associated to several meanings depending on the contexts it occurs in. Stokoe (1978) introduces the concept of breaking gestural configurations down into formational parameters such as location and orientation of the wrist and hand shape. A hand shape is described by a thumb orientation and shapes of the other four fingers. The more recent HamNoSys notation framework (Prillwitz et al., 1989; Hanke, 2004), originally developed for sign languages, follows this breakdown into formational parameters and provides a dictionary of the most frequently used configurations. MURML (Kranstedt and Kopp, 2002) is an XML-based description language which has been influenced by HamNoSys. It allows for detailed control of parallel and sequential components of gestures, whereas hand shapes and facial expressions are specified by simple labels. Several parameters have been defined such as the wrist location in space, the palm and finger orientation, hand shape and the wrist trajectory. Values of these parameters create hand and arm configuration. Such configurations must be described for each phase of a gesture (preparation, pre-stroke hold, stroke, etc). Timing information related to the duration of a gesture to the temporal constraints between body parts involved within the gesture can be provided. MURML allows for a precise description of behaviour. Hartmann et al. (2002) describe a language for ECA animation that unites features of McNeill and HamNoSys. FORM2 by Martell et al. (2003) is an annotation scheme that captures the exact configuration and orientation of the arms and hands of a gesturer. It focuses on a detailed description of the geometrical properties of gestures. A corpus of annotated video material is available.

Scripting Technology for Embodied Persona, STEP (Huang et al., 2004), works for ECAs based on H-Anim.Footnote 5 It offers a set of sensors and effectors through which agents can perceive the world they are placed in and can take appropriate actions. STEP includes two main primitive actions (turn, move) to specify body movement. The first defines the rotation to apply to a given joint of the virtual agent, while the second relates to the displacement of one effector. More complex actions can be obtained by combining these primitive actions with three operators: seq, the sequential operator; par, the parallel operator; and T, the repeated operator.

4.3.3 Facial Expression

As with the other kinds of information relevant in multimodal behaviour generation, several coding schemes to describe facial expressions have been devised too. MPEG-4, for instance, is an ISO/IEC standard which defines specifications for the animation of face and body models within an MPEG-4 terminal (Doenges et al., 1997; Ostermann, 1998; Pandzic and Forchheimer, 2002). Two sets of parameters describe and animate the 3D facial model: the facial animation parameters (FAPs) and the facial definition parameters (FDPs). FDPs define the shape of the model, FAPs define the facial actions. FAPs represent a large set of basic facial actions including head, tongue, eye and mouth movement. In combination they represent facial expressions. Facial expression may also be coded using the Facial Action Coding System (FACS) developed by Ekman et al. (Ekman and Friesen, 1978; Ekman et al., 2002). It is a framework to measure facial signals using minimal action units (AUs). With FACS facial action units can be encoded on a scale of five intensities. Behaviour changes along this scale are carefully described. Paradiso and L’Abbate (2001) have established an algebra to create facial expressions. The authors have elaborated operators that combine and manipulate facial expressions. Another definition language has been proposed by De Carolis et al. (2004) and by Paradiso and L’Abbate (2001). In their language, an expression may be defined at a high level (a facial expression is a combination of other facial expressions already pre-defined) or at a low level (a facial expression is a combination of facial parameters). The low-level facial parameters correspond to the MPEG-4 FAPs. The language is also suitable to create easily extendable facial display dictionaries.

5 Towards a Common Framework for Representations in Multimodal Behaviour Generation

The non-exhaustive overview above provides an impression on the amount of effort that has up to now been invested in the design of representation formats. In the last decade we have seen the development of a significant number of ECA systems and applications. In parallel, we have experienced the publication of an almost equal number of usually XML-based markup and representation languages – cf. Arafa et al. (2003) for an impressive list of such languages. To some extent the proliferation of representation languages can also be explained with the enthusiasm about XML in its early hey-days. Employing XML came with an implicit promise of reusability and ease of application and, in addition to presenting ECA systems, the associated representation formats suddenly became a topic of interest worth publishing. Though there are identifiable similarities between existing representation languages still there are not many examples where a sharing of representation formats – not to speak of software – has taken place. Considering the work that has been put to this topic, there is a significant lack on formats that ever got reused outside their original institution or ECA project. This lack of success in terms of acceptance in the community is most obvious for VHML and especially HumanML, which were explicitly designed with the aim to become standards. For other languages reusability might not have been the primary goal in the first place, as they were designed to fit the needs of a concrete ECA application. But even if the main motivation for existing representations might not have been to trigger and sustain the development of reusable and exchangeable system components via the specification of open interface formats, there is an obvious demand for such formats. The implementation of ECA systems requires expertise in such diverse research topics as emotion modelling, behaviour planning, natural language generation, speech synthesis and computer animation. Only for some of these tasks off-the-shelf modules are available; other components are still in their early stages of research and development. Given the overall complexity of ECA systems, exchangeable sub-components that would allow for a plug-and-play approach for the system development would clearly be desirable. Due to the lack of common standards and architectures, similar functionalities need to be implemented over and over again for different systems. At a second thought the problem can also be stated the other way round: It is not so much the lack of standardised interfaces which prevents the development of common software modules, but the lack of widely used system components hinders the establishment of common representation languages. There is a mutual dependency between interface specifications and the availability of system components, but of course software modules that are both useful and usable for a broader group of developers do indeed provide a high incentive to promote the interface formats connected with these components. These considerations have led to a renewed interest in the development of common interfaces, which should then foster the development of system components that could be shared and reused among different ECA projects, thus avoiding the replication of effort. An exemplary activity in this direction is the work of the SAIBA initiative where several research groups have joined forces to come up with a commonly agreed on framework for multimodal behaviour generation. The goal for this initiative is to develop common specifications for representation languages, which are meant to be application independent and graphics model independent, and to present a clear-cut separation between information types (function versus behaviour specification). Intermediate results of this joint endeavour are documented in Kopp et al. (2006) and Vilhjálmsson et al. (2007). Experience with earlier initiatives in this direction like VHML provides strong evidence that success or failure of such a representation format is tightly coupled to the availability of software components that actually provide an immediate benefit for the system developer. This of course is a chicken or egg problem not easily solved. We thus are re-evaluating exemplary sub-topics for which we think that shared representations and jointly developed software modules could succeed in the intermediate future. It is not by chance that the work within the SAIBA initiative has by now mostly focused on issues concerning the specification of BML, i.e. on schemes that deal with the encoding of behaviours. Though still complicated enough, there is a joint understanding of the concepts necessary to describe the communicative actions of the human body. Human anatomy and the specifics and needs of existing animation techniques and speech synthesisers help to guide the development of BML. For FML, i.e. the encoding of functional categories, it is much more difficult to come up with general, application-independent representations. Among the information types affiliated to the functional domain in SAIBA, the representation of emotions is a prime candidate for the development of a joint representation format; see the chapter “Representing Emotions and Related States in Technological Systems” by Schröder et al. (2010) in this part.

One important issue in the current development of BML is the specification of non-verbal communicative behaviour. Existing schemes which are using joint-angles and segment translations such as MPEG-4 or BVH (Pandzic and Forchheimer, 2002) do provide exact and detailed physical information on body shapes. Nevertheless they are viewed as way too specific. They are lacking flexibility both in the modification of temporal and spatial properties, e.g. they run into problems of collision if body proportions are changed. Also functional information, e.g. the identification of stroke phases, which is crucial for the proper temporal alignment across modalities, is completely missing. Coming up with a common higher- level format for the representation of facial expression, gesture and posture which would function as a sort of middle layer between the specification of intentions and the formats used for actually rendering body movements is by no means a trivial task. But still the problem seems to be confined in a tractable way. There is much agreement on the overall requirements for such a representation, and the development can partly be based on experience gained with existing languages like MURML and FORM. The prospect of coming up with a representation for the physical appearance and bodily actions of an ECA that is independent of individual animation engines is obviously appealing. A strong motivation for working on such a representation for behaviours has been the idea to use it for providing repositories of communicative gestures, for which the terms gesticon and gestuary have been proposed (Krenn and Pirker, 2004), i.e. collections of gestures reusable in different ECA systems. At the same time the immediate requirement for supplementing the representation format with concrete software that actually provides non-trivial functionalities becomes evident. In the terminology of the SAIBA architecture this would be, e.g., behaviour realisation modules which are interpreting BML representations and render them to different animation engines. As long as no components for interpreting and transforming to player-specific code are available, this intermediate representation does not provide any additional functionality and developers of ECA systems would skip it and stick to their own representations. When it comes to representation formats for speech synthesis, we are facing a special situation. Speech synthesis is the one domain where exchangeable off-the-shelf components actually do exist. Basically all these systems provide an input interface and an output interface that are universally accepted, namely text and audio files, respectively. Even if developers of ECA systems might not always be happy with the quality of the synthesiser’s output, these systems deal with a clearly defined, specialised and complex task, and not many developers feel inclined to intermingle with this functionality themselves. Open issues one could think of when it comes to ECAs and speech synthesis are missing standards to specify emotion and the missing ability for incremental speech synthesis which would be desirable for the really interactive systems that would, e.g., react to interruptions by the user in mid-sentence. But there is another issue on representation formats and speech synthesis that is not so much a technical problem. In most implementations of multimodal systems, speech is the leading modality which provides the temporal grid to which the other modalities (lip movements, gestures) are synchronised. Information on the temporal locus of stressed syllables, phonemes, accents, etc. thus is crucial, but in most TTS systems this data is not made available to the user. This is not due to technical reasons, but this information is usually suppressed because of mere ignorance of an existing demand for it. The promotion of a standardised format for this kind of information, i.e. a kind of speech output format, could trigger an increased awareness of TTS suppliers for that kind of information demand by ECA developers. As stated above, not only does the fate of any representation format rely on design factors such as expressive power, flexibility and ease of application but its acceptance in a wider community also strongly depends on the availability of implementations that actually support the creation and interpretation of the proposed format. We are concluding this contribution with a short survey on possible insights which could either be gained from or shared with research domains outside the narrow ECA paradigm. Descriptive schemes used for ECAs of course have much in common with coding schemes that describe the behaviour, physical appearance or emotional states of real humans, and ECA developers have been adapting coding schemes originally designed for humans in the past. FACS (Ekman and Friesen, 1978; Ekman et al., 2002), the coding scheme for facial expressions, was developed in the context of psychological and anthropological research and was very influential for specifying codes for facial animations. HamNoSys (Hanke, 2004), a representation format for sign language, has been adapted for the encoding of gestures in ECAs. Descriptive schemes for gestures have been developed for the manual annotation of multimodal conversational data, e.g. ANVIL (Kipp, 2001) and CoGest (Trippel et al., 2004). Demands for a common gesture description language have also been brought up in the field of automatic gesture recognition and interpretation (Kölsch and Martell, 2006). Another source for inspiration and possible synergies is the gaming industry. Yue and de Byl (2006) present a standardisation initiative for AI components used in games. They deal with problems that are related to the intent and behaviour planning and not so much to the behaviour realisation, e.g. path finding and steering in a game environment and action planning. On an abstract level these are functionalities that resemble those in the behaviour planning component in SAIBA. There are insights to be gained by the way this standardisation process is organised. One interesting aspect is that this initiative does not bother with XML formats but deals with the specification of Application Programming Interfaces (APIs), i.e. formulates their interfaces in directly implementation-related terms.

The gaming industry also provides examples on how pseudo-standards actually may emerge from the spreading of tools and vice versa. BVH (Biovision Hierarchy) is a graphics format developed for storing motion-captured data. Tools for creating key-frame animations in BVH are also emerging, e.g. Cal3D, an open source character animation library,Footnote 6 and plug-ins for exporting BHV format from widely used commercial graphics programs such as 3D Studio Max. The BVH format is also used for animating avatars in the virtual world of Second Life and is an example of how an application provides the incentive for the development of tools and thus for the proliferation of specific representation formats. It remains to be seen whether a similar momentum can be gained in the development of representation formats for ECAs in the future.