Keywords

Introduction

Do gestures communicate? Yes, they do. This has been the conclusion of several meta-studies on the impact of gesture (Goldin-Meadow 2005; Hostetter 2011; Kendon 1994). It is also one of the distinguishing features of gestures in animation. While all movement communicates to some degree, gestures often play a role that is explicitly communicative. Another distinguishing feature for the gestures that we are most often interested in is that they are co-verbal. That is, they occur with speech and they are inextricably linked to that speech in both content and timing. McNeill argues that gestures and language are not separate, but gestures are part of language (McNeill 2005).

There are different forms of movement that can broadly be called “gesture.” Building on the categories of Kendon (1988), McNeill defined “Kendon’s Continuum” (McNeill 1992, 2005) to capture the range of gesture types people employ:

  • Gesticulation: gesture that conveys a meaning related to the accompanying speech.

  • Speechlike gestures: gestures that take the place of a word(s) in a sentence.

  • Emblems: conventionalized signs, like a thumbs-up.

  • Pantomime: gestures with a story and are produced without speech.

  • Sign language: signs are lexical words.

As you move along the continuum, the degree to which speech is obligatory decreases, and the degree to which gestures themselves have the properties of a language increases. This chapter will focus on gesticulations, which are gestures that co-occur with speech as they are most relevant to conversational characters. Synthesis of the whole spectrum, however, presents worthwhile animation problems. Emblems and pantomimes are useful in situations where speech may not be possible. Sign languages are the native language of many members of the deaf community, and sign synthesis can increase their access to computational sources. The problems of gesticulations are unique, however, since they are co-present with speech and do not have linguistic structure on their own.

Kendon introduced a three-level hierarchy to describe the structure of gestures (Kendon 1972). The largest structure is the gesture unit . Gesture units start in a retraction or rest pose, continue with a series of gestures, and then return to a rest pose, potentially different from the initial rest pose. A gesture phrase encapsulates an individual gesture in this sequence. Each gesture phrase can in turn be broken down into a sequence of gesture phases. A preparation is a motion that takes the hands to the required position and orientation for the start of the gesture stroke. A prestroke hold is a period of time in which the hands are held in this configuration. The stroke is the main meaning carrying movement of the gesture and has the most focused energy. It may be followed by a poststroke hold in which the hands are held at the end position. The final phase is a retraction that returns the hands to a rest pose. All phases are optional except the stroke. There are some gestures in which the stroke does not involve any movement (e.g., a raised index finger). These are variously called an independent hold (Kita et al. 1998) or a stroke hold (McNeill 2005). The pre- and poststroke holds were proposed by Kita (1990) and act to synchronize the gesture with speech. The prestroke hold delays the gesture stroke until the corresponding speech begins, and the poststroke hold occurs while the corresponding speech is completing. Much like they allow mental processing in humans, they can be used in synthesis systems to allow time for planning or other processing to take place.

The existence of gesture units is important for animation systems as it indicates a potential need to avoid generating a sequence of singleton gestures that return to a rest pose after each gesture. While this would offer the simplest synthesis solution, people are quite sensitive to the structure of gestural communication. A study (Kipp et al. 2007) showed that people found a character that used multiple phrase gesture units more natural, friendly, and trustworthy than a character that performed singleton gestures, which was viewed as more nervous. These significant differences in appraisal occurred despite only 1 of 25 subjects being able to actually identify the difference between the multiphrase g-unit clips and single phrase g-unit clips. This illustrates what appears to be a common occurrence in our gesture research: people will react to differences in gesture performance without being consciously aware of what those differences are.

Gestures are synchronized in time with their co-expressive speech. About 90% of the time, the gesture occurs slightly before the co-expressive speech (Nobe 2000) and rarely occurs after (Kendon 1972). Research on animated characters does indicate a preference for this slightly earlier timing of gesture, but also suggests that people may not be particularly sensitive to errors in timing, at least within a ± .6 second range (Wang and Neff 2013).

A number of categorizations of gesture have been proposed. One of the best known is from McNeill and Levy (McNeill 1992; McNeill and Levy 1982) and contains the classes iconics, metaphorics, deictics, and beats. Iconic gestures create images of concrete objects or actions, such as illustrating the size of a box. Metaphorics create images of the abstract. For instance, a metaphoric gesture could make a cup shape with the hand, but refer to holding an idea rather than an actual object. Metaphoric gestures are also used to locate ideas spatially, for instance, putting positive things on the left and negative to the right and then using this space to categorize future entities in the conversation. Deictics locate objects and entities in space, as with pointing, creating a reference and context for the conversation. They are often performed with a hand that is closed except for an extended index finger, but can be performed with a wide range of body parts. Deixis can be abstract or concrete. Concrete deixis points to an existing reference (e.g., an object or person) in space, whereas abstract deixis creates a reference point in space for an idea or concept. Beats are small back-and-forth or up-and-down movements of the hand, performed in rhythm to the speech. They serve to emphasize important sections of the speech.

In later work, McNeill (2005) argued that it is inappropriate to think of gesture in terms of categories, but the categories should instead be considered dimensions. This reflects the fact that any individual gesture may contain several of these properties (e.g., deixis and iconicity). He suggests additional dimensions of temporal highlighting (the function of beats) and social interactivity, which helps to manage turn taking and the flow of conversation.

State of the Art

Generation of conversational characters has achieved substantial progress, but the bar for success is extremely high. People are keen observers of human motion and will make judgments based on subtle details. By way of analogy, people will make judgments between good and bad actors, and actors being good in a particular role, but not another – and actors are human, with all the capacity for naturalness and expressivity that comes with that. The bar for conversational characters is that of a good actor, effectively performing a particular role. The field remains a long way from being able to do this automatically, for a range of different characters and over prolonged interactions with multiple subjects.

Gesture Generation Tasks

Gesture Specification

When generating virtual conversational characters, one of the primary challenges is determining what gestures a character should perform. Different approaches have trade-offs in terms of the type of input information they require, the amount of processing time needed to determine a gesture, and the quality of the gesture selection, both on grounds of accurately reflecting a particular character personality and being appropriate for the co-expressed utterance.

One approach is to generate gestures based on prosody variations in the spoken audio signal. Prosody includes changes in volume and pitch. Such approaches have been applied for head nods and movement (Morency et al. 2008), as well as gesture generation (Levine et al. 2009, 2010). A main advantage of the approach is that good-quality audio can be highly expressive, and using it as an input for gesture specification allows the gestures to match the expressive style of the audio. Points of emphasis in the audio appear to be good landmarks for placing gesture, and their use will provide uniform emphasis across the channels. Prosody-based approaches have been used to generate gesture in real time as a user speaks (Levine et al. 2009, 2010). The drawback of only using prosody is that it does not capture semantics, so the gestures will likely not match the meaning of the audio and certainly not supplement the underlying meaning that is being conveyed in the utterance with information not present in the audio. This concern can be at least partially addressed by also parsing the spoken text (Marsella et al. 2013). It is believed that in human communication, the brain is co-planning the gesture and the utterance (McNeill 2005), so approaches that do not use future information about the planned utterance may be unlikely to match the sophistication of human gesture-speech coordination.

Another approach generates gesture based on the text of the dialogue that is to be spoken. A chief benefit of these techniques is that text captures much of the information being conveyed, so these techniques can generate gestures that aid the semantics of the utterance. Text can also be analyzed for emotional content and rhetorical style, providing a rich basis for gesture generation. Rule-based approaches (Cassell et al. 2001; Lee and Marsella 2006; Lhommet and Marsella 2013; Marsella et al. 2013) can determine both the gesture locations and the type of gestures to be performed. Advantages of these techniques are that they can handle any text covered by their knowledge bases and are extensible in flexible and straightforward ways. Disadvantages include that some amount of manual work is normally required to create the rules and it is difficult to know how to author the rules to create a particular character, so behavior tends to be generic. Other work uses statistical approaches to predict the gestures that a particular person would employ (Bergmann et al. 2010; Kipp 2005; Neff et al. 2008). These techniques support the creation of individualized characters, which are essential for many applications, such as anything involving storytelling. Individualized behavior may also outperform averaged behavior (Bergmann et al. 2010), as would be contained in generic rules. These approaches, however, are largely limited to reproducing characters like the subjects modeled and creating arbitrary characters remains an open challenge. Recent work has begun applying deep learning to the mapping from text and prosody to gesture (Chiu et al. 2015). This is a potentially powerful approach, but requires a large quantity of data, and ways to produce specific characters must be developed. While the divide between prosody-driven and rule-based approaches is useful for understanding techniques, current approaches are increasingly relying on a combination of text and prosody information (e.g., (Lhommet and Marsella 2013; Marsella et al. 2013)).

Techniques based on generating gesture from text are limited to ideas expressed in the text. The information we convey through gesture is sometimes redundant with speech, although expressed in a different form, but often expresses information that is different to that in speech (McNeill 2005). For example, I might say “I saw a [monster.],” with the square brackets indicating the location of a gesture that holds my hand above my head, with my fingers bent 90° at the first knuckle and then held straight. The gesture indicates the height of the monster, information completely lacking from the verbal utterance. Evidence suggests that gestures are most effective when they are nonredundant (Goldin-Meadow 2006; Hostetter 2011; Singer and Goldin-Meadow 2005). This implies the need to base gesture generation on a deeper notion of a “communicative intent”, which may not solely be contained in the text and describes the fully message to be delivered.

The SAIBA (situation, agent, intention, behavior, animation) framework represents a step toward establishing a computational architecture to tackle the fundamental multimodal communication problem of moving from a communicative intent to output across the various agent channels of gesture, text, prosody, facial expressions, and posture (SAIBA. Working group website 2012). The approach defines stages in production and markup languages to connect them. The first stage is planning the communicative intent. This is communicated using the Function Markup Language (Heylen et al. 2008) to the behavior planner, which decides how to achieve the desired functions using the agent modalities available. The final behavior is then sent to a behavior realizer for generation using the Behavior Markup Language (Kopp et al. 2006; Vilhjalmsson et al. 2007). Such approaches echo, at least at the broad conceptual level, theories of communication like McNeill’s growth point hypothesis that argue gesture and language emerge in a shared process from a communicative intent (McNeill 2005). Recent work has sought to develop cognitive (Kopp et al. 2013) and combined cognitive and linguistic models (Bergmann et al. 2013) to explore the distribution of communicative content across output modalities.

Gesture Animation

Generation of high-quality gesture animation must satisfy a rich set of requirements:

  • Match the gesture timing to that of the speech.

  • Connect individual gestures into fluent gesture units.

  • Adjust the gesture to the character’s context (e.g., to point to a person or object in the scene).

  • Generate appropriate gesture forms for the utterance (e.g., show the shape of an object, mime an action being performed, point).

  • Vary the gesture based on the personality of the character.

  • Vary the gesture to reflect the character’s current mood and tone of the speech.

While a wide set of techniques have been used for gesture animation, the need for precise agent control, especially in interactive systems, has often favored the use of kinematic procedural techniques (e.g., (Chi et al. 2000; Hartmann et al. 2006; Kopp and Wachsmuth 2004)). For example, Kopp and Wachsmuth kopp04 present a system that uses curves derived from neurophysiological research to drive the trajectory of gesturing arm motions. Procedural techniques allow full control of the motion, making it easy to adjust the gesture to the requirements of the speech, both for matching spatial and timing demands.

While gesture is less constrained by physics than motions like tumbling, physical simulation has still been used for gesture animation and can add important nuance to the motion (Neff and Fiume 2002, 2005; Neff et al. 2008; Van Welbergen et al. 2010). These approaches generally include balance control and a basic approximation to the muscle, such as a proportional derivative controller. The balance control will add full-body movement to compensate for arm movements, and the controllers can add subtle oscillations and arm swings. These effects require proper tuning.

Motion capture data has seen increasing use in an attempt to improve the realism of character motion. These techniques often employ versions of motion graphs (Arikan and Forsyth 2002; Kovar et al. 2002; Lee et al. 2002) which concatenate segments of motion to create a sequence, such as in Fernández-Baena et al. (2014) and Stone et al. (2004). The motion capture data can provide very high-quality motion, but control is more limited, so it can be a challenge to adapt the motion to novel speech or generated different characters. Gesture relies heavily on hand shape, and it can be a challenge to capture good-quality hand motion while simultaneously capturing body motion. Some techniques seek to synthesize acceptable hand motion using the body motion alone (Jörg et al. 2012). For a fuller discussion of the issues around hand animation, please refer to (Wheatland et al. 2015).

As part of the SAIBA effort, several research groups have developed “behavior realizers,” animation engines capable of realizing commands in the Behavior Markup Language (Vilhjalmsson et al. 2007) that is supplied by a higher level in an agent architecture. These systems emphasize control and use a combination of procedural data and motion clips (e.g., (Heloir and Kipp 2009; Kallmann and Marsella 2005; Shapiro 2011; Thiebaux et al. 2008; Van Welbergen et al. 2010)). The SmartBody system, for example, uses a layering approach based on a hierarchy of controllers for different tasks (e.g., idle motion, locomotion, reach, breathing). These controllers may control different or overlapping parts of the body, which creates a coordination challenge. They can be combined or one controller may override another (Shapiro 2011).

Often gesture specification systems will indicate a particular gesture form that is required, e.g., a conduit gesture in which the hand is cupped and moves forward. Systems often employ a dictionary of gesture forms that can be used in syntheses. These gestures have been encoded using motion capture clips, hand animation, or numerical spatial specifications. Some techniques (Kopp et al. 2004) have sought to generate the correct forms automatically, for example, based on a description of the image trying to be created by the gesture.

Gesture animation is normally deployed in scenarios where it is desirable for the characters to portray clear personalities and show variations in emotion and mood. For these reasons, controlling expressive variation of the motion has been an important focus. A set of challenges must be solved. These include determining how to parameterize a motion to give expressive control, understanding what aspects of motion must be varied to generate a desired impact, ensuring consistency over time, determining how to expose appropriate control structures to the user or character control system, and, finally, synthesizing the motion to contain the desired properties. Chi et al. (2000) use the Effort and Shape components of Laban Movement Analysis to provide an expressive parameterization of motion. Changing any of the four effort qualities (Weight, Space, Time, and Flow) or the Shape Qualities (Rising-Sinking, Spreading-Enclosing, Advancing-Retreating) will vary the timing and path of the gesture, along with the engagement of the torso. Hartmann et al. (Hartmann 2005) use tension, continuity, and bias splines (Kochanek and Bartels 1984) to control arm trajectories and provide expressive control through parameters for activation, spatial and temporal extent, and fluidity and repetition. Neff and Fiume (2005) develop an extensible set of movement properties that can be varied and a system that allows users to write character sketches that reflect a particular character’s movement tendencies and then layer additional edits on top.

While gestures are often largely thought of as movements of the arms and hands and often represented this way in computational systems, they can indeed use the whole body. A character can nod its head, gesture with its toe, etc. More importantly, while arms are the dominant appendages for a motion, engaging the entire body can lead to more clear and effective animation. Lamb called this engagement of the whole body during gesturing Posture-Gesture Merger and argued that it led to a more fluid and attractive motion (Lamb 1965).

Additional Considerations

Conversations are interactions between people and this must be reflected in the animation. Both the speaker(s) and listener(s) have roles to play. Visual attention must be managed through appropriate gaze behavior to indicate who is paying attention and how actively, along with indicating who is thinking or distracted. Attentive listeners will provide back channel cues, like head nods, to indicate that they are listening and understanding. These must be appropriately timed with the speaker’s dialogue. Holding the floor is also actively managed. Speakers may decide to yield their turn to another. Listeners may interrupt, and the speaker may yield in response or refuse to do so. Floor management relies on both vocal and gestural cues. Proxemics are also highly communicative to an audience and must be managed appropriately. This creates additional animation challenges in terms of small-scale locomotion in order to fluidly manage character placement.

Gestural behavior must adapt to the context. Gestures will be adjusted based on the number of people in the conversation and their physical locations relative to one another. As characters interact, they may also begin to mirror each other’s behavior and postures. Gestures are also often used to refer to items in the environment and hence must be adapted based on the character’s location. Finally, characters will engage in conversations while also simultaneously performing other activities, such as walking, jogging, or cleaning the house. The gesture behavior must be adapted to the constraints of this other behavior, for example, gestures performed while jogging tend to be done with more bent arms and are less frequent than standing gestures (Wang et al. 2016).

Future Directions

While significant progress has been made, the bar for conversational gesture animation is very high. We are a long way from being able to easily create synthetic characters that match the expressive quality, range, and realism of a skilled actor, and applications that rely on synthetic characters are impoverished by this gap. Some of the key issues to address include:

Characters with large gesture repertoires: It currently takes a great deal of work to build a movement set for a character, generally involving recording, cleaning, and retargeting motion capture or hand animating movements. This places a practical limitation on the number of gestures that they can perform. Methods that allow large sets of gestures to be rapidly generated are needed. A particular challenge is being able to synthesize novel gestures on the fly to react to the character’s current context.

Motion quality: While motion quality has improved, it remains well short of photo-realism, particularly for interactive characters. Hand motion remains a particular challenge, as is appropriate full-body engagement. Most systems focus on standing characters, whereas people engage in a wide range of activities while simultaneously gesturing. A significant challenge is correctly orchestrating a performance across the various movement modalities (breath, arm movements, body movements, facial expressions, etc.), especially when the motion diverges from playback of a recording or hand-animated sequence.

Planning from communicative intent: Systems that can represent an arbitrary communicative intent and can distribute it across various communication modes, and do so in different ways for different speakers, remain a long-term goal. This will likely require both improved computational models and a more thorough understanding of how humans formulate communication.

Customization for characters and mood: While people tend to have their own, unique gesturing style, it is a challenge to imbue synthetic characters with this expressive range without an enormous amount of manual labor. It is also a challenge to accurate reflect a character’s current mood; anger, sadness, irritation, excitement, etc.

Authoring controls: If a user wishes to create a particular character with a given role, personality, etc., there must be tools to allow this to be authored. Substantial work is required to allow authors to go from an imagined character to an effective realization.

Cross-References