BEAT: A Large-Scale Semantic and Emotional Multi-modal Dataset for Conversational Gestures Synthesis

Liu, Haiyang; Zhu, Zihao; Iwamoto, Naoya; Peng, Yichen; Li, Zhengqing; Zhou, You; Bozkurt, Elif; Zheng, Bo

doi:10.1007/978-3-031-20071-7_36

Haiyang Liu¹²,
Zihao Zhu¹³,
Naoya Iwamoto¹⁴,
Yichen Peng¹⁵,
Zhengqing Li¹⁵,
You Zhou¹⁴,
Elif Bozkurt¹⁶ &
…
Bo Zheng¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13667))

Included in the following conference series:

European Conference on Computer Vision

3027 Accesses
33 Citations

Abstract

Achieving realistic, vivid, and human-like synthesized conversational gestures conditioned on multi-modal data is still an unsolved problem due to the lack of available datasets, models and standard evaluation metrics. To address this, we build Body-Expression-Audio-Text dataset, BEAT, which has i) 76 h, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages, ii) 32 millions frame-level emotion and semantic relevance annotations. Our statistical analysis on BEAT demonstrates the correlation of conversational gestures with facial expressions, emotions, and semantics, in addition to the known correlation with audio, text, and speaker identity. Based on this observation, we propose a baseline model, Cascaded Motion Network (CaMN), which consists of above six modalities modeled in a cascaded architecture for gesture synthesis. To evaluate the semantic relevancy, we introduce a metric, Semantic Relevance Gesture Recall (SRGR). Qualitative and quantitative experiments demonstrate metrics’ validness, ground truth data quality, and baseline’s state-of-the-art performance. To the best of our knowledge, BEAT is the largest motion capture dataset for investigating human gestures, which may contribute to a number of different research fields, including controllable gesture synthesis, cross-modality analysis, and emotional gesture recognition. The data, code and model are available on https://pantomatrix.github.io/BEAT/.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Dual-Path Transformer-Based GAN for Co-speech Gesture Synthesis

Article 13 May 2024

Creating a Gesture-Speech Dataset for Speech-Based Automatic Gesture Generation

The quantification of gesture–speech synchrony: A tutorial and validation of multimodal data acquisition using device-based and video-based motion tracking

Article Open access 28 October 2019

1 Introduction

Synthesizing conversational gestures can be helpful for animation, entertainment, education and virtual reality applications. To accomplish this, the complex relationship between speech, facial expressions, emotions, speaker identity and semantic meaning of gestures has to be carefully considered in the design of the gesture synthesis models.

While synthesizing conversational gestures based on audio [20, 32, 52] or text [3, 5, 8, 53] has been widely studied, synthesizing realistic, vivid, human-like conversational gestures is still unsolved and challenging for several reasons. i) Quality and scale of the dataset. Previously proposed methods [32, 52] were trained on limited mo-cap datasets [17, 46] or on pseudo-label [20, 21, 52] datasets (cf. Table 1), which results in limited generalization capability and lack of robustness. ii) Rich and paired multi-modal data. Previous works adopted one or two modalities [20, 52, 53] to synthesize gestures and reported that conversational gestures are determined by multiple modalities together. However, due to the lack of paired multi-modal data, the analysis of other modalities, e.g., facial expression, for gesture synthesis is still missing. iii) Speaker style disentanglement. All available datasets, as shown in Table 1, either have only a single speaker [17], or many speakers but different speakers talk about different topics [20, 21, 52]. Speaker-specific styles were not much investigated in previous studies due to the lack of data. iv) Emotion annotation. Existing work [7] analyzes the emotion-conditioned gestures by extracting implicit sentiment features from texts. Due to the unlabeled, limited emotion categories in the dataset [52], it cannot cover enough emotion in daily conversations. v) Semantic relevance. Due to the lack of semantic relevance annotation, only a few works [31, 52] analyze the correlation between generated gestures and semantics though listing subjective visualization examples. It will enable synthesizing context-related meaningful gestures if existing semantic labels of gestures. In conclusion, the absence of a large-scale, high-quality multi-modal dataset with semantic and emotional annotation is the main obstacle to synthesizing human-like conversational gestures.

There are two design choices for collecting unlabeled multi-modal data, i) the pseudo-label approach [20, 21, 52], i.e., extracting conversational gestures, facial landmark from in-the-wild videos using 3D pose estimation algorithms [12] and ii) the motion capture approach [17], i.e., recording the data of speakers through predefined themes or texts. In contrast to the pseudo-labeling approach, which allows for low-cost, semi-automated access to large-scale training data, e.g., 97h [52], motion-captured data requires a higher cost and more manual work resulting in smaller dataset sizes, e.g., 4h [17]. However, Due to the motion capture can be strictly controlled and designed in advance, it is able to ensure the quality and diversity of the data, e.g., eight different emotions of the same speaker, and different gestures of 30 speakers talking in the same sentences. Besides, high-quality motion capture data are indispensable to evaluate the effectiveness of pseudo-label training.

Table 1. **Comparison of Datasets.** We compare with all 3D conversational gesture and face datasets. “#", “LM" and “BSW" indicate the number, landmark and blendshape weight, respectively. and are highlighted. Our dataset is the largest mocap dataset with multi-modal data and annotations

Based on the above analysis, to address these data-related problems, we built a mo-cap dataset BEAT containing semantic and eight different emotional annotations (cf. Fig. 1), from 30 speakers in four modalities of Body-Expression-Audio-Text, annotated in total of 30M frames. The motion capture environment is strictly controlled to ensure quality and diversity, with 76 h and more than 2500 topic-segmented sequences. Speakers with different language mastery provided data in three other languages at different durations and in pairs. The ratio of actors/actresses, range of phonemes, and variety of languages are carefully designed to cover natural language characteristics. For emotional gestures, feedback on the speakers’ expressions was provided by professional instructors during the recording process and re-recorded in case of non-expressive gesturing to ensure the expressiveness and quality of the entire dataset. After statistical analysis on BEAT, we observed the correlation of conversational gestures with facial expressions, emotions, and semantics, in addition to the known correlation with audio, text, and speaker identity.

Additionally, we propose a baseline neural network architecture, Cascaded Motion Network (CaMN), which learns synthesizing body and hand gestures by inputting all six modalities mentioned above. The proposed model consists of cascaded encoders and decoders for enhancing the contribution of audio and facial modalities. Besides, in order to evaluate the semantic relevancy, we propose Semantic-Relevant Gesture Recall (SRGR), which weights Probability of Correct Keypoint (PCK) based on semantic scores of the ground truth data. Overall, our contributions can be summarized as follows:

We release BEAT, which is the first gesture dataset with semantic and emotional annotation, and the largest motion capture dataset in terms of duration and available modalities to the best of our knowledge.
We propose CaMN as a baseline model that inputs audio, text, facial blendweight, speaker identity, emotion and semantic score to synthesize conversational body and hand gestures through cascaded network architecture.
We introduce SRGR to evaluate the semantic relevancy as well as the human preference for conversational gestures.

Finally, qualitative and quantitative experiments demonstrate the data quality of BEAT, the state-of-the-art performance of CaMN and the validness of SRGR.

2 Related Work

Conversational Gestures Dataset. We first review mo-cap and pseudo-label conversational gestures datasets. Volkova et al. [47] built a mo-cap emotional gestures dataset in 89 mins with text annotation, Takeuchi et al. [45] captured an interview-like audio-gesture dataset in total 3.5-h with two Japanese speakers. Ferstl and Mcdonnell [17] collected a 4-hour dataset, Trinity, with a single male speaker discussing hobbies, etc., which is the most common used mo-cap dataset for conversational gestures synthesis. On the other hand, Ginosar et al. [20] used OpenPose [12] to extract 2D poses from YouTube videos as training data for 144 h, called S2G Dataset. Habibie et al. [21] extended it to a full 3D body with facial landmarks, and the last available data is 33 h. Similarly, Yoon et al. [52] used VideoPose3D [39] to build on the TED dataset, which is 97 h with 9 joints on upper body. The limited data amount of mo-cap and noise in ground truth makes a trade-off for the trained network’s generalization capability and quality. Similar to our work, several datasets are built for talking-face generation and the datasets can be divided into 3D scan face, e.g., VOCA [46] and MeshTalk [42] or RGB images [4, 11, 15, 26, 49]. However, these datasets cannot be adopted to synthesize human gestures.

Semantic or Emotion-Aware Motion Synthesis. Semantic analysis of motion has been studied in the action recognition and the sign-language analysis/synthesis research domains. For example, in some of action recognition datasets [9, 13, 14, 25, 28, 34, 40, 43, 44, 48] clips of action with the corresponding label of a single action, e.g., running, walking [41] is used. Another example is audio-driven sign-language synthesis [27], where hand gestures have specific semantics. However, these datasets do not apply to conversational gestures synthesis since gestures used in natural conversations are more complex than single actions, and their semantic meaning differs from sign-language semantics. Recently, Bhattacharya [7] extracted emotional cues from text and used them for gesture synthesis. However, the proposed method has limitations in the accuracy of the emotion classification algorithm and the diversity of emotion categories in the dataset.

Conditional Conversational Gestures Synthesis. Early baseline models were released with datasets such as text-conditioned gesture [53], audio-conditioned gesture [17, 20, 45], and audio-text-conditioned gesture [52]. These baseline models were based on CNN and LSTM for end-to-end modelling. Several efforts try to improve the performance of the baseline model by input/output representation selection [19, 30], adversarial training [18] and various types of generative modeling techniques [1, 36, 50, 51], which can be summarized by "Estimating a better distribution of gestures based on the given conditions.". As an example, StyleGestures [2] uses Flow-based model [23] and additional control signal to sample gesture from the distribution. Probabilistic gesture generation enables generating diversity based on noise, which is achieved by CGAN [51], WGAN [50]. However, due to the lack of paired multi-modal data, the analysis of other modalities, e.g., facial expression, for gesture synthesis is still missing.

3 BEAT: Body-Expression-Audio-Text Dataset

In this section, we introduce the proposed Body-Expression-Audio-Text (BEAT) Dataset. First, we describe the dataset acquisition process and then introduce text, emotion, and semantic relevance information annotation. Finally, we use BEAT to analyze the correlation between conversational gestures and emotions and show the distribution of semantic relevance.

3.1 Data Acquisition

Motion Capture System. The motion capture system shown in Fig. 2a, is based on 16 synchronized cameras recording motion 120 Hz. We use Vicon’s suits with 77 markers (cf. supplementary materials for the location of markers on the body). The facial capture system uses ARKit with a depth camera on iPhone 12 Pro, which extracts 52 blendshape weights 60 Hz. The blendshape targets are designed based on Facial Action Coding System (FACS) and are widely used by industry novice users. The audio is recorded in a 48 KHz stereo.

Design Criteria. BEAT is equally divided into conversation and self-talk sessions, which consist of 10-min and 1-min sequences, respectively. The conversation is between the speaker and the instructor remotely, i.e., to ensure only the speaker’s voice is recorded. As shown in Fig. 2b, The speaker’s gestures are divided into four categories talking, instantaneous reactions to questions, the state of thinking (silence) and asking. We timed each category’s duration during the recording process. Topics were selected from 20 predefined topics, which cover 33% and 67% debate and description topics, respectively. Conversation sessions would record the neutral conversations without acting to ensure the diversity of the dataset. The self-talk sessions consist of 120 1-minute self-talk recordings, where speakers answer questions about daily conversation topics, e.g., personal experiences or hobbies. The answers were written and proofread by three English native speakers, and the phonetic coverage was controlled to be similar to the frequently used 3000 words [24]. We covered 8 emotions, neutral, anger, happiness, fear, disgust, sadness, contempt and surprise, in the dataset referring to [35] and the ratio of each emotion is shown in Fig. 2c. Among the 120 questions, 64 were for neutral emotions, and the remaining seven had eight questions each. Different speakers were asked to talk about the same content with their personalized gestures. Details about predefined answers and pronunciation distribution are available in the supplementary materials.

Speaker Selection and Language Ratio. We strictly control the proportion of languages as well as accents to ensure the generalization capability of the dataset. As shown in Fig. 2d, the dataset consists mainly of English data: 60 h (81%), 12 h of Chinese, 2 h of Spanish and Japanese. The Spanish and Japanese are also 50% of the size of the previous mo-cap dataset [17]. The English component includes 34 h of 10 native English speakers, including the US, UK, and Australia, and 26 h of 20 fluent English speakers from other countries. As shown in Fig. 2e, 30 speakers (including 15 females) from different ethnicities can be grouped into two depending on their total recording duration as 4-h (10 speakers) and 1-h (20 speakers), where the 1-h data is proposed for few-shot learning experiments. It is recommended to check the supplementary material for details of the speakers.

Recording. Speakers were asked to read answers in self-talk sections proficiently. However, they were not guided to perform a specific style of gesture but were encouraged to show a natural, personal, daily style of conversational gestures. Speakers would watch 2–10 mins of emotionally stimulating videos corresponding to different emotions before talking with the particular emotion. A professional speaker would instruct them to elicit the corresponding emotion correctly. We re-record any unqualified data to ensure the data’s correctness and quality.

3.2 Data Annotation

Text Alignment. We use an in-house-built Automatic Speech Recognizer (ASR) to obtain the initial text for the conversation session and proofread it by annotators. Then, we adopt Montreal Forced Aligner (MFA) aligner [37] for temporal alignment of the text with audio.

Emotion and Semantic Relevance. The 8-class emotion label of self-talk is confirmed, and the on-site supervision guarantees the correctness. For the conversation session, annotators would watch the video with corresponding audio and gestures to perform frame-level annotation. For the semantic relevance, we get the score on a scale of 0–10 from assigned 600 annotators from Amazon Mechanical Turk (AMT). The annotators were asked to annotate a small amount of test data as a qualification check, of which only 118 annotators succeeded in the qualification phase for the final data annotation. We paid $\sim $ $10 for each annotator per hour in this task.

3.3 Data Analysis

The collection and annotation of BEAT have made it possible to analyze correlations between conversational gestures and other modalities. While the connection between gestures and audio, text and speaker identity has been widely studied. We further discuss the correlations between gestures, facial expressions, emotions, and semantics.

Facial Expression and Emotion. Facial expressions and emotions were strongly correlated (excluding some of the lip movements), and we first analyze the correlation between conversational gestures and emotional categories here. As shown in Fig. 3a, We visualized the gestures in T-SNE based on a 2s-rotation representation, and the results showed that gestures have different characteristics in different emotions. For example, as shown in Fig. 3b, speaker-2 has different gesture styles when angry and happy, e.g., the gestures are larger and faster when angry. The T-SNE results also significantly differ between happy (blue) and angry (yellow). However, the gestures for the different emotions are still not perfectly separable by the rotation representation. Furthermore, the gestures of the different emotions appear to be confounded in each region, which is also consistent with subjective perceptions.

Distribution of Semantic Relevance. There is large randomness for the semantic relevance between gestures and texts, which is shown in Fig. 4, where the frequency, position and content of the semantic-related gestures vary from speaker to speaker when the same text content is uttered. In order to better understand the distribution of the semantic relevance of the gestures, we conducted a semantic relevance study based on four hours of two speakers’ data. As shown in Figure 4b, for the overall data, 83% of the gestures have low semantic scores ($\le $ 0.2). For the words-level, the semantic distribution varied between words, e.g., i and was which are sharing a similar semantic score but different in the score distribution. Besides, Figure 4c shows the average semantic scores of nine high-frequency words in the text corpus. It is to be mentioned that the scores of the Be-verbs showed are comparatively lower than that Pronouns and Prepositions which are shown in blue and yellow, respectively. Ultimately, it presents a different probability distribution to the semantically related gestures.

4 Multi-modal Conditioned Gestures Synthesis Baseline

In this section, we propose a baseline that inputs all the modalities for generating vivid, human-like conversational gestures. The proposed baseline, Cascaded Motion Network (CaMN), is shown in Fig. 5, which encodes text, emotion condition, speaker identity, audio and facial blendshape weights to synthesize body and hands gestures in a multi-stage, cascade structure. In addition, semantic relevancy is adopted as a loss weight to make the network generate more semantic-relevant gestures. The text, audio and speaker ID encoders network selection are referred to [52] and customized for better performance. All input data have the same time resolution as the output gestures so that the synthesized gestures can be processed frame by frame through a sequential model. The gesture and facial blendshape weights are downsampled to 15 FPS, and the word sentence is inserted with padding tokens to correspond to the silence time in the audio.

Text Encoder. First, words are converted to word embedding set ${\textbf {v}}^\text {T} \in \mathbb {R}^{300} $ by pre-trained model in FastText [10] to reduce dimensions. Then, the word sets are fine-tuned by customized encoder $E_\text {T}$, which is a 8-layer temporal convolution network (TCN) [6] with skip connections [22], as

$$\begin{aligned} z^\text {T}_{i} = E_\text {T}(v^\text {T}_{i-f}, ..., v^\text {T}_{i+f}), \end{aligned}$$

(1)

For each frame i, the TCN fusions the information from $2f=34$ frames to generate final latent feature of text, the set of features is note as ${\textbf {z}}^\text {T} \in \mathbb {R}^{128} $.

Speaker ID and Emotion Encoders. The initial representation of speaker ID and emotion are both one-hot vectors, as ${\textbf {v}}^\text {ID} \in \mathbb {R}^{30}$ and ${\textbf {v}}^\text {E} \in \mathbb {R}^{8}$. Follow the suggestion in [52], we use embedding-layer as speaker ID encoder, $E_\text {ID}$. As the speaker ID does not change instantly, we only use the current frame speaker ID to calculate its latent features. On the other hand, we use a combination of embedding-layer and 4-layer TCN as the emotion encoder, $E_\text {E}$, to extract the temporal emotion variations.

$$\begin{aligned} z^\text {ID}_{i} = E_\text {ID}(v^\text {ID}_{i}), z^\text {E}_{i} = E_\text {E}(v^\text {E}_{i-f}, ..., v^\text {E}_{i+f}), \end{aligned}$$

(2)

where ${\textbf {z}}^\text {ID} \in \mathbb {R}^{8}$ and ${\textbf {z}}^\text {E} \in \mathbb {R}^{8} $ is the latent feature for speaker ID and emotion, respectively.

Audio Encoder. We adopt the raw wave representation of audio and downsample it to 16 KHZ, considering audio as 15 FPS, for each frame, we have ${\textbf {v}}^\text {A} \in \mathbb {R}^{1067}$. We feed the audio joint with the text, speakerID and emotion features into audio encoder $E_\text {A}$ to learn better audio features. As

$$\begin{aligned} z^\text {A}_{i} = E_\text {A}(v^\text {A}_{i-f}, ..., v^\text {E}_{i+f}; v^\text {T}_{i}; v^\text {E}_{i}; v^\text {ID}_{i}), \end{aligned}$$

(3)

The $E_\text {A}$ consists of 12-layer TCN with skip connection and 2-layer MLP, features in other modifies are concatenated with the 12th layer audio features thus the final MLP layers are for audio feature refinement, and the final latent audio feature is ${\textbf {z}}^\text {A} \in \mathbb {R}^{128}$.

Facial Expression Encoder. We take the ${\textbf {v}}^\text {F} \in \mathbb {R}^{52}$ as initial representation of facial expression. 8-layer TCN and 2-layer MLP based encoder $E_\text {F}$ is adopt to extract facial latent feature ${\textbf {z}}^\text {F} \in \mathbb {R}^{32} $, as

$$\begin{aligned} z^\text {F}_{i} = E_\text {F}(v^\text {F}_{i-f}, ..., v^\text {F}_{i+f}; v^\text {T}_{i}; v^\text {E}_{i}; v^\text {ID}_{i}; v^\text {A}_{i}), \end{aligned}$$

(4)

the features are concatenated at 8th layer and the MLP is for refinement.

Body and Hands Decoders. We implement the body and hands decoders in a separated, cascaded structure, which is based on [38] conclusion that the body gestures can be used to estimate hand gestures. These two decoders, $D_\text {B}$ and $D_\text {F}$ are based on the LSTM structure for latent feature extraction and 2-layer MLP for gesture reconstruction. They would combine the features of five modalities with previous gestures, i.e., seed pose, to synthesis latent gesture features ${\textbf {z}}^\text {B} \in \mathbb {R}^{256}$ and ${\textbf {z}}^\text {H} \in \mathbb {R}^{256}$. The final estimated body $\hat{{\textbf {v}}}^\text {B} \in \mathbb {R}^{27\times 3}$ and hands $\hat{{\textbf {v}}}^\text {H} \in \mathbb {R}^{48\times 3}$ are calculated as,

$$\begin{aligned} z^\text {M}_{i} = z^\text {T}_{i} \otimes z^\text {ID}_{i} \otimes z^\text {E}_{i} \otimes z^\text {A}_{i} \otimes z^\text {F}_{i} \otimes v^\text {B}_{i} \otimes v^\text {H}_{i}, \end{aligned}$$

(5)

$$\begin{aligned} {\textbf {z}}^\text {B} = D_\text {B}(z^\text {M}_{0}, ..., z^\text {M}_{n}), {\textbf {z}}^\text {H} = D_\text {H}(z^\text {M}_{0}, ..., z^\text {M}_{n}; {\textbf {z}}^\text {B}), \end{aligned}$$

(6)

$$\begin{aligned} \hat{{\textbf {v}}}^\text {B} = MLP_{\text {B}}({\textbf {z}}^\text {B}), \hat{{\textbf {v}}}^\text {H} = MLP_{\text {H}}({\textbf {z}}^\text {H}), \end{aligned}$$

(7)

${\textbf {z}}^\text {M} \in \mathbb {R}^{549}$ is the merged features for all modalities. For Eq. 5, the length for the seed pose is four frames.

Loss Functions. The final supervision of our network is based on gesture reconstruction and the adversarial loss

$$\begin{aligned} \ell _{\text {Gesture Rec.}} =\mathbb {E}\left[ \left\| {\textbf {v}}^{B}-\hat{{\textbf {v}}}^{B}\right\| _{1}\right] + \alpha \mathbb {E}\left[ \left\| {\textbf {v}}^{H}-\hat{{\textbf {v}}}^{H}\right\| _{1}\right] , \end{aligned}$$

(8)

$$\begin{aligned} \ell _\text {Adv.}=-\mathbb {E}[\log (Dis(\hat{{\textbf {v}}}^{B};\hat{{\textbf {v}}}^{H}))], \end{aligned}$$

(9)

where the discriminator input to the adversarial training is only the gesture itself. We also adopt a weight $\alpha $ to balance the body and hands penalties. After that, during training, we adjust the weights of L1 loss, and adversarial loss using the semantic-relevancy label $\lambda $ The final loss function is

$$\begin{aligned} \ell =\lambda \beta _{0} \ell _{\text{ Gesture } \text{ Rec. }}+\beta _{1} \ell _{\text{ Adv }}, \end{aligned}$$

(10)

where $\beta _{0}$ and $\beta _{1}$ are predefined weight for L1 and adversarial loss. When semantic relevancy is high, we encourage the network to generate gestures spatially similar to ground truth as much as possible, thus strengthening the L1 penalty and decreasing the adversarial penalty.

5 Metric for Semantic Relevancy

We propose the Semantic-Relevant Gesture Recall (SRGR) to evaluate the semantic relevancy of gestures, which can also be interpreted as whether the gestures are vivid and diverse. We utilize the semantic scores as a weight for the Probability of Correct Keypoint (PCK) between the generated gestures and the ground truth gestures. Where PCK is the number of joints successfully recalled against a specified threshold $\delta $. The SRGR metric can be calculated as follows:

$$\begin{aligned} D_{SRGR} = \lambda \sum \frac{1}{T \times J} \sum _{t=1}^{T} \sum _{j=1}^{J} \textbf{1}\left[ \left\| p_{t}^{j}-\hat{p}_{t}^{j}\right\| _{2}<\delta \right] , \end{aligned}$$

(11)

where $\textbf{1}$ is the indicator function and T, J is the set of frames and number of joints. We think the SRGR, which emphasizes recalling gestures in the clip of interest, is more in line with the subjective human perception of gesture’s valid diversity than the L1 variance of synthesized gestures.

6 Experiments

In this section, we first evaluate the SRGR metric’s validity, then demonstrate our dataset’s data quality based on subjective experiments. Next, we demonstrate the validity of our baseline model using subjective and objective experiments, and finally, we discuss the contribution of each modality based on ablation experiments.

6.1 Validness of SRGR

A user study is conducted to evaluate the validity of SRGR. Firstly, we randomly trim the motion sequences with rendered results into clips which are around 40 s. For each clip, the participants are asked to evaluate the gesture based on its diversity which is the number of non-repeated gestures. Besides, the participants then need to score its attractiveness which should be based on the motion itself instead of the content of the speech. Totally 160 participants took part in the evaluation study, and each of them evaluated 15 random clips of gestures. There are totally 200 gesture clips including the results generated by using the methods from Seq2Seq [53], S2G [20], A2G [32], MultiContext [52], and ground truth, 40 clips for each with the same speaker data. Both of the questions follow a 5-points Likert scale. As shown in Fig. 6, we found a large variance in L1 diversity even though we used 100 gesture segments to calculate the average L1 distance, (usually around 40 segments [32, 33]). Secondly, generated results with strong semantic relevance but a smaller motion range, such as Seq2Seq, obtained a lower L1 diversity than A2G, which has a larger motion range, yet the statistical evidence that humans feel that Seq2Seq has higher diversity than A2G. An explanation is a human evaluating diversity not only on the range of motion but also on some other implicit features, such as expressiveness and semantic relevancy of the motion.

6.2 Data Quality

To evaluate the captured ground truth motion data quality, we compare our proposed dataset with the widely used mocap dataset Trinity [17] and in-the-wild dataset S2G-3D [20, 21]. We conducted the user study by comparing clips sampled from ground truth and generated results using motion synthesis networks trained in each dataset. The Trinity dataset has a total of 23 sequences, with 10 minutes each. We randomly divide the data into 19:2:2 for train/valid/test since there is no standard for splitting.

Table 2. **User Study Comparison with Trinity for Data Quality.** Comparing with Trinity [17], BEAT get user preference score in terms of ground truth data quality. “-b" and “-h" indicate body and hands, respectively.

Table 3. User Study Comparison with S2G-3D. BEAT get similar user preferences in terms of naturalness. Based on the score, the model trained on BEAT dataset would be fitted into a more physically correct, diverse, and attractive distribution.

Full size table

We used S2G [20], as well as the SoTA algorithm A2G [32], to cover both GAN and VAE models. The output layer of the S2G model was adapted for outputting 3D coordinates. In the ablation study, the final generated 3D skeleton results were rendered and composited with audio for comparison in the user study. A total of 120 participant subjects compared the clips randomly sampled from Trinity and our dataset, with 5–20s in length. The participants were asked to evaluate gestures correctness, i.e., physical correctness, diversity and gesture-audio synchrony. Furthermore, the body and hands were evaluated separately for the gesture correctness test. The results are shown in Table 2, demonstrating that our dataset received higher user preference in all aspects. Especially for the hand movements, we outperformed the Trinity dataset by a large margin. This is probably due to the noise of the past motion capture devices and the lack of markers on the hands. Table 3 shows preference ratios (%) of 60 subjects who watch 20 random rendered 3D skeletons pairs per subjective test. Based on the score, the model trained on the BEAT dataset would be fitted into a more physically correct, diverse, and attractive distribution.

6.3 Evaluation of the Baseline Model

Training Setting. We use the Adam optimizer [29] to train at a learning rate of 2e-4, and the 4-speaker data is trained in an NVIDIA V100 environment. For evaluation metrics, L1 has been demonstrated unsuitable for evaluating the gesture performance [32, 52] thus, we adopt FGD [52] to evaluate the generated gestures’ distribution distance with ground truth. It computes the distance between latent features extracted by a pretrained network, we use an LSTM-based autoencoder as the pretrained network. In addition, we adopt SRGR and BeatAlign to evaluate diversity and synchrony. BeatAlign [33] is a Chamfer Distance between audio and gesture beats to evaluate gesture-audio beat similarity.

**Table 4. **Evaluation on BEAT.** Our CaMN performs in the term of FGD, SRGR and BeatAlign, all methods are trained on our dataset (BEAT)**

Table 5. Results of Ablation Study.

Full size table

Quantitative Results. The final results are shown in Table 4. In addition to S2G and A2G, we also compare our results with text-to-gesture and audio &test-to-gesture algorithm, Seq2Seq [53] and MultiContext [52]. The results show that both our end2end model and cascaded model archive SoTA performance in all metrics (cf. supplementary materials for video results).

6.4 Ablation Study

Effectiveness of Cascaded Connection. As shown in Table 5, in contrast to the end-to-end approach, the cascaded connection can achieve better performance because we introduce prior human knowledge to help the network extract features of different modalities.

Effectiveness of Each Modality. We gradually removed the data of one modality during the experiment (cf. Table 5). Synchrony would significantly be reduced after removing the audio, which is intuitive. However, it still maintains some synchronizations, such as the padding and time-align annotation of the text and the lip motion of the facial expression. In contrast, eliminating weighted semantic loss improves synchrony, which means that semantic gestures are usually not strongly aligned with audio perfectly. There is also a relationship between emotion and synchrony, but speaker ID only has little effect on synchrony. The removal of audio, emotion, and facial expression does not significantly affect the semantic relevant gesture recall, which depends mainly on the text and the speaker ID. Data from each modality contributed to improving the FGD, which means using different modalities of data enhances the network’s mapping ability. The unities of audio and facial expressions, especially facial expressions, improve the FGD significantly. We found that removing emotion and speaker ID also impacts the FGD scores. This is because using the integrated network increases the diversity of features, which leads to a diversity of results, increasing the variance of the distribution and making it more like the original data.

Emotional Gestures. As shown in Table 6, we train a classifier by an additional 1DCNN + LSTM network and invite 60 subjects each to classify 12 random real test clips (with audio). The classifier is trained and tested on speaker-4’s ground truth data.

Table 6. Emotional Gesture Classification. The classification accuracy (%) gap between the test real and generated data (1344 clips, 10 s each) is 15.85.

Full size table

6.5 Limitation

Impact of Acting. Self-Talk sessions might reflect the impact of acting, which is inevitable and controlled. Inevitable: The impact is probably caused by pre-defined content. However, to explore the semantic-relevancy and personality, it is necessary to control the variables, i.e., different speakers should talk in the same text and emotion so that the personality can be carefully explored. Controlled. Speakers recorded the conversation session first and were encouraged to keep the same style as the conversation. We also filtered out about 21h of data and six speakers due to inconsistencies in their styles.

Calculation of SRGR. SRGR now is calculated based on semantic annotation, which has a limitation for an un-labelled dataset. To solve this problem, training a scoring network or semantic discriminator is a possible direction.

7 Conclusion

We build a large-scale, high-quality, multi-modal, semantic and emotional annotated dataset to generate more human-like, semantic and emotional relevant conversational gestures. Together with the dataset, we propose a cascade-based baseline model for gesture synthesis based on six modalities and achieve SoTA performance. Finally, we introduce SRGR for evaluating semantic relevancy. In the future, we plan to expand cross-data checks for AU and emotion recognition benchmarks. Our dataset and the related statistical experiments could benefit a number of different research fields, including controllable gesture synthesis, cross-modality analysis and emotional motion recognition in the future.

References

Ahuja, C., Lee, D.W., Nakano, Y.I., Morency, L.-P.: Style transfer for co-speech gesture animation: a multi-speaker conditional-mixture approach. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 248–265. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_15
Chapter Google Scholar
Alexanderson, S., Henter, G.E., Kucherenko, T., Beskow, J.: Style-controllable speech-driven gesture synthesis using normalising flows. In: Computer Graphics Forum. Wiley Online Library, vol. 39, pp. 487–496 (2020)
Google Scholar
Alexanderson, S., Székely, É., Henter, G.E., Kucherenko, T., Beskow, J.: Generating coherent spontaneous speech and gesture from text. In: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, pp. 1–3 (2020)
Google Scholar
Alghamdi, N., Maddock, S., Marxer, R., Barker, J., Brown, G.J.: A corpus of audio-visual lombard speech with frontal and profile views. J. Acoust. Soc. Am. 143(6), EL523-EL529 (2018)
Google Scholar
Ali, G., Lee, M., Hwang, J.I.: Automatic text-to-gesture rule generation for embodied conversational agents. Comput. Anim. Virtual Worlds 31(4–5), e1944 (2020)
Google Scholar
Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)
Bhattacharya, U., Childs, E., Rewkowski, N., Manocha, D.: Speech2AffectiveGestures: synthesizing co-speech gestures with generative adversarial affective expression learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2027–2036 (2021)
Google Scholar
Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2Gestures: a transformer-based network for generating emotive body gestures for virtual agents** this work has been supported in part by aro grants w911nf1910069 and w911nf1910315, and intel. code and additional materials available at: https://gamma.umd.edu/t2g. In: 2021 IEEE Virtual Reality and 3D User Interfaces (VR), pp. 1-10. IEEE (2021)
Bloom, V., Makris, D., Argyriou, V.: G3D: a gaming action dataset and real time action recognition evaluation framework. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, pp. 7–12 (2012)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc comput. linguist 5, 135–146 (2017)
Article Google Scholar
Cao, H., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A., Verma, R.: CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5(4), 377–390 (2014)
Article Google Scholar
Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172–186 (2019)
Article Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: 2015 IEEE International conference on image processing (ICIP), IEEE, pp. 168–172 (2015)
Google Scholar
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)
Article Google Scholar
Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M.J.: Capture, learning, and synthesis of 3D speaking styles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10101–10111 (2019)
Google Scholar
Ferstl, Y., McDonnell, R.: Investigating the use of recurrent motion modelling for speech gesture generation. In: Proceedings of the 18th International Conference on Intelligent Virtual Agents, pp. 93–98 (2018)
Google Scholar
Ferstl, Y., Neff, M., McDonnell, R.: Adversarial gesture generation with realistic gesture phasing. Comput. Graph. 89, 117–130 (2020)
Article Google Scholar
Ferstl, Y., Neff, M., McDonnell, R.: ExpressGesture: expressive gesture generation from speech through database matching. Comput. Anim. Virtual Worlds 32, e2016 (2021)
Article Google Scholar
Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3497–3506 (2019)
Google Scholar
Habibie, I., et al.: Learning speech-driven 3D conversational gestures from video. arXiv preprint arXiv:2102.06837 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Henter, G.E., Alexanderson, S., Beskow, J.: MoGlow: probabilistic and controllable motion synthesis using normalising flows. ACM Trans. Graph. (TOG) 39(6), 1–14 (2020)
Article Google Scholar
Hornby, A.S., et al.: Oxford Advanced Learner’s Dictionary of Current English. Oxford University Press, Oxford (1974)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Article Google Scholar
Jackson, P., Haq, S.: Surrey Audio-Visual Expressed Emotion (SAVEE) Database. University of Surrey, Guildford, UK (2014)
Google Scholar
Kapoor, P., Mukhopadhyay, R., Hegde, S.B., Namboodiri, V., Jawahar, C.: Towards automatic speech to sign language generation. arXiv preprint arXiv:2106.12790 (2021)
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kucherenko, T., Hasegawa, D., Kaneko, N., Henter, G.E., Kjellström, H.: Moving fast and slow: analysis of representations and post-processing in speech-driven automatic gesture generation. Int. J. Hum-Comput. Interact. 37, 1–17 (2021)
Article Google Scholar
Kucherenko, T., et al.: Gesticulator: a framework for semantically-aware speech-driven gesture generation. In: Proceedings of the 2020 International Conference on Multimodal Interaction, pp. 242–250 (2020)
Google Scholar
Li, J., et al.: Audio2Gestures: generating diverse gestures from speech audio with conditional variational autoencoders. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11293–11302 (2021)
Google Scholar
Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3D dance generation with AIST++. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13401–13412 (2021)
Google Scholar
Liu, A.A., Xu, N., Nie, W.Z., Su, Y.T., Wong, Y., Kankanhalli, M.: Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans. Cybern. 47(7), 1781–1794 (2016)
Article Google Scholar
Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in north American english. PLoS ONE 13(5), e0196391 (2018)
Article Google Scholar
Lu, J., Liu, T., Xu, S., Shimodaira, H.: Double-DCCCAE: estimation of body gestures from speech waveform. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 900–904 (2021)
Google Scholar
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017)
Google Scholar
Ng, E., Ginosar, S., Darrell, T., Joo, H.: Body2Hands: learning to infer 3D hands from conversational gesture body dynamics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11865–11874 (2021)
Google Scholar
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7753–7762 (2019)
Google Scholar
Perera, A.G., Law, Y.W., Ogunwa, T.T., Chahl, J.: A multiviewpoint outdoor dataset for human action recognition. IEEE Trans Hum-Mach. Syst 50(5), 405–413 (2020)
Article Google Scholar
Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: bodies, action and behavior with english labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 722–731 (2021)
Google Scholar
Richard, A., Zollhöfer, M., Wen, Y., De la Torre, F., Sheikh, Y.: MeshTalk: 3D face animation from speech using cross-modality disentanglement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1173–1182 (2021)
Google Scholar
Singh, S., Velastin, S.A., Ragheb, H.: Muhavi: A multicamera human action video dataset for the evaluation of action recognition methods. In: 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 48–55. IEEE (2010)
Google Scholar
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Google Scholar
Takeuchi, K., Hasegawa, D., Shirakawa, S., Kaneko, N., Sakuta, H., Sumi, K.: Speech-to-gesture generation: a challenge in deep learning approach with bi-directional LSTM. In: Proceedings of the 5th International Conference on Human Agent Interaction, pp. 365–369 (2017)
Google Scholar
Takeuchi, K., Kubota, S., Suzuki, K., Hasegawa, D., Sakuta, H.: Creating a gesture-speech dataset for speech-based automatic gesture generation. In: Stephanidis, C. (ed.) HCI 2017. CCIS, vol. 713, pp. 198–202. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58750-9_28
Chapter Google Scholar
Volkova, E., De La Rosa, S., Bülthoff, H.H., Mohler, B.: The MPI emotional body expressions database for narrative scenarios. PLoS ONE 9(12), e113647 (2014)
Article Google Scholar
Wang, J., Liu, Z., Chorowski, J., Chen, Z., Wu, Y.: Robust 3D action recognition with random occupancy patterns. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 872–885. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3_62
Chapter Google Scholar
Wang, K., et al.: MEAD: a large-scale audio-visual dataset for emotional talking-face generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 700–717. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_42
Chapter Google Scholar
Wu, B., Ishi, C., Ishiguro, H., et al.: Probabilistic human-like gesture synthesis from speech using GRU-based WGAN. In: GENEA: Generation and Evaluation of Non-verbal Behaviour for Embodied Agents Workshop 2021 (2021)
Google Scholar
Wu, B., Liu, C., Ishi, C.T., Ishiguro, H.: Modeling the conditional distribution of co-speech upper body gesture jointly using conditional-GAN and unrolled-GAN. Electronics 10(3), 228 (2021)
Article Google Scholar
Yoon, Y.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans. Graph. (TOG) 39(6), 1–16 (2020)
Article Google Scholar
Yoon, Y., Ko, W.R., Jang, M., Lee, J., Kim, J., Lee, G.: Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 4303–4309. IEEE (2019)
Google Scholar

Download references

Acknowledgements

This work was conducted during Haiyang Liu, Zihao Zhu, and Yichen Peng’s internship at Tokyo Research Center. We thank Hailing Pi for communicating with the recording actors of the BEAT dataset.

Author information

Authors and Affiliations

The University of Tokyo, Tokyo, Japan
Haiyang Liu
Keio University, Tokyo, Japan
Zihao Zhu
Digital Human Lab, Huawei Technologies Japan K.K., Tokyo, Japan
Naoya Iwamoto, You Zhou & Bo Zheng
Japan Advanced Institute of Science and Technology, Nomi, Japan
Yichen Peng & Zhengqing Li
Huawei Turkey R &D Center, Istanbul, Turkey
Elif Bozkurt

Authors

Haiyang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zihao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Naoya Iwamoto
View author publications
You can also search for this author in PubMed Google Scholar
Yichen Peng
View author publications
You can also search for this author in PubMed Google Scholar
Zhengqing Li
View author publications
You can also search for this author in PubMed Google Scholar
You Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Elif Bozkurt
View author publications
You can also search for this author in PubMed Google Scholar
Bo Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haiyang Liu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4155 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, H. et al. (2022). BEAT: A Large-Scale Semantic and Emotional Multi-modal Dataset for Conversational Gestures Synthesis. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13667. Springer, Cham. https://doi.org/10.1007/978-3-031-20071-7_36

Download citation

DOI: https://doi.org/10.1007/978-3-031-20071-7_36
Published: 13 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20070-0
Online ISBN: 978-3-031-20071-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BEAT: A Large-Scale Semantic and Emotional Multi-modal Dataset for Conversational Gestures Synthesis

Abstract

Similar content being viewed by others

Dual-Path Transformer-Based GAN for Co-speech Gesture Synthesis

Creating a Gesture-Speech Dataset for Speech-Based Automatic Gesture Generation

The quantification of gesture–speech synchrony: A tutorial and validation of multimodal data acquisition using device-based and video-based motion tracking

1 Introduction

2 Related Work