1 Introduction

Synthesizing conversational gestures can be helpful for animation, entertainment, education and virtual reality applications. To accomplish this, the complex relationship between speech, facial expressions, emotions, speaker identity and semantic meaning of gestures has to be carefully considered in the design of the gesture synthesis models.

Fig. 1.
figure 1

Overview. BEAT is a large-scale, multi-modal mo-cap human gestures dataset with semantic, emotional annotations, diverse speakers and multiple languages.

While synthesizing conversational gestures based on audio [20, 32, 52] or text [3, 5, 8, 53] has been widely studied, synthesizing realistic, vivid, human-like conversational gestures is still unsolved and challenging for several reasons. i) Quality and scale of the dataset. Previously proposed methods [32, 52] were trained on limited mo-cap datasets [17, 46] or on pseudo-label [20, 21, 52] datasets (cf. Table 1), which results in limited generalization capability and lack of robustness. ii) Rich and paired multi-modal data. Previous works adopted one or two modalities [20, 52, 53] to synthesize gestures and reported that conversational gestures are determined by multiple modalities together. However, due to the lack of paired multi-modal data, the analysis of other modalities, e.g., facial expression, for gesture synthesis is still missing. iii) Speaker style disentanglement. All available datasets, as shown in Table 1, either have only a single speaker [17], or many speakers but different speakers talk about different topics [20, 21, 52]. Speaker-specific styles were not much investigated in previous studies due to the lack of data. iv) Emotion annotation. Existing work [7] analyzes the emotion-conditioned gestures by extracting implicit sentiment features from texts. Due to the unlabeled, limited emotion categories in the dataset [52], it cannot cover enough emotion in daily conversations. v) Semantic relevance. Due to the lack of semantic relevance annotation, only a few works [31, 52] analyze the correlation between generated gestures and semantics though listing subjective visualization examples. It will enable synthesizing context-related meaningful gestures if existing semantic labels of gestures. In conclusion, the absence of a large-scale, high-quality multi-modal dataset with semantic and emotional annotation is the main obstacle to synthesizing human-like conversational gestures.

There are two design choices for collecting unlabeled multi-modal data, i) the pseudo-label approach [20, 21, 52], i.e., extracting conversational gestures, facial landmark from in-the-wild videos using 3D pose estimation algorithms [12] and ii) the motion capture approach [17], i.e., recording the data of speakers through predefined themes or texts. In contrast to the pseudo-labeling approach, which allows for low-cost, semi-automated access to large-scale training data, e.g., 97h [52], motion-captured data requires a higher cost and more manual work resulting in smaller dataset sizes, e.g., 4h [17]. However, Due to the motion capture can be strictly controlled and designed in advance, it is able to ensure the quality and diversity of the data, e.g., eight different emotions of the same speaker, and different gestures of 30 speakers talking in the same sentences. Besides, high-quality motion capture data are indispensable to evaluate the effectiveness of pseudo-label training.

Table 1. Comparison of Datasets. We compare with all 3D conversational gesture and face datasets. “#", “LM" and “BSW" indicate the number, landmark and blendshape weight, respectively. and are highlighted. Our dataset is the largest mocap dataset with multi-modal data and annotations

Based on the above analysis, to address these data-related problems, we built a mo-cap dataset BEAT containing semantic and eight different emotional annotations (cf. Fig. 1), from 30 speakers in four modalities of Body-Expression-Audio-Text, annotated in total of 30M frames. The motion capture environment is strictly controlled to ensure quality and diversity, with 76 h and more than 2500 topic-segmented sequences. Speakers with different language mastery provided data in three other languages at different durations and in pairs. The ratio of actors/actresses, range of phonemes, and variety of languages are carefully designed to cover natural language characteristics. For emotional gestures, feedback on the speakers’ expressions was provided by professional instructors during the recording process and re-recorded in case of non-expressive gesturing to ensure the expressiveness and quality of the entire dataset. After statistical analysis on BEAT, we observed the correlation of conversational gestures with facial expressions, emotions, and semantics, in addition to the known correlation with audio, text, and speaker identity.

Additionally, we propose a baseline neural network architecture, Cascaded Motion Network (CaMN), which learns synthesizing body and hand gestures by inputting all six modalities mentioned above. The proposed model consists of cascaded encoders and decoders for enhancing the contribution of audio and facial modalities. Besides, in order to evaluate the semantic relevancy, we propose Semantic-Relevant Gesture Recall (SRGR), which weights Probability of Correct Keypoint (PCK) based on semantic scores of the ground truth data. Overall, our contributions can be summarized as follows:

  • We release BEAT, which is the first gesture dataset with semantic and emotional annotation, and the largest motion capture dataset in terms of duration and available modalities to the best of our knowledge.

  • We propose CaMN as a baseline model that inputs audio, text, facial blendweight, speaker identity, emotion and semantic score to synthesize conversational body and hand gestures through cascaded network architecture.

  • We introduce SRGR to evaluate the semantic relevancy as well as the human preference for conversational gestures.

Finally, qualitative and quantitative experiments demonstrate the data quality of BEAT, the state-of-the-art performance of CaMN and the validness of SRGR.

2 Related Work

Conversational Gestures Dataset. We first review mo-cap and pseudo-label conversational gestures datasets. Volkova et al. [47] built a mo-cap emotional gestures dataset in 89 mins with text annotation, Takeuchi et al. [45] captured an interview-like audio-gesture dataset in total 3.5-h with two Japanese speakers. Ferstl and Mcdonnell [17] collected a 4-hour dataset, Trinity, with a single male speaker discussing hobbies, etc., which is the most common used mo-cap dataset for conversational gestures synthesis. On the other hand, Ginosar et al. [20] used OpenPose [12] to extract 2D poses from YouTube videos as training data for 144 h, called S2G Dataset. Habibie et al. [21] extended it to a full 3D body with facial landmarks, and the last available data is 33 h. Similarly, Yoon et al. [52] used VideoPose3D [39] to build on the TED dataset, which is 97 h with 9 joints on upper body. The limited data amount of mo-cap and noise in ground truth makes a trade-off for the trained network’s generalization capability and quality. Similar to our work, several datasets are built for talking-face generation and the datasets can be divided into 3D scan face, e.g., VOCA [46] and MeshTalk [42] or RGB images [4, 11, 15, 26, 49]. However, these datasets cannot be adopted to synthesize human gestures.

Semantic or Emotion-Aware Motion Synthesis. Semantic analysis of motion has been studied in the action recognition and the sign-language analysis/synthesis research domains. For example, in some of action recognition datasets [9, 13, 14, 25, 28, 34, 40, 43, 44, 48] clips of action with the corresponding label of a single action, e.g., running, walking [41] is used. Another example is audio-driven sign-language synthesis [27], where hand gestures have specific semantics. However, these datasets do not apply to conversational gestures synthesis since gestures used in natural conversations are more complex than single actions, and their semantic meaning differs from sign-language semantics. Recently, Bhattacharya [7] extracted emotional cues from text and used them for gesture synthesis. However, the proposed method has limitations in the accuracy of the emotion classification algorithm and the diversity of emotion categories in the dataset.

Conditional Conversational Gestures Synthesis. Early baseline models were released with datasets such as text-conditioned gesture [53], audio-conditioned gesture [17, 20, 45], and audio-text-conditioned gesture [52]. These baseline models were based on CNN and LSTM for end-to-end modelling. Several efforts try to improve the performance of the baseline model by input/output representation selection [19, 30], adversarial training [18] and various types of generative modeling techniques [1, 36, 50, 51], which can be summarized by "Estimating a better distribution of gestures based on the given conditions.". As an example, StyleGestures [2] uses Flow-based model [23] and additional control signal to sample gesture from the distribution. Probabilistic gesture generation enables generating diversity based on noise, which is achieved by CGAN [51], WGAN [50]. However, due to the lack of paired multi-modal data, the analysis of other modalities, e.g., facial expression, for gesture synthesis is still missing.

3 BEAT: Body-Expression-Audio-Text Dataset

In this section, we introduce the proposed Body-Expression-Audio-Text (BEAT) Dataset. First, we describe the dataset acquisition process and then introduce text, emotion, and semantic relevance information annotation. Finally, we use BEAT to analyze the correlation between conversational gestures and emotions and show the distribution of semantic relevance.

3.1 Data Acquisition

Motion Capture System. The motion capture system shown in Fig. 2a, is based on 16 synchronized cameras recording motion 120 Hz. We use Vicon’s suits with 77 markers (cf. supplementary materials for the location of markers on the body). The facial capture system uses ARKit with a depth camera on iPhone 12 Pro, which extracts 52 blendshape weights 60 Hz. The blendshape targets are designed based on Facial Action Coding System (FACS) and are widely used by industry novice users. The audio is recorded in a 48 KHz stereo.

Fig. 2.
figure 2

Capture System and Subject Distribution of BEAT. (a) A 16-camera motion capture system is adopted to record data in Conversation and Self-Talk sessions. (b) Gestures are divided into four categories in Conversation session. (c) Seven additional emotion categories are set in equal proportions in the self-talk session. Besides, (d) our dataset includes four languages which mainly consist of English, (e) by 30 speakers from ten countries with different recording duration.

Design Criteria. BEAT is equally divided into conversation and self-talk sessions, which consist of 10-min and 1-min sequences, respectively. The conversation is between the speaker and the instructor remotely, i.e., to ensure only the speaker’s voice is recorded. As shown in Fig. 2b, The speaker’s gestures are divided into four categories talking, instantaneous reactions to questions, the state of thinking (silence) and asking. We timed each category’s duration during the recording process. Topics were selected from 20 predefined topics, which cover 33% and 67% debate and description topics, respectively. Conversation sessions would record the neutral conversations without acting to ensure the diversity of the dataset. The self-talk sessions consist of 120 1-minute self-talk recordings, where speakers answer questions about daily conversation topics, e.g., personal experiences or hobbies. The answers were written and proofread by three English native speakers, and the phonetic coverage was controlled to be similar to the frequently used 3000 words [24]. We covered 8 emotions, neutral, anger, happiness, fear, disgust, sadness, contempt and surprise, in the dataset referring to [35] and the ratio of each emotion is shown in Fig. 2c. Among the 120 questions, 64 were for neutral emotions, and the remaining seven had eight questions each. Different speakers were asked to talk about the same content with their personalized gestures. Details about predefined answers and pronunciation distribution are available in the supplementary materials.

Speaker Selection and Language Ratio. We strictly control the proportion of languages as well as accents to ensure the generalization capability of the dataset. As shown in Fig. 2d, the dataset consists mainly of English data: 60 h (81%), 12 h of Chinese, 2 h of Spanish and Japanese. The Spanish and Japanese are also 50% of the size of the previous mo-cap dataset [17]. The English component includes 34 h of 10 native English speakers, including the US, UK, and Australia, and 26 h of 20 fluent English speakers from other countries. As shown in Fig. 2e, 30 speakers (including 15 females) from different ethnicities can be grouped into two depending on their total recording duration as 4-h (10 speakers) and 1-h (20 speakers), where the 1-h data is proposed for few-shot learning experiments. It is recommended to check the supplementary material for details of the speakers.

Recording. Speakers were asked to read answers in self-talk sections proficiently. However, they were not guided to perform a specific style of gesture but were encouraged to show a natural, personal, daily style of conversational gestures. Speakers would watch 2–10 mins of emotionally stimulating videos corresponding to different emotions before talking with the particular emotion. A professional speaker would instruct them to elicit the corresponding emotion correctly. We re-record any unqualified data to ensure the data’s correctness and quality.

3.2 Data Annotation

Text Alignment. We use an in-house-built Automatic Speech Recognizer (ASR) to obtain the initial text for the conversation session and proofread it by annotators. Then, we adopt Montreal Forced Aligner (MFA) aligner [37] for temporal alignment of the text with audio.

Emotion and Semantic Relevance. The 8-class emotion label of self-talk is confirmed, and the on-site supervision guarantees the correctness. For the conversation session, annotators would watch the video with corresponding audio and gestures to perform frame-level annotation. For the semantic relevance, we get the score on a scale of 0–10 from assigned 600 annotators from Amazon Mechanical Turk (AMT). The annotators were asked to annotate a small amount of test data as a qualification check, of which only 118 annotators succeeded in the qualification phase for the final data annotation. We paid \(\sim \) $10 for each annotator per hour in this task.

Fig. 3.
figure 3

Emotional Gesture Clustering and Examples. (a) T-SNE visualization for gestures in eight emotion categories. Gestures with different emotions are basically distinguished into different groups, e.g., the Happiness (blue) and Anger (orange). (b) Examples of Happiness (top) and Anger gestures from speaker-2. (Color figure online)

3.3 Data Analysis

The collection and annotation of BEAT have made it possible to analyze correlations between conversational gestures and other modalities. While the connection between gestures and audio, text and speaker identity has been widely studied. We further discuss the correlations between gestures, facial expressions, emotions, and semantics.

Facial Expression and Emotion. Facial expressions and emotions were strongly correlated (excluding some of the lip movements), and we first analyze the correlation between conversational gestures and emotional categories here. As shown in Fig. 3a, We visualized the gestures in T-SNE based on a 2s-rotation representation, and the results showed that gestures have different characteristics in different emotions. For example, as shown in Fig. 3b, speaker-2 has different gesture styles when angry and happy, e.g., the gestures are larger and faster when angry. The T-SNE results also significantly differ between happy (blue) and angry (yellow). However, the gestures for the different emotions are still not perfectly separable by the rotation representation. Furthermore, the gestures of the different emotions appear to be confounded in each region, which is also consistent with subjective perceptions.

Fig. 4.
figure 4

Distribution of semantic labels. (a) Different speaker ID speaks in a same phase happens different levels of semantic relevance and different styles of gesture. (b) The overall semantic distribution of BEAT. (c) The semantic relevance of the high frequency words which are grouped by their lexical in different color. (d, e) Different distribution of semantic relevance happens in words i and was even sharing almost the same level of semantic relevance. (Color figure online)

Distribution of Semantic Relevance. There is large randomness for the semantic relevance between gestures and texts, which is shown in Fig. 4, where the frequency, position and content of the semantic-related gestures vary from speaker to speaker when the same text content is uttered. In order to better understand the distribution of the semantic relevance of the gestures, we conducted a semantic relevance study based on four hours of two speakers’ data. As shown in Figure  4b, for the overall data, 83% of the gestures have low semantic scores (\(\le \) 0.2). For the words-level, the semantic distribution varied between words, e.g., i and was which are sharing a similar semantic score but different in the score distribution. Besides, Figure  4c shows the average semantic scores of nine high-frequency words in the text corpus. It is to be mentioned that the scores of the Be-verbs showed are comparatively lower than that Pronouns and Prepositions which are shown in blue and yellow, respectively. Ultimately, it presents a different probability distribution to the semantically related gestures.

4 Multi-modal Conditioned Gestures Synthesis Baseline

In this section, we propose a baseline that inputs all the modalities for generating vivid, human-like conversational gestures. The proposed baseline, Cascaded Motion Network (CaMN), is shown in Fig. 5, which encodes text, emotion condition, speaker identity, audio and facial blendshape weights to synthesize body and hands gestures in a multi-stage, cascade structure. In addition, semantic relevancy is adopted as a loss weight to make the network generate more semantic-relevant gestures. The text, audio and speaker ID encoders network selection are referred to [52] and customized for better performance. All input data have the same time resolution as the output gestures so that the synthesized gestures can be processed frame by frame through a sequential model. The gesture and facial blendshape weights are downsampled to 15 FPS, and the word sentence is inserted with padding tokens to correspond to the silence time in the audio.

Fig. 5.
figure 5

Cascaded Motion Network (CaMN). As a multi-modal gesture synthesis baseline, CaMN inputs text, emotion label, speaker ID, audio and facial blendweight in a cascaded architecture, the audio and facial feature will be extracted by concatenating the features of previous modalities. The fused feature will be reconstructed to body and hands gestures by two cascaded LSTM+MLP decoders.

Text Encoder. First, words are converted to word embedding set \({\textbf {v}}^\text {T} \in \mathbb {R}^{300} \) by pre-trained model in FastText [10] to reduce dimensions. Then, the word sets are fine-tuned by customized encoder \(E_\text {T}\), which is a 8-layer temporal convolution network (TCN) [6] with skip connections [22], as

$$\begin{aligned} z^\text {T}_{i} = E_\text {T}(v^\text {T}_{i-f}, ..., v^\text {T}_{i+f}), \end{aligned}$$
(1)

For each frame i, the TCN fusions the information from \(2f=34\) frames to generate final latent feature of text, the set of features is note as \({\textbf {z}}^\text {T} \in \mathbb {R}^{128} \).

Speaker ID and Emotion Encoders. The initial representation of speaker ID and emotion are both one-hot vectors, as \({\textbf {v}}^\text {ID} \in \mathbb {R}^{30}\) and \({\textbf {v}}^\text {E} \in \mathbb {R}^{8}\). Follow the suggestion in [52], we use embedding-layer as speaker ID encoder, \(E_\text {ID}\). As the speaker ID does not change instantly, we only use the current frame speaker ID to calculate its latent features. On the other hand, we use a combination of embedding-layer and 4-layer TCN as the emotion encoder, \(E_\text {E}\), to extract the temporal emotion variations.

$$\begin{aligned} z^\text {ID}_{i} = E_\text {ID}(v^\text {ID}_{i}), z^\text {E}_{i} = E_\text {E}(v^\text {E}_{i-f}, ..., v^\text {E}_{i+f}), \end{aligned}$$
(2)

where \({\textbf {z}}^\text {ID} \in \mathbb {R}^{8}\) and \({\textbf {z}}^\text {E} \in \mathbb {R}^{8} \) is the latent feature for speaker ID and emotion, respectively.

Audio Encoder. We adopt the raw wave representation of audio and downsample it to 16 KHZ, considering audio as 15 FPS, for each frame, we have \({\textbf {v}}^\text {A} \in \mathbb {R}^{1067}\). We feed the audio joint with the text, speakerID and emotion features into audio encoder \(E_\text {A}\) to learn better audio features. As

$$\begin{aligned} z^\text {A}_{i} = E_\text {A}(v^\text {A}_{i-f}, ..., v^\text {E}_{i+f}; v^\text {T}_{i}; v^\text {E}_{i}; v^\text {ID}_{i}), \end{aligned}$$
(3)

The \(E_\text {A}\) consists of 12-layer TCN with skip connection and 2-layer MLP, features in other modifies are concatenated with the 12th layer audio features thus the final MLP layers are for audio feature refinement, and the final latent audio feature is \({\textbf {z}}^\text {A} \in \mathbb {R}^{128}\).

Facial Expression Encoder. We take the \({\textbf {v}}^\text {F} \in \mathbb {R}^{52}\) as initial representation of facial expression. 8-layer TCN and 2-layer MLP based encoder \(E_\text {F}\) is adopt to extract facial latent feature \({\textbf {z}}^\text {F} \in \mathbb {R}^{32} \), as

$$\begin{aligned} z^\text {F}_{i} = E_\text {F}(v^\text {F}_{i-f}, ..., v^\text {F}_{i+f}; v^\text {T}_{i}; v^\text {E}_{i}; v^\text {ID}_{i}; v^\text {A}_{i}), \end{aligned}$$
(4)

the features are concatenated at 8th layer and the MLP is for refinement.

Body and Hands Decoders. We implement the body and hands decoders in a separated, cascaded structure, which is based on [38] conclusion that the body gestures can be used to estimate hand gestures. These two decoders, \(D_\text {B}\) and \(D_\text {F}\) are based on the LSTM structure for latent feature extraction and 2-layer MLP for gesture reconstruction. They would combine the features of five modalities with previous gestures, i.e., seed pose, to synthesis latent gesture features \({\textbf {z}}^\text {B} \in \mathbb {R}^{256}\) and \({\textbf {z}}^\text {H} \in \mathbb {R}^{256}\). The final estimated body \(\hat{{\textbf {v}}}^\text {B} \in \mathbb {R}^{27\times 3}\) and hands \(\hat{{\textbf {v}}}^\text {H} \in \mathbb {R}^{48\times 3}\) are calculated as,

$$\begin{aligned} z^\text {M}_{i} = z^\text {T}_{i} \otimes z^\text {ID}_{i} \otimes z^\text {E}_{i} \otimes z^\text {A}_{i} \otimes z^\text {F}_{i} \otimes v^\text {B}_{i} \otimes v^\text {H}_{i}, \end{aligned}$$
(5)
$$\begin{aligned} {\textbf {z}}^\text {B} = D_\text {B}(z^\text {M}_{0}, ..., z^\text {M}_{n}), {\textbf {z}}^\text {H} = D_\text {H}(z^\text {M}_{0}, ..., z^\text {M}_{n}; {\textbf {z}}^\text {B}), \end{aligned}$$
(6)
$$\begin{aligned} \hat{{\textbf {v}}}^\text {B} = MLP_{\text {B}}({\textbf {z}}^\text {B}), \hat{{\textbf {v}}}^\text {H} = MLP_{\text {H}}({\textbf {z}}^\text {H}), \end{aligned}$$
(7)

\({\textbf {z}}^\text {M} \in \mathbb {R}^{549}\) is the merged features for all modalities. For Eq. 5, the length for the seed pose is four frames.

Loss Functions. The final supervision of our network is based on gesture reconstruction and the adversarial loss

$$\begin{aligned} \ell _{\text {Gesture Rec.}} =\mathbb {E}\left[ \left\| {\textbf {v}}^{B}-\hat{{\textbf {v}}}^{B}\right\| _{1}\right] + \alpha \mathbb {E}\left[ \left\| {\textbf {v}}^{H}-\hat{{\textbf {v}}}^{H}\right\| _{1}\right] , \end{aligned}$$
(8)
$$\begin{aligned} \ell _\text {Adv.}=-\mathbb {E}[\log (Dis(\hat{{\textbf {v}}}^{B};\hat{{\textbf {v}}}^{H}))], \end{aligned}$$
(9)

where the discriminator input to the adversarial training is only the gesture itself. We also adopt a weight \(\alpha \) to balance the body and hands penalties. After that, during training, we adjust the weights of L1 loss, and adversarial loss using the semantic-relevancy label \(\lambda \) The final loss function is

$$\begin{aligned} \ell =\lambda \beta _{0} \ell _{\text{ Gesture } \text{ Rec. }}+\beta _{1} \ell _{\text{ Adv }}, \end{aligned}$$
(10)

where \(\beta _{0}\) and \(\beta _{1}\) are predefined weight for L1 and adversarial loss. When semantic relevancy is high, we encourage the network to generate gestures spatially similar to ground truth as much as possible, thus strengthening the L1 penalty and decreasing the adversarial penalty.

5 Metric for Semantic Relevancy

We propose the Semantic-Relevant Gesture Recall (SRGR) to evaluate the semantic relevancy of gestures, which can also be interpreted as whether the gestures are vivid and diverse. We utilize the semantic scores as a weight for the Probability of Correct Keypoint (PCK) between the generated gestures and the ground truth gestures. Where PCK is the number of joints successfully recalled against a specified threshold \(\delta \). The SRGR metric can be calculated as follows:

$$\begin{aligned} D_{SRGR} = \lambda \sum \frac{1}{T \times J} \sum _{t=1}^{T} \sum _{j=1}^{J} \textbf{1}\left[ \left\| p_{t}^{j}-\hat{p}_{t}^{j}\right\| _{2}<\delta \right] , \end{aligned}$$
(11)

where \(\textbf{1}\) is the indicator function and TJ is the set of frames and number of joints. We think the SRGR, which emphasizes recalling gestures in the clip of interest, is more in line with the subjective human perception of gesture’s valid diversity than the L1 variance of synthesized gestures.

6 Experiments

In this section, we first evaluate the SRGR metric’s validity, then demonstrate our dataset’s data quality based on subjective experiments. Next, we demonstrate the validity of our baseline model using subjective and objective experiments, and finally, we discuss the contribution of each modality based on ablation experiments.

6.1 Validness of SRGR

A user study is conducted to evaluate the validity of SRGR. Firstly, we randomly trim the motion sequences with rendered results into clips which are around 40 s. For each clip, the participants are asked to evaluate the gesture based on its diversity which is the number of non-repeated gestures. Besides, the participants then need to score its attractiveness which should be based on the motion itself instead of the content of the speech. Totally 160 participants took part in the evaluation study, and each of them evaluated 15 random clips of gestures. There are totally 200 gesture clips including the results generated by using the methods from Seq2Seq [53], S2G [20], A2G [32], MultiContext [52], and ground truth, 40 clips for each with the same speaker data. Both of the questions follow a 5-points Likert scale. As shown in Fig. 6, we found a large variance in L1 diversity even though we used 100 gesture segments to calculate the average L1 distance, (usually around 40 segments [32, 33]). Secondly, generated results with strong semantic relevance but a smaller motion range, such as Seq2Seq, obtained a lower L1 diversity than A2G, which has a larger motion range, yet the statistical evidence that humans feel that Seq2Seq has higher diversity than A2G. An explanation is a human evaluating diversity not only on the range of motion but also on some other implicit features, such as expressiveness and semantic relevancy of the motion.

Fig. 6.
figure 6

Comparison of Metrics by Group. SRGR shows the consistence with human perception, and lower variance comparing with L1 Diversity in evaluation.

6.2 Data Quality

To evaluate the captured ground truth motion data quality, we compare our proposed dataset with the widely used mocap dataset Trinity [17] and in-the-wild dataset S2G-3D [20, 21]. We conducted the user study by comparing clips sampled from ground truth and generated results using motion synthesis networks trained in each dataset. The Trinity dataset has a total of 23 sequences, with 10 minutes each. We randomly divide the data into 19:2:2 for train/valid/test since there is no standard for splitting.

Table 2. User Study Comparison with Trinity for Data Quality. Comparing with Trinity [17], BEAT get user preference score in terms of ground truth data quality. “-b" and “-h" indicate body and hands, respectively.
Table 3. User Study Comparison with S2G-3D. BEAT get similar user preferences in terms of naturalness. Based on the score, the model trained on BEAT dataset would be fitted into a more physically correct, diverse, and attractive distribution.

We used S2G [20], as well as the SoTA algorithm A2G [32], to cover both GAN and VAE models. The output layer of the S2G model was adapted for outputting 3D coordinates. In the ablation study, the final generated 3D skeleton results were rendered and composited with audio for comparison in the user study. A total of 120 participant subjects compared the clips randomly sampled from Trinity and our dataset, with 5–20s in length. The participants were asked to evaluate gestures correctness, i.e., physical correctness, diversity and gesture-audio synchrony. Furthermore, the body and hands were evaluated separately for the gesture correctness test. The results are shown in Table 2, demonstrating that our dataset received higher user preference in all aspects. Especially for the hand movements, we outperformed the Trinity dataset by a large margin. This is probably due to the noise of the past motion capture devices and the lack of markers on the hands. Table 3 shows preference ratios (%) of 60 subjects who watch 20 random rendered 3D skeletons pairs per subjective test. Based on the score, the model trained on the BEAT dataset would be fitted into a more physically correct, diverse, and attractive distribution.

6.3 Evaluation of the Baseline Model

Training Setting. We use the Adam optimizer [29] to train at a learning rate of 2e-4, and the 4-speaker data is trained in an NVIDIA V100 environment. For evaluation metrics, L1 has been demonstrated unsuitable for evaluating the gesture performance [32, 52] thus, we adopt FGD [52] to evaluate the generated gestures’ distribution distance with ground truth. It computes the distance between latent features extracted by a pretrained network, we use an LSTM-based autoencoder as the pretrained network. In addition, we adopt SRGR and BeatAlign to evaluate diversity and synchrony. BeatAlign [33] is a Chamfer Distance between audio and gesture beats to evaluate gesture-audio beat similarity.

Table 4. Evaluation on BEAT. Our CaMN performs in the term of FGD, SRGR and BeatAlign, all methods are trained on our dataset (BEAT)
Table 5. Results of Ablation Study.

Quantitative Results. The final results are shown in Table 4. In addition to S2G and A2G, we also compare our results with text-to-gesture and audio &test-to-gesture algorithm, Seq2Seq [53] and MultiContext [52]. The results show that both our end2end model and cascaded model archive SoTA performance in all metrics (cf. supplementary materials for video results).

6.4 Ablation Study

Effectiveness of Cascaded Connection. As shown in Table 5, in contrast to the end-to-end approach, the cascaded connection can achieve better performance because we introduce prior human knowledge to help the network extract features of different modalities.

Effectiveness of Each Modality. We gradually removed the data of one modality during the experiment (cf. Table 5). Synchrony would significantly be reduced after removing the audio, which is intuitive. However, it still maintains some synchronizations, such as the padding and time-align annotation of the text and the lip motion of the facial expression. In contrast, eliminating weighted semantic loss improves synchrony, which means that semantic gestures are usually not strongly aligned with audio perfectly. There is also a relationship between emotion and synchrony, but speaker ID only has little effect on synchrony. The removal of audio, emotion, and facial expression does not significantly affect the semantic relevant gesture recall, which depends mainly on the text and the speaker ID. Data from each modality contributed to improving the FGD, which means using different modalities of data enhances the network’s mapping ability. The unities of audio and facial expressions, especially facial expressions, improve the FGD significantly. We found that removing emotion and speaker ID also impacts the FGD scores. This is because using the integrated network increases the diversity of features, which leads to a diversity of results, increasing the variance of the distribution and making it more like the original data.

Emotional Gestures. As shown in Table 6, we train a classifier by an additional 1DCNN + LSTM network and invite 60 subjects each to classify 12 random real test clips (with audio). The classifier is trained and tested on speaker-4’s ground truth data.

Table 6. Emotional Gesture Classification. The classification accuracy (%) gap between the test real and generated data (1344 clips, 10 s each) is 15.85.

6.5 Limitation

Impact of Acting. Self-Talk sessions might reflect the impact of acting, which is inevitable and controlled. Inevitable: The impact is probably caused by pre-defined content. However, to explore the semantic-relevancy and personality, it is necessary to control the variables, i.e., different speakers should talk in the same text and emotion so that the personality can be carefully explored. Controlled. Speakers recorded the conversation session first and were encouraged to keep the same style as the conversation. We also filtered out about 21h of data and six speakers due to inconsistencies in their styles.

Calculation of SRGR. SRGR now is calculated based on semantic annotation, which has a limitation for an un-labelled dataset. To solve this problem, training a scoring network or semantic discriminator is a possible direction.

7 Conclusion

We build a large-scale, high-quality, multi-modal, semantic and emotional annotated dataset to generate more human-like, semantic and emotional relevant conversational gestures. Together with the dataset, we propose a cascade-based baseline model for gesture synthesis based on six modalities and achieve SoTA performance. Finally, we introduce SRGR for evaluating semantic relevancy. In the future, we plan to expand cross-data checks for AU and emotion recognition benchmarks. Our dataset and the related statistical experiments could benefit a number of different research fields, including controllable gesture synthesis, cross-modality analysis and emotional motion recognition in the future.