1 Introduction

Understanding and describing videos that contain multiple events often requires establishing “who is who”: who are the participants in these events and who is doing what. Most of the prior work on automatic video description focuses on individual short clips and ignores the aspect of participants’ identity. In particular, prior works on movie description tend to replace all character identities with a generic label SOMEONE [35, 41]. While reasonable for individual short clips, it becomes an issue for longer video sequences. As shown in Fig. 1, descriptions that contain SOMEONE would not be satisfying to visually impaired users, as they do not unambiguously convey who is engaged in which action in video.

Several prior works attempt to perform person re-identification  [5, 36, 39] in movies and TV shows, sometimes relying on associated subtitles or textual descriptions [27, 31, 40]. Most such works take the “linking tracks to names” problem statement, i.e. trying to name all the detected tracks with proper character names. Others like [29] aim to “fill in” the character proper names in the given ground-truth movie descriptions.

Fig. 1.
figure 1

Compare video description with SOMEONE labels vs. Identity-Aware Video Description : in the first case it may be difficult for a visually impaired person to follow what is going on in the video, while in the second case it becomes clear who is performing which action.

In this work, we propose a different problem statement, which does not require prior knowledge of movie characters and their appearance. Specifically, we group several consecutive movie clips into sets and aim to establish person identities locally within each set of clips. We then propose the following two tasks. First, given ground-truth descriptions of a set of clips, the goal is to fill in person identities in a coherent manner, i.e. to predict the same ID for the same person within a set (see Fig. 2). Second, given a set of clips, the goal is to generate video descriptions that contain corresponding local person IDs (see Fig. 1). We refer to these two tasks as Fill-in the Identity and Identity-Aware Video Description . The first (auxiliary) task is by itself of interest, as it requires to establish person identities in a multi-modal context of video and description.

We experiment with the Large Scale Movie Description Challenge (LSMDC) dataset and associated annotations  [34, 35], as well as collect more annotations to support our new problem statement. We transform the global character information into local IDs within each set of clips, which we use for both tasks.

Fill-in the Identity. Given textual descriptions of a sequence of events, we aim to fill in the person IDs in the blanks. In order to do that, two steps are necessary. First, we need to attend to a specific person in the video by relating visual observations to the textual descriptions. Second, we need to establish links within a set of blanks by relating corresponding visual appearances and textual context. We learn to perform both steps jointly, as the only supervision available to us is that of the identities, not the corresponding visual tracks. Our key idea is to consider an entire set of blanks jointly, and exploit the mutual relations between the attended characters. We thus propose a Transformer model  [42] which jointly infers the identity labels for all blanks. Moreover, to support this process we make use of one additional cue available to us: the gender of the person in question. We train a text-based gender classifier, which we integrate in our model, along with an additional visual gender prediction objective, which aims to recognize gender based on the attended visual appearance.

Fig. 2.
figure 2

Example of the Fill-in the Identity task.

Identity-Aware Video Description. Given a set of clips, we aim to generate descriptions with local IDs. Here, we take a two-stage approach, where we first obtain the descriptions with SOMEONE labels as a first step, and next apply our Fill-in the Identity method to give SOMEONEs their IDs. We believe there is a potential in exploring other models that would incorporate the knowledge of identities into generation process, and leave this to future work.

Our contributions are as follows. (1) We introduce a new task of Identity-Aware Video Description , which extends prior work in that it aims to obtain multi-sentence video descriptions with local person IDs. (2) We also introduce a task of Fill-in the Identity , which, we hope, will inform future directions of combining identity information with video description. (3) We propose a Transformer model for this task, which learns to attend to a person and use gender evidence along with other visual and textual cues to correctly fill in the person’s ID. (4) We obtain state-of-art results in the Fill-in the Identity task, compared to several baselines and two recent methods. (5) We further leverage this model to address Identity-Aware Video Description via a two-stage pipeline, and show that it is robust enough to perform well on the generated descriptions.

2 Related Work

Video Description. Automatic video description has attracted a lot of interest in the last few years, especially since the arrival of deep learning techniques [2, 9, 22, 23, 33, 44, 54, 58]. Here are a few trends present in recent works   [20, 30, 47, 62]. Some works formulate video description as a reinforcement learning problem  [19, 28, 49]. Several methods address grounding semantic concepts during description generation  [52, 59, 61]. A few recent works put an emphasis on the use of syntactic information  [14, 45]. New datasets have also been proposed  [12, 50], including a work that focuses on dense video captioning  [17], where the goal is to temporally localize and caption individual events.

In this work, we generate multi-sentence descriptions for long video sequences, as in [32, 37, 55]. This is different from dense video captioning, where one does not need to obtain one coherent multi-sentence description. Recent works that tackle multi-sentence video description include [56], who generate fine-grained sport narratives, [53], who jointly localize events and decide when to generate the following sentence, and [25], who introduce a new inference method that relies on multiple discriminators to measure the quality of multi-sentence descriptions.

Person Re-identification. Person re-identification aims to recognize whether two images of a person are the same individual. This is a long standing problem in computer vision, with numerous deep learning based approaches introduced over the years  [5, 26, 36, 39, 46]. We rely on [36] as our face track representation.

Connections to Prior Work. Next, we detail how our work compares to the most related prior works.

Identity-Aware Video Description: Closely related to ours is the work of [34]. They address video description of individual movie clips with grounded and co-referenced (re-identified) people. In their problem statement re-identification is performed w.r.t. a single previous clip during description generation. Unlike [34], we address multi-sentence video description, which requires consistently re-identifying people over multiple clips at once (on average 5 clips).

Fill-in the Identity: Our task of predicting local character IDs for a set of clips given ground-truth descriptions with blanks, is related to the work of [29]. However, they aim to fill in global IDs (proper names). In order to learn the global IDs, they use 80% of each movie for training. Our problem statement is different, as it requires no access to the movie characters’ appearance during training: we maintain disjoint training, validation and test movies. A number of prior works attempt to link all the detected face tracks to global character IDs in TV shows and movies  [3, 11, 15, 21, 27, 31, 38, 40], which is different from our problem statement that tries to fill character IDs locally with textual guidance. We compare to two recent approaches to Fill-in the Identity in Sect. 5.2.

3 Connecting Identities to Video Descriptions

An integral part of understanding a story depicted in a video is to establish who are the key participants and what actions they perform over the course of time. Being able to correctly link the repeating appearances of a certain person could potentially help follow the story line of this person. We first address the task of Fill-in the Identity , where we aim to solve a related problem: fill in the persons’ IDs based on the given video descriptions with blanks. Our approach is centered around two key ideas: joint prediction of IDs via a Transformer architecture  [42], supported by gender information inferred from textual and visual cues (see Fig. 3). We then present our second task, Identity-Aware Video Description , which aims to generate multi-sentence video descriptions with local person IDs. We present a two-stage pipeline, where our baseline model gives us multi-sentence descriptions with SOMEONE labels, after which we leverage the Fill-in the Identity auxiliary task to link the predicted SOMEONE entities.

Fig. 3.
figure 3

Overview of our approach to Fill-in the Identity task. See Sect. 3.1.

3.1 Fill-in the Identity

For a set of video clips \(V_i\) and their descriptions with blanks \(D_i, i={1,2...,N}\), we aim to fill in character identities that are locally consistent within the set. We first detect faces that appear in each clip and cluster them based on their visual appearance. Then, for every blank in a sentence, we attend over the face cluster centers using visual and textual context, to find the cluster best associated with the blank. We process all the blanks sequentially and pass their visual and textual representations to a Transformer model [42], which analyzes the entire set at once and predicts the most probable sequence of character IDs (Fig. 3).

Visual Representation. Now, we describe the details of getting local face descriptors and other global visual features for a clip \(V_i\). We detect all the faces in every frame of the clip, using the face detector by [60]. Then, we extract 512-dim face feature vectors with the FaceNet model [36]Footnote 1 trained on the VGGFace2 dataset [5]. The feature vectors are clustered using DBSCAN algorithm [10], which does not require specifying a number of clusters as a parameter. We take the mean of face features in each cluster, resulting in F face feature vectors \((c^i_1, ..., c^i_F)\).

In addition to the face features, we extract spatio-temporal features that describe the clip semantically. These features help the model locate where to look for the relevant face cluster for a given blank. We extract I3D  [6] features and apply mean pooling across a fixed number of T segments following [48], giving us a sequence \((g^i_1, ..., g^i_T)\). We then associate each face cluster with the best temporally aligned segment as follows. For each face cluster \(c^i_f\), we keep track of its frame indices and get a “center” index. Dividing this index by the total number of frames in a clip gives us a relative temporal position of the cluster \(r^i_f\), \(0<= r^i_f < 1\). We get the corresponding segment index \(t^i_f = \lfloor {r^i_f * T}\rfloor \) and obtain the global visual context \(v^i_f = g^i_{t^i_f}\). We concatenate face cluster features \(c^i_f\) with the associated global visual context \(v^i_f\) as our final visual representation.

Filling in the Blanks. Suppose there are B blanks in the set of N sentences \({D_i}\)Footnote 2. One way to fill in these blanks is to train a language model, such as BERT  [8], by masking the blanks to directly predict the character IDs. As we aim to incorporate visual information in our model, we take the following approach.

First, each blank b in a sentence \(D_i\) has to receive a designated textual encoding. We use with a pretrained BERT model, which has been shown effective for numerous NLP tasks. Instead of relying on a generic pretrained model, we train it to predict the gender corresponding to each blank, which often can be inferred from text. For example, in Fig. 2, one can infer that the person in the first clip is male due to the phrase “His brow”. We process all sentences in the set jointly. To get a representation for each blank, we access output embedding from the [CLS] token, a special sentence classification token in [8] whose representation captures the meaning of the entire sentence, over all sentences, and a hidden state of the last layer associated with the specific blank token. Note, that the same [CLS] token is used for all blanks in the set. The final representation \(t_b\) is a concatenation of the [CLS] and the blank token representation.

For each clip \(V_i\), we obtain F face cluster representations \((c^i_1, ... c^i_F)\), which we combine with the corresponding clip level representations \((v^{i}_{1}, ... v^{i}_{F})\). To find the best matching face cluster for the blank b, we predict the attention weights \(\alpha _{bf}\) over all clusters in the clip based on the \(t_b\), and compute a weighted sum over the face clusters, \(\hat{c_b}\):

$$\begin{aligned} \begin{aligned} e_{bf} = W_{\alpha _2}\tanh (W_{\alpha _1}[c^i_f;v^i_f;t_b]), \\ \alpha _{bf} = \frac{\exp (e_{bf})}{\sum _{k=1}^F{\exp (e_{bk})}}, \hat{c_b} = \sum _{f=1}^F{\alpha _{bf}c_f} \end{aligned} \end{aligned}$$
(1)

We concatenate the visual representation \(\hat{c_b}\) with \(t_b\) as the final representation for the blank: \(s_b = [\hat{c_b}; t_b]\).

Given a set of B blanks represented by \((s_1, ..., s_B)\) we now aim to link the corresponding identities to each other. Instead of making pairwise decisions w.r.t. matching and non-matching blanks, we want to predict an entire sequence of IDs jointly. We thus choose the Transformer [42] architecture to let the self-attention mechanisms model multiple pairwise relationships at once. Specifically, we train a Transformer with \((s_1, ..., s_B)\) as inputs and local person IDs \((l_1, ..., l_B)\) as outputs. As we fill in the blanks in a sequential manner, we prevent the future blanks from impacting the previous blanks by introducing a causal masking in the encoder. We train the entire model end-to-end and learn the attention mechanism in Eq. 1 jointly with the ID prediction. Denoting \(\theta \) as all the parameters in the model, the loss function we minimize is:

$$\begin{aligned} \begin{aligned} L_{character}(\theta ) = - \sum _{b=1}^B {p_\theta (l_b \mid s_1, ..., s_{b-1}, l_1, ..., l_{b-1})} \end{aligned} \end{aligned}$$
(2)

We explore the effect of an additional component, a gender prediction loss \(L_{gender}\), that forces the attended visual representation to be gender-predictive. We add a single layer perceptron that takes the predicted feature \(\hat{c_b}\) and aims to recognize the gender \(g_b\) for the blank b. The final loss function we minimize is as follows:

$$\begin{aligned} \begin{aligned} L_{gender}(\theta ) = - \sum _{b=1}^B {p_\theta (g_b \mid \hat{c_b})} \\ L(\theta ) = L_{character} + \lambda _{gen} L_{gender} \end{aligned} \end{aligned}$$
(3)

where \(\hat{c_b}\) is calculated in Eq. 1 and \(\lambda _{gen}\) is a hyperparameter.

We also notice that it is possible to boost the performance of our Transformer model by a simple training data augmentation. Note that there are various ways to split the training data into clip sequences with length N: one can consider a non-overlapping segmentation, i.e. \(\{1,...N\}, \{N+1,...2N\}, ...\) or additionally add all the overlapping sets \(\{2,...N+1\}, \{3,...N+2\}, ...\). Since we predict the local IDs, every such set would result in a unique data point, meaning using all the possible sets can increase the amount of training data by a factor of N.

3.2 Identity-Aware Video Description

Here, given a set of N clips \(V_i\), we aim to predict their descriptions \(D_i\) that would also contain the relevant local person IDs. First, we follow prior works [12, 25] to build a multi-sentence description model with SOMEONE labels. It is an LSTM based decoder that takes as input a visual representation and a sentence generated for a previous clip. Here, our visual representation for \(V_i\) is I3D [6] and Resnet-152  [13] mean pooled temporal segments. Once we have obtained a multi-sentence video description with SOMEONE labels, we process the generated sentences with our Fill-in the Identity model. We demonstrate that this approach is effective, although the Fill-in the Identity model is only trained on ground-truth descriptions. Note, that this two-stage pipeline could be applied to any video description method.

4 Dataset

As our main test-bed, we choose the Large Scale Movie Description Challenge (LSMDC)  [35], while leveraging and extending the character annotations from  [34]. They have marked every sentence in the MPII Movie Description (MPII-MD) dataset where a person’s proper name (e.g. Jane) is mentioned, and labeled the person specific pronouns he, she with the associated names (e.g. he is John). For each one out of 94 MPII-MD movies, we are given a list of all unique identities and their genders. We extend these annotations to 92 additional movies, covering the entire LSMDC dataset (except for the Blind Test set).

Table 1. Statistics for our tasks, based on the LSMDC dataset. See Sect. 4.

We use these annotations as follows. (1) We drop the pronouns and focus on the underlying IDs. (2) We split each movie into sets of consecutive 5 clips (the last set in a movie may contain less than 5 clips). (3) We relabel global IDs into local IDs within each set. E.g. if we encounter a sequence of IDs Jane, John, Jane, Bill, it will become PERSON1, PERSON2, PERSON1, PERSON3. This relabeling is applied for both tasks that we introduce in this work.

We provide dataset statistics, including number of movies, individual sentences, sets and blanks in Table 1Footnote 3. Around 52% of all blanks correspond to PERSON1, 31% – to PERSON2, 12% – to PERSON3, 4% – to PERSON4, 1% or less – to PERSON5, PERSON6, etc. This shows that the clip sets tend to focus on a single person, but there is still a challenge in distinguishing PERSON1 from PERSON2, PERSON3, ... (up to PERSON11 in training movies).

In our experiments we use the LSMDC Validation set for development, and LSMDC Public Test set for final evaluation.

5 Experiments

5.1 Implementation Details

For each clip, we extract I3D  [6] pre-trained on the Kinetics  [16] dataset and Resnet-152  [13] features pre-trained on the ImageNet  [7] dataset. We mean pool them temporally to get \(T=5\) segments [48]. We detect on average 79 faces per clip (and up to 200). In the DBSCAN algorithm used to cluster faces, we set \(\epsilon = 0.2\), which is the maximum distance between two samples in a cluster. Clustering the faces within each clip results in about 2.2 clusters per clip, and clustering over a set results in 4.2 clusters per set. BERT  [8] models use the BERT-base model architecture with default settings as in [51]. Transformer  [42] has a feedforward dimension of 2048 and 6 self-attention layers. We train the Fill-in the Identity model for 40 epochs with learning rate 5e\({-}\)5 with hyperparameter \(\lambda _{gender} = 0.2\). We train the baseline video description model for 50 epochs with learning rate 5e\({-}\)4. We fix batch size as 16 across all experiments, where each batch contains a set of clips and descriptions.

5.2 Fill-in the Identity

Evaluation Metrics. First, we discuss the metrics used to evaluate the Fill-in the Identity task. Given a sequence of blanks and corresponding ground-truth IDs, we consider all unique pairwise comparisons between the IDs. A pair is labeled as “Same ID” if the two IDs are the same, and “Different ID” otherwise. We obtain such labeling for the ground-truth and predicted IDs. Then, we can compute a ratio of the matching labels between the ground-truth and predicted pairs, e.g. if 6 out of 10 pair labels match, the prediction gets an accuracy 0.6Footnote 4. The final accuracy is averaged across all sets. When define like this, we obtain an instance-level accuracy over ID pairs (“Inst-Acc”). Note, that it is important to correctly predict both “Same ID” and “Different ID” labels, which can be seen as a 2-class prediction problem. The instance-level accuracy does not distinguish between these two cases. Thus, we introduce a class-level accuracy, where we separately compute accuracy over the two subsets of ID pairs (“Same-Acc”, “Diff-Acc”) and report the harmonic mean between the two (“Class-Acc”).

Baselines and Ablations. Table 2 summarizes our experiments on the LSMDC Validation set. We include two simple baselines: “The same ID” (all IDs are the same) and “All different IDs” (all IDs are distinct: 1, 2, ...). “GT Gender as ID” directly uses ground truth male/female gender as a character ID (Person 1/2), and serves as an upper-bound for gender prediction. We consider two vision-only baselines, where we cluster all the detected faces within a set of 5 clips, and pick a random cluster (“Random Face Cluster”) or the most frequent cluster (“Most Frequent Face Cluster”) within a clip for each blank. We also consider a language-only baseline “BERT Character LM”, which uses a pretrained BERT model to directly fill in all the blanks. Then we include our Transformer model with our BERT Gender pretrained model. We show the effect of our training data augmentation and use augmentation in the following versions. Finally, we study the impact of adding each visual component (“+ Face.” and “+ Video”), and introduce our vision-based gender loss (full model).

Table 2. Fill-in the Identity accuracy of several baselines, our full method and its ablations on the LSMDC Validation set. We report the predicted ID accuracy at class and instance level, as well as gender accuracy. See Sect. 5.2 for details.

We make the following observations. (1) Instance accuracy for all the same/all distinct IDs provides the insight into how the pairs are distributed (\(40.7\%\) of all pairs belong to “Same ID” class, \(59.3.7\%\) – to “Different ID” class). Neither is a good solution, getting Class-Acc 0. (2) Our Transformer model with BERT Gender representation improves over the vanilla BERT Character model (57.9 vs. 62.6 in Class-Acc). (3) This is also higher than 60.1 of “GT Gender as ID”, i.e. our model relies on other language cues besides gender. (4) Training with our data augmentation scheme further improves Class-Acc to 64.4. (5) Introducing face features boosts the Class-Acc to 65.3, and adding video features improves it to 65.7. (7) Finally, visual gender prediction loss leads to the overall Class-Acc of 65.9. Note, that the instance-level accuracy (Inst-Acc) does not always reflect the improvements as it may favor the majority class (“Different ID”).

We also report gender accuracy (Gen Acc) for the variants of our model (last 4 rows in Table 2). For models without the visual gender loss, we report the accuracy based on their BERT language model trained for gender classification. We see that data augmentation on the language side helps improve gender accuracy (80.3 vs 81.8). Incorporating visual representation with the gender loss boosts the accuracy further (81.8 vs 83.0).

Table 3. Fill-in the Identity human performance and our method evaluated on 200 random Test clips sets.

Human Performance. We also assess human performance in this task in two scenarios: with and without seeing the video. The former provides an overall upper-bound accuracy, while the latter gives an upper-bound for the text-only models. We randomly select 200 sets of Test clips and ask 3 different Amazon Mechanical Turk (AMT) workers to assign the IDs to the blanks. For each set we compute a median accuracy across the 3 workers to remove the outliers and report the average accuracy over all sets. Table 3 reports the obtained human performance and the corresponding accuracy of our model on the same 200 sets. Human performance gets a significant boost when the workers can see the video, indicating that video provides many valuable cues, and not everything can be inferred from text. Our full method outperforms “Human w/o video” but falls behind the final “Human” performance.

Comparison to State-of-the-Art. Since the data for our new tasks has been made public, other research groups have reported results on it, including the works by Yu et al.  [57] and Brown et al.  [4]Footnote 5. Yu et al. propose an ensemble over two models: Text-Only and Text-Video. The Text-Only model builds a pairwise matrix for a set of blanks and learns to score each pair by optimizing a binary cross entropy loss. The Text-Video model considers two tasks: linking blanks to tracks and linking tracks to tracks. The two tasks are trained separately, with the help of additional supervision from an external dataset  [29], using triplet margin loss. While Yu et al. use gender loss to pre-train their language model, we introduce gender loss on the visual side.

Table 4. Fill-in the Identity accuracy of our method and two recent works on the LSMDC Test set.

Brown et al. train a Siamese network on positive (same IDs) and negative (different IDs) pairs of blanks. The network relies on an attention mechanism over the face features to identify the relevant person given the blank encoding.

Table 4 reports the results on the LSMDC Test set. As we see, our approach significantly outperforms both methods. To gain further insights, we analyze some differences in behavior between these methods and our approach.

We take a closer look at the distribution of the predicted IDs (PERSON1, PERSON2, ...) by our method, and the two other approaches. Figure 4 provides a histogram over the reference data and the compared approaches. We can see that our predictions align well with the true data distribution, while the two other methods struggle to capture the data distribution. Notably, both methods favor more diverse IDs (2, 3, ...), failing to link many of the re-occurring character appearances. This may be in part due to the difference in the objective used in our approach vs. the others. While they implement a binary classifier that selects the best matching face track for each blank, we use Transformer to fill in the blanks jointly, allowing us to better capture both local and global context.

Figure 5 provides a qualitative example, comparing our approach, Yu et al. and Brown et al. As suggested by our analysis, these two methods often predict diverse IDs instead of recognizing the true correspondences.

Fig. 4.
figure 4

Fill-in the Identity : histogram over the frequencies of predicted IDs for our method, its text-only version and two SOTA works. See Sect. 5.2.

Fig. 5.
figure 5

Qualitative example for Fill-in the Identity task, comparison between our approach and two recent methods. Correct/incorrect predictions are labeled with green/red, respectively. P1, P2, ... are person IDs. See Sect. 5.2 for details. (Color figure online)

Fig. 6.
figure 6

Qualitative example for Fill-in the Identity task, comparison between our final approach with visual representation and a text-only ablation (“Transf. + BERT Gender LM + Augm.” in Table 2). We include the predicted character ID (P1, P2, ...) and gender for each blank. See Sect. 5.2 for details.

Ours vs. Ours Text Only. In Fig. 6, we compare our full model vs. its text-only version. After having seen the two characters in the first clip (P1 and P2), our full model recognizes that the man and woman appearing in a next set of clips are the same two characters, and successfully links them as P1 in the second and P2 in the third and fourth clip with the correct genders. In the last clip with two characters, the full model is also able to visually ground the same woman as “heading out” and assign the ID as P2. On the other hand, the text-only model cannot tell that the first two characters appear in the next set of clips without a visual signal, and incorrectly assigns the blanks as different character IDs, P3 and P4, after the second blank. The text only model also fails to link the character in third and last clip as the same ID due to the limited information available in textual descriptions alone. This shows that the task is hard for the text-only model, while our final model learns to incorporate visual information successfully.

Table 5. Identity-Aware Video Description scores for our method on the LSMDC Test set. See Sect. 5.3 for details.

5.3 Identity-Aware Video Description

Finally, we evaluate our two-stage approach for the Identity-Aware Video Description task. One issue with directly evaluating predicted descriptions with the character IDs is that the evaluation can strongly penalize predictions that do not closely follow the ground-truth. Consider an example with ground-truth [P1] approaches [P2]. [P2] gets up. vs. prediction [P1] is approached by [P2]. [P1] stands up. As we see, direct comparison would yield a low score for this prediction, due to different phrasing which leads to different local person IDs. If we instead consider an ID permutation [P2] approaches [P1]. [P1] gets up., we can get a better match with the prediction. Thus, we consider all the possible ID permutations as references, evaluate our prediction w.r.t. all of them, and choose the reference that gives the highest BLEU@4 to compute the final scores. The caption with the best ID permutation is then used to evaluated at set level using the standard automatic scores (METEOR  [18], BLEU@4  [24], CIDEr-D  [43]).

Results. In Table 5, we compare results from our captioning model (Sect. 3.2) with different Fill-in the Identity approaches to fill in the IDs on the LSMDC Test set. Our model outperforms the baseline approaches, including Ours Text-Only model. This confirms that our Fill-in the Identity model successfully uses visual signal to perform well, on both ground truth and predicted sentences.

Fig. 7.
figure 7

Qualitative example for the Identity-Aware Video Description task. P1, P2, ... are person IDs. See Sect. 5.3 for details.

Figure 7 provides an example output of our two-stage approach. We show the ground truth descriptions with person IDs on the left, and our generated descriptions with the predicted IDs on the right. Our Fill-in the Identity model consistently links [P1] as the woman who “sits down”, “looks at [P2]”, and “steps out”, while [P2] as the man who “smiles” and “stares” across the clips.

While we experiment with a fairly straightforward video description model, our two-stage approach can enable character IDs if applied to any model.

6 Conclusion

In this work we address the limitation of existing literature on automatic video and movie description, which typically ignores the aspect of person identities.

Our main effort in this paper is on the Fill-in the Identity task, namely filling in the local person IDs in the given descriptions of clip sequences. We propose a new approach based on the Transformer architecture, which first learns to attend to the faces that best match the blanks, and next jointly establishes links between them (infers their IDs). Our approach successfully leverages gender information, inferred both from textual and visual cues. Human performance on the Fill-in the Identity task shows the importance of visual information, as humans perform much better when they see the video. While we demonstrate that our approach benefits from visual features (higher ID accuracy and gender accuracy, better ID distribution), future work should focus on better ways of incorporating visual information in this task. Finally, we compare to two state-of-the-art multi-modal methods, showing a significant improvement over them.

We also show that our Fill-in the Identity model enables us to use a two-stage pipeline to tackle the Identity-Aware Video Description task. While this is a simple approach, it is promising to see that our model can handle automatically generated descriptions, and not only ground-truth descriptions. We hope that our new proposed tasks will lead to more research on bringing together person re-identification and video description and will encourage future works to design solutions that go beyond a two-stage pipeline.