Fig. 1.
figure 1

We learn through self-supervision to represent a video as a set of discrete audio-visual objects. Our model groups a scene into object instances and represents each one with a feature embedding. We use these embeddings for speech-oriented tasks that typically require object detectors: (a) multi-speaker source separation, (b) speaker localization, (c) synchronizing misaligned audio and video, and (d) active speaker detection. Using our representation, these tasks can be solved without any labeled data, and on domains where off-the-shelf detectors are not available, such as cartoons and puppets. Please see our webpage for videos: http://www.robots.ox.ac.uk/~vgg/research/avobjects.

1 Introduction

When humans organize the visual world into objects, hearing provides cues that affect the perceptual grouping process. We group different image regions together not only because they look alike, or move together, but also because grouping them together helps us explain the causes of co-occurring audio signals.

In this paper, our objective is to replicate this organizational capability, by designing a model that can ingest raw video and transform it into a set of discrete audio-visual objects. The network is trained using only self-supervised learning from audio-visual cues. We demonstrate this capability on videos containing talking heads.

This organizational task must overcome a number of challenges if it is to be applicable to raw videos in the wild: (i) there are potentially many visually similar sound generating objects in the scene (multiple heads in our case), and the model must correctly attribute the sound to the actual sound source; (ii) these objects may move over time; and (iii) there can be multiple other objects in the scene (clutter) as well.

To address these challenges, we build upon recent works on self-supervised audio-visual localization. These include video methods that find motions temporally synchronized with audio onsets [13, 40, 44], and single-frame methods [6, 31, 46, 52] that find regions that are likely to co-occur with the audio. However, their output is a typically a “heat map” that indicates whether a given pixel is likely (or unlikely) to be attributed to the audio; they do not group a scene into discrete objects; and, if only using semantic correspondence, then they cannot distinguish which, of several, object instances is making a sound.

Our first contribution is to propose a network that addresses all three of these challenges; it is able to use synchronization cues to detect sound sources, group them into distinct instances, and track them over time as they move. Our second contribution is to demonstrate that object embeddings obtained from this network facilitate a number of audio-visual downstream tasks that have previously required hand-engineered supervised pipelines.

As illustrated in Fig. 1, we demonstrate that the embeddings enable: (a) multi-speaker sound source separation [2, 20]; (b) detecting and tracking talking heads; (c) aligning misaligned recordings [12, 15]; and (d) detecting active speakers, i.e. identifying which speaker is talking [13, 50]. In each case, we significantly outperform other self-supervised localization methods, and obtain comparable (and in some cases better) performance to prior methods that are trained using stronger supervision, despite the fact that we learn to perform them entirely from a raw audio-visual signal.

The trained model, which we call the Look Who’s Talking Network (LWTNet), is essentially “plug and play” in that, once trained on unlabeled data (without preprocessing), it can be applied directly to other video material. It can easily be fine-tuned for other audio-visual domains: we demonstrate this functionality on active speaker detection for non-human speakers, such as animated characters in The Simpsons and puppets in Sesame Street. This demonstrates the generality of the model and learning framework, since this is a domain where off-the-shelf supervised methods, such as methods that use face detectors, cannot transfer without additional labeling.

2 Related Work

Sound Source Localization. Our task is closely related to the sound source localization problem, i.e. finding the location in a video that is the source of a sound. Early work performed localization [7, 22, 34, 39] and segmentation [37] by doing inference on simple probabilistic models, such as methods based on canonical correlation analysis.

Recent efforts learn audio and video representations using self-supervised learning [13, 40, 44] with synchronization as the proxy task: the network has to predict whether video and audio are temporally aligned (or synthetically shifted). Owens and Efros [44] show via heat-map visualizations that their network often attends to sound sources, but do not quantitatively evaluate their model. Recent work [38] added an attention mechanism to this model. Other work has detected sound-making objects using correspondence cues [6, 31, 35, 36, 46, 48, 52, 54], e.g. by training a model to predict whether audio and a single video frame come from the same (or different) videos. Since these models do not use motion and are trained only to find the correspondence between object appearance and sound, they would not be able to identify which of several objects of the same category is the actual source of a sound. In contrast, our goal is to obtain discrete audio-visual objects from a scene, even when they bellong to the same category (e.g. multiple talking heads). In a related line of work, [25] distill visual object detectors into an audio model using stereo sound, while [27] use spatial information in a scene to convert mono sound to stereo.

Fig. 2.
figure 2

The Look Who’s Talking Network (LWTNet): (1) Computes an audio-visual attention map \(S_{av}\) by solving a synchronization task, (2) accumulates attention over time, (3) selects audio-visual objects by computing the N highest peaks with non-maximum suppression (NMS) from the accumulated attention map, each corresponding to a trajectory of the pixel over time; (4) for every audio-visual object, it extracts embedding vectors from a spatial window \(\rho \), using the local attention map \(S_{av}\) to select visual features, and (5) provides the audio-visual objects as inputs to downstream tasks.

Active Speaker Detection (ASD). Early work on active speaker detection trained simple classifiers on hand-crafted feature sets [16]. Later, Chung and Zisserman [13] used synchronization cues to solve the active speaker detection problem. They used a hand-engineered face detection and tracking pipeline to select candidate speakers, and ran their model only on cropped faces. In contrast, our model learns to do ASD entirely from unlabeled data. Chung et al.[11] extended the pipeline by enrolling speaker models from visible speaking segments. Recently, Roth et al. [50] proposed an active speaker detection dataset and evaluated a variety of supervised methods for it.

Source Separation. In recent years, researchers have proposed a variety of methods for separating the voices of multiple speakers in a scene [2, 20, 23, 44]. These methods either only handle a single on-screen speaker [44] or use hand-engineered, supervised face detection pipelines. Afouras et al. [2] and Ephrat et al. [20], for example, detect and track faces and extract visual representations using off-the-shelf packages. In contrast, we use our model to separate multiple speakers entirely via self-supervision.

Other recent work has explored separating the sounds of musical instruments and other sound-making objects. Gao et al. [26, 28] use semantic object detectors trained on instrument categories, while [51, 58] do not explicitly group a scene into objects and instead either pool the visual features or produce a per-pixel map that associates each pixel with a separated audio source. Recently, [57] added motion information from optical flow. We, too, use flow in our model, but instead of using it as a cue for motion, we use it to integrate information from moving objects over time [24, 47] in order to track them. In concurrent work [36] propose a model that groups and separates sound sources.

Representation Learning. In recent years, researchers have proposed a variety of self-supervised learning methods for learning representations from images [10, 18, 32, 33, 41, 43, 55, 56], videos [29, 30] and multimodal data [5, 40, 42, 45, 46]. Often the representation learned by these methods is a feature set (e.g., CNN weights) that can be adapted to downstream tasks by fine-tuning. By contrast, we learn an additional attention mechanism that can be used to group discrete objects of interest for downstream speech tasks.

3 From Unlabeled Video to Audio-Visual Objects

Given a video, the function of our model is to detect and track (possibly several) audio-visual objects, and extract embeddings for each of them. We represent an audio-visual object as the trajectory of a potential sound source through space and time, which in the domain that we experiment on is often the track of a “talking head”. Having obtained these trajectories, we use them to extract embeddings that can be then used for downstream tasks.

In more detail, our model uses a bottom-up grouping procedure to propose discrete audio-visual objects from raw video. It first estimates local (per-pixel and per-frame) synchronization evidence, using a network design that is more fine-grained in space and time than prior models. It then aggregates this evidence over time via optical flow, thereby allowing the model to obtain robustness to motions, and groups the aggregated attention into sound sources by detecting local maxima. The model represents each object as a separate embedding, temporal track, and attention map that can be adjusted in downstream tasks.

We will now give an overview of the model, which is shown in Fig. 2, followed by the learning framework which uses self-supervision based on synchronization. For architecture details, please refer to the the arXiv version.

3.1 Estimating Audio-Visual Attention

Before we group a scene into sound sources, we estimate a per-pixel attention map that picks out the regions of a video whose motions have a high degree of synchronization with the audio. We propose an attention mechanism that provides highly localized spatio-temporal attention, and which is sensitive to speaker motion. As in [6, 31], we estimate audio-visual attention via a multimodal embedding (Fig. 2, step 1). We learn vector embeddings for each audio clip and embedding vectors for each pixel, such that if a pixel’s vector has a high dot product with that of the audio, then it is likely to belong to that sound source. For this, we use a two-stream architecture similar to those in other sound-source localization work [6, 31, 52], with a network backbone similar to [11].

Video Encoder. Our video feature encoder is a spatio-temporal VGG-M [9] with a 3D convolutional layer first, followed by a stack of 2D convolutions. Given a \(T \times H \times W \times 3\) input RGB video, it extracts a video embedding map \(f_v(x,y,t)\) with dimensions \(T \times h \times w \times D\).

Audio Encoder. The audio encoder is a VGG-M network operating on log-mel spectrograms, treated as single-channel images. Given an audio segment, it extracts a D-dimensional embedding \(f_a(t)\) for every corresponding video frame t.

Computing Fine-Grained Attention Maps. For each space-time pixel, we ask: how correlated is it with the events in the audio? To estimate this, we measure the similarity between the audio and visual features at every spatial location. For every space-time feature vector \(f_v(x, y, t)\), we compute the cosine similarity with the audio feature vector \(f_a(t)\):

$$\begin{aligned} {S}_{av}(x,y,t) = f_v(x,y,t) {\cdot } f_a(t), \end{aligned}$$
(1)

where we first \(l_2\) normalize both features. We refer to the result, \({S}_{av}(x,y,t)\), as the audio-visual attention map.

Fig. 3.
figure 3

Intermediate representations from our model. We show the per-frame attention maps \({S}_{av}(t)\), the aggregated attention map \({S}_{av}^{tr}\) and the two highest scoring extracted audio-visual objects. We show the audio-visual objects for a single frame, with a square of constant width.

3.2 Extracting Audio-Visual Objects

Given the audio-visual evidence, we parse a video into object representations.

Integrating Evidence Over Time. Audio-visual objects may only intermittently make sounds. Therefore, we need to integrate sparse attention evidence over time. We also need to group and track sound sources between frames, while accounting for camera and object motion. To make our model more robust to these motions, we aggregate information over time using optical flow (Fig. 2, step 2). We extract dense optical flow for every frame, chain the flow values together to obtain long-range tracks, and average the attention scores over these tracks. Specifically, if \(\mathcal {T}( x, y, t)\) is the tracked location of pixel (xy) from frame 1 to the later frame t, we compute the score:

$$\begin{aligned} S_{av}^{tr} (x,y) = \frac{1}{T} \sum _{t = 1}^{T} S_{av}(\mathcal {T}( x, y, t), t), \end{aligned}$$
(2)

where we perform the sampling using bilinear interpolation. The result is a 2D map containing a score for the future trajectory of every pixel of the initial frame through time. Note that any tracking method can be used in place of optical flow (e.g. with explicit occlusion handling); we use optical flow for simplicity.

Grouping a Scene into Instances. To obtain discrete audio-visual objects, we detect spatial local maxima (peaks) on the temporally aggregated synchronization maps, and apply non-maximum suppression (NMS). More specifically, we find peaks in the time-averaged synchronization map, \(S_{av}^{tr}(x, y)\), and sort them in decreasing order; we then choose the peaks greedily, each time suppressing the ones that are within a \(\rho \times \rho \) box. The selected peaks can be now viewed as distinct audio-visual objects. Examples of the intermediate representations extracted at the steps described so far are shown in Fig. 3.

Extracting Object Embeddings. Now that the sound sources have been grouped into distinct audio-visual objects, we can extract feature embeddings for each one of them that we can use in downstream tasks. Before extracting these features, we locate the position of the sound source in each frame. A simple strategy for this would be to follow the object’s optical flow track throughout the video. However, these tracks are imprecise and may not correspond precisely to the location of the sound source. Therefore, we “snap” to the track location to the nearest peak in the attention map. More specifically, in frame t, we search in an area of \(\rho \times \rho \) centered on the tracked location \(\mathcal {T}( x, y, t)\), and select the pixel location with largest attention value. Then, having tracked the sound source in each frame, we select the corresponding spatial feature vector from the visual feature map \(f_v\) (Fig. 2, step 4). These per-frame embedding features, \(f_{v}^{att}(t)\), can then be used to solve downstream tasks (Sect. 4). One can equivalently view this procedure as an audio-visual attention mechanism that operates on \(f_v\).

3.3 Learning the Attention Map

Training our model amounts to learning the attention map \(S_{av}\) on which the audio-visual objects are subsequently extracted. We obtain this map by solving a self-supervised audio-visual synchronization task [13, 40, 44]: we encourage the embedding at each pixel to be correlated with the true audio and uncorrelated with shifted versions of it. We estimate the synchronization evidence for each frame by aggregating the per-pixel synchronization scores. Following common practice in multiple instance learning [6], we measure the per-frame evidence by the maximum spatial response:

$$\begin{aligned} S_{av}^{att}(t) = \max _{x,y} S_{av}(x,y,t). \end{aligned}$$
(3)

We maximize the similarity between a video frame’s true audio track while minimizing that of N shifted (i.e. misaligned) versions of the audio. Given visual features \(f_v\) and true audio \(a_i\), we sample N other audio segments from the same video clip: \(a_1, a_2, ..., a_N\), and minimize the contrastive loss [15, 43]:

$$\begin{aligned} \mathcal {L} = -\log \frac{\exp (S_{av}^{att}(v,a_i))}{\exp (S_{av}^{att}(v,a_i)) + \sum _{j=1}^N \exp (S_{av}^{att}(v,a_j))}. \end{aligned}$$
(4)

For the negative examples, we select all audio features (except for the true example) in a temporal window centered on the video frame.

In addition to the synchronization task, we also consider the correspondence task of Arandjelović and Zisserman [6], which chooses negatives audio samples from random video clips. Since this problem can be solved with even a single frame, it results in a model that is less sensitive to motion.

4 Applications of Audio-Visual Object Embeddings

We use our learned audio-visual objects for a variety of applications.

4.1 Audio-Visual Object Detection and Tracking

We can use our model for spatially localizing speakers. To do this, we use the tracked location of an audio-visual object in each frame.

4.2 Active Speaker Detection

For every frame in our video, our model can locate potential speakers and decide whether or not they are speaking. In our setting, this can be viewed as deciding whether an audio-visual object has strong evidence of synchronization in a given frame. For every tracked audio-visual object, we extract the visual features \(f_v^{att}(t)\) (Sect. 3.2) for each frame t. We then obtain a score that indicates how strong the audio-visual correlation for frame t is, by computing the dot product: \(f_v^{att}(t){\cdot }f_a(t)\). Following previous work [13], we threshold the result to make a binary decision (active speaker or not).

Fig. 4.
figure 4

Multi-speaker separation. We isolate the sound of each speaker’s voice by combining our audio-visual objects with a network similar to [2]. Given a spectrogram of a noisy sound mixture, the network isolates the voice of each speaker, using the visual features provided by their audio-visual object.

4.3 Multi-speaker Source Separation

Our audio-visual objects can also be used for separating the voices of speakers in a video. We consider the multi-speaker separation problem [2, 20]: given a video with multiple people speaking on-screen (e.g., a television debate show), we isolate the sound of each speaker’s voice from the audio stream. We note that this problem is distinct from on/off-screen audio separation [44], which requires only a single speaker to be on-screen.

We train an additional network that, given a waveform containing an audio mixture and an audio-visual object, isolates the speaker’s voice (Fig. 4, full details in the the arXiv version of the paper). We use an architecture that is similar to [2], but conditions on our self-supervised representations instead of detections from a face detector. More specifically, the method of [2] runs a face detection and tracking system on a video, computes CNN features on each crop, and then feeds those to a source separation network. We, instead, simply provide the same separation network with the embedding features \(f_v^{att}(t)\).

4.4 Correcting Audio-Visual Misalignment

We can also use our model to correct misaligned audio-visual data—a problem that often occurs in the recording and television broadcast process. We follow the problem formulation proposed by Chung and Zisserman [13]. While this is a problem that is typically solved using supervised face detection [13, 15], we instead tackle it with our learned model. During inference, we are given a video with unsynchronized audio and video tracks, and we shift the audio to discover the offset \(\hat{\varDelta t}\) that maximizes the audio-visual evidence:

$$\begin{aligned} \hat{\varDelta t} = \mathop {\mathrm {arg\,max}}\limits _{\varDelta t} \frac{1}{T} \sum _{t=1}^T S_{{\varDelta t}}^{att}(t), \end{aligned}$$
(5)

where \(S_{{\varDelta t}}^{att}(t)\) is the synchronization score of frame t after shifting the audio by \({\varDelta t}\). This can be estimated efficiently by recomputing the dot products in Eq. 1.

In addition to treating this alignment procedure as a stand-alone application, we also use it as a preprocessing step for our other applications (a common practice in other speech analysis work [2]). When given a test video, we first compute the optimal offset \(\hat{\varDelta t}\), and use it to shift the audio accordingly. We then recompute \(S_{av}(t)\) from the synchronized embeddings.

5 Experiments

5.1 Datasets

Human Speech. We evaluate our model on the Lip Reading Sentences (LRS2 and LRS3) datasets and the Columbia active speaker dataset. LRS2 [1] and LRS3 [3] are audio-visual speech datasets containing 224 and 475 h of videos respectively, along with ground truth face tracks of the speakers. The Columbia dataset [8] contains footage from an 86-minute panel discussion, where multiple individuals take turns in speaking, and contains approximate bounding boxes and active speaker labels, i.e. whether a visible face is speaking at a given point in time. All datasets provide (pseudo-)ground truth bounding boxes obtained via face detection, which we use for evaluation. We resample all videos to a resolution of \(H \times W = 270\times 480\) pixels before feeding them to our model, which outputs \(h\times w = 18\times 31\) attention maps. We train all models on LRS2, and use LRS3 and Columbia only for evaluation.

Non-human Speakers. To evaluate our method on non-human speakers, we collected television footage from The Simpsons and Sesame Street shows (Table 3a). For testing, we obtained ASD and speaker localization labels, using the VIA tool [19]: we asked human annotators to label frames that they believed to contain an active speaker and to localize them. For every dataset, we create a single-head and a multi-head set, where clips are constrained to contain a single active speaker or multiple heads (talking or not) respectively. We provide dataset statistics in Table 3a and more details in the the arXiv version of the paper.

5.2 Training Details

Fig. 5.
figure 5

Talking head detection and tracking on LRS3 datasets. For each of the 4 examples, we show the audio-visual attention score on every spatial location for the depicted frame, and a bounding box centered on the largest value, indicating the speaker location. Please see our webpage for video results.

Fig. 6.
figure 6

Handling motion: Talking head detection and tracking on continuous scenes from the validation set of LRS2. Despite the significant movement of the speakers and the camera, our method accurately tracks them.

Audio-Visual Object Detection Training. To make training easier, we follow [40] and use a simple learning curriculum. At the beginning of training, we sample negatives from random video clips, then switch to shifted audio tracks later in training. To speed up training, we also begin by taking the mean dot product (Eq. 3), and then switch to the maximum. We set \(\rho \) to 100 pixels.

Source Separation Training. Training takes place in two steps: we first train our model to produce audio-visual objects by solving a synchronization problem. Then, we train the multi-speaker separation network on top of these learned representations. We follow previous work [2, 20] and use a mix-and-separate learning procedure. We create synthetic videos containing multiple talking speakers by 1) selecting two or three videos at random from the training set, depending on the experiment, 2) summing their waveforms together, and 3) vertically concatenating the video frames together. The model is then tasked with extracting a number of talking heads equal to the number of mixed videos and predicting an original corresponding waveform for each.

Non-human Model Training. We fine-tune the best model from LRS2 separately on each of the two datasets with non-human speakers. The lip motion for non-human speakers, such as the motion of a puppet’s mouth, is only loosely correlated with speech, suggesting that there is less of an advantage to obtaining our negative examples from temporally shifted audio. We therefore sample our negative audio examples from other video clips rather than from misaligned audio (Sect. 3.3) when computing attention maps (Fig. 7).

Fig. 7.
figure 7

Active speaker detection on the Columbia dataset, and an example from the Friends TV show. We show active speakers in and inactive speakers in . The corresponding detection scores are noted above the boxes (the threshold has been subtracted so that positive scores indicate active speakers). (Color figure online)

5.3 Results

1. Talking Head Detection and Tracking. We evaluate how well our model is able to localize speakers, i.e. talking heads (Table 1a). First, we evaluate two simple baselines: the random one, which selects a random pixel in each frame and the center one, which always selects the center pixel. Next, we compared with two recent sound source localization methods: Owens and Efros [44] and AVE-Net [6]. Since these methods require input videos that are longer than most of the videos in the test set of LRS2, we only evaluate them on LRS3. We also perform several ablations of our model: To evaluate the benefit of integrating the audio-visual evidence over flow trajectories, we create a variation of our model called No flow that, instead, computes the attention \(S_{av}^{tr}\) by globally pooling over time throughout the video. Finally, we also consider a variation of this model that uses a larger NMS window (\(\rho =150\)).

Fig. 8.
figure 8

Active speaker detection for non-human speakers. We show the top 2 highest-scoring audio-visual objects in each scene, along with the aggregated attention map. Please see our webpage for video results.

We found that our method obtains very high accuracy, and that it significantly outperforms all other methods. AVE-Net solves a correspondence task that doesn’t require motion information, and uses a single video frame as input. Consequently, it does not take advantage of informative motion, such as moving lips. As can be seen in Fig. 5, the localization maps produced by AVE-Net [6] are less precise, as it only loosely associates appearance of a person to speech, and won’t consistently focus on the same region. Owens and Efros [44], by contrast, has a large temporal receptive field, which results in temporally imprecise predictions, causing very large errors when the subjects are moving. The No flow baseline fails to track the talking head well outside the NMS area, and its accuracy is consequently lower on LRS3. Enlarging the NMS window partially alleviates this issue, but the accuracy is still lower than that of our model. We note that the LRS2 test set contains very short clips (usually 1–2 seconds long) with predominantly static speakers, which explains why using flow does not provide an advantage. We show some challenging examples with significant speaker and camera motion in Fig. 6. Please refer to the the arXiv version of the paper  for further analysis of camera and speaker motion.

Table 1. (a): Talking head detection and tracking accuracy. A detection is considered correct if it lies within the true bounding box. (b): Active speaker detection accuracy on the Columbia dataset [8]. F1 Scores (%) for each speaker, and the overall average.

2. Active Speaker Detection. Next, we ask how well our model can determine which speaker is talking. Following previous work that uses supervised face detection [14, 53], we evaluate our method on the Columbia dataset [8]. For each video clip, we extract 5 audio-visual objects (an upper bound on the number of speakers), each of which has an ASD score indicating the likelihood that it is a sound source (Sect. 4.2). We then associate each ground truth bounding box with the audio-visual object whose trajectory follows it the closest. For comparison with existing work, we report the F1 measure (the standard for this dataset) per individual speaker as well as averaged over all speakers. For calculating the F1 we set the ASD threshold to the one that yields the Equal Error Rate (EER) for the pretext task on the LRS2 validation set. As shown in Table 1b, our model outperforms all previously reported results on this dataset, even though (unlike other methods) it does not use labeled face bounding boxes for training.

3. Multi-speaker Source Separation. To evaluate our model on speaker separation, we follow the protocol of [2]. We create synthetic examples from the test set of LRS2, using only videos that are between \(2-5\) seconds long, and evaluate performance using Signal-to-Distortion-Ratio (SDR) [21] and Perceptual Evaluation of Speech Quality (PESQ, varies between 0 and 4.5) [49] (higher is better for both). We also assess the intelligibility of the output by computing the Word Error Rate (WER, lower is better) between the transcriptions obtained with the Google Cloud speech recognition system. Following [3], we train and evaluate separate models for 2 and 3 speakers, though we note that if the number of speakers were unknown, it could be estimated using active speaker detection.

For comparison, we implement the model of Afouras et al. [2], and train it on the same data. For extracting visual features to serve as its input, we use a state-of-the-art audio-visual synchronization model [15], rather than the lip-reading features from Afouras et al. [4]. We refer to this model as Conversation-Sync. This model uses bounding boxes from a well-engineered face detection system, and thus represents an approximate upper limit on the performance of our self-supervised model. Our main model for this experiment is trained end-to-end and uses \(\rho =150\). We also performed a number of ablations: a model that freezes the pretrained audio-visual features and a model with a smaller \(\rho =100\).

We observed (Table 2a) that our self-supervised model obtains results close to those of [2], which is based on supervised face detection. We also asked how much error is introduced by lack of face detection. In this direction we extract the local visual descriptors using tracks obtained with face detectors instead of our audio-visual object tracks. This model, Oracle-BB, obtains results similar to ours, suggesting that the quality of our face localization is high.

Table 2. (a): Source separation on LRS2. #Spk indicates the number of speakers. The WER on the ground truth signal is 20.0%. (b): Audio-visual synchronization accuracy (%) evaluation for a given number of input frames.

4. Correcting Misaligned Visual and Audio Data. We use the same metric as [15] to evaluate on LRS2. The task is to determine the correct audio-to-visual offset within a ±15 frame window. An offset is considered correct if it is within 1 video frame from the ground truth. The distances are averaged over 5 to 15 frames. We compare our method to two state-of-the-art synchronization methods: SyncNet [13] and the state-of-the-art Perfect Match [15]. We note that [15] represents an approximate upper limit to what we would expect our method to achieve, since we are using a similar network and training objective; the major difference is that we use our audio-visual objects instead of image crops from a face detector. The results (Table 2b) show that our self-supervised model obtains comparable accuracy to these supervised methods.

5. Generalization to Non-human Speakers. We evaluate the LWTNet model’s generalization to non-human speakers using the Simpsons and Sesame Street datasets described in Sect. 5.1. The results of our evaluation are summarized in Table 3b. Since supervised speech analysis methods are often based on face detection systems, we compare our method’s performance to off-the-shelf face detectors, using the single-head subset. As a face detector baseline, we use the state-of-the-art RetinaFace [17] detector, with both the MobileNet and ResNet-50 backbones. We report localization accuracy (as in Table 1a) and Average Precision (AP). It is clear that our model outperforms the face detectors in both localization and retrieval performance for both datasets.

The second evaluation setting is detecting active speakers in videos from the multi-head test set. As expected, our model’s performance decreases in this more challenging scenario; however, the AP for both datasets indicates that our method can be useful for retrieving the speaker in this entirely new domain. We show qualitative examples of ASD on the multi-head test sets in Fig. 8.

Table 3. (a): Label statistics for non-human test sets. S is single head and M multi-head. (b): Non-human speaker evaluation for ASD and localization tasks on Simpsons and Sesame Street. MN: MobileNet; RN: ResNet50.

6 Conclusion

In this paper, we have proposed a unified model that learns from raw video to detect and track speakers. The embeddings learned by the model are effective for many downstream speech analysis tasks, such as source separation and active speaker detection, that in previous work required supervised face detection.