An illustration of the weakly-supervised audio-visual video parsing (AVVP) task and our pseudo label exploration method. a Given a video and its event label (“speech" and “vacuum cleaner"), b AVVP task needs to predict and localize the audio events, visual events, and audio-visual events. Note that "vacuum cleaner" only exists in the visual track, while "speech" exists in both audio and visual tracks, resulting in the audio-visual event "speech". c To ease this challenging weakly-supervised task, we aim to explicitly assign reliable segment-level audio and visual pseudo labels. In our pseudo label generation process, the pretrained CLAP and CLIP models are used to tell what events occur in each audio and visual segment, respectively. d We further propose a pseudo label denoising strategy to improve the obtained visual pseudo labels by examining those segments that have abnormally large forward loss values. In the example, visual event vacuum cleaner at the third segment is assigned an incorrect pseudo label ‘0’ and gets a large forward loss. Our pseudo-label denoising strategy further amends this, giving the accurate pseudo label ‘1’

1 Introduction

Acoustic and visual signals flood our lives in abundance, and each signal may carry various events. For example, we often see driving cars and pedestrians walking around on the street. Meanwhile, we can hear the beeping of the car horns and the sound of people talking. Humans achieve such a comprehensive understanding of audio-visual events in large part thanks to the simultaneous use of their auditory and visual sensors. To imitate this kind of intelligence for machines, many research works started from some fundamental tasks of single modality understanding, such as the audio classification (Hershey et al., 2017; Kong et al., 2018; Kumar et al., 2018; Gong et al., 2021), video classification (Karpathy et al., 2014; Long et al., 2018a, b; Tran et al., 2019), and temporal action localization (Zeng et al., 2019; Chao et al., 2018; Zhu et al., 2021; Gao et al., 2022). The audio classification task focuses on the recognition of the audio modality, while the video classification and temporal action localization tasks focus on the visual modality. With the deepening of research, many works have further explored the multi-modal audio-visual perception (Wei et al., 2022), giving birth to tasks such as sound source localization (Arandjelovic & Zisserman, 2017; Rouditchenko et al., 2019; Arandjelovic & Zisserman, 2018; Senocak et al., 2018; Hu et al., 2020, 2019; Qian et al., 2020; Zhao et al., 2018; Afouras et al., 2020; Zhou et al., 2022b, 2023b; Sun et al., 2023), audio-visual event localization (Tian et al., 2018; Wu et al., 2019; Xu et al., 2020; Zhou et al., 2021; Mahmud & Marculescu, 2022; Rao et al., 2022b; Xia & Zhao, 2022; Wu et al., 2022; Zhou et al., 2023a; Wang et al., 2023), audio-visual video description (Shen et al., 2023) and question answering (Yun et al., 2021; Li et al., 2022; Yang et al., 2022; Song et al., 2022; Li et al., 2023).

Recently, Tian et al.  (Tian et al., 2020) proposed a new multi-modal scene understanding task, namely Audio-Visual Video Parsing (AVVP). Given an audible video, the AVVP task asks to identify what events occur in the audio and visual tracks and in which video segments these events occur. Accordingly, the category and temporal boundary of each event are expected to be predicted for each modality. Note that both the audio and visual tracks may contain multiple distinct events, and these events usually exist in different consecutive segments, it is labor-intensive to provide segment-level event labels for each modality with strong supervision. The fact is that the AVVP is performed in a weakly-supervised setting where only the video label is provided during model training. As the example shown in Fig. 1a, we only know that this video contains the event set of speech and vacuum cleaner. For each event, the model needs to judge whether it exists in the audio modality (audio event), visual modality (visual event), or both (audio-visual event), and locate the specific temporal segments, respectively. Notably, as illustrated in Fig. 1b, in the AVVP task, the audio-visual event is the intersection of the audio event and visual event, whereas the video label is the union of the audio event and visual event.

In this work, we emphasize there are two main challenges in the AVVP task. (1) Cross-modal interference from the video label. As the example shown in Fig. 1b, given the weakly-supervised video label, the audio and the visual track share the same supervision, i.e., {speech, vacuum cleaner} together. However, the audio and visual tracks contain distinct events. The vacuum cleaner only exists in the visual modality. Thus, during the model training process, the label vacuum cleaner will interfere with the audio event parsing. Similarly, the visual event parsing may also be interfered with the audio event label in other samples. (2) Temporal segment distinction. Assuming we successfully identify there is an event vacuum cleaner in the visual modality, it is still hard to distinguish which segments contain this event (segment level) under the weakly-supervised labels (video level). These two challenges make the AVVP an intractable Multi-modal Multi-Instance Learning (MMIL) problem, namely distinguishing the events from both modality and temporal perspectives.

In the pioneer work (Tian et al., 2020), a benchmark named Hybrid Attention Network (HAN) is proposed to encode the audio-visual features, which uses attentive pooling to aggregate the audio and visual features to predict events of the video. The weak video label is used as the main supervision. To address this task, they propose to obtain the pseudo labels for separate audio and visual modalities by processing the known video label with label smoothing (Szegedy et al., 2016) technique. Their experimental results indicate that generating pseudo labels for each modality brings significant benefits for supervising event parsing (Tian et al., 2020). The subsequent studies diverge into two branches. Most of them focus on designing effective networks to implicitly aggregate the multi-modal features for prediction (Mo & Tian, 2022; Pasi et al., 2022; Lamba et al., 2021; Yu et al., 2022; Lin et al., 2021; Jiang et al., 2022; Gao et al., 2023), while using the video-level pseudo labels generated by HAN (Tian et al., 2020). In contrast, the other new works (Wu & Yang, 2021; Cheng et al., 2022) devote to generating better pseudo labels for each modality based on the baseline backbone of HAN. However, the generated pseudo label is denoised from the video label and limited to the video level which only indicates what events exist in each modality (modality perspective). Therefore, it fails to address the second challenge because it remains difficult to distinguish which segments contain the event (temporal perspective).

To deal with the above-mentioned two challenges, our work starts with the intuition that can we explicitly generate pseudo labels for each segment of each modality to facilitate this MMIL task. This is inspired by two observations: (1) The AVVP models are expected to be well-guided with segment-level labels as such fine-grained labels can provide more explicit supervision information and directly fit the goal of the AVVP task (temporal perspective); (2) The audio and visual signals are processed with independent sensors for humans. We can indeed annotate each modality, specifically for what we hear or see, by leveraging unimodal input (modality perspective). To this end, we propose a Visual-Audio Pseudo LAbel exploratioN (VAPLAN) method that aims to generate high-quality segment-level pseudo labels for both visual modality and audio modality and further advances this weakly-supervised AVVP task.

To obtain the visual or audio pseudo labels, a natural idea is to borrow free knowledge from pretrained models for the image or audio classification. However, there is a category misalignment problem between the source and the target datasets using such a strategy. Take generating visual pseudo labels as an example, the models typically pretrained on the ImageNet (Deng et al., 2009) would classify the instance in the AVVP task into predefined categories of the ImageNet. However, the predicted category label may not exist in the target LLP dataset of the AVVP task, causing the category misalignment. Different from the traditional image classification models, vision-language pre-training (Alayrac et al., 2022; Jia et al., 2021; Radford et al., 2021) has attracted tremendous attention recently, which can flexibly classify images from an open-category vocabulary and show impressive zero-shot performance. Among those works, Contrastive Language-Image Pretraining (CLIP) (Radford et al., 2021) is a representative one. Given an image, its potential category names are inserted into a predefined text prompt. Then CLIP can score the categories according to the similarity between the encoded texts and the image features. The category with a high similarity score is finally identified as the classification result. Similar to the CLIP, in the audio community, the Contrastive Language-Audio Pretraining (CLAP) (Wu et al., 2023) is trained on a large-scale corpus that incorporates the texts with the semantic-aligned audio. With similar training and inference schemes, CLAP is able to perform audio classification in a zero-shot manner, and satisfactorily identify the category of a given audio from open-vocabulary too.

Inspired by such benefits of large-scale pretraining, we propose a Pseudo Label Generation (PLG) module that seeks guidance from the CLIP (Radford et al., 2021) and CLAP (Wu et al., 2023) to generate reliable segment-level visual and audio pseudo labels. A simple illustration of PLG can be seen from Fig. 1c. Given all the potential event labels, CLIP/CLAP acting like an intelligent robot is asked to answer whether the event is contained in the given visual/audio segment. In brief, the queried event categories with high cross-modal similarity scores that exceed the pre-set threshold \(\tau _v\)/\(\tau _a\) are finally regarded as the visual/audio pseudo labels. This process can be applied to each video segment, so we can obtain segment-level pseudo labels. We provide more implementation details in Sect. 4.1. The generated pseudo labels are used to provide full supervision for each modality. Going a step further, we consider the generated pseudo labels may contain potential noise since the pseudo labels are non-manually annotated. Especially, some video instances can be challenging even for human annotators due to issues inherent in the collected videos, such as objects in the visual event being too small or obscured. As the example shown in Fig. 1d, only part of the vacuum cleaner is visible in the third segment. PLG only uses the single frame to generate pseudo labels and fails to recognize the visual event vacuum cleaner for this segment without contextual information, giving the incorrect pseudo label ‘0’ for this category (denoted by brown box). To alleviate such noise in pseudo labels generated by PLG, we further propose a Pseudo Label Denoising (PLD) strategy to re-examine the generated pseudo labels and amend the incorrect ones. Samples with noisy labels are usually hard to learn and often get a large forward propagation loss (Hu et al., 2021a; Kim et al., 2022; Huang et al., 2019). In our work, the large loss comes from those data where the model is unable to give consistent predictions with the pseudo labels. For the video example shown in Fig. 1d, the third segment indeed suffers an abnormally large forward loss whereas the value is 3.39. Note that the values are almost zero for other segments in the same video which are assigned accurate labels. This motivates us to perform a segment-wise denoising by checking the abnormally large forward loss along the timeline. The segments with these controversial pseudo labels will be reassigned, providing a more accurate version. More discussions and implementation details of PLD will be introduced in Sect. 4.3.

PLG and PLD enable the production of high-quality pseudo labels. Furthermore, we find that the obtained segment-level audio and visual pseudo labels contain rich information, indicating how many categories of events happen in each audio/visual segment (category-richness) and how many audio/visual segments a certain category of the event exists in (segment-richness). Take the visual modality for example, as shown in Fig. 1b, the video-level label indicates that there may be at most two events in the visual track, i.e., the speech and vacuum cleaner. In practice, only the fourth segment contains both two events while the first segment contains only one event, namely the vacuum cleaner. Therefore, we can denote the visual category richness for the first and the fourth segments as 1/2 and 1, respectively. Similarly, from the perspective of the event categories, the vacuum cleaner event appears in four video segments of the entire video which totally contains five segments, while the speech event only exists in one (the fourth) segment. Thus, we can denote the visual segment richness for events of vacuum cleaner and speech as 4/5 and 1/5, respectively. Such information about category richness and segment richness can also be observed in the audio track. An AVVP model should be aware of the differences in category richness and segment richness to give correct predictions. Based on this, we propose a Pseudo Label Exploitation (PLE) strategy that uses a novel Richness-aware Loss to align the richness information contained in model predictions with that contained in pseudo labels. Our experiments verify that the generated pseudo labels combined with the proposed richness-aware loss significantly boost the video parsing performance.

For the challenging audio-visual video parsing task, we conduct a comprehensive study on the exploration of the segment-wise audio and visual pseudo labels, including their generation, exploitation, and denoising. Extensive experimental results demonstrate the effectiveness of our main designs. Besides, our method can also be extended to the related weakly-supervised audio-visual event localization (AVEL) (Tian et al., 2018; Wu et al., 2019; Zhou et al., 2021) task. Overall, our contributions can be summarized as follows:

  • We introduce a new approach to explore the pseudo-label strategy for the AVVP task from a more fine-grained level, i.e., the segment level.

  • Our proposed pseudo label generation and label denoising strategies successfully provide high-quality segment-wise audio and visual pseudo labels.

  • We propose a new richness-aware loss function for superior model optimization, effectively exploiting the segment-richness and category-richness present in the pseudo labels.

  • Our method achieves new state-of-the-art in all types of event parsing, including audio event, visual event, and audio-visual event parsing.

  • The proposed core designs can be seamlessly integrated into existing frameworks for the AVVP task and AVEL task, leading to enhanced performances.

2 Related Work

Audio-Visual Video Parsing (AVVP). AVVP task needs to recognize what events happen in each modality and localize the corresponding video segments where the events exist. Tian et al.  (Tian et al., 2020) first propose this task and design a hybrid attention network to aggregate the intra-modal and inter-modal features. Also, they use the label smoothing (Szegedy et al., 2016) strategy to address the modality label bias from the single video-level label. Some methods focus on network design. Yu et al. (2022) propose a multimodal pyramid attentional network that consists of multiple pyramid units to encode the temporal features. Jiang et al. (2022) use two extra independent visual and audio prediction networks to alleviate the label interference between audio and visual modalities. Mo & Tian (2022) use learnable class-aware tokens to group the semantics from separate audio and visual modalities. To overcome the label interference, Wu & Yang (2021) swap the audio and visual tracks of two event-independent videos to construct new data for model training. The pseudo labels are generated according to the predictions of the reconstructed videos. Cheng et al. (2022) first estimate the noise ratio of the video label and reverse a certain percentage of the label with large forward losses. Although these methods bring considerable improvements, they can only generate the event label from the video level. Unlikely, we aim to directly obtain high-quality pseudo labels for both audio and visual modalities from the segment level that further helps the video parsing system training.

CLIP/CLAP Pre-Training. Here, we discuss the pre-training technique and elaborate on why we choose the CLIP/CLAP as the base big model for generating pseudo labels in this work. CLIP (Radford et al., 2021) is trained on a dataset with 400 million image-text pairs using the contrastive learning technique. This large-scale pretraining enables CLIP to learn efficient representations of the images and texts and demonstrates impressive performance on zero-shot image classification. Its zero-shot transfer ability opens a new scheme to solve many tasks and spawns a large number of research works, such as image caption (Barraco et al., 2022), video caption (Tang et al., 2021), and semantic segmentation (Ma et al., 2022; Ding et al., 2022; Xu et al., 2021; Zhou et al., 2022a; Rao et al., 2022b). Most of the works choose to freeze or fine-tune the image and text encoders of CLIP to extract advanced features for downstream tasks (Tang et al., 2021; Wang et al., 2022; Barraco et al., 2022; Ma et al., 2022; Zhou et al., 2022c). For the zero-shot semantic segmentation, some methods start to use the pretrained CLIP to generate pixel-level pseudo labels which are annotator-free and helpful (Zhou et al., 2022a; Rao et al., 2022b). Similarly to CLIP, CLAP (Wu et al., 2023) is trained using a similar contrastive objective but with 630k audio-text pairs and achieves state-of-the-art zero-shot audio classification performance. Recently, some works have started to use CLAP to facilitate downstream tasks, such as audio source separation (Liu et al., 2023b), text-to-audio generation (Liu et al., 2023a), and speech emotion recognition (Pan et al., 2023). In this work, we make a new attempt to borrow the prior knowledge from CLIP/CLAP to ease the challenging weakly-supervised audio-visual video parsing task.

Learning with Pseudo Labels. Deep neural networks achieve remarkable performance in various tasks, largely due to the large amount of labeled data available for training. Recently, some researchers have attempted to generate massive pseudo labels for unlabeled data to further boost model performance. Most methods directly generate and use pseudo labels, which have been proven to be beneficial for various tasks, such as image classification (Yalniz et al., 2019; Xie et al., 2020; Pham et al., 2021; Rizve et al., 2021; Zoph et al., 2020; Hu et al., 2021b), speech recognition (Kahn et al., 2020; Park et al., 2020), and image-based text recognition (Patel et al., 2023). For the studied AVVP task, few works study the impact of pseudo labels and existing several methods focus on disentangling the event pseudo label for each modality from the known video label (Tian et al., 2020; Wu & Yang, 2021; Cheng et al., 2022). However, the obtained pseudo labels are confined to the video level. On the other hand, some new works in other fields notice the potential noise contained in the pseudo labels and propose effective methods to better learn with noisy pseudo labels (Hu et al., 2021a; Kim et al., 2022). Specifically, Hu et al. (2021a) propose to optimize the network by giving much weight to the clean samples while less on the hard-to-learn samples. In the weakly-supervised multi-label classification problem, Kim et al. (2022) propose to correct the false negative labels that are likely to have larger losses. However, these works focus on label refinement for image tasks. Refocusing on our video task, we conduct a comprehensive exploration of pseudo labels, encompassing both their generation and denoising. Specifically, we propose to assign explicit pseudo labels for each segment of each modality. We achieve this goal by flexibly sending all the possible event categories to reliable large-scale text-vision/audio models to pick the most likely event categories for each video segment. Furthermore, we propose a new pseudo-label denoising strategy, which performs segment-wise denoising to provide pseudo labels with more accurate temporal boundaries within each video. We also provide more in-depth discussions on pseudo-label quality assessment and the denoising effects in different modalities as shown in Sect. 5.2.

3 Preliminary

In this section, we formulate the detail of the AVVP task and briefly introduce the baseline framework HAN (Tian et al., 2020), which is used in both our approach and prior works employing video-level pseudo labels (Wu & Yang, 2021; Cheng et al., 2022) in the AVVP task.

Task Formulation. Given a T-second video sequence \(\{V_t, A_t\}_{t=1}^{T}\), \(V_t\) and \(A_t\) denote the visual and the audio components at the t-th video segment, respectively. The event label of the video \( {\varvec{y}}^{v \cup a} \in \mathbb {R}^{1 \times C} = \{ y^{v \cup a}_c | y^{v \cup a}_c \in \{0, 1\}, c = 1, 2,..., C\}\), where C is the total number of event categories, the superscript ‘\(v\cup a\)’ denotes the event label of the entire video is the union of the labels of audio and visual modalities, value 1 of \({y}^{v \cup a}_c\) represents an event with that c-th category happens in the video. Note that \({\varvec{y}}^{v \cup a}\) is a weakly-supervised label from the video level, the label of each individual modality for each video segment is unknown during training. However, the audio events, visual events, and audio-visual events contained in each segment need to be predicted for evaluation. We denote the probabilities of the video-level visual and audio events as \(\{\{{\varvec{p}}^v; {\varvec{p}}^a\} \in \mathbb {R}^{1 \times C} | {p}^v_c, {p}^a_c \in [0, 1] \}\), \({\varvec{p}}^{v \cap a}={\varvec{p}}^{v}*{\varvec{p}}^{a}\) is used to represent the intersection of them. Thus, the probability of the visual events, audio events, and audio-visual events of all video segments can be denoted as \(\{{\varvec{P}}^v; {\varvec{P}}^a; {\varvec{P}}^{v \cap a} \} \in \mathbb {R}^{T \times C} \), which need to be predicted.

Baseline Framework. The baseline network HAN

 (Tian et al., 2020) uses the multi-head attention (MHA) mechanism in Transformer (Vaswani et al., 2017) to encode intra-modal and cross-modal features for audio and visual modalities. We denote the initial audio and visual features extracted by pretrained neural networks (Hershey et al., 2017; He et al., 2016) as \( {\varvec{F}}^a \), \({\varvec{F}}^v \in \mathbb {R}^{T \times d}\), where d is the feature dimension. The process of HAN can be summarized as,

$$\begin{aligned} \left\{ \begin{aligned} {\varvec{\dot{F}}}^a = {\varvec{F}}^a + \texttt {MHA}({\varvec{F}}^a, {\varvec{F}}^a) + \texttt {MHA}({\varvec{F}}^a, {\varvec{F}}^v), \\ {\varvec{\dot{F}}}^v = {\varvec{F}}^v + \texttt {MHA}({\varvec{F}}^v, {\varvec{F}}^v) + \texttt {MHA}({\varvec{F}}^v, {\varvec{F}}^a), \end{aligned} \right. \end{aligned}$$
(1)

where \({\varvec{\dot{F}^a}}\), \({\varvec{\dot{F}^v}} \in \mathbb {R}^{T \times d}\) are the updated audio and visual features. The probabilities of segment-wise events for audio and visual modalities are predicted through a fully-connected (FC) layer and a sigmoid function, denoted as \({\varvec{P}}^a \in \mathbb {R}^{T \times C}\) and \({\varvec{P}}^v \in \mathbb {R}^{T \times C}\). An attentive pooling layer is further used to transform the segment-level predictions \(\{ {\varvec{P}}^a; {\varvec{P}}^v\}\) to video-level predictions \(\{ {\varvec{p}}^a; {\varvec{p}}^v\} \in \mathbb {R}^{1 \times C}\). By summarizing the audio and visual predictions, \({\varvec{p}}^a\) and \({\varvec{p}}^v\), we obtain the event prediction of the entire video \({\varvec{p}}^{v \cup a} \in \mathbb {R}^{1 \times C}\). The basic video-level objective for model training is:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{\text {bce}}({\varvec{p}}^{v \cup a}, {\varvec{y}}^{v \cup a}) + \mathcal {L}_{\text {bce}}({\varvec{p}}^a, \overline{{\varvec{y}}}^a) + \mathcal {L}_{\text {bce}}({\varvec{p}}^v, \overline{{\varvec{y}}}^v), \end{aligned}$$
(2)

where \(\mathcal {L}_{\text {bce}}\) is the binary cross-entropy loss, \({\varvec{y}}^{v \cup a} \in \mathbb {R}^{1 \times C} \) is the video-level ground truth label and \(\{\overline{{\varvec{y}}}^v; \overline{{\varvec{y}}}^a \} \in \mathbb {R}^{1 \times C} \) are the video-level visual and audio pseudo labels generated using label smoothing (Szegedy et al., 2016) from \({\varvec{y}}^{v \cup a}\).

Fig. 1
figure 1

Overview of our method. As a label refining method, we aim to produce high-quality and fine-grained segment-wise event labels. For the backbone, any existing network for the AVVP task can be used to generate event predictions. Here, we adopt the baseline HAN (Tian et al., 2020). In our solution, we design a pseudo label generation (PLG) module, where the pretrained CLIP (Radford et al., 2021) and CLAP (Wu et al., 2023) are used to generate segment-level pseudo labels for the visual and the audio modality, respectively. Notably, the parameters of the CLIP and CLAP are frozen. In the figure, we detail the visual pseudo label generation and simplify that for the audio modality since they share similar pipelines. In brief, the pseudo labels can be identified by thresholding the similarity of visual/audio—(event) text embeddings. For the t-th segment, the video label ‘speech’ is filtered out for the visual modality and only ‘rooster’ is remained for the audio modality. After that, with the generated pseudo labels, we propose the pseudo label exploitation (PLE) by designing a richness-aware loss as a new fully supervised objective to help the model align the category richness and segment richness in the prediction and pseudo label. Lastly, we design a pseudo label denoising (PLD) strategy that further refines the pseudo labels by reversing the positions with anomalously large forward loss values. Specifically, we re-examine the pseudo labels along the timeline. Pseudo labels of those segments with abnormal high binary cross-entropy forward loss will be refined (the motivation and implementation detail can be seen in Sect. 4.3). The updated pseudo labels are further used as new supervision for model training. \(\otimes \) denotes the matrix multiplication and \(\odot \) is the element-wise multiplication

4 Our Method

An overview of our method is shown in Fig. 2. We focus on producing reliable segment-level audio and visual pseudo labels to better supervise the model for audio-visual video parsing. For the backbone, we simply adopt the baseline HAN (Tian et al., 2020) to generate event predictions. Our method provides the following new innovations. (1) We propose a pseudo label generation module that uses the pretrained CLIP (Radford et al., 2021) and CLAP (Wu et al., 2023) models to respectively generate reliable visual and audio pseudo labels from the segment level. (2) We then propose a pseudo label exploitation strategy to utilize the obtained pseudo labels. Specifically, we design a new richness-aware loss to regularize the predictions to be aware of the category richness and segment richness contained in the pseudo labels. This is helpful for model optimization. (3) We also propose a pseudo label denoising strategy that further improves the generated visual pseudo labels for those data with abnormally high forward loss values due to being assigned incorrect pseudo labels. Next, we elaborate on these proposed strategies.

4.1 Pseudo Label Generation (PLG)

PLG aims to generate high-quality visual and audio pseudo labels from the segment level that are expected to alleviate the video-level label interference for single modality and better supervise the model to distinguish video segments. As discussed in Sect. 1, we select the pretrained CLIP (Radford et al., 2021) and CLAP (Wu et al., 2023) to achieve this goal due to their flexible open-vocabulary classification capabilities.

Taking visual modality as an example, we detail the pseudo label generation process. Specifically, each video instance is evenly split into several segments and we sample the middle frame to represent each segment. As shown in Fig. 2-1, for the sampled frame \(\mathcal {I}_t\) at the t-th segment, we input it into CLIP image encoder and obtain the visual feature, denoted as \({\varvec{f}}^{\mathcal {I}}_t \in \mathbb {R}^{1 \times d}\). As for the event category encoding, the default text input of the CLIP text encoder follows the prompt “A photo of a [CLS]” where the [CLS] can be replaced by the potential category names. For the AVVP task, we empirically change the prompt to a more appropriate one, “This photo contains the [CLS]” (An ablation study of prompt in CLIP text encoder will be shown in Sect. 5.2). By replacing the [CLS] in this prompt with each event category and sending the generated texts to the CLIP text encoder, we can obtain the text (with event category) features of all C-class \({\varvec{f}}^{\mathcal {T}} \in \mathbb {R}^{C \times d}\). Then the normalized cosine similarity \({\varvec{s}}_t \in \mathbb {R}^{1 \times C}\) between the image and event categories can be computed by,

$$\begin{aligned} {\varvec{s}}_t = {\textrm{softmax}}\left( \frac{ {\varvec{f}}^{\mathcal {I}}_t }{\Vert {\varvec{f}}^{\mathcal {I}}_t \Vert _2 } \otimes (\frac{ {\varvec{f}}^{\mathcal {T}}}{ \Vert {\varvec{f}}^{\mathcal {T}} \Vert _2})^{\top }\right) , \end{aligned}$$
(3)

where \(\otimes \) denotes the matrix multiplication, and \(\top \) is the matrix transposition. A high similarity score in \({\varvec{s}}_t\) indicates that the event category is more likely to appear in the image.

We use a threshold \(\tau _v\) to select the categories with higher confidence scores in \({\varvec{s}}_t\) and obtain the score mask \({\varvec{m}}_t\). After that, we impose the score mask \({\varvec{m}}_t\) on the known video-level label \({\varvec{y}}^{v \cup a}\) with element-wise multiplication \(\odot \) to filter out the visual events occurring at t-th segment \({\varvec{\hat{y}}}^v_t \in \mathbb {R}^{1 \times C}\). This process can be formulated as,

$$\begin{aligned} \left\{ \begin{aligned}&{\varvec{m}}_t = \mathbbm {1}({{\varvec{s}}_t - \tau _v}), \\&{\varvec{\hat{y}}}^v_t = {\varvec{m}}_t \odot {\varvec{y}}^{v \cup a}, \end{aligned} \right. \end{aligned}$$
(4)

where \(\mathbbm {1}(x_i)\) outputs ‘1’ when the input \(x_i \ge 0\) else outputs ‘0’, \( i=1, 2,..., C\), and \({\varvec{m}}_t \in \mathbb {R}^{1 \times C}\).

This pseudo label generation process can be applied to all the segments. Therefore, we can obtain the segment-level visual pseudo label for each video, denoted as \({\varvec{\hat{Y}}}^v =\{ {\varvec{\hat{y}}}^v_t \}\in \mathbb {R}^{T \times C}\). Note that the video-level visual pseudo label \(\hat{{\varvec{y}}}^v \in \mathbb {R}^{1 \times C}\) can be easily obtained from \({\varvec{\hat{Y}}}^v\), where \( \hat{{y}}^v_c = \mathbbm {1}(\sum _{t=1}^{T} {\varvec{\hat{Y}}}^v_{t,c})\) that means if a category of the event exists in at least one video segment, it is contained in the video-level label.

As for the audio pseudo labels, they can be generated in a similar way but with several adjustments. For brevity, we introduce the main steps here. (1) We use the CLAP model instead of the CLIP for audio pseudo label generation. (2) The audio waveform of the entire video is split into T equal-length segments and each segment is sent to the CLAP audio encoder. (3) We use the prompt “This sound contains the [CLS]” with the event categories as the input of CLAP text encoder. (4) We compute the similarity score of the text and audio features extracted by CLAP (just like Eq. 3) and use an independent threshold \(\tau _a\) (replace \(\tau _v\) in Eq. 4) to select high similarity values. In this way, we obtain the segment-level audio pseudo label \({\varvec{\hat{Y}}}^a \in \mathbb {R}^{T \times C}\) and the video-level audio pseudo label \(\hat{{\varvec{y}}}^a \in \mathbb {R}^{1 \times C}\) for each video sample.

4.2 Pseudo Label Exploitation (PLE)

The weakly-supervised AVVP task requires predicting for each segment, but only the video-level label is provided. This task would be greatly advanced if segment-level supervision is additionally provided. In this part, we try to exploit the pseudo labels from both the video-level and segment-level since we have obtained pseudo labels of these two levels, namely \(\hat{{\varvec{y}}}^m \in \mathbb {R}^{1 \times C}\) and \(\hat{{\varvec{Y}}}^m \in \mathbb {R}^{T \times C}\), where \(m \in \{v, a\}\) denotes the modality type. In particular, for the segment-level supervision, we propose a new richness-aware optimization objective to help the model align the predictions and pseudo labels. We introduce our pseudo label exploitation strategy in the two aspects below.

Basic video-level loss. Existing methods usually adopt the objective function formulated in Eq. 1 for model training (Wu & Yang, 2021; Yu et al., 2022; Cheng et al., 2022; Mo & Tian, 2022), where \(\overline{{\varvec{y}}}^m \in \mathbb {R}^{1 \times C}\) is the video-level label obtained by label smoothing. Instead, we use the video-level pseudo label \(\hat{{\varvec{y}}}^m \in \mathbb {R}^{1 \times C}\) generated by our PLG module as new supervision. The objective is then updated to,

$$\begin{aligned} \mathcal {L_V} = \mathcal {L}_{\text {bce}}({\varvec{p}}^{v \cup a}, {\varvec{y}}^{v \cup a}) + \mathcal {L}_{\text {bce}}({\varvec{p}}^a, \hat{{\varvec{y}}}^a) + \mathcal {L}_{\text {bce}}({\varvec{p}}^v, \hat{{\varvec{y}}}^v). \end{aligned}$$
(5)

New segment-level loss. With the segment-wise pseudo label \(\hat{{\varvec{Y}}}^m\), we propose a new richness-aware loss that is inspired by the following observations. (1) Each row of the segment-wise pseudo labels, e.g., \(\hat{{\varvec{Y}}}^m_{t\cdot } \in \mathbb {R}^{1 \times C}\), the t-th row of the pseudo label, indicates whether all the events appear in the t-th segment. For example, we show the visual pseudo label in Fig. 2-2, i.e., \(\hat{{\varvec{Y}}}^m\) where \(m=v\). There are three visual events in the first segment, i.e., the dog, rooster, and speech, \(\hat{{\varvec{Y}}}^v_{1\cdot } = [1,1,1]\), while the last segment only contains one rooster event, i.e., \(\hat{{\varvec{Y}}}^v_{T\cdot } = [0,1,0]\). This reflects the richness of the event category in different segments that indicates how many event categories exist in each segment. Similarly, the audio pseudo label \(\hat{{\varvec{Y}}}^a\) tells the category richness of audio events. We define the category richness of t-th segment \({cr}_t^m\) as the ratio of the category number of t-th segment to the total category number of the video, written as,

$$\begin{aligned} {cr}_t^m = \frac{ \sum _{c=1}^C \hat{{\varvec{Y}}}^m_{t,c} }{ \sum _{c=1}^C {\varvec{y}}^{v \cup a}_{c}}, \end{aligned}$$
(6)

where \(m \in \{v, a\}\) denotes the visual or audio modality. Therefore, we can obtain the category richness vector of all segments \({\varvec{cr}}^m \in \mathbb {R}^{T \times 1}\) for each modality. In the example shown in Fig. 2-2, the visual category richness for the first and last segments, i.e., \(cr_1^v\) and \(cr_T^v\), is equal to 1 and 1/3, respectively.

(2) On the other hand, each column of the pseudo labels, e.g., \(\hat{{\varvec{Y}}}^m_{\cdot c}\in \mathbb {R}^{T \times 1}\), \(m \in \{v, a\}\), indicates how many visual/audio segments contain the event of c-th category. We denote the segment richness of c-th category \({sr}_c^m\) as the ratio of the number of segments containing that category c to the total segment number of the video, written as below,

$$\begin{aligned} {sr}_c^m = \frac{1}{T} \sum _{t=1}^T \hat{{\varvec{Y}}}^m_{t,c}. \end{aligned}$$
(7)

In the example shown in Fig. 2-2, the visual segment richness for the event categories dog and speech, i.e., \(sr_1^v\) and \(sr_3^v\) is equal to 1/2 and 1/4, respectively. Extending to all C event categories, we can obtain the segment richness vector of all the categories \({\varvec{sr}}^m \in \mathbb {R}^{1 \times C}\), where \(m \in \{v, a\}\) denotes the visual and audio modalities.

So far, regardless of modality \(m \in \{v, a\}\), we can obtain the category richness \({\varvec{cr}}^m\) and segment richness \({\varvec{sr}}^m\) of the pseudo label. With the prediction \({\varvec{P}}^m \in \mathbb {R}^{T \times C}\) from the baseline network, we can compute its category richness and segment richness in the same way, denoted as \({\varvec{pcr}}^m \in \mathbb {R}^{T \times 1}\) and \({\varvec{psr}}^m \in \mathbb {R}^{1 \times C}\). Then, we design the segment-level richness-aware loss \(\mathcal {L_S}\) to align the richness of the predictions and the pseudo labels, calculated by,

$$\begin{aligned} \mathcal {L_S} = \sum _{m\in \{v, a\}} {\mathcal {L}_{\text {bce}}({\varvec{pcr}}^m, {\varvec{cr}}^m) + \mathcal {L}_{\text {bce}}({\varvec{psr}}^m, {\varvec{sr}}^m)}. \end{aligned}$$
(8)

The total objective function \(\mathcal {L}_{\text {total}}\) for AVVP in this work is the combination of the basic loss \(\mathcal {L_V}\) and the richness-aware loss \(\mathcal {L_S}\), i.e.,

$$\begin{aligned} \mathcal {L}_{\text {total}} = \mathcal {L_V} + \lambda \mathcal {L_S}, \end{aligned}$$
(9)

where \(\lambda \) is a weight parameter.

4.3 Pseudo Label Denoising (PLD)

In general, PLG can produce trustworthy segment-level pseudo labels, especially when combined with the proposed richness-aware loss, which significantly improves the audio-visual video parsing performance. This can be verified by our experiments shown in Sect. 5.3. Going a step further, we posit that the generated pseudo labels may still encompass some noise. By our observation, the video-level event category pseudo-annotation can be satisfactorily tackled, but the misclassification of specific segments exists along the timeline within each video, particularly when dealing with hard video instances that are difficult to annotate from the segment level. We specifically trace such challenges in the visual modality and observe that without contextual information, separate frames sent to the CLIP may be incorrectly classified, especially in the instances where the visual objects in the images are too diminutive, the images are afflicted by blurriness or inadequate lighting, when portions of the objects are obscured, rendering them arduous to discern, etc. As shown in Fig. 2, the dog at the last two segments is mostly obscured by the rooster, and CLIP fails to recognize the visual event dog without contextual information. In this case, the generated pseudo labels do not accurately capture the temporal boundary of the event and would be detrimental to model training. We believe that the segment-level visual pseudo labels can be further refined. As for the audio modality, the audio signal is represented through waveform and it keeps good continuity even if it is split into multiple segments for pseudo-labeling. This characteristic may help to resist disturbances along the timeline when generating segment-level audio pseudo labels with CLAP. In fact, the quality of audio pseudo labels is indeed better than that of visual pseudo labels. For example, the segment-level F-score metric for audio pseudo labels is \(\sim \)10 points higher than that of visual pseudo labels, as demonstrated in Tables 12. This implies the high quality of audio pseudo labels produced by PLG and highlights the greater difficulty in enhancing the accuracy of visual pseudo labels. We present further discussions with more experimental results in Sect. 5.2.

In this section, we propose a pseudo label denoising (PLD) strategy that aims to recheck the pseudo labels generated by PLG and further refine the inaccurate ones (noisy pseudo labels). Our PLD is inspired by the works that conduct label denoising with the help of the forward propagation loss for image tasks (Kim et al., 2022; Hu et al., 2021a). In general, a large forward loss means that the trained model does not give the same prediction as the labels for a sample. There are two main reasons for this: (1) the provided label is correct but the video data is hard to learn and the model does not learn an effective representation for it; (2) the label itself is incorrect. In this work, our PLD aims to leverage the forward loss to check the temporal continuity of segment-level pseudo labels in each video and amend the abnormal segments when they belong to the second case.

Specifically, we first use the objective function shown in Eq. 9 to train a baseline model. Then, we compute the element-wise forward loss matrix by measuring the binary cross entropy between the prediction \({\varvec{P}}^m\) and the pseudo label \({\varvec{\hat{Y}}}^m\), denoted as \( {\varvec{\mathcal {M}}}^m = \mathcal {L}_{\text {bce}}({\varvec{P}}^m, {\varvec{\hat{Y}}}^m) \in \mathbb {R}^{T \times C}\), where \(m\in \{v,a\}\) denotes the visual and the audio modality. Denote the j-th column of \({\varvec{\mathcal {M}}}^m\) as \({\varvec{\mathcal {M}}}_{\cdot j}^m \in \mathbb {R}^{T \times 1}\), it indicates the loss value of all segments for the specific j-th event category. In the example shown in Fig. 2-3, we display the forward loss matrix for the visual modality and find that the last two video segments have much larger forward losses than other segments for the dog category; they actually contain this event like other segments. The abnormally large loss value is caused by the fact that the last two segments are assigned incorrect visual pseudo labels. Therefore, the matrix \({\varvec{\mathcal {M}}}^m\) can reflect those segments whose pseudo labels contain potential noise and require refinement.

Note that the pseudo label \({\varvec{\hat{y}}}^m \in \mathbb {R}^{1 \times C}\) indicates the predicted event categories that appear in each modality. We trust the event category \({\varvec{\hat{y}}}^m\) and use it to mask the matrix \({\varvec{\mathcal {M}}}^m\). There are two steps for the matrix \({\varvec{\mathcal {M}}}^m\) masking. Step I: For other event categories that do not occur in the video sample, their pseudo labels will be eased by setting zeros in \({\varvec{\mathcal {M}}}^m\). For the example shown in Fig. 2-2, we only need to denoise the pseudo labels for the three columns of \({\varvec{\hat{y}}}^m\) that corresponds to the predicted event categories of dog, rooster and speech. The calculation of the masked matrix \({\varvec{\mathcal {M}'}}^m\) can be computed by,

$$\begin{aligned} \begin{aligned}&{\varvec{\mathcal {M}'}}^m = f_{\text {rpt-T}}({\varvec{\hat{y}}}^m) \odot {\varvec{\mathcal {M}}}^m, \end{aligned} \end{aligned}$$
(10)

where \({\varvec{\mathcal {M}'}}^m \in \mathbb {R}^{T \times C}\), and \(f_{\text {rpt-T}}({\varvec{\hat{y}}}^m)\) denotes the operation of repeating \({\varvec{\hat{y}}}^m\) along the temporal dimension for T times, and \(f_{\text {rpt-T}}({\varvec{\hat{y}}}^m) \in \mathbb {R}^{T \times C}\).

Step II: Returning to the masked forward loss of all video segments of the j-th category \({\varvec{\mathcal {M}'}}^m_{\cdot j} \in \mathbb {R}^{T \times 1}\), we treat the average of the top-K smallest loss values of \({\varvec{\mathcal {M}'}}^{m}_{\cdot j}\) as the threshold \(\mu _j^m\). \(\mu _j^m\) is the tolerable forward loss within a video sample. If the loss of some segments is abnormally larger than \(\mu _j^m\), they may have incorrect pseudo labels. Comparing the forward loss of each segment with \(\mu _j^m\), we can obtain a binary mask vector \({\varvec{\varphi }}_j^m \in \mathbb {R}^{T \times 1}\), where ‘1’ reflects that the segment has a larger loss than \(\mu _j^m\). This process can be written as,

$$\begin{aligned} \left\{ \begin{aligned}&\mu _j^m = f_{\text {avg}}(f_{\Bbbk }({\varvec{\mathcal {M}'}}^m_{\cdot j})), \\&{\varvec{\varphi }}_j^m = \mathbbm {1}({\varvec{\mathcal {M}'}}^m_{\cdot j} - \alpha \cdot \mu _j^m), \end{aligned} \right. \end{aligned}$$
(11)

where \(f_{\Bbbk }\) and \(f_{\text {avg}}\) denotes the top-K minimum loss selection and the average operation, respectively. Note that we set a scaling factor \(\alpha \) to magnify the averaged loss. It is used to better ensure that anomalous loss is caused by incorrect pseudo labels rather than the data not being well learned.

Extending Eq. 11 to all the event categories, we obtain the binary mask matrix of the video \({\varvec{\varPhi }}^m = \{ {\varvec{\varphi }}_j^m \} \in \mathbb {R}^{T \times C}\). Afterwards, the segment-level pseudo label \({\varvec{\hat{Y}}}^m\) produced by PLG can be refined by reversing the positions that have unusually large loss values reflected by \({\varvec{\varPhi }}^m\), denoted as \( {\varvec{\widetilde{Y}}}^m = f_{\sim }({\varvec{\hat{Y}}}^m, {\varvec{\varPhi }}^m)\). As shown in Fig. 2-2, for the event dog again, the visual pseudo labels generated by PLG are ‘0’ for the last two segments (indicating that there is no dog) and get a large forward loss (marked by the purple box in Fig. 2-3). This indicates that the visual pseudo labels of these two segments are incorrect (actually containing dog) and are thus reversed during the denoising process. We display more examples in Fig. 7 to illustrate the pseudo label denoising process. Finally, the pseudo labels refined by PLD can be taken as new supervision for the model training.

5 Experiments

5.1 Experimental Setup

Dataset. Experiments for the AVVP task are conducted on the publicly available Look, Listen, and Parse (LLP) (Tian et al., 2020) dataset. It contains 11,849 videos spanning over 25 common audio-visual categories, involving scenes such as humans, animals, vehicles, musical instruments, etc. Each video is 10 s long and around 61% of the videos contain more than one event category. Videos of the LLP dataset are split into 10,000 for training, 649 for validation, and 1200 for testing. The training set is provided with only the video-level labels, i.e., the label union of the audio events and visual events. For validation and test sets, the segment-wise event labels for each audio and visual modality are additionally provided.

Evaluation metrics. Following existing works (Tian et al., 2020; Cheng et al., 2022; Wu & Yang, 2021; Yu et al., 2022), we evaluate our method by measuring the parsing results of all the types of events, namely audio events (A), visual events (V), and audio-visual events (AV, both audible and visible). The average parsing result of the three types is denoted as the “Type@AV” metric. Different from Type@AV metric, “Event@AV” metric calculates the F-score considering the predictions of the audio and the visual events together. For the above event types, both the segment-level and event-level F-scores are used as evaluation metrics. The segment-level metric measures the quality of the predicted events by comparing them with the ground truth for each video segment. And the event-level metric treats consecutive segments containing the same event category as a whole event, and computes the F-score based on mIoU = 0.5 as the threshold. Therefore, the event-level F-score metric is more difficult because it requires the model to predict a satisfactory temporal boundary of the event.

Implementation details. (1) Feature extraction. For the LLP dataset, each video is divided into 10 consecutive 1-second segments. For a fair comparison, we adopt the same feature extractors to extract the audio and visual features. Specifically, the VGGish (Hershey et al., 2017) network pretrained on AudioSet (Gemmeke et al., 2017) dataset is used to extract the 128-dim audio features. The pretrained ResNet152 (He et al., 2016) and R(2+1)D (Tran et al., 2018) are used to extract the 2D and 3D visual features, respectively. The low-level visual feature is the concatenation of 2D and 3D visual features. (2) Pseudo label preparation. For each video in the training set of the LLP dataset, we first offline generate the segment-wise visual and audio pseudo labels using our PLG module. We use the ViT-B/32-based CLIP (Vaswani et al., 2017) and HTSAT-RoBERTa-based CLAP (Wu et al., 2023) to conduct the pseudo label generation, and their parameters are frozen. (3) Training procedure. The objective function \(\mathcal {L}_{\text {total}}\) shown in Eq. 9 is used to train the baseline model HAN (Tian et al., 2020). The hyperparameter \(\lambda \) in Eq. 9 for balancing the video-level and the segment-level losses is empirically set to 0.5. This pretrained model is then used in our PLD to further refine the pseudo labels. The refined pseudo labels are used to supervise the baseline model training again. For all the training processes, we adopt the Adam optimizer to train the model with a mini-batch size of 32 and the learning rate of \(3 \times 10^{-4}\). The total training epoch is set to 30. All experiments are conducted with PyTorch (Paszke et al., 2019) on a single NVIDIA GeForce-RTX-2080-Ti GPU. The codes, pseudo labels, and pretrained models will be released.

5.2 Parameter Studies

We perform parameter studies of essential parameters used in our method, namely the score threshold \(\tau _v\)/\(\tau _a\) and the text prompt for CLIP/CLAP used in the PLG module, and the top-K and scaling factor \(\alpha \) used in the PLD strategy. Experiments in this section are conducted on the validation set of the LLP dataset of which the segment-level event labels are accessible. Thus, we also verify the quality of pseudo labels through correctness measurements in this part.

Table 1 Parameter study of the threshold \(\tau _v\) and prompt used in the VISUAL pseudo label generation

Study of the thresholds and prompts in PLG. \(\tau _v\)/\(\tau _a\) is the threshold to select high scores of the cosine similarity between the event category and the visual/audio segment in the mask calculation (Eq. 4). We first explore the impact of \(\tau _v\) on the visual pseudo label generation. As shown in the upper part of Table 1, we used the default prompt VP1—“This photo contains the [CLS]” and test several values of \(\tau _v\) to generate visual pseudo labels. Then, we report the category precision between the pseudo labels and the ground truth at the video level, and the segment-level and event-level F-scores to measure the quality of the generated pseudo labels. As shown in the Table, the pseudo label with the best quality is obtained when \(\tau _v = \) 0.041. And all the evaluation metrics drop significantly when \(\tau _v\) changes from 0.041 to 0.042. We argue such sensitivity is related to the softmax operation in Eq. 3 that squeezes the similarity score into small logits. The metrics for visual modality are acceptable up to the threshold of \(\tau _v\) =0.041. Using the same experimental strategy, we explore the impacts of threshold \(\tau _a\) in audio pseudo label generation. The experimental results are shown in Table 2 and we find that the optimal audio pseudo labels are obtained when \(\tau _a\) is equal to 0.038.

Table 2 Parameter study of the threshold \(\tau _a\) and prompt used in the AUDIO pseudo label generation

Furthermore, we explore the impact of prompts used in the PLG. The prompts are combined with the event categories and sent as text inputs to the CLIP or CLAP text encoder. For the visual pseudo label generation, specifically, we test four types of prompts, i.e., our default VP1—“This photo contains the [CLS]”, VP2—“This photo contains the scene of [CLS]”, VP3—“This photo contains the visual scene of [CLS]” and VP4—“This is a photo of the [CLS]”. We use these different prompts to generate pseudo labels and compare them with the ground truth. As shown in the lower part of Table 1, visual pseudo labels generated using these different prompts remain relatively consistent. The pseudo label has the highest F-score using the VP1 prompt. Therefore, we use the prompt VP1 as the default setup for visual pseudo label generation in our following experiments. Notably, the precision of the video-level visual pseudo label reaches about 87% under the optimal setup, whereas the precision of directly assigning video labels as the visual event labels (i.e., without prompt) is only \(\sim \)67%. This reveals that PLG can satisfactorily disentangle visual events from weak video labels. For the audio pseudo label generation, we test two prompts, i.e., the AP1—“This is a sound of [CLS]” and AP2—“This sound contains the [CLS]”, to generate segment level audio pseudo labels. Then, we report the segment-level and event-level F-scores of the audio events under different setups and use their average value to select the best one. As shown in the Table 2, performances moderately change under different setups, and the best performance is obtained when using the AP2 prompt and \(\tau _a\) equals 0.038. We thereby use this optimal setup as the default for audio pseudo label generation. It is noteworthy that the event-level F-score is only around 64% if simply assigning the video labels to all the audio segments (without prompt). In contrast, this metric is around 72% for our generated audio pseudo labels. This reveals the vital role of segment-level event identification.

Table 3 Parameter study of the K and scaling factor \(\alpha \) used in the VISUAL pseudo label denoising

Study of the K and \(\alpha \) in PLD. For each predicted event category, the top-K smallest forward loss along the temporal dimension is magnified by \(\alpha \) and used as the threshold to determine which segments’ pseudo labels should be refined (Eq. 11). The segment-level and event-level F-scores of the events are used to evaluate the quality of the denoised pseudo labels. For the visual pseudo label denoising, the results in Table 3 indicate that denoised visual pseudo labels ensure significantly better results than the original labels generated by PLG. In particular, the event-level F-score is improved by 3.46%. Observing Table 3, the optimal setup are \(K = \) 5 and \(\alpha = \) 30. Under this setup, the segment-level and event-level F-scores of the visual pseudo labels of the validation set achieve 72.99% and 68.28%, respectively. For the audio pseudo label denoising, as shown in Table 4, the denoised audio pseudo labels are slightly better than the pseudo labels generated by PLG under the optimal setup (\(K = \) 6, \(\alpha \) = 400). As discussed in Sect. 4.3, PLD is proposed to alleviate the potentially discontinuous pseudo-event labels that happened during PLG and provide better temporal boundaries of the events. We argue that the discontinuity of pseudo labels of audio events rarely occurs due to the temporal characteristics of audio data, thus leading to a slight improvement for the audio modality as shown in Table 4. Besides, from Tables  3 and  4, we observe an interesting phenomenon that the segment-level and event-level F-scores of audio pseudo labels without PLD (80.32% and 71.54%) remain superior to those of the denoised visual pseudo labels (72.99% and 68.28%). This suggests the high quality of audio pseudo labels generated by PLG and underscores the greater difficulty in denoising visual pseudo labels. We ultimately strike a balance between the second computational costs and denoising improvements and refrain from applying PLD to the audio modality in our experiment setup.

Table 4 Parameter study of the K and scaling factor \(\alpha \) used in the AUDIO pseudo label denoising
Table 5 Ablation study of the main modules
Fig. 2
figure 2

Event-level F-scores of pseudo labels for each event category. a We display the event-level F-scores of audio and visual pseudo labels generated by PLG. b Compared to PLG, PLD further improves the event-level F-scores for most categories, providing more accurate visual pseudo labels. All the results are reported on the validation set of the LLP dataset

5.3 Ablation Studies

In this section, we provide some ablation studies to explore the impact of each module in our method. The experimental results are shown in Table 5. The row with id-\(\textcircled {1}\) denotes the performance of the baseline HAN (Tian et al., 2020).

Impact of the PLG. To further verify the benefits of PLG, we use the generated pseudo labels to supervise the model training. Note that the vanilla HAN (id-\(\textcircled {1}\) in Table 5) is trained with the video-level pseudo label obtained by using label smoothing on the given weak label (Eq. 2). For a fair comparison, we only use the video-level pseudo labels generated by PLG as the model supervision (Eq. 5). As shown in row-\(\textcircled {2}\) of Table 5, utilizing the video-level pseudo label generated by our PLG significantly improves the visual event parsing performances. The visual metric (V) increases from 52.9 to 64.1% at the segment level while from 48.9 to 60.2% at the event level. These improvements reflect that our PLG generates more accurate video-level pseudo labels for the visual modality, better distinguishing the event categories and guiding the model training. The improvement in audio event parsing is not pronounced in this situation. We anticipate that the temporally continuous audio segments are more challenging to distinguish under weak video-level supervision. Additionally, the visual features can encapsulate more distinct event semantics, thereby promoting model optimization that is more beneficial to the visual modality. Even so, the utilization of more fine-grained, segment-level pseudo labels generated by our PLG (see ids \(\textcircled {3}\) and \(\textcircled {4}\) in Table  5) significantly enhances both the audio and visual event parsing performances.

Our PLG is able to generate high-quality pseudo labels at the segment level, which can be verified by the results shown in Tables 1 and  2. In Fig. 3a, we further display the event-level F-scores of the generated audio and visual pseudo labels of each event category and provide more discussions. As seen, the audio and visual pseudo labels have satisfactory F-scores for most of the categories. The highest F-score is 93.5% for audio event Accordion and 91.7% for visual event Blender, respectively. Besides, we also find that each modality faces some intractable event categories, such as the speech for visual modality and cat for audio modality. We argue this is caused by the unbalanced data distribution and some categories are particularly difficult for visual recognition, such as speech, cheering, and clapping. Nevertheless, our PLG generally provides reliable audio and visual pseudo labels from both the video level and segment level, ensuring better model learning.

Impact of the PLE. Our PLE uses the proposed richness-aware loss \(\mathcal {L_S}\) in Eq. 8 to exploit the pseudo labels from segment-level, which is taken as a complement to the video-level supervision. At first, we make an ablation study to explore the effect of the respective richness component. As shown in Table 6, “\(\mathcal{S}\mathcal{R}\)” and “\(\mathcal{C}\mathcal{R}\)” denote the segment richness loss and category richness loss between the predictions and pseudo labels, respectively. From Table 6, we can find that each of them can effectively improve the model performance since the studied AVVP task requires distinguishing both the video segments and the event categories. When both types of richness information are used, the pseudo labels fully demonstrate the capability for model optimization. To further validate its superiority, we compare it with a native variant that directly computes the binary cross entropy loss between the predictions and the pseudo labels, denoted as \(\mathcal {L'_S} = \mathcal {L}_{\text {bce}}({\varvec{P}}^v, \hat{{\varvec{Y}}}^v) + \mathcal {L}_{\text {bce}}({\varvec{P}}^a, \hat{{\varvec{Y}}}^a)\). As shown in the row-\(\textcircled {3}\) and \(\textcircled {4}\) of Table 5, both \({\mathcal {L}}'_{\mathcal {S}}\) and the proposed \({\mathcal {L}}_{\mathcal {S}}\) are beneficial for the audible video parsing since they all provide segment-level supervision. Nevertheless, the proposed RL loss is more helpful. The conventional cross-entropy loss relies on ‘hard’ segment-wise alignments between predictions and pseudo labels. In contrast, our proposed richness-aware loss exploits the pseudo labels by aligning predictions from two independent dimensions: category-richness (cr) and segment-richness (sr). According to the definitions of cr (Eq. 6) and sr (Eq. 7), their values are expressed as percentages (‘soft’ ratios) and are independent. This design makes the model trained with our richness-aware loss automatically balance and utilize the soft supervisions from category-richness and segment-richness. Experimental results shown in Table 6 indicate the superiority of our flexible design of richness-aware loss.

Table 6 Richness-aware loss \(\mathcal {L_S}\) under different configurations
Table 7 Comparison with the state-of-the-arts
Table 8 Generalization of our method on other audio-visual video parsing backbones
Table 9 Generalization of our method on the weakly-supervised audio-visual event localization task

Impact of the PLD. The impact of PLD can be observed from two aspects. On one hand, PLD provides more accurate pseudo labels than PLG. As the quality measurement of visual pseudo labels shown in Table 3 on the validation set, the average F-score is 67.95% for PLG while it is 70.63% for PLD. In Fig. 3b, we show event-level F-scores for the visual pseudo labels obtained by PLG and PLD of each event category. PLD further improves the F-scores for most categories (18/25), e.g., the metrics for events Fire alarm and Blender increase substantially by 18.2% and 6.8%, respectively. On the other hand, visual pseudo labels generated by PLD are more helpful than PLG for model training. We update the visual pseudo labels as the new supervision to train the HAN model. As shown in row-\(\textcircled {5}\) of Table 5, the model has superior performance on all types of event parsing. This again reveals that the visual pseudo labels obtained by PLD are more accurate than by PLG and can better supervise the multi-modal parsing model. These results verify the effectiveness of the label denoising strategy in PLD.

Fig. 3
figure 3

Qualitative examples for the weakly-supervised audio-visual event localization task. This task aims to temporally locate those segments containing events that are both audible and visible. The previous state-of-the-art method, CMBS (Xia & Zhao, 2022), utilizes only the video-level weak labels for model training and predictions. In contrast, our method can generate high-quality segment-level pseudo labels, offering fine-grained supervision during training and producing more accurate localization results. “GT” denotes the ground truth. “PL-A” and “PL-V” represent our segment-level pseudo labels for the audio and visual modalities, respectively. The audio-visual event pseudo labels (“PL-AV”) result from the intersection of “PL-A” and “PL-V”. Our method surpasses the vanilla CMBS model in distinguishing between the background and audio-visual events (a) as well as among different audio-visual event categories (b)

5.4 Comparison with the State-of-the-Arts

We report the performance of our VAPLAN on the test set of the LLP dataset. The comparison results with existing methods are shown in Table 7. Our method achieves superior performance on all types of event parsing. First, compared to the baseline HAN (Tian et al., 2020) on which our method is developed, our method significantly improves the performance. Especially for the visual event parsing (V in the table), the segment-level metric is lifted from 52.9% to 66.7% (\(\uparrow \) 13.8%), and the event-level metric is improved from 48.9% to 63.3% (\(\uparrow \) 14.4%). Second, our method outperforms other competitors on the track of generating pseudo labels for the AVVP task. As shown in the low part of Table 7, our method generally exceeds the previous state-of-the-art JoMoLD (Cheng et al., 2022) by about 1.5 points for the audio event parsing, and around 3 points for the visual event and audio-visual event parsing. Both JoMoLD (Cheng et al., 2022) and MA (Wu & Yang, 2021) generate audio-visual pseudo labels from the video level, while our method can provide audio-visual pseudo labels from a more fine-grained segment level. Our video parsing model can be better supervised and optimized, resulting in better performance. Furthermore, we report the result of our method using the visual and audio features respectively extracted by CLIP and CLAP. As shown in the last row of Table 7, all types of event parsing performance can be further significantly improved. In particular, the audio event parsing benefits more from such advanced feature representations. As shown, its performance improves by 6.6% and 6.2% for the segment-level and event-level F-scores, respectively. These improvements demonstrate the effectiveness and superiority of our method.

Fig. 4
figure 4

Qualitative examples of the audio-visual video parsing using different methods. We compare our method with the HAN (Tian et al., 2020), MA (Wu & Yang, 2021) and JoMoLD (Cheng et al., 2022). “GT” denotes the ground truth. Our method successfully recognizes that there is only one visual event violin in (a) or basketball bounce in (b). Our method is also more accurate in parsing the audio events and audio-visual events, providing better temporal boundaries of the events

5.5 Generalization of Our Method

Generalization on other AVVP backbones. A core contribution of our method is that it can provide high-quality segment-level audio and visual pseudo labels, which then better guide the model optimization. Our method can also be applied to other existing backbones in the AVVP task. To explore its impact, we examine two recently proposed networks, i.e., MGN (Mo & Tian, 2022) and MM-Pyramid (Yu et al., 2022). Specifically, we train the models using the pseudo labels generated by our PLG and refined by our PLD, respectively. The experimental results are shown in Table 8. Both PLG and PLD significantly boost the vanilla models, especially in the visual event and audio-visual event parsing. Take the MM-Pyramid (Yu et al., 2022) method for example, the segment-level visual event parsing performance is improved from 54.4% to 65.4% and 66.4% by using our PLG and PLD, respectively. PLD is superior due to the additional label denoising strategy. Such improvements can also be observed for MGN (Mo & Tian, 2022). Besides, it is worth noting that these two backbones perform better when combined with our (segment-level) pseudo labels than the (video-level) pseudo labels generated by the previous method MA (Wu & Yang, 2021). These results again indicate that our method is able to provide better fine-grained pseudo labels and demonstrate the superiority and generalizability of our method.

Generalization on the AVEL task. We also extend our pseudo label generation strategy to a related audio-visual event localization (AVEL) task. We explore the challenging weakly-supervised setting where the model needs to localize those video segments containing the audio-visual events (an event is both audible and visible) given only the video-level event category label. Previous AVEL methods merely use the known video-level labels as the objective for model training. Here we try to generate segment-level pseudo labels for this task as we did for the weakly-supervised AVVP task. Similarly, we use the pretrained CLIP and CLAP models to generate segment-level visual and audio pseudo labels, respectively. The audio-visual event pseudo labels are the intersection of them. In this way, we know if there is an audio-visual event in each video segment. Then such segment-level pseudo labels can be used as a new objective to supervise the model training. We test three representative audio-visual event localization methods whose official codes are available, namely the AVEL (Tian et al., 2018), PSP (Zhou et al., 2021) and CMBS (Xia & Zhao, 2022). We conduct experiments on the corresponding AVE (Tian et al., 2018) dataset and the results are shown in Table 9. The second column shows the performance of vanilla models with only the video-level supervision. The last column shows that these models can be significantly improved by around 2 points when using our segment-level pseudo labels.

Fig. 5
figure 5

Typical and challenging visualization examples of the generated audio and visual pseudo labels.\(\textcircled {1}\)” and “\(\textcircled {2}\)” denote the ground truth and the obtained pseudo labels, respectively. a In these typical cases where the events are clearly represented in the audio and visual signals, our method can generate accurate segment-level pseudo labels. We also display some challenging examples: the audio event is mixed with other sounds (b) or the visual event is hard to perceive (c). In general, our method can provide satisfactory audio and visual pseudo labels

Fig. 6
figure 6

Qualitative visualization examples of the pseudo label denoising. Here, we take the visual modality as an example since it faces more challenges in both pseudo label generation and denoising processes. “GT” denotes the ground truth. “F-loss” represents the forward loss between the model predictions and the pseudo labels generated by PLG (Eq. 10). PLG basically disentangles the visual event(s) from the weak video label, yielding well-defined segment-wise event categories. Additionally, PLD helps alleviate potential label noise for those segments along the timeline in the same video whose pseudo labels generated by PLG suffer abnormally large loss values. The improved labels are highlighted by the dotted box

We also present some qualitative examples for a more intuitive comparison. As shown in Fig. 4a, the audio-visual event church bell occurs exclusively in the last five video segments. The previous state-of-the-art method, CMBS, incorrectly assumes this event to be present in the first five segments as well. In contrast, our method yields accurate localization results. The reason is that vanilla CMBS relies solely on the known weak event label (video-level) to supervise model training, while our method is capable of generating high-quality pseudo labels at the segment level. In the lower part of Fig. 4a, we illustrate our pseudo label generation process. Our method accurately identifies that the church bell event exists in all the visual segments but is present only in the last five audio segments, which results in the precise audio-visual event pseudo label and then better supervises the model training and predictions. Similar benefits can also be observed from Fig. 4b, the vanilla CMBS incorrectly classifies the audio-visual event guitar to be the ukulele. In contrast, our method can generate accurate segment-level pseudo labels, thereby ensuring superior predictions. These results again verify the generalization of our method and we believe our method can also help to address other related audio-visual tasks lacking fine-grained supervision.

5.6 Qualitative Results on the AVVP Task

Visualization examples of the audio-visual video parsing. We first display some qualitative video parsing examples in Fig. 5. We compare our method with HAN (Tian et al., 2020), MA (Wu & Yang, 2021), and JoMoLD (Cheng et al., 2022). Both MA and JoMoLD are developed on the HAN and try to generate video-level pseudo labels for better model training. As shown in Fig. 5a, two events exist in the video, i.e., speech and violin, while the visual event only contains the violin. For audio event parsing, although all methods correctly recognize the two events occurring in the audio track, our method locates more exact temporal segments. Also, our method accurately recognizes the visual event violin and provides superior audio-visual event parsing. In Fig. 5b, both the events speech and basketball bounce exist in the video. All methods miss the audio event speech. The reason may be that the speech event only happens in the second segment and the audio signal contains some noise from outdoors. It is hard to distinguish them. For visual and audio-visual event parsing, only our method provides satisfactory prediction for the audio event basketball bounce. Although our method incorrectly identifies that the third segment contains this event, we argue that there may be an annotation mistake. The basketball player in this segment is clearer than in the second segment. If true, our result is more correct. These video samples demonstrate the superiority of our method, which leverages high-quality segment-level pseudo labels to better supervise model training.

Visualization examples of the obtained pseudo labels. In this part, we display the pseudo labels of some typical and challenging video samples. Our method is able to provide high-quality segment-level audio and visual pseudo labels. As shown in Fig. 6a, the baby cry event is clearly represented in the video and our method successfully recognizes it in both audio and visual tracks. The temporal boundaries of the generated pseudo labels highly match the ground truth. Our method performs well in handling similar cases with explicit audio and visual event signals. Turning to Fig. 6b, our method generates accurate pseudo labels for the visual event frying food and audio event speech. The audio event frying food in the eighth segment is not identified. The difficulty is that the sound of frying food is mixed with the louder sound of speech, which causes the frying food event to be missed. The compound audio classification is still a challenging task in the community. In Fig. 6c, our method satisfactorily generates segment-level pseudo labels for all the audio events but fails to recognize the visual event dog. The dog in the visual frames is too small (located around the man’s feet in the figure) to be identified. This situation is hard to judge even for a human annotator. The pseudo labels can be further explored in the future if considering more specific techniques for these challenging cases. Nevertheless, our method can generally provide reliable segment-level pseudo labels.

Visualization of the pseudo label denoising. As shown in Fig. 7, we show two visualization examples to reflect the impact of pseudo label denoising. Here, we take the more challenging visual pseudo label denoising as an example. As shown in Fig. 7a, the video-level label contains the events of speech and cat, where speech does not exist in the visual modality. PLG successfully recognizes that only cat event happens in the visual track. However, since the object is too blurry in the first two segments, the event cat is incorrectly recognized. As a result, the forward loss values for these two segments are significantly greater, possibly 300 to 400 times larger than the other segments, as shown in the Fig. 7a. Contributing to the proposed label denoising (PLD) strategy, we make the correction. Observing Fig. 7b, there are no visual events. PLG mistakenly classifies a few segments as the event clapping because the player’s movements are complex in these segments. This inaccuracy is once again evident through the abnormally high forward losses. PLD also rectifies these erroneous pseudo labels. By analysis, the pseudo labels generated by PLG rely on the prior knowledge of event categories from the pretrained CLIP, while PLD benefits from an additional revision process (—the joint exploration of the predictions and pseudo labels through the forward loss calculation in each video) to possibly correct inaccurate segment-level pseudo labels in PLG.

6 Conclusion

We propose a Visual-Audio Pseudo LAbel exploratioN (VAPLAN) method for the weakly-supervised audio-visual video parsing task. VAPLAN is a new attempt to generate segment-level pseudo labels in this field, which starts with a pseudo label generation module that uses the reliable CLIP and CLAP models to determine the visual events and audio events occurring in each modality (at the segment level) as pseudo labels. We then exploit the category richness and segment richness contained in the pseudo labels and propose a new richness-aware loss as fine-grained supervision for the AVVP task. Furthermore, we propose a pseudo label denoising strategy to refine the visual pseudo labels and better guide the predictions. Qualitative and quantitative experimental results on the LLP dataset corroborate that our method can effectively generate and exploit high-quality segment-level pseudo labels. All these proposed techniques can be directly used in the community. We also extend our method to a related weakly-supervised audio-visual event localization task and the experimental results verify the effectiveness and generalization of our method. We believe this work will not only facilitate future research on the studied audio-visual video parsing task but also inspire other related audio-visual topics seeking better supervision.