1 Introduction

Sound source separation [1,2,3,4] is a classical audio processing problem, where the objective is to recover original component signals from a given mixture audio. Well known example of such task is the cocktail party problem, where multiple people are talking simultaneously (e.g. at a cocktail party) and the observer is attempting to follow one of the discussions. The general form of the problem is challenging and highly underdetermined. Fortunately, one is often able to leverage additional constraints from external cues, such as vision. For instance, the cocktail party problem turns more tractable by observing the lip movements of people [5]. Similar visual cues have also been applied in other sound separation tasks [6,7,8,9,10,11,12]. This type of problem setup is often referred as visually guided sound source separation (see e.g. Fig. 1).

Besides separating the component signals from the mixture, one is often interested in identifying the source location. Such task would be intractable from a single audio channel, but could be approached using e.g. microphone arrays [13]. Alternatively, the sound source location could be determined from the visual data [14, 15], which are more often available.

This paper proposes a new approach for visually guided sound source separation and localisation. Our system (Fig. 2), referred as Cascaded Opponent Filter (COF), consists of an initial separation stage and one or more subsequent cascaded Opponent Filter (OF) modules (Fig. 4). The OF module utilises visual cues from all videos to reconstruct each component audio. This is in contrast to most previous works (e.g. [8, 9]), where the separation is done only based on the corresponding video. The OF module is very light containing only 17 parameters (in our case) and we show that it can greatly improve the sound separation performance over the recent single stage systems [8, 9] and recursive method [10].

Moreover, since motion is strongly correlated to sound formation [9], we build our system on both appearance and motion representations. To this end, we examine multiple options based on video frames, optical flows, dynamic images [16], and their combinations. Finally, we introduce a Sound Source Location Masking (SSLM) network that, in conjunction with COF, is able to pin point pixel level segmentation of the sound source location. Qualitative results indicate sharper and more accurate results compared to the baselines [8,9,10]. The entire system is trained using a self-supervised setup with a large set of unlabelled videos.

Fig. 1.
figure 1

Visually guided sound source separation aims at splitting the input mixture (column (a)) into component signals corresponding to the given visual cues (column (b)). The proposed COF approach results in better separation performance over the baseline methods SoP [8], SoM [9], and MP-Net [10] on MUSIC dataset [8].

2 Related Work

Cross-modal Learning from Audio and Vision. Aytar et al. [17] presented a method for learning joint audio-visual embeddings by minimizing the KL-divergence of their representations. Owens et al. [18] proposed a synchronization based cross-modal approach for visual representation learning. Arandjelovic et al. [19, 20] associated the learnt audio and visual embeddings by asking whether they originate from the same video. Nagrani et al. [21] learned to identify face and voice correspondences. More recent works, include transferring mono- to binaural audio using visual features [11], audio-video deep clustering [22], talking face generation [23], audio-driven 3D facial animation prediction [24], vehicle tracking with stereo sound [25], visual-to-auditory [26, 27], audio-visual navigation [28, 29], and speech embedding disentanglements [30]. Unlike these works, (visually guided) sound source separation aims at splitting the input audio into original components signals.

Video Sequence Representations. Most early works in video representations were largely based on direct extensions of the image based models [31,32,33]. More recently, these have been replaced by deep learning alternatives operating on stack of consecutive video frames. These works can be roughly divided into following categories: 1) 3D CNN applied on spatio-temporal video volume [34]; 2) two-stream CNNs [35,36,37] applied on video frames and separately computed optical flow frames; 3) LSTM [38], Graph CNN [39] and attention clusters [40] based techniques; and 4) 2D CNN with the concept of dynamic image [16]. Since most of these methods are proposed for action recognition problem, it is unclear which representation would be best suited for self-supervised sound source separation. Therefore, this paper evaluates multiple options and discusses their pros and cons.

(Visually Guided) Sound Source Separation. The sound source separation task is extensively studied in the audio processing community. Early works were mainly based on probabilistic models [1,2,3,4], while recent methods utilise deep learning architectures [41,42,43,44]. Despite of the substantial improvements, the pure audio based source separation remains a challenging task. At the same time, visually guided sound source separation has gained increasing attention. Ephrat et al. [5] extracted face embeddings to facilitate speech separation. Similarly, Gao et al. [6, 12] utilised object detection and category information to guide source separation. Gan et al. [45] associated body and finger movements with audio signals by learning a keypoint-based structured representation. While impressive, these methods rely on the external knowledge of the video content (e.g. speaking faces, object types, or keypoints).

The works by Zhao et al. [8, 9] and Xu et al. [10] are most related to ours. In [8] the input spectrogram is split into components using U-Net [46] architecture and the separated outputs are constructed as a linear combinations of these. The mixing coefficients are estimated by applying Dilated ResNet to the keyframes representing the sources. The subsequent work [9] introduced motion features and improvements to the output spectrogram prediction. Both of these methods operate in a single stage manner directly predicting the final output. Alternatively, Xu et al. [10] proposed to separate sounds by recursively removing large energy components from the sound mixture. Our work explores multiple approaches to utilize the appearance and motion information to refine the sound source separation in multi-stages. The proposed Opponent Filter uses visual features of a sound source to look for incorrectly assigned sound components from opponent sources, resulting in accurate sound separation.

Sound Source Localization. Early work by Hershey et al. [47] localised sound sources by modeling the audio-visual synchrony as a non-stationary Gaussian process. Barzelay et al. [48] applied cross-modal association and visual localization by temporal coincidences. Based on canonical correlations, Kidron et al. [49] localized visual events associated with sound sources. Recently, Senocak et al. [50] learned to localize sound sources in visual scenes by transferring the sound-guided visual concepts to sound context vector. Arandjelovic et al. [20] obtained locations by comparing visual and audio embeddings using a coarse grid. Class activation maps were used by [7, 51]. Gao et al. [12] localised potential sound sources via a separate object detector. Rouditchenko et al. [52] segmented visual objects by leveraging a task of sound separation. Zhao et al. [8, 9] and Xu et al. [10] visualise the sound sources by calculating the sound volume at each spatial location. In contrast to these methods, which either produce coarse sound location or rely on the external knowledge, we propose a self-supervised SSLM network to localise sound sources on a pixel level.

Fig. 2.
figure 2

Architecture of the proposed Cascaded Opponent Filter (COF) network. COF operates in multiple stages: In the first stage, visual representations (vision network) and sound features (sound network) are passed to the sound separator and further produce a binary mask \(\hat{b}\) (Eq. (1), (2)) for each output source. Stage two refines the separation result \(\hat{Y}\) using the opponent filter (OF) module guided by the visual cues. Later stages are identical to second stage with OF module. The sound networks share parameters only if they are in the same stage. The vision networks share parameters (within and across stages) if they have same architecture.

3 Method

This section describes the proposed visually guided sound source separation method. We start with a short overview and then continue to detailed describe each component.

3.1 Overview

The inputs to our system consist of a mixture audio (e.g. band playing) and a set of videos, each representing one component of the mixture (e.g. person playing a guitar). The objective of the system is to recover the component sound signals corresponding to each video sequence. Figure 2 illustrates an overview of the approach. Note that the audio signals are represented as spectrograms, which are obtained from the audio stream using Short-term Fourier transform (STFT).

The proposed system consists of multiple cascaded stages. The first stage contains three components: 1) a sound network that splits the input spectrogram into a set of feature maps; 2) a vision network that converts the input video sequences into compact representations; and 3) a sound separator that produces spectrum masks (not shown in Fig. 2) of the component audios (one per video) based on the outputs of the sound and vision networks.

The second stage contains similar sound and vision networks as the first one (internal details may differ). However, instead of the sound separator, the second stage contains a special Opponent Filter (OF) module, which enhances the separation result by transferring sound components between the sources. The output of the filter is passed to the next stage or used as the final output. The following stages are identical to the second one and, for this reason, we refer our method as cascaded opponent filter (COF) network. The final component audios are produced by applying the inverse STFT to the predicted component spectrograms.

In addition, we propose a new Sound Source Location Masking (SSLM) network (not shown in Fig. 2) that indicates the pixels with highest impact on the sound source separation (i.e. source location). The entire network is trained in an end-to-end fashion using artificially generated examples. That is, we take two or more videos and create an artificial mixture by summing the corresponding audio tracks. The created mixture and video frames are provided to the system, which then has to reproduce the original component audios. In the following sections, we will present each component with more details and provide the learning objective used in the training phase.

Fig. 3.
figure 3

Architecture of (a) MA(C2D-RGB, C3D-RGB), (b) MA(C2D-RGB, C3D-FLO), (c) C2D-DYN, and (d) MA: Mutual Attention module.

3.2 Vision Network

The vision network aims at converting the input video sequence (or keyframe) into a compact representation that contains the necessary information of the sound source. Sometimes already a pure appearance of the source (e.g. instrument type) might be sufficient, but, in most cases, the motions are vital cues to facilitate the source separation (e.g. hand motion, mouth motion, etc.). The appropriate representation may have high model/computation complexity and, to seek for a balance between computational complexity and performance, we study several visual representation options. The models are introduced in the following and the detailed network architectures are provided in the supplementary material. In all cases, we assume that the input video sequence is of size \(\textit{3}\times \textit{16H}\times \textit{16W}\) and has \(\textit{T}\) frames.

The first option, referred as C2D-RGB, is a pure appearance-based representation. This is obtained by applying a dilated ResNet18 [53] to a single keyframe extracted from the sequence. More specifically, given an input RGB image of size \(\textit{3}\times \textit{16H}\times \textit{16W}\), the C2D-RGB produces a representation of size \(\textit{K}\times \textit{H}\times \textit{W}\). Dynamic image [16] is a compact representation, which summarises the appearance and motion of the entire video sequence into a single RGB image by rank pooling the original pixel data. The second option, referred as C2D-DYN, first converts the input video into a dynamic image (size \(\textit{3}\times \textit{16H}\times \textit{16W}\)) and then applies a dilated ResNet18 [53] to produce a representation of size \(\textit{K}\times \textit{H}\times \textit{W}\). Figure 3c illustrates C2D-DYN option.

The third option, referred as C3D-RGB, applies 3D CNN to extract the appearance and motion information from the sequence simultaneously. C3D-RGB uses 3D version of ResNet18 and produces a representation of size \({\textit{T}}^{'}\times \textit{K}\times \textit{H}\times \textit{W}\). The optical flow [35, 54, 55] explicitly describes the motion between the video frames. The fourth option, referred as C3D-FLO, first estimates the optical flow between the consecutive video frames using LiteFlowNet [55], and then applies 3D ResNet18 to the obtained flow sequence. C3D-FLO produces a representation of size \({\textit{T}}^{'}\times \textit{K}\times \textit{H}\times \textit{W}\).

In addition, following the recent work [36] in action recognition, we propose a set of two stream options by combining pairs of C2D-RGB, C3D-RGB, and C3D-FLO representations using Mutual Attention (MA) module. The module is depicted in Fig. 3d. It enhances the sound source relevant motions and eliminates motion irrelevant appearance by giving the mutual attention mechanism. Finally, we receive the mutual attentive features of dimension \({\textit{T}}^{'}\times \textit{K}\times \textit{H}\times \textit{W}\) from the two-stream structures, which are referred to as MA(C2D-RGB, C3D-RGB) and MA(C2D-RGB, C3D-FLO). Figure 3a and 3b illustrate these options. We omit the model of two 3D streams MA(C3D-RGB, C3D-FLO) due to large size of the resulting model.

3.3 Sound Network

The sound network splits the input audio spectrogram into a set of feature maps. The network is implemented using U-Net [46] architecture and it converts the input spectrogram of size \(\textit{HS}\times \textit{WS}\) into an output of size \(\textit{HS}\times \textit{WS}\times \textit{K}\). Note that the number of created feature maps K is equal to the visual feature dimension K in the previous section. At the first stage, the input to the sound network is the original mixture spectrogram \(\textit{X}_{\textit{mix}}\), while in later stages, the sound network operates on the current estimates of the component spectrograms. This allows stages to focus on different details of the spectrogram. In the following, we will denote the kth feature map, produced by the sound network for an input spectrogram \(\textit{X}\), as \(S(X)_{k}\).

Fig. 4.
figure 4

An illustration of the Opponent Filter (OF) module at stage j in the case of two sound sources. The input consists of the visual representation \(\mathbf {z}\) and the previous spectrum mask \([g]_{j-1}\), for both sources. First, we obtain the spectrograms \(\hat{Y}\) for both sources from the spectrum masks (Eq. (2)). Second, the spectrograms are turned into feature maps F with the sound network (Sect. 3.3). Third, the visual representation \(\mathbf {z}_2\) and the feature map \(F_1\) are used to identify components from the source 1 that should belong to the source 2 (\(r_{1->2}\) in figure). The spectrum masks are updated accordingly by subtracting from \([g_1]_{j-1}\) and adding to \([g_2]_{j-1}\). Similar operation is done for the source 2. Finally, the updated spectrum masks \([g_1]_{j}\) and \([g_2]_{j}\) are passed to the next stage.

3.4 Sound Separator

The sound separator combines the visual representations with the sound network output and produces an estimate of the component signals. First, we apply global max pooling operation over the spatial dimensions (\(\textit{H}\times \textit{W}\)) of the visual representation. For 3D CNN-based options, we further apply max pooling layer along the temporal dimension (\(\textit{T}^{'}\)). As a result, we obtain a feature vector \(\mathbf {z}\) with \(\textit{K}\) elements. We combine \(\mathbf {z}\) with sound network output using a linear combination to predict the spectrum masks g as Eq. (1).

$$\begin{aligned} \textit{g}(\mathbf {z}, X) = \sum ^K_{k=1} \alpha _{k} \, \mathbf {z_{k}} * S(X)_{k} + \beta , \end{aligned}$$
(1)

where \(\alpha _{\textit{k}}\) and \(\beta \) are learnable weight parameters, \(\mathbf {z}_k\) is the kth element of visual vector \(\mathbf {z}\), and \(S(X)_{k}\) is the kth sound network feature map for a spectrogram \(\textit{X}\).

3.5 Opponent Filter Module

The structure of the Opponent Filter (OF) module is illustrated in Fig. 4 in the case of two sound sources. The main idea in the OF is to use visual representation of the source n to identify spectrum components from the source m that should belong to source n but are currently assigned to m. These are then transferred from source m to n. The motivation behind the construction is to utilise all visual representations \(z_1,\ldots ,z_N\) to determine each component audio, instead of using only the corresponding one. This is in contrast to the previous works SoP [8], SoM [9] (and approximately for MP-Net [10]), where the output for each source is determined solely by the same visual input. Our approach leads to more efficient use of the visual cues, which is reflected by the performance improvements shown in the experiments (see Sect. 4.2). Moreover, in our case \((K=16)\), the selected architecture requires only 17 parameters (consist of 16 \(\alpha _{k}\) values and one \(\beta \) as shown in Eq. (3)), which makes it very light and efficient to learn. The OF module is used in all but the first stage of the COF.

More specifically, the OF module takes the visual representation \(\mathbf {z}\) and the previous spectrum mask \([g]_{j-1}\) for each sound source as an input. Firstly, the spectrum masks are converted to the spectrograms \(\hat{Y}\) as

$$\begin{aligned} \hat{b} = th( \sigma ({g})), \quad \hat{Y} = \hat{b} \otimes X_{\textit{mix}} \end{aligned}$$
(2)

where \(\sigma \) denotes the sigmoid function, th represents the thresholding operation with value 0.5, and \(\otimes \) is the element-wise product. In other words, we first map g into a binary mask \(\hat{b}\), and then produce the estimate of the output component spectrogram as an element-wise multiplication between the binary mask \(\hat{b}\) and the original mixture spectrogram \(X_{\textit{mix}}\). g and \(\hat{Y}\) are provided for the upcoming stage as inputs (or used as the final output). We denote the outputs corresponding to nth video at stage j as \([{g}_n]_{j}\), \([\hat{b}_n]_{j}\), and \([\hat{Y}_n]_{j}\). The obtained spectograms are passed to the sound network (see Sect. 3.3), which converts them to feature maps of size \(\textit{HS}\times \textit{WS}\times \textit{K}\) denoted by \(F_n\) for the source n.

Secondly, the OF module takes one source at a time, referred using index \(n\in [1,N]\), and iterates over the remaining sources \(m\in \{[1,N] | m\not =n \}\). N is the number of sources in sound mixture. For each pair (nm) the filter determines a component of source m that should be reassigned to the source n as

$$\begin{aligned} {r}_{m->n} = \sum ^K_{k=1} \alpha _{k} \mathbf {z_{n,k}} * F_{m,k} + \beta \end{aligned}$$
(3)

where \(\mathbf {z}_{n,k}\) is the kth element of visual representation of sound n. \(F_{m,k}\) is the kth sound network feature maps of sound m. The \({r}_{m->n}\) denotes the residual spectrum components identified from source m that should belong to source n but are currently assigned to m. The obtained component will be subtracted from the spectrum mask \([{g}_m]_{j-1}\) and added to \([{g}_n]_{j-1}\) as follows

$$\begin{aligned}{}[{g}_m]_{j}&= [{g}_m]_{j-1} \ominus {r}_{m->n} \end{aligned}$$
(4)
$$\begin{aligned}{}[{g}_n]_{j}&= [{g}_n]_{j-1} \oplus {r}_{m->n} \end{aligned}$$
(5)

where the \([g_n]_{j}\) is the spectrum mask (Eq. (1)) of nth video in stage j, \({r}_{m->n}\) is the residual spectrum components from sound m to sound n. \(\oplus \) and \(\ominus \) denote the element-wise sum and subtraction, respectively.

The overall process can be summarized in the following Algorithm 1,

figure a

3.6 Learning Objective

The model parameters are optimised with respect to the binary cross entropy (BCE) loss that is evaluated between the predicted and ground truth masks over all stages. More specifically,

$$\begin{aligned} \mathcal {L}_{\textit{sep}} = \sum ^J_{j=1} r_j \, \textit{BCE}([\hat{b}]_j, b_{gt}) \end{aligned}$$
(6)

where \(r_j\) is a weight parameter, \([\hat{b}]_j\) is the predicted binary mask, \(b_{gt}\) is the ground truth mask (determined by whether the target sound is the dominant component in the mixture), and J is the total number of stages.

3.7 Sound Source Location Masking Network

The objective of the Sound Source Location Masking (SSLM) network is to identify a minimum set of input pixels, for which the COF network would produce almost identical output as for the entire image. In practice, we follow the ideas presented in [56], and build an auxiliary network to estimate a sound source location mask that is applied to the input RGB frames. The SSLM is trained together with the overall model in a self-supervised manner (please see supplementary material). The input video frames are first passed through the SSLM component which outputs a weighted location mask [0,1] having same spatial size as the input frame. The input video frames are multiplied element-wise with the mask, and the result is passed to the COF model. We illustrate the overall structure of the SSLM in Fig. 5a. The final optimisation is done by minimising the following loss function,

$$\begin{aligned} \mathcal {L} = \sum ^J_{j=1} r_j \, l_{\textit{diff}}([\hat{b}_{\textit{SSLM}}]_j, [\hat{b}]_j) + \lambda \frac{1}{q} \parallel \textit{SSLM}(I)\parallel _1, \end{aligned}$$
(7)

where \(l_{\textit{diff}}\) denotes the difference between the \([\hat{b}_{\textit{SSLM}}]_j\) and \([\hat{b}]_j\) by \(L_1\) norm, \([\hat{b}_{\textit{SSLM}}]_j\) is the output sound separation mask obtained using only selected pixels, \([\hat{b}]_j\) is the output separation mask for the original image. \(r_j\) and \(\lambda \) are hyperparameters which control the contribution of each loss factor. The \(\lambda \frac{1}{q} \parallel \textit{SSLM}(I)\parallel _1\) norm produces a location mask with only small number of non-zero values. q is the total number of pixels of the \(\textit{SSLM}(I)\).

Fig. 5.
figure 5

(a) The diagram of the Sound Source Location Masking (SSLM) network and (b) Visualizing sound source location of our methods in comparison with baseline models SoP [8], SoM [9], and MP-Net [10] on MUSIC dataset.

4 Experiments

We evaluate the proposed approach using Multimodel Sources of Instrument Combinations (MUSIC) [8] dataset, and two sub-sets of AudioSet [57]: A-MUSIC and A-NATURAL. The proposed model is trained using artificial examples, generated by adding audio signals from two or more training videos. The performance of the final sound source separation is measured in terms of standard metrics: Signal to Distortion Ratio (SDR), Signal to Interference Ratio (SIR), and Signal to Artifact Ratio (SAR). Higher is better for all metricsFootnote 1.

4.1 Datasets and Implementation Details

MUSIC. The Multimodel Sources of Instrument Combinations (MUSIC) [8] dataset is a relatively small, but has high quality. Most of the video frames are well aligned with the audio signals and have little off-screen noise. Part of the original MUSIC dataset is no longer available in YouTube (10% missing at the time of writing). In order to keep dataset size, we replaced the missing entries with similar YouTube videos. Baseline methods (e.g., SoP [8]) in original paper split the dataset into 500 training and 130 validation videos, and report the performances on the validation set (train/test split are not published). Instead, we follow the standard practice of reporting the performance on a separate hold-out test set. For this purpose, we randomly split the dataset into 400 training, 100 validation, and 130 test videos. This leads to 20% less training videos compared to [8]. All tested methods are trained and evaluated with the same data and pre-processing steps (see implementation details).

Table 1. The sound separation results of the proposed COF network, conditioning on appearance cues, on MUSIC test dataset

A-MUSIC and A-NATURAL. AudioSet consists of an expanding ontology of 632 audio event classes and is a collection of over 2 million 10-s sound clips drawn from YouTube videos. Many of the AudioSet videos have limited quality and sometimes the visual content might be uncorrelated to the audio track. A-MUSIC dataset is a trimmed musical instrument dataset from AudioSet. It has around 25k videos spanning ten instrumental categories. A-NATURAL dataset is a trimmed natural sound dataset from AudioSet. It contains around 10k videos which cover 10 categories of natural sounds. We split both the A-MUSIC and A-NATURAL dataset samples to 80%, 10%, and 10% as train, validation and test set. More details of datasets are discussed in the supplementary material.

Implementation Details. We sub-sample each audio signals at 11kHz and randomly crop an audio clip of 6 s for training. A Time-Frequency (T-F) spectrogram of size \(512\times 256\) is obtained by applying STFT, with a Hanning window size of 1022 and a hop length of 256, to the input sound clip. We further re-sample this spectrogram to a T-F representation of size \(256\times 256\) on a log-frequency scale. We extract video frames at 8fps and give a single RGB image to the C2D-RGB model, \(T=48\) frames to C2D-DYN and all the discussed C3D models. Further implementation details are provided in the supplementary material.

4.2 Opponent Filter

In this section, we assess the performance of the OF module. For simplicity we use only the appearance based features (C2D-RGB) in all stages. The baseline is provided by the basic single stage version denoted as COF - 1 stage, which does not contain the opponent filter module. The results provided in Table 1 indicate a clear improvement from the OF stages.

In addition, we evaluate the impact of the “addition” and “subtraction” branches in the OF module. To this end, we implement two versions COF\(_{addition}\) and COF\(_{subtraction}\), which include only the “addition” (Eq. (5)) or “subtraction” (Eq. (4)) operation in the OF, respectively. The corresponding results in Table 1 indicate that both versions obtain similar performance which is between the baseline and the full model. We conclude that both operations are essential part of the OF module and contribute equally to the sound separation result.

Table 2. The sound separation results with COF, conditioning on different visual cues, on the MUSIC test dataset. Table contains three blocks: 1) single-stage COF associated with visual cues predicted from MA-RGB, MA-FLO, and C2D-DYN; 2) two-stage extension of the models in the first block; 3) two-stage COF with C2D-RGB at stage 1 and C3D-RGB, C3D-FLO, or C2D-DYN at stage 2

4.3 Visual Representations

We firstly separate sounds by implementing a single stage network, associating with the appearance and motion cues discussed in Sect. 3.2. We denote the MA(C2D-RGB, C3D-RGB) and MA(C2D-RGB, C3D-FLO) as MA-RGB and MA-FLO in Table 2. As is shown in the block 1 of Table 2, the results with appearance and motion cues clearly surpass the network with only appearance cues from C2D-RGB in Table 1, which proposes that the motion representation is important for the sound separation quality. Block 2 shows the performance of how the visual information separates sounds in a two-stage manner. Explicitly, we replace the vision network at each stage in Fig. 2 with MA-RGB, MA-FLO, and C2D-DYN. Block 1 and 2 report that the three two-stage networks obtain similar performance and outperform their single-stage counterparts from block 1 with a large margin.

Finally, we evaluate an option where the first stage utilises only appearance based option and the second stage applies motion cues. In practice, we combine C2D-RGB with C3D-RGB, C3D-FLO, or C2D-DYN. The results in block 3 of Table 2, indicate that this combination obtains similar or even better performance than the options where motion information was provided for both stages. We conclude that the appearance information is enough to facilitate coarse separation at first stage. The motion information is only needed at the later stages to provide higher separation quality. It is worth noting that the COF(C2D-RGB, C2D-DYN) combination has less computation complexity and better performance compared to the 3D CNN alternatives. Therefore, we apply C2D-RGB for the 1st stage and C2D-DYN for the later stages for all the remaining experiments.

Table 3. Sound separation performance with 2 and 3 stages COF models compared with three recent baselines SoP [8], SoM [9], and MP-Net [10], on MUSIC, A-MUSIC, and A-NATURAL datasets. The top 2 results are bolded.

4.4 Comparison with the State-of-the-Art

We compare the 2-stage and 3-stage of the proposed COF model with three recent baseline methods SoP [8], SoM [9], and MP-Net [10]. For SoP we use the publicly available implementation from the original authors. For SoM and MP-Net we use our own implementations since there were no publicly available versions. The corresponding results for MUSICFootnote 2, A-MUSIC, and A-NATURAL datasets are provided in Table 3, Fig. 1, and Fig. 6. The quantitative results indicate that our model outperforms the baselines with a large margin across all three datasets.

Increasing the Number of Stages: We observe that the computational cost increases approximately linearly with respect to the number of stages. The performance generally improves until it reaches a plateau. COF with 2, 3, 4, and 5 stages obtain SDRs of 9.17, 10.07, 10.12, and 10.32 on MUSIC dataset, respectively. The corresponding FLOPs (GMACS) are 8.05, 12.06, 16.06, and 20.07. The performance plateaus at 3 stages, which led to a compromise at this point.

Mixture of Three Sources: We assess the COF model using a mixture of three sound sources from the MUSIC dataset. In this case, the two-stage model obtains SDR: 3.33, SIR: 10.32, and SAR: 6.70 which are clearly higher than SDR: 1.30, SIR: 8.66, and SAR: 5.73 obtained with MP-Net [10] that is particularly designed for the multi-source case. As discussed in Sect. 3.5, the computational cost of COF scales approximately linearly with the number of sources. For instance, the FLOPs (GMACS) for 2, 3, 4, 5, 10, and 15 sources are 8.05, 11.09, 14.12, 17.16, 32.36, and 47.62 respectively.

Fig. 6.
figure 6

Visualizing sound source separation of our 2-stage COF model on A-MUSIC and A-NATURAL datasets, in comparison with baseline methods SoP [8], SoM [9], and MP-Net [10].

4.5 Visualizing Sound Source Locations

We compare the sound source localizing capability of our best two-stage model with state-of-the-art methods in Fig. 5b. Columns (2)–(5) display the sound energy distributions of spatial location in heatmaps on input frame during inference. COF produces precise associations between visual representation and separated sounds, though columns (5) is just the visualization from the first stage of COF. As we know, the spatial features from ConvNet usually have small resolution (\(14\times 14\) pixels in this work). Thus, the final visualized location is generally coarse after up-sampling the heatmap to the resolution of the input image. Differently, our proposed SSLM learns to predict a pixel-level sound source location mask, as shown in column (6), which precisely localizes sound sources and preserves high quality of sound separation. Further examples are provided in the supplementary material.

5 Conclusions

We proposed an innovative framework of visually guided Cascaded Opponent Filter (COF) network to recursively refine sound separation with visual cues of sound sources. The proposed Opponent Filter (OF) module uses visual features of all sound sources to look for incorrectly assigned sound components from opponent sounds, resulting in accurate sound separation. For this purpose, we studied different visual representations based on video frames, optical flows, dynamic images, and their combinations. Moreover, we introduced a Sound Source Location Making (SSLM) network, together with COF, to precisely localize sound sources.