Keywords

1 Introduction

We now extend the study of sound scene and event classification to settings where the observations are obtained from multiple sensors, which we refer to as multiview data, each sensor contributing a particular view of the data. Instances of this include both multichannel audio data, as acquired by microphone arrays, or more generally, multimodal data, i.e., heterogeneous data that involves two or more modalities such as the audio or visual modalities in video recordings.

Be it for applications in machine perception—at the heart of robots’ and virtual agents’ intelligence systems—or video description—as part of video surveillance or multimedia indexing systems—multiview approaches can lead to a significant boost in performance in challenging real-world situations. Indeed, multiplying the sources of information, through different views, should result in a more robust overall “picture” of the scene being analyzed, where sensors, and consequently views, which are not reliable, e.g., noisy, at a particular time instant, are hopefully backed-up by others. This is, for instance, the case in video recordings where sound-emitting target events are not visible onscreen because of poor lighting conditions or occlusions.

Such an endeavor is actually as promising as challenging, primarily because of the significant increase in the volume of the data to be analyzed, but also owing to the potential heterogeneity of the different streams of information (e.g., audio and visual streams), which additionally may not be perfectly synchronized. Another difficulty is that it is usually not possible to determine which streams are not reliable at every time instant. To see this, consider the scenario of scene analysis using a robot’s sensors. The data views available may then be composed of the multiple audio streams acquired by the robot’s microphone array, as well as RGB and depth-image streams captured by its cameras, possibly along with other signals recorded by inertial measurement units. As the cameras are pointed at an interactant, events of interest may appear only partially in their field of view, and be present in the audio recording only at a very low signal-to-noise ratio. This may be due to background noise (including the robot’s internal noise, the so-called ego-noise, typically produced by its cooling fans or its actuators) and the voice of the interactant, or the robot itself, in the foreground.

In this chapter, we briefly introduce some of the techniques that can be exploited to effectively combine the data conveyed by the different views under analysis for a better interpretation. Numerous good surveys have been written on the general topic of multimodal data fusion, notably the paper by Atrey et al. [10] which is quite comprehensive. Therefore, we first provide a high-level presentation of generic methods that are particularly relevant in the context of multiview and multimodal sound scene analysis (Sect. 9.3). It is worth noting that some of the techniques presented have not necessarily yet been considered in the context of scene and event recognition as envisaged in this book. We still briefly cover them in this chapter as it is believed they hold a good potential for such applications. We then more specifically present a selection of techniques used for audiovisual event detection and microphone array-based scene analysis (in Sects. 9.4 and 9.5, respectively).

2 Background and Overview

2.1 Multiview Architectures

Figure 9.1 depicts an overview of the main fusion strategies that are exploited when analyzing multiview data, namely (1) fusion at the representation or feature level (upper row of the figure), and (2) fusion at the decision-level, usually implying integration of partial classifier-outputs. Each of these methods will be discussed further in Sect. 9.3. In particular, we will focus on a special case of representation/feature-level fusion that is here referred to as joint subspace learning where the aim is to learn or non-trivially transform the representations based on inter-relationships across the views.

Fig. 9.1
figure 1

Overview of multiview data analysis approaches

As previously mentioned, views can be either of the same nature, in which case they are referred to as channels (typically audio channels) each corresponding to a particular microphone, or of different nature as in multimodal scenarios where, for example, some of the views could correspond to different audio channels while others to video images recorded by different cameras.

2.2 Visual Features

Since videos are central to the content of this chapter, a short note on commonly employed visual features is in order. Features extracted from visual streams can be broadly classified into two categories: appearance-based and motion-based features. For the former, several local and global descriptors representing appearance attributes namely, color, texture, and shape are extracted. While some works utilize the raw pixel data or color histograms, others rely on standard features such as scale-invariant feature transform (SIFT) [96] and histograms of oriented gradients (HOG) [40]. Lately, features extracted from convolutional neural networks have dominated [88].

Motion-based features are typically computed using optical flow or tracking data. It is possible to represent temporal changes of segmented regions, objects, and shapes by calculating velocity and acceleration, i.e., optical flow and its derivative. Other popular features include histograms of optical flow (HOF) [154] and motion boundary histograms (MBH) [154]. As MBH is computed from optical flow derivatives, it is not affected by constant motion. This makes it robust to camera motion. The reader is referred to [79, 101] for an extensive review of visual representations used for multimodal analysis.

In multiview settings temporal synchronization across views is quite challenging. Notably, in the audiovisual case, since the video frame rate, typically around 25–30 frames per second is significantly different from the audio one, features from both modalities must be appropriately sampled for temporal correspondence. Moreover, the natural asynchrony that exists between the two modalities must also be taken into account. This means that cues for an audiovisual event might not appear simultaneously in both modalities.

3 General Techniques for Multiview Data Analysis

Generally, the techniques discussed here (in the following two subsections) operate at either the representation-level or the decision-level as further described in the next sections.

3.1 Representation and Feature Integration/Fusion

Representation or feature integration/fusion is the process of combining different types of features or low-level data representations from different views into a common representation (usually to be exploited by a prediction system).

In practice, this can be primarily achieved by concatenating the feature vectors o m, t , extracted from views m, 1 ≤ mM, at the same time positions t, to build integrated feature vectors \( \bar{\mathbf{o}}_{t} = \left [\mathbf{o}_{1,t}^{\mathsf{T}},\ldots,\mathbf{o}_{M,t}^{\mathsf{T}}\right ]^{\mathsf{T}} \); provided that the data analysis-rate and cross-view synchronization issues have been previously addressed.

However, the dimensionality of the resulting representation is often too high, which has led researchers to resort to dimensionality reduction methods. A common approach is then to use feature transform techniques, possibly principal component analysis (PCA) [31], independent component analysis (ICA) [139], or linear discriminant analysis (LDA) [31] (see Chap. 4, Sect. 4.5.1). An interesting alternative is feature selection (see Chap. 4, Sect. 4.5.2). In fact, when applied to the feature vectors \( \bar{\mathbf{o}}_{t} \), the selection will hopefully retain a subset of the most “relevant” features across the various views (with respect to a selection criterion).

Nevertheless, in multimodal settings, the previous methods often turn out to be limited owing to the different physical nature of the features to be combined. In particular, the features do not necessarily live in the same metric spaces, and are not necessarily extracted from the same temporal segments. Consequently, there has been a number of works attempting to address these limitations.

An interesting approach, within the framework of multiple kernel learning, consists in considering separate kernels for different features, to build optimal convex combinations of these in order to use them for classification, as done, for example, in [30, 157].

Another approach that is worthy of note is the construction of joint multimodal representations, as done in video analysis applications, where various types of audiovisual representations have been envisaged. Examples include the creation of audiovisual atoms [78] or audiovisual grouplets [76], both exploiting audiovisual correlations. A joint audiovisual representation may in particular be built using one of the joint subspace learning methods described in the following.

3.1.1 Feature-Space Transformation

A number of techniques have been suggested to map the observed feature vectors from two modalities to a low dimensional space where a measure of “dependency” between them can be computed. Let us assume the N observed feature vectors from two modalities, \( \mathbf{o}_{1,t} \in \mathbb{R}^{J_{1}} \) and \( \mathbf{o}_{2,t} \in \mathbb{R}^{J_{2}} \) (t = 1, , N), are assembled column-wise in matrices \( \mathbf{O}_{1} \in \mathbb{R}^{J_{1}\times N} \) and \( \mathbf{O}_{2} \in \mathbb{R}^{J_{2}\times N} \), respectively.Footnote 1 The methods we describe here aim to find two mappings f 1 and f 2 (that reduce the dimensions of feature vectors in each modality), such that a dependency measure S 12(f 1(O 1), f 2(O 2)) is maximized. Various approaches can be described using this same formalism. The advantages of doing so are twofold: (1) it appropriately modifies the feature spaces to uncover relationships between views specified by the measure of dependency, and (2) by projecting data into the same space, dimensionality difference between views is eliminated and direct comparison across views is made possible. Fisher et al. [51] choose the mutual information [35] as a dependency measure and seek single-layer perceptrons f 1 and f 2 projecting the audiovisual feature vectors to a 2-dimensional space. Other more popular approaches (for which closed-form solutions can be found) use linear mappings to project the feature streams:

  • Canonical correlation analysis (CCA), first introduced by Hotelling [67], aims at finding pairs of unit-norm vectors t 1 and t 2 such that

    $$ \displaystyle{ (\mathbf{t}_{1},\mathbf{t}_{2}) =\mathop{ \mathrm{arg\,max}}\limits _{(\mathbf{t}_{1},\mathbf{t}_{2})\in \mathbb{R}^{J_{1}}\times \mathbb{R}^{J_{2}}}\mbox{ corr}\left (\mathbf{t}_{1}^{\mathsf{T}}\mathbf{O}_{ 1},\mathbf{t}_{2}^{\mathsf{T}}\mathbf{O}_{ 2}\right ) } $$
    (9.1)

    CCA can be considered equivalent to mutual information maximization for the particular case where the underlying distributions are elliptically symmetric [83]. Several variants have been proposed to incorporate sparsity and non-negativity into the optimization problem to resolve issues with interpretability and ill-posedness, respectively [84, 138]. In the context of multimodal neuronal data analysis, temporal kernel CCA [15] has been proposed to take into account the temporal dynamics.

  • An alternative to the previous methods (expected to be more robust than CCA) is co-inertia analysis (CoIA). It consists in maximizing the covariance between the projected audio and visual features:

    $$ \displaystyle{ (\mathbf{t}_{1},\mathbf{t}_{2}) =\mathop{ \mathrm{arg\,max}}\limits _{(\mathbf{t}_{1},\mathbf{t}_{2})\in \mathbb{R}^{J_{1}}\times \mathbb{R}^{J_{2}}}\mbox{ cov}\left (\mathbf{t}_{1}^{\mathsf{T}}\mathbf{O}_{ 1},\mathbf{t}_{2}^{\mathsf{T}}\mathbf{O}_{ 2}\right ) } $$
    (9.2)

    A possible reason for CoIA’s stability is that it is a trade-off between CCA and PCA, thus it benefits from advantages of both [21].

  • Yet another configuration known as cross-modal factor analysis (CFA), and found to be more robust than CCA in [92], seeks two matrices T 1 and T 2, such that

    $$ \displaystyle{ (\mathbf{T}_{1},\mathbf{T}_{2}) =\mathop{ \mathrm{arg\,max}}\limits _{(\mathbf{T}_{1},\mathbf{T}_{2})}\left (1 -\|\mathbf{T}_{1}\mathbf{O}_{1} -\mathbf{T}_{2}\mathbf{O}_{2}\|_{F}^{2}\right ) =\mathop{ \mathrm{arg\,min}}\limits _{ (\mathbf{T}_{1},\mathbf{T}_{2})}\|\mathbf{T}_{1}\mathbf{O}_{1} -\mathbf{T}_{2}\mathbf{O}_{2}\|_{F}^{2} } $$
    (9.3)

    with T 1 T 1 T = I and T 2 T 2 T = I. ∥V F denotes the Frobenius norm of matrix V.

Note that all the previous techniques can be kernelized to study nonlinear coupling between the modalities considered (see, for instance, [64, 90]).

The interested reader is referred to [64, 67, 92] for further details on these techniques, and to [58] for a comparative study.

3.1.2 Multimodal Dictionary Learning

While previous approaches relied on modeling the association between the features across modalities, this class of techniques targets the extraction of meaningful multimodal structures to jointly represent all the modalities. This is useful because feature transformation techniques like CCA impose simplifying assumptions such as linearity and are adversely affected by lack of data. To this end, Monaci et al. [106] propose to learn multimodal dictionaries wherein the dictionary elements are learned using an algorithm that enforces synchrony between modalities and decorrelation between the learned dictionary elements. The learned templates can then be used for performing various tasks. Monaci et al. improve upon this foundational work by proposing a bimodal matching pursuit algorithm which integrates dictionary learning and coding [107]. The sparse shift-invariant generative model used for the audiovisual case can be given by defining multimodal dictionary elements \( \{\phi _{d}\}_{d=1}^{D} = \left (\phi _{d}^{a}(t),\phi _{d}^{v}(x,y,t)\right ) \) consisting of audio, ϕ d a, and visual, ϕ d v, parts and a spatio-temporal shift operator \( T_{(pqr)}\phi _{d} = \left (\phi _{d}^{a}(t - r),\phi _{d}^{v}(x - p,y - q,t - r)\right ) \) such that the multimodal signal s is approximated by the following equation:

$$ \displaystyle{ s \approx \sum _{d=1}^{D}\sum _{ i=1}^{n_{d} }c_{d_{i}}T_{(pqr)d_{i}}\phi _{d} } $$
(9.4)

where n d is the number of instances of ϕ d and \( c_{d_{i}} \) specifies the weights for AV components of ϕ d at the ith instance. Several limitations of this approach have been improved upon by proposing a new objective function and algorithm to balance the two modalities, reduce computational complexity, and improve robustness [94].

3.1.3 Co-Factorization Techniques

Matrix factorization techniques can be profitably used to extract meaningful representations for the data being analyzed.

When dealing with multichannel data—i.e., with data views of the same nature (e.g., multichannel audio or images)—observations from multiple channels may be profitably assembled in multi-way arrays, i.e., tensors, before being modeled by tensor factorization methods. As for multichannel audio data, a popular approach consists in collecting the spectrograms of signals from different channels (originating from different microphones) in a 3-way tensor , as illustrated in Fig. 9.2, before processing it with the so-called PARAFAC (PARAllel FACtor analysis) decomposition method, possibly with non-negativity constraints. This can be interpreted as an attempt to explain audio spectra observations v tm as being linear combinations of elementary spectra w k , temporally weighted by activation coefficients o kt up to spatial modulation coefficients q mk .

Fig. 9.2
figure 2

PARAFAC decomposition of multichannel audio spectra

Such decompositions were found particularly useful in multichannel audio source separation [52, 118]. For more information about tensor factorization methods, we refer the reader to [33, 87, 159].

In contrast to the previous setting, data from different modalities usually live in feature spaces of completely different topology and dimensionality (think of audio as opposed to video), preventing the possibility of “naturally” representing them by the same tensor. In this case, one may resort to the so-called co-factorization techniques, that is techniques performing two (or more) factorizations in parallel, which are linked in a particular way. Because of the different nature of the modalities, this link has usually to be characterized through temporal dependencies between the temporal activations in cross-modal correspondence, and unlikely through dependencies between dictionary elements of different modalities.

Assuming that appropriate nonnegative features have been extracted at the same rate from the two modalities being analyzedFootnote 2—say the audio and images of a video—so that two observation matrices \( \mathbf{V}_{1} \in \mathbb{R}_{+}^{J_{1}\times N} \) and \( \mathbf{V}_{2} \in \mathbb{R}_{+}^{J_{2}\times N} \) are available, for the audio and visual data. One may seek a model (W 1, W 2, O) such that:

$$ \displaystyle{ \left \{\begin{array}{@{}l@{\quad }l@{}} \mathbf{V}_{1} \approx \mathbf{W}_{1}\mathbf{O} \quad \\ \mathbf{V}_{2} \approx \mathbf{W}_{2}\mathbf{O}\, \quad \\ \mathbf{W}_{1} \geq 0,\,\mathbf{W}_{2} \geq 0,\,\mathbf{O} \geq 0\,;\quad \end{array} \right. } $$
(9.5)

in such a way that the temporal activations be the same for both modalities. This is referred to as hard co-factorization, an approach that has been followed in a number of works (see, e.g., [53, 160, 161]). Clearly, this approach is limited in that it does not account for possible local discrepancies across the modalities. This happens, for example, when there is a mismatch between the audio and the images information, say because of a visual occlusion in video analysis scenarios. This motivates the soft co-factorization model of Seichepine et al. [134], which merely encourages the temporal activations corresponding to each modality to be close, as opposed to equal, according to:

$$ \displaystyle{ \left \{\begin{array}{@{}l@{\quad }l@{}} \mathbf{V}_{1} \approx \mathbf{W}_{1}\mathbf{O}_{1} \quad \\ \mathbf{V}_{2} \approx \mathbf{W}_{2}\mathbf{O}_{2}\, \quad \\ \mathbf{O}_{1} \approx \mathbf{O}_{2} \quad \\ \mathbf{W}_{1} \geq 0,\,\mathbf{W}_{2} \geq 0,\,\mathbf{O}_{1} \geq 0,\,\mathbf{O}_{2} \geq 0.\quad \end{array} \right. } $$
(9.6)

The model (9.6) is estimated by solving the following optimization problem:

$$ \displaystyle{ \left \{\begin{array}{@{}l@{\quad }l@{}} \mathop{\min }\limits_{\boldsymbol{\theta }}\,\,C_{c}(\boldsymbol{\theta })\,\,;\,\,\boldsymbol{\theta }\,\stackrel{\varDelta }{=}\,\left (\mathbf{W}_{1},\mathbf{O}_{1},\mathbf{W}_{2},\mathbf{O}_{2}\right ) \quad \\ \mathbf{W}_{1} \geq 0,\,\mathbf{W}_{2} \geq 0,\,\mathbf{O}_{1} \geq 0,\,\mathbf{O}_{2} \geq 0\,;\quad \end{array} \right. } $$
(9.7)
$$ \displaystyle{ C_{c}(\boldsymbol{\theta })\,\stackrel{\varDelta }{=}\,D_{1}(\mathbf{V}_{1}\,\vert \,\mathbf{W}_{1}\mathbf{O}_{1}) +\gamma D_{2}(\mathbf{V}_{2}\,\vert \,\mathbf{W}_{2}\mathbf{O}_{2}) +\delta P\left (\mathbf{O}_{1},\mathbf{O}_{2}\right )\,; } $$
(9.8)

where:

  • D 1(. | . ) and D 2(. | . ) are the measures of fit, respectively, relating to the first and second modalities; note that they may be chosen to be different divergences, each well suited to the corresponding feature space;

  • \( P\left (.,.\right ) \) is a penalty on the difference between (properly rescaled) activation values occurring at the same instant; they can be either the 1 or 2-norm of the difference between the rescaled activations;

  • γ and δ are regularization parameters controlling, respectively, the relative importance of each modality and the coupling penalty.

The interested reader is referred to [134] for more details on the algorithms.Footnote 3 The soft co-factorization scheme has proven effective for multichannel [134] and multimodal audio source separation [120], as well as multimodal speaker diarization [133]. It is believed to be promising for audiovisual event detection tasks.

3.1.4 Neural Networks and Deep Learning

Lately, rapid progress in the application of deep learning methods to representation learning has motivated researchers to use them for fusing multiview data [3, 112, 140]. The primary advantage of neural networks is their ability to model very complex nonlinear correlations that exist between multiple views. Early insights into their use for multiview data were provided by Yuhas et al. [163] who trained a network to predict audio using visual input. Subsequently, Cutler et al. [39] proposed to learn audiovisual correlations for the task of speaker detection using a time-delayed neural network (TDNN). Recently, various multimodal autoencoder architectures for learning shared representations have been proposed, even for the case where only a single view is present at training and testing time [112]. Another interesting work extends CCA to learning two deep encodings, one for each view, such that their correlation is maximized [3]. Regularized deep neural networks [158] have also been proposed to construct shared representations taking into account the feature inter-relationships. Each of these methods has been developed independently in different settings. Their application to event analysis and detection still remains to be explored. This is a rapidly growing area of research; we refer the interested reader to [56, 68] for recently proposed multimodal fusion architectures.

3.2 Decision-Level Integration/Fusion

Decision-level fusion, also known as late integration refers to the idea of combining intermediate decisions, i.e., partial classifier-outputs, in order to achieve a more accurate multimodal characterization of a content, an idea which has been explored extensively, under various configurations. This can be seen as a particular case of ensemble learning [125] where the base classifiers (to be combined) operate on different views of the data.

Numerous works rely on majority voting procedures whereby final global decisions are made based on a weighted sum of individual voters, each typically corresponding to a decision taken on a particular view. The weights are often chosen using either heuristics or trial-and-error procedures (see, for example, [93]). This idea can be better formalized using a Bayesian framework that allows for taking into account the uncertainty about each classifier decisions [71, 103].

3.2.1 Probabilistic Combination Rules

When using classifiers providing local probabilistic outputs p(G c o m, t ) for the t-th observation of the m-th view, o m, t , a simplistic decision strategy assumes feature-vector observations from different views to be independent, and the decision rule consequently takes the form:

$$ \displaystyle{ \hat{G} =\mathop{ \mathrm{arg\,max}}\limits _{c}\log [p(G_{c}\mid \mathbf{o}_{0,t},\,\ldots,\,\mathbf{o}_{M-1,t})] =\mathop{ \mathrm{arg\,max}}\limits _{c}\sum _{m=0}^{M-1}\log p(G_{ c}\mid \mathbf{o}_{m,t})\,. } $$
(9.9)

It is worth mentioning that alternative simple combination rules have also been employed that are discussed in Kittler et al. [86].

The previous approach does not allow for incorporating prior knowledge about the dependency structure in the data, in particular the cross-modal and temporal dependencies. To this end, sophisticated dynamic classifiers have been utilized, ranging from variants of (multistream) hidden Markov Models (HMM) [7, 60, 85, 111], through more general dynamic Bayesian networks [32, 59, 109], to even more general graphical models such as conditional random fields (CRF) [20].

3.2.2 Neural Networks

Neural networks can also be used for late integration. Some works have utilized them to adaptively learn the weights for fusing multiple classifiers or system outputs [74, 113]. This is typically carried out by training the network to minimize the error between estimated and oracle weights [74]. Besides, in order to take into account the temporal and multiview dependencies, a frequently used strategy is to perform end-to-end training with “fusion” integrated as a layer (usually close to the output layer) into the architecture [49, 82]. Such methods cannot be termed as late fusion in the conventional sense as feature learning and decision fusion stages are not independent.

3.2.3 Other Methods

Another widespread strategy consists in using the monomodal classifier-outputs as features, on the basis of which a new classifier, that is expected to optimally perform the desired multimodal fusion, is learned [13, 156]. Also, solutions to deal with the potential imprecision of some views have been proposed using the Dempster–Shafer theory [54].

Finally, it is important to note that the techniques described in this section are not mutually exclusive: in practice one may jointly consider different integration strategies for different features and views (possibly being driven by some expert knowledge), and different analysis time-horizons. This raises the difficult issue of effectively and efficiently exploiting, at the final prediction stage, heterogeneous representations: low-level instantaneous features, possibly over varying time-scales, intermediate prediction results—sometimes seen as outputs of event or concept detectors—bags-of-words or bags-of-systems extracted over longer texture-windows, etc.

4 Audiovisual Event Detection

4.1 Motivation

The target of audiovisual event detection (AVED) is to detect specific events that occur in an audiovisual recording or real-time stream, and to identify the class of those events. Though the task is more widely addressed through the analysis of the video images, information conveyed by the sound track may become key for a proper detection. Indeed, the visual information may not be sufficient since occlusions may occur and events may be localized in space, hence not visible in the images, given that the camera field of view is necessarily restricted. Also the images may not be usable because of poor lighting conditions, or fast camera motion. AVED then enables a more reliable detection of these events, by combining audio and visual cues.

4.1.1 Examples in Video Content Analysis and Indexing

Researchers continue to explore various techniques for improving video content analysis and indexing for better navigation and user experience. In this context, AV event analysis at various levels of granularity provides useful insights into the composition of such data in terms of objects, activities, and scenes. This not only improves retrieval but also provides a representation closer to our understanding of the physical world. For example, a user could search a database for activity videos such as “dribbling a basketball” or “playing a violin.” Evidently, these are two very distinct tasks where the differences can be readily detected based on auditory and motion information. Moreover, joint analysis could reveal the presence of various objects (e.g., violin, basketball) and also the surroundings (e.g., concert hall, court).

Such an analysis makes object detection and segmentation [72], concept classification [75, 76, 78], scene segmentation and change detection[149], activity analysis, and various other related tasks possible. Several systems submitted to TRECVIDFootnote 4 video content analysis tasks of multimedia event detection, story segmentation, and search rely on AV analysis [2, 77, 156].

4.1.2 Examples in AV Surveillance and Robot Perception

Video has recently become an increasingly important resource for forensics and surveillance [104, 124] . Video captured by CCTV systems or video recorded from mobile devices (and possibly shared on multimedia platforms) can provide essential clues in solving criminal cases. For example, when considering an investigation about a missing person, video documents can help to localize the missing person or a suspect, providing crucial information about their whereabouts. The analysis of videos linked with a missing person or her/his social network can also help to understand the conditions of the disappearance (was it a kidnapping, a runaway, etc.) and largely influence the investigation.

An investigator looking for a video in a large dataset may want to retrieve information based on the type of scene where the video was recorded or also, at a finer granularity level, based on specific events that occurred during the recording. In addition, the detection of specific events can help to confirm (or deny) the fact that a video was recorded in a particular scene. Some events are indeed representative of particular scenes. For example, train noise in all probability indicates the scene takes place in a train station. Plates and cutlery noises indicate the scene is probably taking place in a restaurant [22]. On the other hand, some events are unlikely to happen in particular scenes. AVED can then help tracking anomalies to detect abnormal events (gunshots, crowd panic, etc.) [97] or to identify a recording scene where information has voluntary been concealed. This is the case, for example, when a kidnapper sends a ransom video recorded from inside a building but a church bell or a train passing nearby can be heard during the video. This type of information that is not present visually can help to localize the place where the video was recorded [136].

4.2 AV Event Detection Approaches

4.2.1 AV Event Detection and Concept Classification

Approaches to AV event detection have been very varied and data dependent. Many works for traditional event detection utilize Markov model variants such as the duration dependent input–output Markov model (DDIOMM)[110], multistream HMM, or coupled HMM [69]. The former uses a decision-level fusion strategy and the latter two do it at an intermediate level. These methods have been shown to perform better than single modality-based approaches with coupled-HMMs being particularly useful for modeling AV asynchrony.

Specifically, with regard to event detection in surveillance videos, Cristiani et al. [37] propose to use the AV concurrence matrix to identify salient events. The idea is to model the audio/video foreground and construct this matrix based on the assumption that simultaneously occurring AV foreground patterns are likely to be correlated. Joint AV analysis has also been employed extensively for sports video analysis and for broadcast analysis in general. In one approach, several feature detectors are built to encode various characteristics of field sports. Their decisions are then combined using a support vector machine (SVM)[127]. Several approaches for structuring TV news videos have also been proposed.

On the other hand, joint codebook-based approaches have been quite popular for the task of multimedia concept classification.Footnote 5 In essence, each element of these multimodal codebooks captures some part of a salient AV event. Work on short-term audiovisual atoms (S-AVA) [78] aims to construct a codebook from multimodal atoms which are a concatenation of features extracted from tracked short-term visual-regions and audio. To tackle the problem of video concept classification, this codebook is built through multiple instance learning. Following this work, AV grouplets (AVG) [76] were proposed, where separate dictionaries are constructed from coarse audio and visual foreground/background separation. Subsequently, AVGs are formed based on the mixed-and-matched temporal correlations. For instance, an AVG could consist of frames where a basketball player is seen in the foreground with the audio of the crowd cheering in the background. As an alternative, Jhuo et al. [75] determine the relations between audio and visual modalities by constructing a bi-partite graph from their bag-of-words representation. Subsequently, spectral clustering is performed to partition and obtain bi-modal words. Unlike S-AVA and bimodal words, AVG has the advantage of explicitly tackling temporal interactions. However, like S-AVA, it relies on video region tracking, which is quite difficult for unconstrained videos.

4.2.2 AV Object Localization and Extraction

AV object localization and extraction refers to the problem of identifying sources visually and/or aurally. This section serves to show how objects responsible for audiovisual events can be extracted from either of the modalities through joint analysis. The general approach is to first associate the two modalities using methods discussed in Sect. 9.3. The parameters learned during the former step can then be utilized for object localization and segmentation (visual part), audio source separation (audio counterpart), or unsupervised AV object extraction in both modalities. We now discuss approaches to each of these application scenarios.

Object localization and segmentation has been a popular research problem in the computer vision community. Various approaches have leveraged the audio modality to better perform this task with the central idea of associating visual motion and audio. Fisher et al. [51] proposed to use joint statistical modeling to perform this task using mutual information. Izadinia et al. [72] consider the problem of moving-sounding object segmentation, using CCA to correlate audio and visual features. The video features consisting of mean velocity and acceleration computed over spatio-temporal segments are correlated with audio. The magnitude of the learned video projection vector indicates the strength of association between corresponding video segments and the audio. Several other works have followed the same line of reasoning while using different video features to represent motion [84, 138]. Effectiveness of CCA can be illustrated with a simple example of a video with a person dribbling a basketball [72] (see Fig. 9.3). Simplifying Izadinia et al.’s [72] visual feature extraction methodology, we compute the optical flow and use mean velocity calculated over 40 × 40 blocks as the visual representation and mel-spectra as the audio representation. The heat map in Fig. 9.3 shows correlation between each image block and audio. Areas with high correlation correspond to regions with motion. If we instead use a soft co-factorization model [134], it is indeed possible to track the image blocks correlated with the audio in each frame.

Fig. 9.3
figure 3

CCA illustration: heat map showing correlation between video image regions and audio. Black squares indicate highest correlation

Another approach worth mentioning is one that uses Gestalt principles for locating sound sources in videos [105]. Inspired by Gestalt principle of temporal proximity the authors propose to detect synchronous audiovisual events. A particularly different approach was taken by Casanovas et al. [29] who proposed an audiovisual diffusion coefficient to remove information from video image parts which are not correlated with the audio.

Audio source separation is the audio counterpart of the previously discussed problem. The aim is to extract sound produced by each source using video information. As done for videos, mutual information maximization has been used to perform source separation in a user-assisted fashion by identifying the source spatially. Recent methods perform this within the NMF-based source separation framework [120, 132].

Several other approaches deal with both object segmentation and source separation together in a completely unsupervised manner. Work by Barzeley et al. [11] considers onset coincidence to identify AV objects and subsequently perform source separation. A particular limitation of this method is the requirement of setting multiple parameters for optimal performance on each example. Blind AV source separation work has also been attempted using nonnegative CCA [138] and sparse representations [28]. Independent component analysis over concatenated features from both modalities also extracts meaningful audiovisual objects [139]. However its application is limited to static scenes. Finally, multimodal dictionary learning has also been utilized in this context [94].

While the methods discussed in this section have been shown to work well in controlled environments, their performance is expected to degrade in dense audiovisual scenarios. Moreover, they make a simplifying assumption that all the objects are seen onscreen. It must be emphasized that most of these techniques can be considered symmetric, in the sense that they can be applied to tasks in either of the modalities with appropriate representations.

5 Microphone Array-Based Sound Scene Analysis

In complex sound scenes the sounds coming from different sources can be overlapping in time and frequency. Single channel processing can discriminate sources based on time or frequency as long as they are separated in either time or frequency. Trying to detect or classify sound events that are overlapping both in time and frequency directly from a single channel signal will generally result in a confusion between events. An alternative approach is to attempt to separate individual events prior to detection or classification. However, trying to separate sounds that are overlapping both in time and frequency with single channel techniques is known to be problematic and will inevitably introduce a loss of information resulting in a degradation of the subsequent detection and classification performance. Microphone arrays enable the usage of multichannel techniques that exploit not only temporal and spectral diversity between sources but also spatial information about their location.

Historically microphone arrays were composed of a set of microphones placed along a straight line (with constant spacing between microphones for linear arrays or variable spacing for logarithmic arrays). Less constraint arrays have been used for specific purposes such as spherical and circular arrays and more recently arrays without spatial constraints in the case of wireless acoustic sensor networks became a popular research topic. Some approaches and concepts presented here are applicable only to specific array topologies or at least when the topology is known beforehand (see also below). In this chapter we also assume that the signals coming from different microphones are synchronized at the stage of sampling in order to allow for the exploitation of spatial cues. Readers should keep in mind that at the time of writing of this book, dealing with unsynchronized microphone arrays is still an open research problem.

5.1 Spatial Cues Modeling

In order to exploit spatial information about the sound sources, audio scene analysis algorithms usually first model the spatial cues and then estimate the corresponding parameters. Both deterministic and probabilistic modeling of such spatial cues have been widely considered in the literature. The former case usually relies on (a) the point source assumption, where sound from a source is assumed to come from a single position, and (b) the narrowband approximation, where a mixing process from an audio source to the microphone array is characterized by a mixing frequency dependent vector [98]. Probabilistic modeling is usually applied for reverberated or diffuse sources, where sound from a source may come from many directions due to the reverberation, e.g., source localization [18, 63, 116], separation [46, 73, 99], and beamforming systems [17, 48]. This section will discuss some typical spatial cue models, in both a deterministic and a probabilistic sense, for different audio scene analysis applications.

5.1.1 Binaural Approach

Humans generally combine cues from several audiovisual streams to localize sound sources spatially. The main cues for localization in the horizontal hemisphere are related to binaural hearing (relying on the difference between the signal reaching the right ear and the signal reaching the left ear). All these cues are encoded in the so-called interaural transfer function (ITF) that includes the following:

  • The interaural time difference (ITD) is the difference between the time-of-arrival of a signal at the left ear and the right ear. It is useful to localize sounds based on their onset and at low frequency (below 1.5 kHz) [89] (see also Fig. 9.4a).

    Fig. 9.4
    figure 4

    Artificial representation of the binaural cues. (a) ITD, (b) IPD, (c) IID

  • The interaural phase difference (IPD) is the phase difference between the signal at the left ear and the right ear. It is useful to localize on-going sound as long as the wavelength is larger than the diameter of the head (below 1.5 kHz) [162] (see also Fig. 9.4b);

  • The interaural intensity difference (IID) is the difference in level between the signal at the left ear and the right ear due to the acoustic shadow produced by the head for sounds above 3 kHz (below the so-called head shadow effect is not present) [108] (see also Fig. 9.4c).

All the concepts mentioned above can be extended to general microphone array setups. The ITD and IPD concepts directly generalize to linear microphone arrays where they relate rather straightforwardly to time difference of arrival (TDOA) and direction of arrival (DOA). In this case, however, the arrays have to be designed carefully to prevent spatial aliasing. The IID concept is less applicable to small linear arrays as it relies on the head shadow effect. Indeed, in small arrays the level difference between the signal impinging two consecutive microphones might not be significant. However, in ad-hoc arrays where the topology is unconstrained, the microphones can be quite far apart and IID can become insightful as well, granted that the microphone positions are known beforehand. These spatial cues are extensively exploited to extract a signal of interest from the mixture using beamforming approaches described in Sect. 9.5.1.2 (for example, the delay-and-sum beamformer directly relies on ITD). Spatial cues can also be used directly for sound source localization (see also Sect. 9.5.2.3) and, by proxy, for source separation (see also Sect. 9.5.2.1) and sound event detection (see also Sect. 9.5.2.2).

5.1.2 Beamforming Methods

Fixed beamformers compose a first simple class of multichannel algorithms which can separate signals coming from different directions. A fixed beamformer tries to steer toward the direction from where the desired sound signal comes and to reject signals coming from other directions. The main categories of fixed beamformers include delay-and-sum beamformers, filter-and-sum beamformers [66], superdirective microphone arrays [36], or the original formulation of the minimum variance distortionless beamformer (MVDR) [26]. Adaptive beamformers try to steer toward the direction of the desired sound signal and to adaptively minimize the contributions from the undesired sources coming from other directions. This typically yields a constrained optimization problem. Frost introduced the linearly constrained minimum variance beamformer (LCMV) as an adaptive framework for MVDR [55].

The generalized side lobe canceler (GSC), also known as the Griffiths-Jim beamformer, is an alternative approach to the LCMV where the optimization problem is reformulated as an unconstrained problem [62]. The GSC can be decomposed as a fixed beamformer steering toward the desired source, a blocking matrix, and a multichannel adaptive filter [65].

The multichannel Wiener filters (MWF) represent another class of multichannel signal extraction algorithms which are defined by an unconstrained optimization problem [45]. MWF-based algorithms can be implicitly decomposed into a spatial filter and a spectral filter, and can indeed be considered as beamformers [135]. Besides, a reformulation of MWF allows for explicitly controlling the spectral distortion introduced [45, 135].

5.1.3 Nonstationary Gaussian Model

The nonstationary Gaussian framework has emerged in audio source separation [46, 50, 114, 119] as a probabilistic modeling of the reverberated sources. It was then also applied in, e.g., multichannel acoustic echo cancellation [144] and multichannel speech enhancement [145]. In this paradigm, the short-time Fourier transform (STFT) coefficients of the source images c j (t, f), i.e., the contribution of the j-th source (1 ≤ jJ) at the microphone array, are modeled as a zero-mean Gaussian random vector whose covariance matrix \( \widehat{\mathbf{R}}_{j}(t,f) = \mathbb{E}\left (\mathbf{c}_{j}(t,f)\mathbf{c}_{j}^{H}(t,f)\right ) \) can be factorized as

$$ \displaystyle{ \widehat{\mathbf{R}}_{j}(t,f) = v_{j}(t,f)\mathbf{R}_{j}(t,f), } $$
(9.10)

where v j (t, f) are scalar time-varying variances encoding the spectro-temporal power of the sources and R j (t, f) are I × I spatial covariance matrices encoding their spatial position and spatial width. This model does not rely on the point source assumption nor on the narrowband assumption, hence it appears applicable to reverberated or diffuse sources. In the general situation where the sound source can be moving, the spatial cues encoded by R j (t, f) are time-varying. However, in most cases where the source position is fixed and the reverberation is moderate, the spatial covariance matrices are time-invariant: R j (t, f) = R j (f). Different possibilities of parameterizing R j (f) have been considered in the literature resulting in either the rank-1 or the full-rank matrices, where the later case was shown to be more appropriate for modeling the reverberated and diffuse sources as it accounts directly for the interchannel correlation in the off-diagonal entries of R j (f) [46].

5.2 Spatial Cues-Based Sound Scene Analysis

This section will discuss the use of spatial cue models presented in the previous section in some specific applications, namely sound source separation, acoustic event detection, and moving sound source localization and tracking.

5.2.1 Sound Source Separation

In daily life, recorded sound scenes often result from the superposition of multiple sound sources which prevent both human and machines from well localizing and perceiving the target sound sources. Thus, source separation plays a key role in sound scene analysis, and its goal is to extract the signals of individual sound sources from an observed mixture [98]. It offers many practical applications in, e.g., communication, hearing aids, robotics, and music information retrieval [6, 14, 100, 152].

Most source separation algorithms operate in the time-frequency (T-F) domain with the mixing process formulated as

$$ \displaystyle{ \mathbf{x}(t,f) =\sum _{ j=1}^{J}\mathbf{c}_{ j}(t,f) } $$
(9.11)

where \( \mathbf{x}(t,f) \in \mathbb{C}^{I\times 1} \) denotes the STFT coefficients of the I-channel mixture at T-F point (t, f), and \( \mathbf{c}_{j}(t,f) \in \mathbb{C}^{I\times 1} \) is the j-th source image. As c j (t, f) encodes both spectral information about the sound source itself and the spatial information about the source position, a range of spectral and spatial models has been considered in the literature resulting in various source separation approaches. In the determined case where IJ, non-Gaussian modeling such as frequency-domain independent component analysis (FDICA) has been well-studied [122, 128]. In the under-determined situation where I < J, sparse component analysis (SCA) has been largely investigated [19, 61, 81]. As a specific example of the nonstationary Gaussian modeling presented in Sect. 9.5.1.3, the parameters are usually estimated by the expectation maximization (EM) algorithm derived in either the maximum likelihood (ML) sense [46] or the maximum a posteriori (MAP) sense [47, 117, 119]. Then source separation is achieved by the multichannel Wiener filtering. Readers are referred to, e.g., [95, 150] for the survey of recent advances on both blind scenarios and informed scenarios which exploit some prior knowledge about the sources themselves [119] or the mixing process [47] to better guide the source separation.

5.2.2 Sound Event Detection

As different sound events usually occur at different spatial locations in the sound scene, spatial cues obtained from microphone array processing intrinsically offer important information for SED. As an example, information about the source directions inferred from the interchannel time differences of arrival (TDOA) was used to help partitioning home environments into several areas containing different types of sound events in [151]. The combination of these spatial features with the classic MFCC was reported to improve the event classification in the experiment. Motivated by binaural processing, in [1] the stereo log-mel-band energy is extracted from stereo recordings to train the neural networks in order to obtain a meaningful cue similarly to the IID.

5.2.3 Localization and Tracking of Sound Sources

Sound source localization and tracking are concerned with estimating and following the position of a target source within a sound scene. This active field of research in microphone array processing finds important applications, e.g., in surveillance or video conferencing where the camera should be able to follow the moving speaker, and even can automatically switch the capture to an active sound source in multiple source environments [153]. Spatial cues offered by the multichannel audio capture play a key role in deriving the algorithms.

The problem of acoustic source localization has been a relevant topic in the audio processing literature for the past three decades because of its applicability to a wide range of applications [41, 146]. The most effective solutions rely on the use of spatial distributions of microphones, which sample the sound field at several locations. Spurious events, reverberation, and environmental noise, however, can be a significant cause of localization error. In order to ease the problem, at least for those errors that are contained in a limited number of time frames, source tracking techniques can come in handy, as they are able to perform trajectory regularization, even on the fly. Typical approaches are based on particle [5, 91, 155], Kalman [4, 57], or distributed Kalman filtering [143].

Different methodologies have been developed for the localization of acoustic sources through microphone arrays. Those that gained in popularity are based on measurements of the time delay between array microphones. Working on the time domain is often a suitable choice for wideband signals, and most techniques tend to rely on the analysis of the generalized cross-correlation (GCC) of the signals [27] and variants thereof. Localization in the frequency domain, however, can be shown to attain good results for narrowband or harmonic sources immersed in a wideband noise and rely on the analysis of the covariance matrix of the array data. A taxonomy of the localization techniques is represented in Fig. 9.5.

Fig. 9.5
figure 5

Taxonomy of source localization techniques

5.2.3.1 Time-Domain Localization

Steered response power (SRP, [42, 43, 102]) and global coherence field (GCF, [115]) proceed through the computation of a coherence function that maps the GCC values at different microphone pairs on the hypothesized source location. A source location estimate is found as the point in space that maximizes the coherence function. In [23] the scenario of multiple sources is accommodated through a two-step procedure that, after localizing the most prominent source, deemphasizes its contribution in the GCC, so that other sources can be localized. These techniques are known for their high level of accuracy, and are suitable for networks of microphone arrays, where synchronization can only be guaranteed between microphones of the same array. One limitation of such solutions is their computational cost, which is proportional to the number of hypothesized source locations. This means that increasing the spatial resolution results in higher computational costs. Some solutions have been proposed in the literature to mitigate this problem. In [165] the authors propose a hierarchical method that begins with a coarser grid, and refines the estimate at different steps by computing the map for finer grids concentrated around the candidate locations estimated at the previous step. In [44] a similar approach is adopted, but a stochastic region contraction strategy is used for going from a coarser to a finer grid. An example of steered response power with stochastic region contraction map is shown in Fig. 9.6.

Fig. 9.6
figure 6

Example of a coherent map using the Steered Response Power with Stochastic Region Contraction technique (SRP-SRC, from [44])

Less cumbersome are the solutions based on the time difference of arrival (TDOA), which is estimated as the time lag of the GCC that exhibits the maximum value. The TDOA is then converted into range difference (RD), which measures the difference of the range between the source and the two microphones in the pair. The locus of candidate source locations corresponding to a given TDOA is a branch of hyperbola whose foci are in the microphones locations, and whose aperture is proportional to the measured TDOA. The most straightforward technique for localization consists in intersecting branches of hyperbolas corresponding to the TDOA measurements coming from different pairs of microphones. The cost function that is based on this procedure is strongly nonlinear, which makes the method sensitive to measurement errors. Least squares cost functions provide a good approximation [12, 34, 70, 129]. The main drawback of TDOA-based localization is its sensitivity to outlier measurements. In [24, 25, 130] techniques for removal of the outliers were presented. In particular, the DATEMM algorithm [130] is based on the observation that TDOAs over a closed loop must sum to zero.

5.2.3.2 Frequency-Domain Localization

Techniques in the frequency domain are based on the observation that different microphones in the array will receive differently delayed replicas of the source signals. This, in the frequency domain, corresponds to a phase offset. For distant sources the phase offset between adjacent microphones is constant throughout the array. Delay-and-sum beamformers compensate the offsets so that the components related to a direction will sum up coherently and the others will not. The estimation of the direction of arrival (DOA) of the target source proceeds by searching for the direction that maximizes the output energy of the beamformer over a grid of directions [141, Chapter 6]. The most straightforward nonparametric beamformer is the delay and sum, which is known for its low resolution capabilities, making it difficult to distinguish sources that are seen under close angles from the array viewpoint. The minimum variance distortionless response beamformer (MVDR, [26]) partially improves the resolution capabilities. Parametric techniques, among which it is worth mentioning multiple signal classification (MUSIC, [131]), and estimation of the signal parameters through rotational invariance techniques (ESPRIT, [126]) bring improvements in terms of resolution. However, they are known for their sensitivity to noise and reverberation, which tends to introduce spurious localizations. The superdirective data-independent beamformer [16] was shown to partially mitigate this problem. An interesting solution to the sensitivity to reverberation was proposed in [137] for the detection of gunshots using networks of sensors, each equipped with four or more microphones. For each sensor, both DOA and TDOA are measured. Source location is estimated by intersecting the loci of potential source locations (hyperbolas and direction of arrival) for the two kind of measurements from all the sensors. In reverberant conditions and in the presence of interferers, some TDOAs and some DOAs could be related to spurious paths, thus providing multiple estimates of the gunshot location. The actual gunshot location is found as the one that maximizes the number of consistent TDOAs and DOAs.

It is important to notice that TDOA-based and frequency-domain source localization techniques require the synchronization of the microphones within the array. This, in fact, becomes an issue when multiple independent small arrays are deployed in different locations. In [25] the authors propose a technique for the localization without requiring a preliminary synchronization of the arrays by including the time offsets between the arrays into the unknowns, along with the location of the source. Another important issue is the self-calibration of the array, i.e., the estimation of the mutual relative positions of the microphones [38, 147]. The widespread diffusion of mobile phones and devices equipped with one or more microphones enables the implementation of a wireless acoustic sensor network in seconds, for goals ranging from teleconferencing to security. In this context, however, both calibration and synchronization are needed before normal operation [123].

5.2.3.3 Acoustic Source Tracking

Independently of the adopted localization method, reverberation and interferers could introduce spurious localizations. The goal of source tracking is to alleviate the influence of outliers. The idea behind tracking is that measurements related to the actual source must follow a dynamical model whereas those related to spurious sources must not [155]. Another goal that can be pursued with tracking systems is that of fusing information coming from both audio and visual localization systems [9, 142]. Several solutions have been presented in the literature. The Kalman filter [57] is a linear system characterized by two equations. The state equation models the evolution of the state of a system (location and speed of the source) from one time frame to the next one. The observation equation links the state variables with the observable measurements. The goal of the Kalman filter is to estimate the current state from the knowledge of time series of the observations.

Recently, distributed Kalman filters have been used, which enable the tracking of acoustic sources also in the case of distributed array networks [164], without requiring that all nodes communicate the whole state of the system.

Inherent assumptions that lie in the use of the Kalman filter are the linearity and Gaussianity of measurement and state vectors. In order to gain in robustness against the nonlinearity, the use of the extended Kalman filter has been proposed [142], which linearizes the nonlinear system around the working point. In order to gain in robustness against non-Gaussian conditions, however, one has to resort to a different modeling of the source dynamics. In recent years particle filter gained interest in the source localization community due to the fact that it is suitable also to perform tracking in nonlinear non-Gaussian systems and, more in general, for its higher performance [155]. Particle filtering [8] assumes that both state and measurement vectors are known in a probabilistic form. Once a new measurement vector is available, the likelihood function of the current observation from a given state is sampled through particles. Each particle is assigned a weight, which determines its relevance in the likelihood function. Only relevant particles will be propagated to the next step. The source location is determined as the centroid of the set of particles. An example of tracking of one, two, or three acoustic sources on a given trajectory for DOA measurements is shown in Fig. 9.7.

Fig. 9.7
figure 7

Example of tracking of one, two, or three sources over a prescribed trajectory (from [148])

In audio surveillance contexts, it is important to enable localization also when multiple sources are active at any time, with a small convergence time when acoustic sources alternate. This is important, for example, in events that involve multiple acoustic sources (brawls, people yelling, etc.). In recent years, swarm particle filtering has shown to address this scenario particularly well [121]. It is based on the idea that the propagation of each particle to the next step is determined not only by the previous history of the particle itself, but also by the particle that exhibits the best likelihood at the current time instant. Consequently, the overall behavior of the systems resembles that of a bird flock, rapidly moving toward the active source. An example of behavior of swarm particle filtering is shown in Fig. 9.8. Here two sets of particles at four consecutive time frames estimate the location of a source using particle filtering (PF) and swarm particle filtering (Swarm). The two sets are initialized identically. It is possible to notice that after four steps, the swarm particles cluster around the source location, while the PF is still converging.

Fig. 9.8
figure 8

Example of behavior of two sets of particles propagated using particle filtering (PF) and swarm particle filtering (SWARM). The two sets of particles occupy the same location at the first time frame (from [121])

6 Conclusion and Outlook

Multichannel and multimodal data settings represent opportunities to address complex real-world scene and event classification problems in a more effective manner. The availability of concurrent, hence potentially complementary streams of data is amenable to a more robust analysis, by effectively combining them, using appropriate techniques, be it at the input representation-level, the feature-level, or the decision-level. Successful applications of such techniques have been realized in various multichannel audio and audiovisual scene analysis tasks.

Yet, a number of research questions remain open in these settings. Notably, it is still not clear how to generically detect when some of the data views are temporarily not reliable (typically noisy or out of focus, with respect to the classes of interest) and which strategies should be developed that can efficiently ignore such views and proceed with the classification (or any other similar data processing) using models which were perhaps trained assuming all views are available.

Also, given the complexity of accurately annotating all data views, especially for instantaneous multi-label event classification tasks, that is when multiple events may occur simultaneously, it is important to consider learning methods that can take advantage of very coarse ground-truth labels, which may have been obtained based on just one of the views, without necessarily being relevant for others. An example of this is the “blind” annotation of the audio track of a video (without considering the images) where sound events may not be visible onscreen at the same time stamps. Multiple instance learning and weakly supervised learning techniques may turn out to be effective learning paradigms to address these difficulties.