1 Introduction

Acoustic event detection (AED), which determines both the types and the happening times (beginning and end times) of different acoustic events, enables automatic systems to obtain a better understanding of what is happening in an acoustic scene. It has been applied to many applications including security [12, 73, 80], life assistance [36, 61] and human–computer interaction [78, 94]. Due to the important and indispensable role of AED, there has been an increasing research activity in this area [29, 59, 62, 85]. Various evaluation campaigns including CLEAR [75], DCASE 2013 [26], DCASE 2016 [83] and DCASE 2017 [4] were organized to address the challenges in and to promote research on AED.

The AED task is typically treated as a frame-based classification or regression problem with each frame corresponding to an acoustic event type and its continuous-valued localizations. Frame-wise classification of acoustic events is applied over sliding time windows with statistical models to represent the acoustic features. A straightforward idea is to use conventional machine learning algorithms, such as GMMs [93] to model the means and variances of the acoustic features for each event type. During testing, the likelihoods from GMMs are summed across time and the type with the highest probability is chosen as the final detected result. Other statistical machine learning algorithms such as hidden Markov models (HMMs) [72, 90], support vector machines (SVMs) [57, 79] and nonnegative matrix factorization (NMF) [19, 40] are also applied to perform the classification task. The non-speech sounds were also detected using the fuzzy integral (FI) [77], which have shown comparable results to the high performing SVM feature-level fusion in [77]. In [62], the authors proposed a technique for the joint detection and localization of non-overlapping acoustic events using random forest (RF). Multivariable random forest regressors are learned for each event category to map each frame to continuously estimate the onset and offset time of the events. Heittola et al. [30] proposed two iterative approaches based on the expectation maximum (EM) algorithm [91] to select the most likely stream to contain the target sound: one by always selecting the most likely stream and the other by gradually eliminating the most unlikely streams from the training data.

Although some improvements have been made using the aforementioned learning algorithms, these conventional machine learning techniques show a limited power when the AED task becomes more challenging if the acoustic events are polyphonic or strongly labeled annotations (both the presence and the localizations of the acoustic events are given in the training process) are not available. Inspired by the successful applications of deep learning techniques in computer vision, speech signal and natural language processing, the AED task is also popularly performed using different neural network-based deep learning approaches.

Neural network-based deep learning algorithms are originated in the 1940s leading to the first wave of artificial intelligence (AI) algorithms with the creation of the single-layer perceptron (SLP) and the multilayer perceptron (MLP) [66, 67]. In [32], a new layer-wise greedy learning-based training method was proposed for the deep neural network (DNN). Hidden layers in a network are pretrained one layer at a time using the unsupervised learning approach, and this considerably helps to accelerate subsequent supervised learning through the back-propagation algorithm [23, 46]. A convolutional neural network (CNN)-based approach achieved a new error record of 0.39% on the handwriting digits database MNIST in [64], which marks a significant progress in performance since the classical prototype LeNet-5 [47]. In [33], the authors proposed the auto-encoders (AEs) to pretrain the feed-forward neural network (FNN). Afterward, variations of AEs including the de-noising auto-encoder (DAE) [81, 82], the sparse auto-encoder (SAE) [50] and the variational auto-encoder (VAE) [38] which was proposed to enhance the ability of feature learning and representation. However, the VAE introduces potentially restrictive assumptions about the approximate posterior distribution making the generated data blurry [37]. Generative adversarial networks (GANs) were proposed to offer a distinct approach by focusing on the game-theoretic formulation while training the generative model [27] and to produce high-quality images [18, 65]. GANs [31, 58, 65, 71] have been popularly adopted to generate training data in [6, 55, 92].

Table 1 Some conventional machine learning approaches applied to the acoustic event detection

The aforementioned neural network-based deep learning approaches have shown their superior performances in AED tasks. In [9], the DNN was used to perform the polyphonic acoustic event detection task. The deep neural network-based system outperformed the conventional learning method using nonnegative matrix factorization at the preprocessing stage and HMM as a classifier. The recurrent neural network (RNN) was applied to the AED system in [49, 60] to capture the context information deep in time. In [60], the authors presented a technique based on the bidirectional long short-term memory (BLSTM). The multi-label BLSTM was trained to map acoustic features of multiple classes to binary activity indicators of each event class. In [34], the authors presented a polyphonic AED system with a multi-model system. In that work, one DNN was used to detect acoustic event of “car” and five bidirectional gated recurrent units–recurrent neural networks (BGRU-RNN) were used to detect other acoustic events. The CNN [48] was used to extract the high-level features that are invariant to local spectral and temporal variations in [35]. Authors in [4, 59] combined the RNN and CNN by adopting the convolutional recurrent neural network (CRNN) to model the audio features and achieved the state-of-the-art performance.

Tables 1 and 2 comprehensively list the recent works with conventional machine learning- and neural network-based deep learning approaches applied to the AED. In Tables 1 and 2, different evaluation datasets and metrics were adopted. In order to highlight the advantages of the deep learning approaches over the conventional machine learning techniques, Table 3 lists selected top systems based on the unified evaluation databases in DCASE Challenges from the years 2013 to 2017. The first block in Table 3 shows the system performances using the conventional machine learning approaches. It is worth noting that the deep learning approaches were not applied to the AED system in 2013. The second block shows that the neural network-based deep learning approaches outperformed the conventional machine learning techniques. The detection performance was pushed further with improved error rates using the deep learning approaches in 2017. The number of deep learning-based systems also increased dramatically from 0 in 2013 to 33 in 2017, which dominated the 36 submitted systems. To show the computational load of different training approaches, Table 4 shows the detailed information of some typical neural network structures and the corresponding detection error rates when the Mel-band energy is adopted as the acoustic feature. The aforementioned trend motivated us to write this survey of neural network-based deep learning approaches applied to the acoustic event detection task.

Table 2 Neural network-based deep learning approaches in acoustic event detection
Table 3 Selected top systems using conventional machine learning and deep learning approaches on the DCASE Challenge evaluation databases from the years 2013 to 2017
Table 4 Some typical neural network structures and the corresponding detection error rates in acoustic event detection

The purpose of this paper is to provide a comprehensive survey for the neural network-based deep learning approaches on the acoustic event detection task. Two types of acoustic event detection, namely the strongly and the weakly labeled acoustic event detection, are surveyed in this paper. Our survey is different to that of [15] by including works on the important weakly labeled acoustic event detection problem (where only the presence of acoustic events is given in the training process) which is one of the new tasks for the DCASE Challenge 2017.

This survey is organized as follows. Some common acoustic event databases and evaluation metrics are introduced in Sect. 2. In Sect. 3, the strongly labeled acoustic event detection is first introduced. Afterward, applications of some state-of-art deep learning approaches on the strongly labeled acoustic event detection are elaborated. In Sect. 4, we introduce the weakly labeled acoustic event detection task and the recent advances in that area. The reasons why neural network-based deep learning approaches benefit the AED task and the issues to be studied further are given in Sect. 5. We conclude the paper in Sect. 6.

2 Metrics and Databases in Acoustic Event Detection

2.1 Evaluation Metrics

There are two ways of evaluation in the system performance, namely the segment-based and event-based statics [54] when the system output and the ground truth label are compared in fixed length intervals or at event instance level.

  • Segment-based metric For segment-based metric, the predicted active acoustic events are determined in a fixed short time interval with the true positive (tp), false positive (fp), false negative (fn) and true negative (tn) defined, respectively. The true positive means that the acoustic event exists in both the system output and the ground truth label simultaneously. The false positive denotes that the system determines the acoustic event as active, while the true label for the acoustic event is inactive. The false negative means the system fails to detect the acoustic event when the reference indicates the acoustic event to be active. The true negative means the system and the ground truth both determine the acoustic event as inactive.

  • Event-based metrics For event-based evaluation metric, the system output and the ground true label are compared event by event. Similarly, the true positive, false positive, false negative and true negative are defined. All these mentioned statics are calculated based on the fact whether the system output which has a temporal position overlaps with the temporal position in the ground true label. A tolerance with respect to the ground true label is usually allowed.

The F-score and the error rate (ER) are commonly adopted as the final evaluation metrics when the segment-based and event-based statics are available.

  • F-score

    Based on the segment-based and event-based statics, the segment-based and event-based F-score can be calculated as:

    $$\begin{aligned} F = \frac{2\times P \times R}{P + R} \end{aligned}$$
    (1)

    where the P and R denote the precision and recall accuracy, respectively. The P and R are expressed as:

    $$\begin{aligned} \begin{aligned} P&= \frac{tp}{tp+fp}\\ R&= \frac{tp}{tp+fn} \end{aligned} \end{aligned}$$
    (2)
  • Error rate

    The error rate measures the number of prediction errors regarding the insertions (I), the deletions (D) and the substitutions (S). The errors are calculated segment by segment. In a segment seg, the number of insertions I(seg) is the number of incorrect system outputs, the number of deletions D(seg) is the number of ground truth events that are not correctly identified, and the number of substitutions S(seg) means the number of acoustic events for which some other acoustic events are the outputs rather than the correct acoustic events. The error rate can be calculated as:

    $$\begin{aligned} ER = \frac{\sum _{seg=1}^{seg=S}I(seg) + \sum _{seg=1}^{seg=S}D(seg) + \sum _{seg=1}^{seg=S}S(seg)}{N(seg)} \end{aligned}$$
    (3)

    where seg denotes the \(seg^{th}\) segment, N(seg) is the number acoustic events annotated as active in the segment seg, and the I(seg), D(seg) and S(seg) can be expressed as:

    $$\begin{aligned} \begin{aligned} I(seg)&= \text {max}(0,fp(seg)-fn(seg))\\ D(seg)&= \text {max}(0,fn(seg)-fp(seg))\\ S(seg)&= \text {min}(fn(seg),fp(seg)) \end{aligned} \end{aligned}$$
    (4)

2.2 Datasets

In this part, several commonly used acoustic event detection databases are listed in Table 5. In Table 5, the number of acoustic event classes, the acoustic event segments and the corresponding references are shown. A detailed description of the acoustic event detection databases can be referred in.Footnote 1

Table 5 Some commonly adopted acoustic event detection databases

3 Strongly Labeled Acoustic Event Detection

For strongly labeled acoustic event detection, the acoustic event types and the event localizations are annotated in the training set. The task is to detect the acoustic event types and the happening times given an audio stream during testing.

The inputs of the AED system are the acoustic features \({{\varvec{X}}}_t\) and the acoustic features of each frame are associated with one output label vector, which can be written in binary format as:

$$\begin{aligned} {{\varvec{y}}}_t = \{y_{t,1},y_{t,2},\ldots ,t_{t,e},\ldots ,t_{t,E}\} \end{aligned}$$
(5)

where \(y_{t,e}\) is equal to 1 when the eth event type is active at time index t. Otherwise, \(y_{t,e}\) is set to 0. The E is the total number of acoustic event types of interest.

The training space \(\varOmega _{strong}\) for the strongly labeled AED system training can be expressed as:

$$\begin{aligned} \varOmega _{strong} = \{{{\varvec{X}}}_t,{{\varvec{y}}}_t\} \end{aligned}$$
(6)
Fig. 1
figure 1

The flowchart of the DNN-based strongly labeled acoustic event detection system

Figure 1 shows the general flowchart of the DNN-based strongly labeled AED system. As shown in Fig. 1, each frame corresponds to one input feature vector \({{\varvec{X}}}_t\) and one output training label \({{\varvec{y}}}_t\). The neural network classifier is trained in a supervised way and outputs the continuous probabilities representing the probability that each frame belongs to the event classes of interest. The binary cross-entropy function [17] is adopted as the training criteria, which can be expressed as:

$$\begin{aligned} L = -q \times log(p) - (1-q) \times log(1-p) \end{aligned}$$
(7)

where q is the target probability from the training database and p is the estimated probability that the current frame belongs to a certain event type. The q is equal to 1 if the training vector corresponds to the ground truth label and p is the sigmoidal output of the deep neural network.

During testing, with the trained acoustic model and the given test audio stream, each time index t will correspond to E output probability predictions, which are expressed as:

$$\begin{aligned} \hat{{\varvec{y}}}_{t} = \{\hat{y}_{t,1},\hat{y}_{t,2},\ldots ,\hat{y}_{t,e},\ldots ,\hat{y}_{t,E}\} \end{aligned}$$
(8)

where \(\hat{y}_{t,e}\) represents the probability that the current frame t belongs to the eth event type. Afterward, a global threshold \(\tau \), which is empirically set, is applied to \(\hat{{\varvec{y}}}_t\). Event classes with a higher probability than the global threshold are detected as the final active acoustic events.

3.1 CNN in Strongly Labeled AED

The flowchart in Fig. 1 can be applied to the CNN-based strongly labeled AED when the DNN classifier is replaced by the CNN classifier in the training process. Figure 2 shows the training process of the CNN-based strongly labeled AED system. The convolutional neural network model structure includes convolutional layers, max-pooling layers, a flattening layer and a sigmoid output layer. The convolution operation performs the high-level feature extraction. The sub-sampling operation is performed, and max-pooling operations are carried out over the entire sequence length. Typically, the Relu or the sigmoid activation function is used for the kernels. As there may be more than one acoustic event happening at the same time index, a sigmoid layer composed of fully connected neurons is used. The binary cross-entropy is adopted as the loss function in training.

Fig. 2
figure 2

The training process of the CNN-based strongly labeled acoustic event detection system (Figure extracted from [88])

3.2 RNN in Strongly Labeled AED

The same flowchart in Fig. 3 can be applied to the RNN-based strongly labeled AED when the DNN-based classifier is replaced with the RNN classifier. The RNN classifier is adopted in order to utilize the long context information.

Fig. 3
figure 3

Training process for the RNN-based strongly labeled acoustic event detection. The current hidden layer output depends on both the input and the previous hidden neurons, which effectively utilizes the long context information

Figure 3 shows the basic concept of the RNN training process. As shown in Fig. 3, the current hidden layer depends on both the input and the previous hidden neurons. Multiple RNN hidden units are stacked on top of each other. The hidden state sequence of the lower layer can be computed as:

$$\begin{aligned} h^{l}_t = H(h_t^{l-1},h_{t-1}^l)\ (1 \le l \le L) \end{aligned}$$
(9)

Here, \(h^l\) becomes the first input layers when l equals to 0. The output of the RNN is expressed as:

$$\begin{aligned} \mathbf o = W_{h,y}h^L_T + b_y \end{aligned}$$
(10)

where \(W_{h,y}\), \(h^L_T\) and \(b_y\) are the weight parameters between the output layer and the last hidden layer, the last hidden layer output and the bias, respectively. Afterward, the \(\mathbf o \) is presented to a sigmoid layer to get the predicted probability \(\hat{{{\varvec{y}}}}_t\).

3.3 CRNN in Strongly Labeled AED

Figure 4 shows the training process of the CRNN-based strongly labeled AED system. The testing process is the same as testing process in Fig. 1. From Fig. 4, there are three function blocks used for the training, namely the convolution layer block, the recurrent layer block and the feed-forward layer block. The convolution layer block extracts the high-level features with the acoustic features of consecutive frames as the input. The stacked features from the convolutional and max-pooling layers are then fed to the recurrent layer block. The feed-forward layer block with sigmoid activation function is used for the classification as the output layer, and the cross-entropy is adopted as the loss function, which is expressed as:

$$\begin{aligned} J_{CE}({{\varvec{W}}},{{\varvec{b}}}) = -\sum _{t}\sum _{e}log\text { }v_{e,t}^{L} \end{aligned}$$
(11)

Here \(v_{e,t}^{L}\) is the probability estimated from the neural network \(P_{NN}(e|{{\varvec{X}}}_t)\), which is the RNN output in the training process.

Fig. 4
figure 4

Training process of the CRNN-based strongly labeled AED. The extracted acoustic features of consecutive frames are fed to the convolutional layers. Stacked outputs of the convolutional layers are fed to the recurrent network, activations of which act as the inputs to the feed-forward layers

During testing, with the trained acoustic neural network model and the given test audio stream, each time index t will correspond to E output probabilities which is expressed as:

$$\begin{aligned} \hat{y}_t = \{\hat{y}_{t,1},\hat{y}_{t,2},\ldots ,\hat{y}_{t,e},\ldots ,\hat{y}_{t,E}\} \end{aligned}$$
(12)

where \(\hat{y}_{t,e}\) represents the probability that the current frame t belongs to the eth event type.

4 Weakly Labeled Acoustic Event Detection

The weakly labeled acoustic event detection research is a recent hot topic area [4, 74], and only the presence of the acoustic events is annotated in each audio segment which makes the acoustic event detection more challenging. Let \({{\varvec{R}}} = \{R_s:s=1\rightarrow N\}\) be the audio recording collections and \({{\varvec{EV}}}_s =\{EV_{s1},EV_{s2},\ldots ,EV_{se},\ldots ,EV_{sE}\}\) be the corresponding acoustic events presented in the sth audio recording. Here the N and E denote the number of audio recordings and event types of interest. For each \(R_s\), the presence of acoustic events \({{\varvec{EV}}}_s\) is annotated but without annotating the localization of each event in \(R_s\) (weak label for each acoustic event).

The AED system inputs are the acoustic features of one certain audio stream \({{\varvec{X}}}_s\), and the training outputs are the recording-wise-based label \({{\varvec{EV}}}_s\) rather than the frame-wise-based labels \({{\varvec{y}}}_t\). The training space \(\varOmega _{\mathrm{weak}}\) for the strongly labeled AED system can be expressed as:

$$\begin{aligned} \varOmega _{\mathrm{weak}} = \{{{\varvec{X}}}_s,{{\varvec{EV}}}_s\} \end{aligned}$$
(13)

Here s is the audio recording index and \({{\varvec{X}}}_s\) is the acoustic feature vector for the sth audio recording. The \({{\varvec{EV}}}_s\) are the training output labels represented as binary vectors.

Although the localization of the active acoustic events is not known in the training set for the weakly labeled AED, the task of the weakly labeled AED is exactly the same as the strongly labeled acoustic events which is to predict both the types and the localizations of the active acoustic events in the test audio stream.

Since the happening times of each present acoustic events are not known in the training set, it is impossible to use the audio segments that contain only the events of interest to train the acoustic models in a supervised way. In this section, how to train the acoustic models using the weakly labeled data to perform the acoustic event detection task will be elaborated.

4.1 Multiple Instance Learning for Weakly Labeled AED

One common technique to perform the weakly labeled AED is to treat the task as a multiple instance learning problem [20, 42], in which labels are known for a collection of instances.

4.1.1 Multiple Instance Learning

Multiple instance learning is based on bag–label pairs rather than the instance–label pairs. Here, the “bag” is a collection of instances. Two types of bags, namely the positive bag and the negative bag, are used in the training process. The positive bags contain at least one positive instance which belongs to the target class to be classified. The negative bags are only collections of negative instances.

Let the bag–label pairs be \((B_s,Y_s)\), where \(B_s\) and \(Y_s\) denote the sth bag and the assigned label for this bag. The \(B_s\) contain multiple instances \(a_{sj}\) where j is from 1 to \(n_s\) and \(n_s\) is the number of instances in the \(B_s\). The bag \(B_s\) can be expressed as:

$$\begin{aligned} B_s = \{a_{sj}: j = 1 \rightarrow n_s\} \end{aligned}$$
(14)

If all the instances in \(B_s\) are negative, the label for \(B_s\) is -1. The label for \(B_s\) is 1 if there is at least one positive instances in \(B_s\). The label \(Y_s\) for the \(B_s\) can be expressed as:

$$\begin{aligned} Y_s = \max _{1 \le j \le n_s}\{y_{sj}\} \end{aligned}$$

where \(y_{sj}\) denotes the actual label for the jth instance in the sth bag.

4.1.2 Multiple Instance Learning for Weakly Labeled AED

Figure 5 shows the general flowchart of the multiple instance learning-based weakly labeled AED. As shown in Fig. 5, the bag construction, model training and the event localization constitute the AED system.

Fig. 5
figure 5

The weakly labeled AED system based on the multiple instance learning

  • Bag construction

    To perform the weakly labeled AED, each audio recording \(R_s\) and the label \(EV_{s}\) can be treated as one bag and the corresponding bag label. The audio recording \(R_s\) is segmented into a number of short sub-segments \(\{R_{s,1},R_{s,2},\ldots ,R_{s,k},\ldots ,R_{s,K}\}\) where k and K denote the kth audio segment and the total number of instances in the bag.

  • Model training

    In the training process, the conventional instance-level-based loss function is replaced by the multiple instance learning-based loss function due to the fact that only the bag-level labels are provided in the training set. The multiple instance learning-based loss function can be expressed as:

    $$\begin{aligned} J_{\mathrm{loss}} = \sum _{s = 1}^{s = S}J_{s,\mathrm{loss}} \end{aligned}$$
    (15)

    The \(J_{s,\mathrm{loss}}\) is the loss of the bag \(B_s\). The \(J_{s,\mathrm{loss}}\) is computed as:

    $$\begin{aligned} J_{s,\mathrm{loss}} = \frac{1}{2}\left( \max _{1 \le j \le n_s}o_{s,j}-d_s\right) ^2 \end{aligned}$$

    where \(o_{s,j}\) is the neural network output for the jth instance in the sth bag and the \(d_s\) is the manually annotated bag label for the sth bag. It is worthy noting that the weight parameters are updated after all instances in bag have been fed forward to the network. The process is then continued until the overall divergence falls to a desired tolerance or the maximum iteration has been reached. By training the acoustic models based on the bag–label pairs, the trained model outputs both the instance-wise probabilities \(o_{sj}\) for each instance in the bag and the bag labels if maximal-scoring strategy is applied to the instance-wise probabilities.

  • Event localization

    Once the training is complete, the trained models can classify the individual instances by outputting the instance probability \(o_{s,j}\), which means the constructed system can not only detect the presence of an event in a test recording but also in individual segments. In order to perform the localization task, the testing recording RT is first split into K short sub-segments \(\{RT_1, RT_2,\ldots ,RT_K\}\). The localization of the sub-segment \(RT_{k}\) is between \((k-1)\times l^{'}\) and \((k-1)\times l^{'} + l\) where l is the length of the segment in seconds and \(l^{'}\) is the overlapping length with previous segment. If one acoustic event is detected in \(RT_k\), then the localization of the detected acoustic event is \((k-1)\times l^{'}\) and \((k-1)\times l^{'} + l\).

4.2 Variational Deep Learning Approaches in Weakly Labeled AED

Deep learning approaches from the strongly labeled AED can also be applied to the weakly labeled AED with different scoring or training strategies.

  • Scoring Strategy

    In [45], the authors used a global-input CNN model and the separated-input model to perform the weakly labeled AED. The global-input CNN model takes the spectrogram as the input and the bag labels as the output. The separated-input models are trained using n-second segmented waveform as the input. All the short sub-segments that make up the audio recording \(R_s\) are assigned with the same label (bag label). As the global-input CNN model is expected to have better performance than the separated-input models by using the correct label, predictions of the global-input model are spread evenly and then subsequently averaged with the predictions of other separated-input models. The work of [16] also proposed to use the sub-segments of each audio recording to train the separated acoustic models.

  • Training Strategy

    In [89], the authors proposed to a gated convolutional neural network (GCRNN) in the weakly labeled AED. In that work, in order to get the localization information, an additional feed-forward neural network with softmax as the activation function is introduced. The pooling operation was only applied to the frequency domain rather than the time domain to keep the time resolution of the whole audio spectrogram. Authors in [3] also adopted the same strategy by performing the max-pooling along the frequency domain to preserve the input time resolution. During training, the weak labels help with controlling the learning of strong labels by weighting the loss at the weak and the strong outputs differently.

5 Discussions

The recent advances in neural network-based deep learning for AED are reviewed in this paper. The introduction of neural network-based deep learning approaches undoubtedly has boosted the acoustic event detection performance. In this section, we briefly discuss the key reasons behind the success of the neural network-based deep learning methods and several potential issues for further consideration in the area of acoustic event detection.

5.1 Benefits of Deep Learning

Here, we list three main advantages of the neural network-based deep learning in AED as follows:

  • Effective Representations Neural network-based deep learning approaches in acoustic event detection can learn more comprehensive and informative information from the raw data, which greatly benefits the acoustic event detection when the various event classes and the noise-like characteristic of the acoustic events challenge the acoustic event detection performance. Intraclass variations and spectral-temporal properties across classes pose challenges to acoustic event detection. The deep learning approaches effectively deal with the aforementioned challenges. The state-of-the-art acoustic event detection system [59] adopted the CNN to extract the high-level information from the spectrogram information with a subsequent recurrent neural network to utilize the long context information. Due to the effective learning and representation, the neural network-based deep learning approaches greatly improve and extend the frontier of the acoustic event detection.

  • Powerful Relationship Modeling With the various types of activation functions, neural networks have the ability to model nonlinear and complicated relationships between inputs and outputs. This is a great advantage for dealing with natural signals sampled from real-world scenarios and for predicting unseen data. In the real-world acoustic event detection, some acoustic events sound similar but are actually different, such as the “car” and “bus”. The deep learning approaches can successfully learn the inherent relationship between different event classes rather than training the acoustic models based on one specific event class as adopted in the conventional machine learning methods. That is one key reason why the deep learning approaches are achieving higher detection performance than the conventional machine learning methods, such as GMM and SVM in the area of acoustic event detection in recent campaigns [52]

  • Flexible Setting of Networks Neural network-based deep learning approaches can be applied to the AED task using a flexible architecture with diverse combinations of different neural networks. The deep learning approaches in the strongly labeled acoustic event detection system where the training process is instance–label pair-based can be flexibly transferred to the weakly labeled acoustic event detection with some variations. The works [3, 89] applied the CRNN network from the strongly labeled acoustic event detection to the weakly labeled acoustic event detection by only pooling the frequency domain and keeping the time resolution fixed thus performing the detection task when the annotations are weakly labeled. Authors in [16, 45] similarly adopted the CNN structures from the strongly labeled acoustic event detection to perform the task of weakly labeled acoustic event detection by splitting the audio recordings into multiple segments with the global-input and separated-input models trained, respectively.

5.2 Future Issues

Although the neural network-based deep learning approaches have been successfully applied to the acoustic event detection task, there are still some issues that need to be resolved in order to further extend the frontier of the acoustic event detection. Here, we list two main challenges facing the deep learning-based acoustic event detection as following:

  • Weakly Labeled and Imbalanced Training Data A powerful neural network is always associated with a large amount of training data. However, in the area of acoustic event detection, the audio recordings can be obtained easily while the annotation process is always expensive especially when the precise localizations of the polyphonic acoustic events need to be labeled, which leads to the weakly labeled data problem. The other problem is the imbalanced training data. For some acoustic events, such as “break squeaking”, the audio collection process is not as easy as the collection process of common events such as “people speaking” and “car”. How to effectively deal with the limited and imbalanced training data using the neural network-based deep learning approaches are facing the acoustic event detection.

  • Hyper-parameters and Architectures High-performance acoustic models are associated with good neural network structures and fine-tuning strategies. The deep learning models are influenced by various aspects, such as network topology, training method and hyper-parameters. In the area of acoustic event detection, the large amount of event classes and the uncertainties when different acoustic events overlap with each other require the deep learning approaches to deal with the hyper-parameters and the neural network structures in order to avoid the general traps, such as the over-fitting or local optimum problem.

6 Conclusion

In this paper, the recent neural network-based deep learning approaches on the acoustic event detection task are reviewed. Two types of acoustic event detection, namely the strongly labeled acoustic event detection and the weakly labeled acoustic event detection, are first introduced with subsequent elaboration on different deep learning approaches applied to these two acoustic event detection tasks.

Neural network-based deep learning approaches have demonstrated remarkable success in acoustic event detection task and outperformed other conventional machine learning techniques. Meanwhile, the advances in the hardware equipments, such as high-performance GPUs, also accelerate the development of the deep learning approaches in acoustic event detection task. However, there are still many challenges, such as the limited training data and the hyper-parameter fine-tuning, facing the deep learning-based acoustic event detection.