1 Introduction

Over the past few years, many researchers have been working on developing sound-based surveillance tools to automatically detect environmental sounds [1,2,3,4]. Developing sound surveillance system is a popular research field due to its potential benefits in both public and private environments. Recently, some efforts have been directed toward systems capable of detecting and classifying these sounds [5]. Environmental sounds exist in domestic, business, and out door environments. Most of the investigations concentrate on a restricted domain. For example, a system capable of recognizing sounds in a specific indoor environment may be of great importance for monitoring and security applications [4, 6]. These functionalities can also be used in portable assistive devices to alert disabled and elderly persons with hearing impairment about specific sounds such as doorbells, alarm signals, etc.

Recently, sound event recognition (SER) has gained significant interest due to its wide applications in the field of multimedia context analysis and automated audio surveillance [2, 3]. In the case of automated surveillance, audio sensors play a vital role during night-times when compared to video cameras [7, 8]. SER is important to detect the environmental sounds as abnormal or normal events. For instance, door slam, knock, laughter, and coughing sounds are grouped into normal sounds and suspicious events such as glass breaking and screaming are considered as abnormal sounds. SER task also helps in recognizing the context or environment for robots and smart cars [9, 10].

Some of the challenges related to SER include the following: the existence of multiple sound sources or overlapping or polyphonic events; recognition of confusable sound events; and lack of compact representation techniques for sound events. These challenges increase the complexity of learning acoustic events and complicates the real-time automated surveillance systems. Recently, SER systems focus on learning representations that can accurately capture the characteristics of a given sound event. Various audio features have been proposed for sound event recognition tasks in different applications.

The most widely used handcrafted sound features are Mel-frequency Cepstral coefficients (MFCCs). On the other hand, the spectrogram-based visual features are extracted by transforming sound signal into its two-dimensional Time-Frequency representation. Some of the visual features used for the recognition task are Scale Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), Histogram of oriented Gradients (HOG), and Local Binary Pattern (LBP). Previous studies have analyzed the performance of audio and visual features in automatic speech recognition (ASR) [11] and Music Information Retrieval (MIR) [12] tasks.

In this paper, we focus on forming a multi-view representation by combining visual features extracted from spectrograms with the well-known MFCC features. Indeed, MFCCs essentially capture nonlinear information from the power spectrum of the signal. HOG-based features are invariant to small time and frequency translations. They include local direction of variation of power spectrum, which is not provided by MFCC [13, 14]. Hence, we propose multi-view representation that combines the advantages of both MFCC and HOG features. In this work, we propose two variants of the Multi-View Representation. The first variant combines HOG features extracted from spectrogram with cepstral features such as MFCCs. Another variant combines statistical features computed from spectrogram and the MFCCs extracted from sound signal. Multi-view representations are then fed as input to discriminative classifier such as Support Vector Machines (SVM) to recognize the given sound signal.

The rest of this paper is organized as follows. Related work on sound event representations is briefly presented in Sect. 2. In Sect. 3, we describe the proposed Multi-view Representation approaches for the SER task. In Sect. 4, we present experimental studies and discussion.

2 Related work

In the recent studies, there have been a lot of work done in feature learning and recognition of sound events [15,16,17,18,19,20,21]. Cakir et al. [15] studied three types of features namely Mel-frequency Cepstral coefficients(MFCCs), Mel-band energies, and log Mel-band energies. Kim and Kim et al. [22] proposed a segmental 2-D MFCCs which rely on transformation of cosines. Lim et al. [23] proposed a bag-of-audio words approach in order to recognize the universal characteristics of an environmental sound event. Eronen et al. [24] used a method which proposes the combination of frequency-domain features and time-domain features such as Zero Crossing Rate (ZCR), spectral centroid, spectral roll-off, short-time average energy, MFCCs, and linear prediction coefficients to classify 24 contexts with the use of hidden Markov models (HMMs). Chu et al. [25] proposed an idea of Matching Pursuit (MP) to obtain effective time and frequency-based features. Then, MP-based feature was combined with MFCC-based features for the acoustic event recognition. Ye et al. [26] has incorporated the local statistics such as mean and standard deviation on local pixels to establish a robust Local Binary Pattern (LBP). Besides, the L2-Hellinger normalization method was applied to the proposed features to further increase the robustness and the discriminative power.

The choice of compact representation influences the outcome of any learning tasks. Most of the recent methods convert the segments of acoustic signals into spectrograms. Spectrogram is a visual time-frequency representation of a sound signal. Some of the visual features extracted from spectrograms are Scale-Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), and Histogram of oriented gradients (HOG) [14, 27]. Few other methods involve the extraction of sound features such as Mel Frequency Cepstral Coefficients, Constant-Q chromagram, and Spectral flatness directly from raw sound signals. Then, the extracted feature vectors are used as sound event descriptors to train the model.

Feroze et al. [28] proposed a method using features such as loudness, MFCC’s, and perceptual linear predictive (PLP) features. It was concluded from the experimental studies that PLP-based features outperformed the MFCC-based features, for sound event recognition tasks. However, MFCC’s are generally preferred over other audio features. Jayalakshmi et al. [6] proposed an approach based on statistical moments computed from MFCC features with Support Vector Machine (SVM) classifier. This approach outperformed the generative model-based classifiers such as the Hidden Markov Model (HMM) and Gaussian Mixture Model (GMM) for sound event recognition. A system submitted for DCASE2016 Challenge for Sound Event recognition in Synthetic Audio Task 2 involved building a sound event recognition system based on semi-supervised non-negative matrix factorization (NMF), combined with local dictionaries (MLD).

Fig. 1
figure 1

Block diagram of proposed multi-view representation (MVR) approach for sound event recognition (SER) task

A system proposed by Li et al. [29] consists of two main steps: deep audio feature (DAF) extraction and bidirectional long-short-term memory classification. MFCC’s were extracted from each frame of audio, and DAF features were learnt using deep neural networks. Finally, a combination of LSTM and Bi-Directional Recurrent Neural Networks was used for classification. This showed moderate performance improvement over existing systems. Another deep network was built using a combination of Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) [30]. CNN’s have shown to be robust to local and temporal spectral variations and capable of extracting high-level features. RNN’s can learn longer-term temporal context, are combined to form a C-RNN network for the task of Polyphonic sound event recognition. Recently, Yu et al. [31] proposed a system to classify the audio events through EEG signals by monitoring the brain activity of participants. In this work, we focus on sound event recognition by using complementary data present in two different modalities of sound signals.

3 Multi-view representation for sound event recognition

The performance of machine learning approaches is predominately dependent on the compact data representation. Effective recognition of sound events depends significantly on the representations derived from sound samples. But these techniques lack in capturing the significant discriminative patterns of sound classes in unconstrained environments because contents of sound signals highly depend on the context of an environmental scene. Therefore, we focus on combining capabilities from multiple modes (views) of input that will complement each other and can be generalized for effective representation. Generally, the multi-view representation aims to combine the multiple views into a single and compact representation to exploit the complementary knowledge contained in multiple views to comprehensively represent the data.

This paper aims to utilize the multiple views of sound data from sound events and to propose a new MVR-based system that yields better results when given as an input to traditional shallow models instead of data-hungry deep models. We propose two variants of the Multi-View Representation (MVR)-based approach for the SER task. The first approach uses the auditory image-based visual features extracted from spectrograms as one form of input and cepstral features extracted from sound signals as other form of input. Second approach uses the auditory image-based statistical features and the cepstral features of sound signal. In addition to these two variants, we have also explored Constant Q-transform (CQT) and Variable-Q Transform (VQT)-image-based visual features for multi view representations. The block diagram of the proposed approach is given in Fig. 1.

3.1 Visual features from auditory images

Spectrogram contains the spectrum of frequencies of a signal varying with time. These spectrograms are used extensively in the field of music and speech processing. Spectrogram is depicted as an image which consists of intensity shown by varying color and brightness.

Figures 23 and 4 illustrate the spectrograms of ‘Keyboard’,‘Cough’, and ‘Laugh’ sound event classes, respectively, from the DCASE2016 Task 2 dataset. It can be observed from Fig. 2 that, the spectrograms generated for sound signals of same sound class do not vary much and looks similar. Whereas for ‘Cough’, and ‘Laugh’ sounds which are acoustically similar but different sound classes, the corresponding spectrograms are dissimilar as shown in Figs. 3 and 4. Though these two sound classes sound similar (overlapping) the corresponding spectrograms show the discrimination. This helps to reduce the overlap between two different but overlapping sound classes leading to improved performance.

We explore two types of features namely Histogram of Oriented Gradient (HOG) and statistical features. Generally, several key points can be extracted from the spectrogram images. They provide the temporal analysis of the sound event thus giving an auditory image that is easier to interpret. One of the visual feature descriptor popularly used in computer vision and object recognition is the histogram of oriented gradients (HOG). This method counts occurrences of gradient orientation in localized portions of an image. It describes the local characteristics of a given spectrogram image in the gradient directions. HOG can be used to capture the modulation of the scene along the temporal axis. The steps involved in computing HOG are listed as follows: (1) The gradient of either spectrogram or constant Q-transform representation is calculated; (2) The angles of all pixel gradients are calculated; (3) Non-overlapping cells of the images are formed; (4) Each cell histogram obtained is normalized based on the histogram of its neighbors. Finally, filtering and pooling are performed to optimize the HOG descriptor.

Fig. 2
figure 2

Spectrograms of two samples of ‘Keyboard’ sound Class from DCASE2016 Task 2 dataset

Fig. 3
figure 3

Spectrogram of ‘Cough’ sound

Fig. 4
figure 4

Spectrogram of ‘Laugh’ sound

Apart from spectrograms, other auditory images generated using Constant Q Transform (CQT) and Variable Q Transform (VQT) have also been analyzed. The CQT is given by the formula mentioned below[32] :

$$\begin{aligned} x[k]=\frac{1}{L[k]} \sum _{n=0}^{L[k]-1} S[k,n] s[n]e^{-j2\pi \frac{Qn}{L[k]}} \end{aligned}$$
(1)

where 2\(\pi \) Qn/l[k] gives the frequency of the kth component and s[n] is the sample of the digitized time-frequency. The S[k,n] represents the window function that depends on k as well as n. Similarly, Variable Q Transform (VQT) has been implemented that is similar to CQT but allows for different filterbanks to be used in each downsampled octave.

3.2 Cepstral features from sound signals

Handcrafted feature extraction methods have proved to be effective for tasks such as object recognition and audio tagging. Visual features have proved to work better in unconstrained and generalized environments. From this, it can be inferred that some generic audio and visual feature extraction techniques complement each other based on their characteristics. In the cepstral feature extraction, the N-dimensional Mel-frequency Cepstral Coefficient (MFCC) features are extracted from a short-term power spectrum of sound with a Mel-frequency scale[33]. MFCC features are proved to be effective in many of the sound-related surveillance applications [3, 12].

In this work, we propose an efficient multi-view representation (MVR) for sound events as shown in Fig. 1. Proposed Multi-View Representations combine both handcrafted cepstral features and auditory image-based visual features. The multi-view representations involve the combination of features that are complementary to each other. Here we propose two such combinations based on auditory images of sound. The first approach combines the statistical moment-based features with MFCC features. This approach involves the computation of moments from pixel values along either of the axes and using those moments as a feature vector. Statistical features are computed from pixel values along either of the axis. We have experimented using both the axes (X-axis as well as Y-axis) to find which axis gives a better result. It was evident that along the y-axis the results were better compared to the x-axis. In this approach, parameters such as skewness, mean, median, variance, minimum, maximum, and kurtosis are calculated for each column along X axis or each row along Y axis. Different sound events often have different sound properties, which lead to the change of the texture pattern of the auditory images and alters the image intensity. The multi-view representation is then formed by combining the statistical moment-based features with MFCC features.

Another variant of the MVR-based approach combines the HOG feature with MFCC. The Multi view representations of sound signals are given as input to the Support Vector Machine (SVM) to recognize the sound events. In addition to the spectrogram images, we have also experimented with other images formed using Constant Q transform and variable-Q transform (VQT) coefficients for further analysis.

4 Experimental studies and discussions

4.1 Datasets used for studies

We have used sound events from the following three different datasets, namely the Environment Sound Classification-10 (ESC-50) dataset [12], DCASE2016 Task 2 [34] , and DCASE2018 Task 2 dataset [35]. These sounds are recorded in different environments with different subjects and noise levels. Each dataset contains various types of sounds recorded in different scenarios with different noise levels.

ESC-50 Dataset This dataset consists of 2000 labeled environmental sound event examples. It contains 50 classes with 40 instances per class [12]. The data are grouped into 5 major categories with 10 classes for every category: Animal sounds, natural soundscapes, and water sounds, human (non-speech) sounds, interior/domestic sounds, and exterior/ urban noises. It consists of animal sounds such as dog barking, the sound made by cows, frogs, etc., outdoor sound events such as rain, sea waves, fire crackling, etc., human sounds such as snoring, clapping, snoring, etc., indoor sounds such as vacuum cleaner, washing machine, alarm clock, etc., other sounds such as vehicles, church bell, hand saw, etc. 5-fold cross validation is carried out to evaluate the proposed approach.

DCASE2016 Task 2 Dataset This dataset is provided for sound event recognition in synthetic audio [34]. The audio dataset consists of isolated sound events for 11 different sound event classes with 20 samples per class related to office environment: clearing throat, coughing, door knock, door slam, drawer, human laughter, keyboard, keys (placed on a table), page-turning, the phone ringing, and speech. We used all the data in DCASE2016 Task 2 and carried out fivefold cross-validation.

DCASE2018 Task 2 dataset This dataset contains 41 sound classes [35]. We used training data of 41 classes of sound events for our experimental studies. Some of the sound events in the dataset are shattering, fireworks, keyboard sound, keys jingling, etc. The sound events for this training dataset are composed of 9473 examples. Fivefold cross-validation is carried out to evaluate the performance of the proposed approach.

Table 1 Comparison of recognition accuracy (\(\%\)) for ESC-50 dataset

4.2 Feature extraction

To capture the characteristics of sound events, we have used 26-dimensional MFCC features as sound features. For each sound event, we computed the statistical features across the 26-dimensional MFCC features. Mean, median, minimum, maximum, variance, skewness, and kurtosis were computed as statistical features. These statistical features, computed across 26-dimensional features, were then concatenated to form the fixed dimensional representation (26*7). Similarly, for all the examples, the spectrogram-based visual feature HOG is extracted. Besides, from each spectrogram, we also extract seven moment-based statistical features such as mean, median, minimum, maximum, standard deviation, skewness, and kurtosis. These features were combined to form a fixed dimensional vector for each spectral image. CQT and VQT image-based feature extraction are also performed on the given input sound signal. After the feature extraction step, the extracted visual features were combined with sound features to form a multi-view representation for effective learning.

4.3 Performance analysis

Table 2 Comparison of recognition accuracy (\(\%\)) for DCASE2016 Task 2 dataset
Table 3 Comparison of recognition accuracy (\(\%\)) for DCASE2018 Task 2 dataset

In all experiments, MFCC features are used as the basic sound features. The performance of proposed approach and other conventional approaches are shown in Tables 12 and 3 for the datasets ESC-50, DCASE2016 Task 2, and DCASE2018 Task 2, respectively. The first method is spectrogram with HOG-based approach as single-view representation. In HOG computation, cell size is fixed as 16*16, number of orientations as 8 and pooling operator as average operator(pooling over frequency) as given in [39]. The second method uses statistical features in two ways: i) statistical features computed along pixel values of every row (Spectrogram+moments (X axis) + MFCC + SVM) and ii) statistical features computed along pixel values of every column (Spectrogram + moments (Y axis) + MFCC + SVM).

The performance of the proposed Spectrogram + HOG + MFCC-based multi-view representation approach gives better results compared to spectrogram with HOG-based approach. The proposed approach gives an accuracy of 72.7 \(\%\), 91.3 \(\% \),and 80.17\(\% \) for the datasets ESC-50, DCASE2016 Task 2, and DCASE2018 Task 2, respectively. Even though the size of the dataset increases, the method is consistent in its performance compared to single-view representations. The second approach is based on statistical moments that achieved an accuracy of 68.9 \(\% \), 83\(\%\), and 80.17\(\%\) for the datasets ESC-50, DCASE2016 Task 2, and DCASE2018 Task 2, respectively. Here we can observe that the proposed approaches work better when compared to other conventional approaches such as CQT- and VQT-based representations.

Table 4 Confusion matrix of single-view representation-based approach (Spectrogram + HOG + SVM) for DCASE2016 Task 2 dataset
Table 5 Confusion matrix of multi-view representation-based approach (Spectrogram + HOG + MFCC + SVM) for DCASE2016 Task 2 dataset

In the case of the ESC-50 dataset, Table 1 shows the baseline performance reported in [12]. Piczak [36] evaluated the potential of convolutional neural networks with log-scaled Mel-spectrograms for recognizing the short duration environmental sounds. The proposed MVR approach produced a much better performance with an accuracy of 33 % improvement when compared to the baseline system reported in [12].

Similarly, Table 2 shows the recognition accuracy of the proposed approach and some of the state-of-the-art methods reported in the literature for the DCASE2016 Task 2 sound event dataset. The proposed approach outperforms the baseline system by giving a 50% increase in classification accuracy. Table 2 adds some of the systems submitted in the DCASE2016 Task 2 challenge for sound event recognition [34]. The baseline system [37] of DCASE2016 Task 2 uses a dictionary of spectral template with supervised NMF approach. The proposed MVR approach outperformed the following systems in the DCASE2016 Task 2 challenge: representations of constant-Q transform (CQT) with RNN classifier, Gammatone cepstrum with Random forests classifier, the Bi-Directional Long-Short Term Memory (BSLTM) with Mel-Filter Bank features, and Non-Negative Matrix Factorization with a Mixture of Local Dictionaries (NMF-MLD). From the above observations, it is clear that significant improvement can be achieved even with simpler handcrafted features with shallow models trained on meaningful multi-view representations rather than using data-hungry deep feature learning techniques.

The results of single view and multi-view representations studied for three datasets are analyzed with the help of confusion matrices. As an example, in case of DCASE2016 Task 2 dataset, the proposed HOG+MFCC-based MVR approach reduces the overlap between classes such as {Human laughter, Page turning}, and {Phone ringing, Page turning} which can be seen in Tables 4 and 5. This leads to slightly improved performance compared to single view HOG-based approach.

Table 3 shows the recognition accuracy of the proposed MVR-based approaches and single-view approaches. There are 11 sound event classes in DCASE2016 Task 2 dataset and 41 sound event classes in DCASE2018 Task 2 dataset. As the number of sound classes increases the MVR using HOG+MFCC approach outperforms single-view with HOG only approach as given in Table 3.

5 Conclusion

In this paper, a Multi-View Representation (MVR)-based approach for Sound Event Recognition has been proposed. The proposed approach combines auditory image-based visual features with cepstral features to form compact and effective representation. The proposed handcrafted feature-based MVR representation with simple shallow model such as SVM as classifier leads to improved performance over other state-of-the-art methods for ESC-50, DCASE2016 Task 2, and DCASE2018 Task 2 datasets. The proposed approach is more suitable for recognizing acoustically similar but different sound classes.