1 Introduction

Sound Event Detection is one of the emerging research areas that can be implemented in multimedia, IoT, Robotics, Software modules, etc. Sound detection can help develop voice assistants, sound detection in various environments, border security systems, health systems, machinery defects, etc. Apart from that, the detection of sounds in various environments is a challenging task for researchers. The detection becomes complicated based on environmental conditions such as forest, rain, and under closed areas. Here, the sound can be generated in any direction from the environment and sometimes create reflections. A better approach is that the system should also need to find sound localization and detection. So, the research will be more complicated to find sound event detection along with the direction of arrival; in some applications like smart homes, robots, and the forest, localization should be an essential parameter for the system. We also need to process the polyphonic rather than the monophonic sound event in which the framework yields an arrangement of non-overlapping sound events. Polyphonic SED is fit for recognizing various sound events at a similar time. The quantity of sound events dynamic in an occurrence isn't known as prior, which presents an alternate degree of difficulty in detection. Polyphonic SED system requires multi-label categorization, which isn't broadly tested in sound data processing tasks. Acoustic sound detection provides a better result when machine learning techniques are implemented. The models should be developed with supervised and unsupervised learning-based algorithms. The current research strategies are mainly conducted with supervised learning-based algorithms. Here, we need different datasets concerning the model we need to evaluate. The learned data sets will be labeled or labeled weekly in the machine.

Several datasets are available for sound event detection and localization tasks, catering to various environments and challenges. These datasets include UrbanSound, AudioSet, DCASE challenge dataset with subsets like DCASE2013-2018, Freesound Dataset, TUT Sound Events 2017 and 2018, CHiME-Home dataset, MIMII Dataset for industrial audio anomaly detection [1,2,3,4], Detection of Sound Events in Urban Areas dataset, BUMD and etc. [5,6,7]. These datasets encompass a wide range of real-world sounds, annotated with labels for different sound events, enabling researchers and practitioners to develop and evaluate sound event detection and localization algorithms effectively.

In the classification process,, feature extraction has abecomes very important in bringing better classification results.

In sound event detection, various feature extraction techniques are employed to capture essential characteristics of audio signals. These techniques include Mel-Frequency Cepstral Coefficients (MFCCs) and Log Mel Spectrograms, which represent spectral features on a Mel frequency scale. Additionally, methods such as Gammatone Filterbank Features and Auditory Spectrograms are utilized to mimic human auditory perception. Deep Learning-based approaches, including Convolutional Neural Networks (CNNs), offer direct extraction of features from raw waveforms or pre-trained models like VGGish for log mel spectrogram embeddings [8,9,10,11,12,13]. Other techniques, such as Wavelet Transform, decompose signals into time–frequency representations, while rhythm-based features capture temporal patterns. Statistical properties, zero crossing rate, energy distribution, pitch-related information, and harmonic/timbral features further enrich the feature set. Combining these techniques provides a comprehensive representation of audio data essential for accurate sound event detection and localization tasks [14,15,16,17].

The existing research applied different algorithms such as the Hidden Markov model (HMM), non-negative matrix factorization (NMF), support vector machine (SVM), and random forest. Recent approaches use deep learning-based methods using deep neural networks (DNN), convolutional neural networks (ConvNet), recurrent neural networks (RNN), and convolutional recurrent neural networks (CRNN) [18,19,20,21,22]. All algorithms are implemented with key techniques like 1D,2D ConvNet, Multilayer CNN,GCCPHAT, BiGRU, LSTM and etc. The sound will be fragmented, and features extracted with MFCC, Mel Spectrogram, RMS, etc., then implemented various classification algorithms.

1.1 Motivation

In the domain of research on sound event detection, considerable survey studies have been conducted, with a primary focus on classifying sounds or acoustics across diverse environmental contexts. Despite the wealth of information provided in these studies, there are often certain gaps or missing components that researchers have acknowledged. However, recognizing the importance of addressing these limitations, researchers have made efforts to include relevant information based on the specific needs of their studies.In exploring the literature, we examined several survey papers that have high citations and a significant impact factor. Through this process, we have identified specific limitations in each of these papers. These limitations may range from methodological constraints to gaps in coverage or analysis. In doing so, we have tried to provide a comprehensive overview of the state of the art in sound event detection research, taking into account the challenges highlighted in previous studies.

Gabriel et al. [66] thoroughly categorized the techniques applied in the aforementioned scientific domains. It used standards from the literature to categorize sound source localization systems. Additionally, a comparison between traditional approaches predicated on the propagation model and approaches based on deep learning and machine learning techniques has been done. The most comprehensive knowledge possible about the potential applications of mathematical relationships, artificial intelligence, and physical phenomena in determining accurate source localization has been carefully considered. The paper also emphasizes the importance of these techniques in both military and civilian settings. However, the authors did not focus on the Datasets and feature extraction techniques.

Dang et al. [69] expressed the survey of sound event detection that involved the deep learning models and the challenge initiated by the DCASE 2016 to 2017. In this paper, the authors mainly focused on only various deep-learning models that are used for Sound classification, such as RNN,CNN, and CRNN. They did not explain the feature extractions, datasets and various results comparisons.

Nunes et al. [70], finding out if an object's sounds are typical or odd is part of their survey on detecting anomalous sounds. A Systematic Review (SR) examining research on anomalous sound detection employing machine learning (ML) methods presented in this paper. Between 2010 and 2020, 31 documents analyzed for this investigation. The most recent developments are covered, including evaluation techniques AUC and F1-score, ML models like Autoencoder (AE) and Convolutional Neural Network (CNN), and datasets like ToyADMOS, MIMII, and Mivia. The authors are not focused on the comparative study of various models.

Chandrakala et al. [72] surveyed sound event and scene representation and recommended appropriate machine-learning methods for audio surveillance projects. Different benchmark datasets are categorized based on the actual audio surveillance application scenarios. Several state-of-the-art methods are evaluated on two benchmark datasets intended for the sound event and audio scene detection tasks to obtain a quantitative understanding. Finally, future directions for improving environmental audio scene and sound event detection are delineated.

Teck Kai et al. [73] surveyed the sound event classifications in various directions. The authors perfectly explained the model implementations with a comparison of results. Sections are prepared as the parameters. However, in this survey, the authors did focus little on the feature extraction methods and datasets.Table 1 presents the various review research on sound event detection and localization and their limitations.

Table 1 Various review research on sound event detection and localization and their limitations

1.2 Contributions

This paper endeavors to enhance the understanding of Sound Event Detection and Localization through a comprehensive review. Our primary focus lies in conducting a valuable comparative analysis of various research endeavors about sound event detection. We examine numerous recent models, highlighting their contributions and addressing significant challenges they present. Moreover, in a well-structured manner, we discuss key components of sound, including datasets, feature extraction techniques, machine learning models, and localization methodologies.

Below are the notable contributions of our research:

  1. 1.

    We conducted a thorough analysis of existing models in Sound Event Detection and Localization to enrich our review of the current state of research.

  2. 2.

    By identifying gaps in the literature, we have contributed to the progression of SED knowledge. Furthermore, we recommend future research directions and strategies to address these gaps effectively.

  3. 3.

    Our comparative study, through a meticulously prepared and extensive literature review, has provided valuable insights and enhanced the domain of Sound Event Detection.

The rest of my paper is presented as follows. Section 2 describes the different datasets that are used to detect the sound event. Section 3 discusses various feature extraction techniques used in various models. Section 4 illustrated the machine learning algorithms and Sect. 5 discussed the neural network models comparative study. Section 6 provides a review of the localization or direction of sound arrivals. Section 7 describes key parameters to evaluate sound event detection and localization. Section 8 discusses various environmental research scopes. Then followed by Sect. 9 with challenges. Finally, Sect. 10 illustrated the sound event detection related applications in the real world.

2 Datasets

The field of sound research has various kinds of specialized datasets, each prepared with different aspects of audio analysis and classification. Well-known datasets are prepared for environmental sounds, urban areas, parks, rooms, offices, motors, vehicles, traffic, the health sector, drones, animals, birds, etc. The UrbanSound dataset has ten classes of urban sounds, offering researchers a comprehensive collection ranging from street music to car horns and sirens [21, 26]. The AudioSet dataset by Google has millions of 10-s sound clips sourced from YouTube, covering over 600 labeled audio events, thus providing a rich resource for various research endeavors [28,29,30]. DCASE challenge datasets focus on real-world sound event detection and classification, offering recordings of various acoustic scenes and events. Freesound Datasets and FSD50K, drawn from the collaborative database Freesound, provide researchers with extensive collections of Creative Commons licensed sounds, facilitating tasks such as audio tagging and event detection [31, 32]. TUT Acoustic Scenes offers audio recordings from diverse acoustic environments supplemented with annotations for sound event detection and classification tasks. Speech Commands Dataset offers short audio clips of spoken words, key for keyword spotting and speech recognition research. ESC-10, smaller than ESC-50, streamlines experiments and educational purposes with its condensed 10-class environmental sound dataset [41, 42]. MIVIA Audio Events Dataset captures various events in indoor and outdoor settings, serving as a valuable resource for audio event detection and classification studies. CHiME Home and DESED datasets zoom into domestic environments, providing recordings of household activities and events for sound event detection and localization research [43,44,45].

When dealing with a small dataset in machine learning, the risk of overfitting becomes more evident. Overfitting occurs when a model learns the training data too well, capturing noise and irrelevant patterns that do not generalize to new data. Researchers have applied several strategies to handle this issue. Wang et al. [48] opted for fewer parameters in their approach to reduce the risk of overfitting. Hu et al. [49] implemented the cross-validation techniques, like k-fold cross-validation, to overcome the overfitting and estimate the model's performance by evaluating it on multiple validation sets.

Augmenting the dataset through techniques like data augmentation increases its effective size and helps the model generalize better. Additionally, feature selection can reduce model complexity by selecting relevant and eliminating redundant features. Monitoring the model's performance on a validation set during training and preventing early when performance declines can prevent overfitting. Furthermore, ensuring data using pre-trained models through transfer learning can leverage existing knowledge to improve model performance on small datasets.

Bubashait et al. [57] compared the various model accuracies on the Urbansound8K dataset. The features are extracted from urban sounds using Mel scale cepstral analysis (MEL) spectrum images. Sound processing is facilitated through an open-source library known as Librosa. The performance of CNN and LSTM models against a baseline ANN model in classifying. Evaluation of model performance is conducted using the UrbanSound8k dataset. The CNN model exhibits a lower performance, achieving an accuracy rate of 87.15% and an f1 score of 85.63%, compared to the DNN baseline and the LSTM model. Conversely, the LSTM model outperforms the CNN model, demonstrating superior accuracy on test data with a rate of 90.15% and an f1 score of 90.15%.

Fonseca et al. [63], FSD50K is an open dataset with over 51,000 audio clips, totaling more than 100 h of audio content, manually annotated across 200 classes from the AudioSet Ontology. They provided a comprehensive account of the FSD50K creation process, tailored specifically to the unique characteristics of Freesound data. This included insights into encountered challenges and the solutions implemented. Furthermore, they conducted sound event classification experiments, presenting baseline systems and key insights into factors to consider when partitioning Freesound audio data for SER purposes.

Piczak et al. [64], the paper presented ESC-50 dataset, a fresh annotated collection of 2000 short clips spanning 50 categories of common sound events. It also offers a comprehensive collection of 250,000 unlabeled audio excerpts from recordings available through the Freesound project. Furthermore, it evaluates human accuracy in classifying environmental sounds, contrasting it with the performance of selected baseline classifiers utilizing features derived from mel-frequency cepstral coefficients and zero-crossing rate. Table 2 presents the various dataset and description.

Table 2 Different dataset with number of classes

2.1 Challenges

Diverse Characteristics

Datasets are created based on specific environments with unique characteristics, such as environmental sounds, human sounds, Urban sounds, and vehicle sounds)—the process of preparing a generalized dataset across different types of sounds.

Feature Extraction

Different sound types may require different preprocessing techniques and feature extraction methods based on the environment and sound type.

Domain Discrepancies

Synthetic datasets reduce the complexity in processing pre-defined models but perform poorly in real-time environments.

Class Imbalance

Datasets are prepared with unequal samples per class based on requirements and resource availability. This leads to biased models that perform poorly. For that, researchers need to perform augmentation and preprocessing additionally.

Overfitting

With the limited data samples, the models cannot train properly.

3 Feature extraction

Sound feature extraction methods are techniques for extracting relevant information or characteristics from audio signals. These features are then used for various purposes, such as speech recognition, music analysis, sound classification, and more.

Feature extraction is vital for sound processing as it reduces high-dimensional sound data into representative features, reducing computational complexity while preserving essential information. These features facilitate efficient analysis, enabling speech recognition, music classification, and sound event detection [25, 26]. Moreover, they enhance noise robustness by focusing on discriminating aspects less affected by noise, promoting interpretability by revealing underlying sound characteristics and ensuring adaptability across diverse scenarios [27].

Time-domain features, derived directly from the amplitude values of the sound waveform, provide insights into the signal's temporal characteristics. Examples include zero-crossing rate (ZCR), energy, root mean square (RMS) amplitude, and temporal statistics such as mean and variance.On the other hand, frequency-domain features represent the frequency content of the sound signal [28,29,30]. Techniques like STFT are employed for their extraction. Examples of frequency-domain features include spectral centroid, spectral bandwidth, spectral roll-off, and spectral flux. Cepstral features, such as MFCC, are derived from the spectral envelope of the sound signal. Pitch and harmonic features describe the pitch and harmonic structure of the sound signal. They include fundamental frequency (pitch), harmonic-to-noise ratio (HNR), and cepstral peak prominence (CPP), providing insights into the tonal properties of the audio.Temporal features include rhythm features like beat and tempo, as well as onset detection features that identify the starting points of sound events.Spectral features include spectral flatness, spectral contrast, and spectral entropy, which are useful for tasks like sound classification and acoustic scene analysis [31,32,33]. Wavelet and time–frequency features are derived from time–frequency representations of the sound signal obtained using techniques like wavelet transform or spectrograms [34, 35]. These features simultaneously capture time and frequency information, offering a detailed representation of the signal. Deep learning-based features represent a recent advancement. In this approach, features are learned directly from raw sound data using neural networks. Features extracted from CNN trained on spectrograms or RNN for sequence modeling have shown promising results in various audio processing tasks [56, 75]. Figure 1 illustrate the different kinds of feature type and relevant techniques.

Fig.1
figure 1

Different kinds of feature type and relevant techniques

Wang et al. [48] discussed an approach to detecting and locating sound events in real-world environments. For the audio-only part, they used ResNet-Conformer architecture as the primary acoustic model. For the audio-visual task, they utilized object and human body detection algorithms in videos to identify potential sound events, combining these findings with acoustic features to enhance detection. The model mainly used log-mel spectra features extracted from multichannel audio. They augmented the data using the ACS strategy and obtained about 192 h of data on the dev-test set of the STARSS23 dataset.

Jinbo et al. [49] report explained how they tackled Task 3 of the DCASE 2023 Challenge, which deals with identifying and locating sounds in real environments. They assessed the suggested approach using STARSS23's dev-test set. Using the data above generating strategy, they produce a significant amount of data, comprising 50000 5-s clips (dataset C) from computationally generated SRIRs and 2700 1-min clips (datasets A and B) from TAU-SRIR DB, where PANNs clean the sound event examples of B and C. The model implemented the CNN model to classify the sound by extracting the MFCC and Mel Spectrogram features. The model gained an accuracy of 82.2%.

Cheimariotis et al. [50] created a system designed to identify sound occurrences in-home sound classification. The model dealt with Task 4a of the "DCASE 2023, which is to identify 10 typical events that take place in homes within 10-s audio samples. The main components of the methodology were the application of data augmentation techniques to the mel-spectrograms that represented the audio clips, the use of BiGRU for sequence modeling, the fusion of these features with BEATs embeddings, and feature extraction through the use of a frequency-dynamic convolutional network enhanced with an attention module at each convolutional layer. The model has achieved the 0.798 accuracy.

Changmin et al. [86] suggested a model with a frequency dynamic CRNN structure. They first adjusted the sigmoid function by a temperature parameter to get a soft confidence value. Secondly, they employed a weak SED, which sets the timestamp to the duration of the audio clip and only makes weak predictions. Third, the PSDS scenario 2 benefited from adding the FSD50K dataset to the poorly labeled dataset. Next, the expanded dataset extracts features from the log-mel spectrogram. They used 128 mel-frequency bands, 256 sample hop lengths, and 2048 sample frame lengths to extract features. FDY-CRNN was used to implement the student and instructor models. The best PSDS scenario 1 of 0.473 and PSDS scenario 2 of 0.695 on the domestic environment SED real validation dataset.

Soo-Jong et al. [87] addressed the challenge of weakly labeled datasets using a novel time–frequency (T-F) segmentation framework. They utilized a CNN for segmentation and global weighted rank pooling for classification, and features are extracted by log mel spectrogram. Validation on DCASE 2018 data showed significant performance improvements over baseline scores, with F1 scores of 0.534, 0.398, and 0.167 achieved in audio tagging, frame-wise SED, and event-wise SED, respectively. Additionally, our method achieves an F1 score of 0.218 in T-F segmentation, a task previously unattainable. Table 3 presents the different feature extraction techniques and outcome with respect to ML based algorithms.

Table 3 Different feature extraction techniques and outcomes

3.1 Challenges

Feature extraction is essential in many machine learning and signal processing applications, especially audio processing. It involved transforming raw data into a set of measurable elements that a model can use as input for prediction.

Feature Selection

The selection of features significantly impacts model performance. Different features may capture different values from the sound signal. A model's feature selection can significantly affect its accuracy and performance.

Computational Complexity

Extracting deep features or complex spectrograms requires significant computational resources.

Real-Time Applications

In the real-time scenarios such as live audio streaming or real-time speech recognition, audio data must be processed.

4 Machine learning sound event detection

Several types of research have been conducted on Sound Event detection with multi-channel and polyphonic sounds. They have been used to implement machine learning models such as HMM,NMF, SVM,Linear Regression, etc. SVM is utilized for its robustness in classification tasks, while KNN offers simplicity and effectiveness by classifying based on neighboring data points. Random Forest, an ensemble method, combines multiple decision trees for accurate classification. Decision trees are favored for their interpretability and ability to handle non-linear relationships in data. Linear regression, though primarily for continuous target variables, can be adapted for classification, although it's more commonly used in related tasks like sound source localization.

T. Heittola et al. [2] have discussed that they have implemented the 15 different types sound of 30-s length acoustic sounds as data sets for their research. They divided the data into two parts, the development and evaluation sets. The development set is again divided into training and tested sets to be used for cross-validation during system development in the implementation process of MFCC calculated as 40 ms frames and 40 mel bands, and 50% overlaps each. The classification of GMM has been used on the data set. The GMM classification is measured based on the accuracy and correctly classified segments. The authors considered the error rate and F-Score in fixed time intervals. The sound events in one second are compared with output and ground truth values. The author evaluated the scenario based on precision, recall, and F-score. The F1 score fro wind blowing is 14.2 in the residential area and water tap running F1 score is 41.2 in Home environment. Kawaguchi et al. [7] have implemented a model to classify the sound by using non-negative matrix factorization (NMF) and this model also compared with Semi-supervised NMF(SSNMF). This models mainly relate their results non-negative matrix under approximation(NMU).

Selver Ezgi et al. [8] have defined a model to detect the multimedia event. For that, they initially extracted the MFCC feature from sound-related data samples. They have used SVM to perform the classification. The occurrences in an actual class are represented in the rows, although the examples in a predicted class are represented in each column of the matrix. The confusion matrix shows that significant gains may be achieved by executing the appropriate parameter optimizations. The system's overall recognition rate is generally good for different classes.

Parathai et al. [10] proposed the solution to classify events from a single noisy mixture. It consists of two main steps: separating the acoustic event and classifying the acoustic event. Complex Matrix Factor (CMF) is expanded through cooperation with optimal adaptation. L1 scattered offered adaptive CMF to decompose a noisy single-channel mixture, where the method encodes the spectrum plot and predicts the phase of the incoming signal in the time–frequency response. A Vector Machine Strategy (SVM) was applied on a one-to-one (OvsO) basis with an average supervisor to classify the unmixed audio into the category of the matched audio event. By moving the unmixed signals, the MSVM method divides the independent signals into blocks, after which the three features of each block are coded. OvsO uses the SVM approach to learn cepstral coefficients of frequency inclination, short-time energy, and short-time zero interference rate from several classes of audio events.

Huy Dat et al. [11] have introduced a model with SVM classification that used a distribution of subdomain temporal envelope (STE) and kernel technologies for subdomain probability distance (SPD). The generalized gamma modeling, well designed to characterize the sound and probability distance core provides the closed-shape solution for calculating the convergence distance, greatly reducing the computations price. Experiments were carried out using a database of 10 various categories of sound events. The proposed classification style outperformed traditional SVM classifiers with cepstral frequency inclination coefficients significantly, according to the findings (MFCCs).

Xianjun et al. [14] introduced a strategy outlined in their study involving the utilization of a pre-trained CNN to extract bottleneck features coupled with random forest classifiers for event detection. The study comprehensively details these techniques along with their practical applications. Additionally, the authors propose a method to incorporate context into the classification process by modeling the temporal evolution of event classes using an HMM. Through rigorous evaluation of two publicly available datasets, TUT Acoustic Scenes 2017 and TUT Sound Events 2017, the authors demonstrate the effectiveness of their methodology and achieve an accuracy of 91%. Yuanjun Zhao et al. [15] introduced a novel approach to sound event identification leveraging multiple optimized kernels. Their method demonstrates improved categorization performance through the integration of diverse kernels. The technique involved training several SVM utilizing various kernel functions and aggregating their outcomes for decision-making, as elaborated in their research. Additionally, the authors advocated for a grid search approach to fine-tune kernel parameters effectively. Through extensive evaluation on publicly available datasets—TUT Acoustic Scenes 2017 and TUT Sound Events 2017—the authors showcase the effectiveness of their methodology, achieving state-of-the-art results in terms of accuracy and F1-score. Table 4 illustrate the Different Machine Learning Models to Find Polyphonic Sound Event Detection.

Table 4 Different machine learning models to find polyphonic sound event detection

5 Neural networks models

Neural network architectures such as CNN and RNN play a vital role in accurately classifying large datasets. Emerging CNN variants like VGG16, VGG19, ResNet, AlexNet, MobileNet, DenseNet, and EfficientNet, among others, have been used in classification tasks. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) models are employed to preserve previous layer outputs in current layers, often extended with Bi-directional LSTM and Bi-directional GRU for enhanced performance [4, 36, 37]. These architectures extract complex features from audio data, capture temporal dependencies, and sense hierarchical patterns within sound events. Large labeled datasets and techniques such as transfer learning and SED models achieve good accuracy and generalization across diverse applications, from environmental monitoring to smart home systems. Despite noise robustness and scalability challenges, neural network models hold promise in elevating audio analysis systems and driving real-world implementations of SED forward. The CNN-related models are adequate for capturing local spectrogram features, representing sound as a function of time and frequency. RNNs, LSTM, and GRU models are adept at handling temporal dynamics in sequential data, making them suitable for modeling long-range dependencies in audio sequences. On the other hand, CRNNs merge the strengths of CNNs and RNNs, enabling them to capture local and temporal features simultaneously [38,39,40]. The choice of model depends on factors such as data characteristics, computational resources, and task specifics. Each architecture has its advantages and limitations, and selecting the most appropriate model requires careful consideration of these factors to ensure optimal performance in sound event detection tasks.

Annamaria et al. [1] have investigated "Sound Event Detection in the DCASE 2017 Challenge," which presented an analysis of the DCASE 2017 challenge's sound event detection task, which aimed to advance state-of-the-art sound event detection. The authors describe the dataset and the evaluation metric used in the challenge in detail, highlighting the difficulties associated with sound event detection, such as varying acoustic conditions and class imbalances. They also talk about how deep neural networks (DNNs), CNN and RNN are used in high-performing systems, as well as data augmentation and ensemble learning. Overall, the paper provides a comprehensive analysis of the DCASE 2017 challenge's sound event detection task, providing insights into the task's challenges and the state-of-the-art techniques used to address these challenges.

Hyungui Lim et al. [3] introduced the 1D ConvNet to detect rare sound events. The authors used the 1D convolutional neural network, RNN, and LSTM algorithms with long-amplitude mel-spectrogram as input acoustic features. Frame-wise log-amplitude mel-spectrogram fed into our proposed model, and the model returns the output for every incoming sequence. They implemented the spectral-side 1D ConvNet that enables frame-level investigation. Their research used the two layers of RNN with each 128 LSTM unit. They have applied the unidirectional backward RNN-LSTM procedure to produce more accuracy in the system. The performance on the test set of the development dataset yields an error rate of 0.07 and an F-score of 96.26 on the event-based metric.

Adavanne et al. [5], this paper proposed utilizing CBRNN to detect bird calls, treating the Bird Audio Detection (BAD) challenge as a SED task. CRNN architecture combines the modeling capabilities of CNN, RNN, and fully connected (FC) layers. In this, CRNNs expanded to handle multiple feature classes, with CNN feature maps processed using a bidirectional RNN, forming the convolutional bidirectional RNN. The model with CBRNN achieves an AUC measure of 95.5% on five cross-validations of the development data and 88.1% on unseen evaluation data.

Qiuqiang et al. [6] proposed a model integrated with CNN-Transformer, which is similar to CRNN.In their approach, they implemented threshold optimization like mean average precision(mAP) for SED. The authors implemented the improved architecture of LSTM called BiGRU with CNN. Model automatic threshold optimization system achieves state-of-the-art results, with an audio tagging F1 score of 0.646, surpassing the score of 0.629 obtained without threshold optimization, and a sound event detection F1 score of 0.584, outperforming the score of 0.564 without threshold optimization.

Kyoungjin et al. [24] have implemented a model for SED using CNN. For that, they have chosen the multi-channel environment, and they have drawn the STFT coefficient from them. This model also extended with weighted prediction error (WPE). MVDR beamforming is carried out with the source and noise masks estimated by the DNN. Likewise, the experiment goes on with multiple test cases. In this paper, evaluation of the metrics has been done on binary analysis on the entered test data of true positives (TP), the number of false positives (FP), and the number of false negatives (FN) is aggregated. Here the authors evaluated the metrics for Precision (P), Recall (R), and F-Score. The entire proposed system executed end is explained in three parts. Here 1st, the first part is dereverberation, 2nd, the part is MVDR beam forming. The final stage will be CRNN-based SED. By the final stage, the proposed system detects the presence and absence of sound events.

Xu et al. [27] introduced a gated convolutional neural network and a temporal attention-based localization technique for audio classification. The model employed a CRNN with learnable gated linear units (GLUs) applied to the log Mel spectrogram. Additionally, they introduced a temporal attention mechanism across frames to predict the event locations within a chunk derived from the weakly labeled data. The model excelled in both sub-tasks of the DCASE 2017 challenge, achieving an F-value of 55.6% and an Equal Error of 0.73, respectively.

Adavanne et al. [29] aimed to focus on Sound Event Localization and Detection (SELD) for the DCASE 2019 challenge. A baseline method utilizing a convolutional recurrent neural network establishes benchmark performance on this reverberant dataset. The results consider different numbers of overlapping sound events and varied reverberant environments. Overall, SELDnet demonstrated slightly superior performance on the FOA dataset compared to the MIC dataset. Furthermore, SELDnet exhibits enhanced performance in scenarios devoid of polyphony across datasets. Notably, the SELDnet model trained with data from all five environments displays the best performance, particularly excelling in the initial environment with an F1-Score of 85.0 and an error rate of 0.25.

Jingyang et al. [30] presented a comprehensive approach to tackle the SELD task, consisting of data augmentation, network prediction, and post-processing stages. Our approach employed the CRNN architecture for model prediction. Given the scarcity of data in the challenge setting, we advocate for data augmentation to enhance the system's performance. Evaluation of the DCASE 2019 Challenge Task 3 Development Dataset reveals our system achieves approximately a 59% reduction in Sound Event Detection (SED) error rate and a 13% reduction in directions-of-arrival (DOA) error compared to the baseline system, specifically on the Ambisonic dataset.

Turab Iqbal et al. [31] focused on two-stage polyphonic sound event detection and localization, employing log mel features for event detection and intensity vector along with Generalized Cross Correlation (GCC) for localization. These features are fed into a microphone array system. Log mel features were primarily utilized for event detection, while intensity vector and GCC features employed for precise localization. Additionally, an intensity vector in log mel space and GCC with phase transform (GCC-PHAT) features was utilized for DOA estimation. The methodology involved constructing 2DCNN layers, referred to as feature layers, comprising four groups of 2D CNN layers with subsequent 2 × 2 average pooling. Each group of CNN layers comprised two 2D Convs with a receptive field of 3 × 3, a stride of 1 × 1, and a padding size of 1 × 1. The two-stage approach yielded promising results with an error rate of 0.13, an F1-Score of 0.930, and a DOA error of 6.61 degrees.

Ying Tong et al. [32] proposed a model operated by taking consecutive spectrogram time frames as input and generating two outputs simultaneously. Firstly, it conducts Sound Event Detection (SED) through multi-label classification on each time frame, effectively capturing temporal activity for all sound event classes. Secondly, it performs localization by estimating the 3-D Cartesian coordinates of the direction-of-arrival (DOA). Compared to various baselines, including SED and DOA estimation methods, the proposed approach showcases robustness across diverse structures, adaptability to unseen DOA values, resilience to reverberation, and effectiveness in low SNR scenarios. Within this architecture, local shift-invariant features within the spectrogram are learned through multiple layers of 2D Convolutional Neural Networks (CNNs). Each CNN layer utilizes Rectified Linear Unit (ReLU) activation on dimensions of 3 × 3 × 2C receptive fields along the time–frequency-channel axis. This model achieves an accuracy of 87%.

Thi Ngoc Tho et al. [34], a novel approach was proposed to estimate Sound Event Localization and detection by employing a CRNN-based Sequence Matching Network (SMN). The authors accounted for overlapping sounds and their onset and offset parameters, aligning them with the active segments of the output and incorporating a DOA estimator alongside sound classes. Implementation involved utilizing BiGRU coupled with fully connected layers. In the second phase, a CRNN-based SMN was trained to align the output sequences of the event detector and DOA estimator. The estimated DOAs were then associated with relevant sound classes. This modular and hierarchical approach significantly enhanced the performance of the SELD task across all evaluation metrics. The proposed ensemble achieved a localization error of 9.3°, a localization recall of 90%, and secured the second position in the team category of the DCASE2020 sound event localization and detection challenges. Table 5 presents the Different Neural Network Models to Find Polyphonic Sound Event Detection. Table 6 illustrates the different Neural network models with various parameters.

Table 5 Different neural network models to find polyphonic sound event detection
Table 6 Different neural network models with various parameters

6 Approaches for sound event localization

Many overlapping sound waves in different frequency bands make up noise. Sound event detection and localization are two tasks that work together to identify the actions of sounds like horns and dogs barking heavy engines when they're active, calculating their separate geographical position courses, and connecting textual labels with sound events. Generally, the arrival angle of sound direction is challenging to detect. Several researchers have been involved in finding the localization of sound events. Sound event localization, crucial in various applications like surveillance, robotics, and augmented reality, uses several approaches to accurately determine sound sources' spatial coordinates. One common method involves microphone arrays, where the Time Difference of Arrival (TDOA) or Direction of Arrival (DOA) of sound signals across multiple microphones is analyzed. GCC-PHAT is one widely used signal-processing algorithm for sound event localization, especially with microphone arrays. It works by finding the time delay between signals received by different microphones due to variations in sound source arrival times, known as Time Delay of Arrival (TDOA) [45,46,47]. GCC-PHAT calculates cross-correlation between microphone signals to identify the time delay that maximizes correlation. Before computing cross-correlation, signals undergo a preprocessing step called Phase Transform (PHAT), which normalizes signals based on their phase to reduce the influence of signal magnitude. This normalization improves TDOA estimation accuracy and sound source localization, making GCC-PHAT effective in mitigating reverberation and noise for more precise localization results.

Grondin et al. [28] The two CRNNs to perform sound event detection with Time Differences Of Arrival (TDOA) and localization with DOA on the proposed system were based. In this paper, the system has four microphone arrays, thus combining results with six pairs of microphones to provide the 3-D Direction of Arrival (DOA) and the final classification. The proposed sound event detection and localization were submitted to the DCASE 2019 challenge. This research also performs CRNN architecture which uses both the spectrogram and the GCCPHAT features to perform the SED and estimate TDOA.

Archontis Politis et al. [29] have presented sound event localization and detection based on the DCASE 2019 challenge. Here the entire research done with a multi-room reverberant dataset is provided for the task. In this approach, the DNN has been implemented for classification and regression. The average SNR sound event was sampled at 30 dB, and in this research, authors considered the temperature also as a parameter when finding sound detection and localization. The model implemented CRNN with bi-directional GRU to identify the direction of arrival separate from sound event detections.

Zhang et al. [30], the main goal of this paper is to detect the polyphonic sound event and localization. The authors explained the concept: data augmentation, network prediction, and post-processing stage. In the last stage of post-processing, they proposed an idea like prior knowledge-based regulation(PKR). By using PKR concept, they brought the average value of localization prediction. They proved that their process reduces the mean square error. They have implemented the CRNN to find the localization and sound detection. The training set of SED jointed with STFT, and DOA jointed with Mel-spectrogram.

Adavanne et al. [32] have explained the key concept of direction of arrival to find the localization of sound by using the phase and magnitude spectrum of sound waves from multiple directions. In this research, the localization was identified by defining of 3-D Cartesian coordinates of DOA. In this method, the phase and magnitude of the sound signal are evaluated separately to achieve a better result on localization. The entire process on the baseline of CRNN.

Ying Tong et al. [33] have implemented a new SELD method based on multi directional of arrival beam forming and multitasking learning. Multiple-DOA beam forming is used to achieve signal separation and provides a varied sound field description. For SED and sound source localization (SSL), we plan two networks and utilize a multitasking tutorial for SED, where the task associated with SSL acts as regulation. Instead of estimating the signal from DOA For each source, they suggested doing several DOA for the formation of the beam, which directs the beams evenly towards different DOA, such as sources that distribute spatially and noise signal can be separated. DOA output signals are used to extract features for both SSL and SED. Based on CPS and SPP, the steering vector is calculated for each DOA and used to design beam converters for many DOAs. The three-task learning system is used, which uses both regression and Criterion SSL based on classification for organizing the network SED.

Nguyen et al. [34] Sound event detection and localization have to be done in two separate ways, one for detection and the other for localization. Here the detection depends on time–frequency patterns to distinguish different sound classes. Localization and direction of the sound estimation use magnitude or phase differences between microphones. Here they implemented the trained CNN to frequency patterns with the magnitude and phase of the signal to execute the model. The system also extended with a new concept: sequence matching network (SMN). Initially, the model detects the sound by using CRNN to detect the sound events and a single-source histogram method to estimate the DOAs. The next level model was implemented with a trained CRNN-based sequence matching network to match the two output sequences of the event detector and DOA estimator.

Trowitzsch et al. [35] have proposed a system that uses a robotic binaural system to detect sound events and localization. Presents an approach that robustly binds localization with detecting sound events in a robotic binaural system. We use recreations of a complete set-up of test scenes with different co-happening sound sources and propose execution measures for deliberate examination of the effect of scene intricacy on this isolated identification of sound sorts. Investigating the impact of spatial scene plan, we show how a robot could work with a superior through an ideal head pivot. Besides, we explore the exhibition of isolated identification given conceivable restriction mistakes just as a blunder in assessing the number of dynamic sources.

Xianjun et al. [37], in this study, blanket representations of SELD are generated using traditional microphone array signal processing, they implemented a new SELD method based on multidirectional of arrival (DOA) beam forming and multitasking learning. Multiple-DOA beam forming is used to achieve signal separation and provides a varied sound field description. For sound event localization and sound source localization (SSL), they planned two networks and utilized a multitasking tutorial for SED, where the task associated with SSL acts as regulation. They evaluated the model that instead of estimating the signal from DOA For each source, we suggest doing several DOA for the formation of the beam, which directs the beams evenly towards different DOA, such as sources that distribute spatially and noise signals can be separated. DOA output signals are used to extract features for both SSL and SED. Based on CPS and SPP, the steering vector is calculated for each DOA and used to design beam converters for many DOAs. The three-task learning system is used, which uses both regression and Criterion SSL based on classification for organizing the network SED. Experimental results using the DCASE2019 SELD task database show that the suggested technique obtains the most current results. Table 7 describes the sound arrival direction, localization-related models, and key approaches.

Table 7 Models for sound event localizations

7 Acoustic parametric analysis

Acoustic monitoring has become a widely used process for assessing the status and diversity of sound-producing. Different acoustic metrics are utilized to find the accuracy in sound detection, such as Acoustic Complexity Index(ACI), Acoustic Diversity Index(ADI), Acoustic Evenness Index(AEI) and etc. Extensive analysis needs to identify and detect the audio signal of the various types. This process consumes more time. Studies conducted in various environments and geography regions release errors in Correlation among audio diversity and biodiversity indicators, indicating a need for studies to evaluate acoustic monitoring.

Moreno-Gómez et al[38] have investigated the concept of acoustic indices in the rainforest and biodiversity hotspots. Seven audio indicators are evaluated to assess the reliability as surrogate models for variations in the bird and the tadpole animals. They have used three automated voice recordings they are SM1, SM4,SM3, where every device is put into just one sampling station. As the first approach to assess the relationship between birds and the indicators of tadpole richness and vocal diversity has conducted correlations among the variables for every station, we used the bootstrap technique with the 1000 iterations. For every iteration, received feedback randomly with replacing and ran the Correlation analysis. The top-ranked model, M1, encompassing ACI, H, Hf, Ht, and BI, was identified as the most suitable. This model featured fixed effects for intercept, bird richness, and anuran richness, random intercepts for station, hour, and month, and random slopes for birds and anurans by station. With AICc weights exceeding 0.95 and Delta-AICc values surpassing 7 compared to the second-ranked model, M1 demonstrated substantial support, suggesting that the factors within it effectively elucidate the variance observed across the five acoustic indices. Notably, ACI, BI, and Ht exhibited the highest effect sizes for species richness, with Ht particularly influencing bird richness significantly.Their AICc weights were below 0.5, and Delta-AICc values were less than 1 about the subsequent models, indicating that other model possibilities shouldn't be disregarded. Figure 2 present the result scenario of proposed model of [38].

Fig. 2
figure 2

Results from bootstrapped correlations computed by Spearman tests between bird and anuran richness and acoustic diversity indices in the three stations [38]

Eldridge et al. [39] have investigated sounding out acoustic metrics with an independent device recorder, which enables the large-scale monitoring of the audio and the audio scanning. Nearly 26 vocal indices are calculated, and a comparison has been made to the observed differences in species diversity. The Five Audio Diversity Indicators (Voice Dynamic Index, Audio Diversity Index, Audio Equivalence Index, Audio Entropy, and the normal difference audio index) and three simple audio descriptors were evaluated. Highly signified correlations are 65% among the audio indicators, and the richness of bird species was observed in temperate habitats. Poor bonding has been observed in geo-tropical habitats that host multiple types of sounds other than birds. Multivariate classification analysis showed that each habitat has a distinct acoustic scene, and AIs trace the differences that are observed in the community composition that depend on the habitat. Rapid Audio Survey (RAS), was suggested as not invasive and an approach to the assessment of biodiversity, and it is interesting gaining to research people and policymakers. They analyzed the ACI value with 0.49, ADI value greater than 0.5

Felipe Carmo et al. [40] have investigated acoustic indices in the rainforest. They arranged the model at the point of 12 stations. It follows the bird monitoring protocol using the GPS from the range of 350 to 500 m.They implemented the ARU method, which is called Autonomous recorder units. It is one of the sampling methods. They performed the automatic sound monitoring using the 9 ARU'S, SM2.One of the methods stopped recording, so it is not considered in the analysis part. To avoid the recordings of the noise of the humans, like when we stepped on the branches of the forest floor, they used the ARU method with the 12-point stations. Each recorder was used with 2 Omni directional microphones. They investigated their research with ARU method for the 18 days; the first nine were used with pc sampling. The data collected were used to differentiate between the acoustic indices during the researcher's presence and absence.

Fairbrass et al. [41] have investigated their research on acoustic indices measuring biodiversity in urban areas. They used acoustic recordings for 7 days to capture the weekly daily activities. In order to increase the variability in the recordings of the biotic sounds. The (acoustic indices) AI'S were tested using the threshold frequency.For consistency, they tested all the AIs using an upper threshold value of 12 kHz.They have acknowledged that frequencies are included above the threshold for the BI and NDSI. Acoustic diversity is identified by the various sound events associated with the same sound class identified in each recording. Most of the sites were influenced by both low and high-frequency sound activities. Anthropogenic is one of the sounds. In the dataset, it is composed of a wide variety of sound types, like traffic sounds, followed by human voices, crackles from the recorders, electrical buzzes, and the environment.

Machado et al. [42] have performed their survey on bird communication. They have assessed how two specific records (the acoustic variety list − ADI − and standardized distinction sounds cape file − NDSI) reflect bird species lavishness and organization in an ensured region close to Brasilia city. Their research has conjectured that ADI ought to mirror the qualities of birds in the cerrado and in the exhibition woods, i.e., with higher qualities in display timberland than in the cerrado. Based on natural surroundings structure, they have likewise guessed that NDSI ought to be lathey haver in less intricate territory, and lothey haver in regions near urbanized regions. They have evaluated 30 areas by introducing programmed recorders to create 15 min wave documents Manual investigation of the documents uncovered the presence of 107 bird species our outcomes shothey haver that ADI was altogether connected with species lavishness, being higher in exhibition woods than in the cerrado. Acoustic files for biological investigations and biodiversity checking are one of the programmed approaches for information examination. As per their assessment, the relationship of acoustic variety record and bio variety by applying a straight model looking at the mean ADI esteem and the bird species extravagance enrolled in every area.

Siddagangaiah et al. [46] have presented a noise-resilient approach for detecting biophonic sounds from fish choruses based on complexity-entropy (hereinafter referred to as C-H). The C-H approach was tested with data collected in Changhua and Miaoli (Taiwan) in the spring of 2016 and 2017. Miaoli was subjected to constant maritime traffic, which resulted in a 10 dB rise. They suggested that using the C-H technique could assist overcome the limits of acoustic indices in noisy maritime environments. They developed an approach for detecting fish choruses based on the C-H method and compared its detection performance to AIs such as ACI, ADI, and BI. The fish chorusing was shown to be favorably connected with C, but negatively linked with H, resulting in |r|> 0.9. The use of marine acoustic biological activity as a proxy for addressing trends in biodiversity levels and ecosystem functioning could be very useful. The C-H approach was developed and tested in marine habitats to fill in the gaps left by other indices originally designed and utilized for terrestrial settings. Noise from shipping operations or natural sources such as wind and tides had no effect on the C-H technique, which was found to be strongly linked with fish chorusing. When used in conjunction with other current acoustic indices, the C-H technique could be a useful tool for managers and decision-makers to track changes in the makeup of animal communities. Table 8 presents the various parameters to evaluate the sound event detection in different accuracy levels.

Table 8 Various parameters to evaluate sound events

8 Sound processing in different environments

The research on Sound event detection become more emerging in the present days and implemented in diverse domains for different purposes. SED finds application in environmental monitoring, which tracks environmental sounds like bird calls, animal noises, and weather patterns, aiding ecological studies and disaster management. In surveillance and security, SED identifies suspicious sounds such as glass breaking or gunshots, enhancing public safety measures. Integrating SED into smart home systems enables the recognition of specific events like smoke alarms or appliance malfunctions, enhancing home safety and convenience. The healthcare system also enhanced the technology with a sound classification that benefits in monitoring patient conditions by identifying medically suitable sounds such as heartbeats. SED detects equipment failures or anomalies in industrial environments as automotive safety systems utilize SED to detect sounds like horns or sirens, contributing to road safety. Speech recognition systems leverage SED to filter out background noise, improving accuracy. Entertainment and gaming applications utilize SED for audio experiences and interactive events. Additionally, echo monitoring systems serve various purposes, including sonar systems for underwater object detection, medical ultrasound imaging for visualizing internal structures, radar systems for tracking objects, and structural health monitoring for assessing structural integrity. These applications underscore the versatility and significance of SED and echo monitoring systems across multiple fields, promising further advancements in the future.

Imoto et al. [79] introduced a novel SED method based on multitask learning(MTL) of SED and ASC, employing soft labels for acoustic scenes to better represent the nuanced relationship between sound events and scenes. Experimental evaluations conducted on the TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets demonstrate that our proposed approach enhances SED performance by 3.80% in F-score compared to conventional MTL-based methods. Specifically, the proposed CNN-BiGRU model achieves an F1 score of 49.82% and an error rate of 0.691, outperforming the baseline model with an F1 score of 42.17% and an error rate of 0.756.

Mingying Zhu et al. [80] introduced a new method to classify bird sounds automatically. It starts by dividing the audio into sections using a sliding window, selecting the five sections with the highest energy. Then, it extracts important features using a technique called orthogonal matching pursuit. For a dataset with 14 bird species, we achieve a classification accuracy of 98.96% and an F1-score of 98.93% using a 2D-CNN-v2. The highest accuracy for another dataset with 18 species is 97.82%, and the F1-score is 97.47% 2D-CNN-v2 with Bark-scaled SFM as input. The dataset xeno-canto encompasses 14 bird species prevalent in Queensland, Australia, sourced from the Xeno-Canto website and resampled at 11,025 Hz to accommodate the predominant frequencies below 5 kHz in these recordings.

Minhyuk et al. [81] proposed a model to classify human activities utilizing sound recognition, leveraging a residual neural network. The dataset encompassed ten classes of daily indoor activities. After data collection, feature extraction was carried out using the Log Mel-filter bank energies technique. A robust residual neural network comprising 34 convolutional layers was then trained on this data. The findings revealed a remarkable accuracy rate of 87.6%. Precision scores varied between 76.8% and 92.6% across different activity classes, while Recall scores ranged from 75.8% to 98.6%. Additionally, the F1 score ranged from 78.6% to 93.7%.

Yuren et al. [82] proposed a model to classify the Borneo forest sounds using the CNN model, such as animal calls, wind sounds, and bird calls. They found that accuracy was better even with lots of data, but it got much better when they used data augmentation and transfer learning, even with very little data. This shows that CNNs can be useful for identifying animal sounds, even in small projects with many rare species. The modified version of the Keras VGG-19 model achieved 90.4% accuracy on balanced data and 93.2% on imbalanced data.

Messner et al. [83] proposed model to detect heart sound (S1) representing systole and heart sound (S2) marking diastole heart sound states using RNN. They used the PhysioNet/CinC Challenge 2016 dataset, comprising heart sound recordings and annotated states. They employed spectral and envelope features extracted from these recordings. The model achieved an average F1 score of approximately 96%. Table 9 illustrates the various environments that are included in the sound related research.

Table 9 Various kinds of environments are included in the sound-related research

9 Current research challenges

Sound event detection poses several challenges in current research. Real-world environments are often filled with background noise from various sources. Developing robust sound event detection systems requires large amounts of annotated audio data for training and evaluation. This background noise can significantly degrade the performance of sound event detection systems by covering the target sound events. Robustness to domain shifts and adapting to new acoustic conditions are essential for practical deployment in diverse real-world scenarios. Building a robust SED model requires a large and diverse dataset covering various sound events in various acoustic environments. Collecting and annotating such datasets can be time-consuming and expensive. Feature extraction is a crucial step in SED systems. It is challenging to extract features that effectively represent sound events while suppressing irrelevant background noise and interference. Extracting high-level, semantically meaningful features that capture common characteristics across different sound event categories can improve the generalization ability of SED systems. Processing audio data and training complex machine-learning models for SED can be computationally challenging, especially when dealing with large datasets or deploying models on resource-constrained devices.

Sound events in real-world environments are often accompanied by background noise, which can degrade the performance of SED models. Robust noise reduction and interference rejection techniques are necessary to improve the reliability of event detection. Table 10 presents the challenges of different models with various feature combinations.

Table 10 Challenges of different models with various feature combinations

10 Applications

Sound event detection using machine learning has led to the development of various related technologies with a wide range of applications.

  1. a.

    Acoustic Scene Analysis: Technologies for analyzing the acoustic characteristics of environments, such as identifying the presence of specific sounds (e.g., sirens, alarms, speech) and categorizing acoustic scenes (e.g., indoor, outdoor, urban, rural).

  2. b.

    Keyword Spotting and Wake Word Detection:These techniques for detecting specific keywords or wake words within audio streams. They are commonly used in voice-activated devices and virtual assistants to trigger actions or initiate interactions (e.g., Apple's Siri, Google Assistant, Samsung's Bixby,Amazon Alexa, etc.).

  3. c.

    Environmental Monitoring Systems: Systems equipped with sensors and sound event detection algorithms for monitoring and analyzing sounds in natural habitats, urban areas, or industrial environments to track biodiversity, assess noise pollution, monitor traffic patterns, or detect anomalies.

  4. d.

    Healthcare Monitoring Devices: Devices and applications capable of monitoring health-related sounds (e.g., coughing, snoring, breathing patterns) for telemedicine, sleep analysis, monitoring patients with respiratory conditions, or detecting signs of distress.

  5. e.

    Security and Surveillance Systems: Technologies for detecting and classifying sounds related to security threats or abnormal events, such as glass breaking, footsteps, gunshots, or vehicle alarms, in surveillance camera footage or audio recordings for enhanced security monitoring.

  6. f.

    Smart Home Automation: Integration of sound event detection capabilities into smart home systems to automate tasks based on detected sounds (e.g., turning on lights in response to doorbell rings and alerting homeowners to potential security breaches).

  7. g.

    Industrial Monitoring and Predictive Maintenance: Solutions for monitoring machinery and equipment in industrial settings by analyzing sounds to detect anomalies, predict failures, schedule maintenance, and optimize performance to minimize downtime and improve operational efficiency.

  8. h.

    Assistive Technologies for People with Disabilities: These technologies are designed to assist individuals with hearing or other disabilities by analyzing sounds and providing relevant feedback or alerts (e.g., sound-based navigation aids and assistance in identifying environmental sounds).

  9. i.

    Entertainment and Gaming: Integrating sound event detection algorithms into gaming and entertainment systems to create immersive experiences, enhance virtual reality environments, or provide interactive gameplay based on detected sounds.

  10. j.

    Automotive Safety and Driver Assistance Systems: Sound event detection capabilities are incorporated into vehicles to improve driver safety, detect potential hazards (e.g., sirens, horns, tire screeches), and enhance driver assistance features such as collision avoidance and emergency braking systems.

These technologies show the diverse range of applications enabled by sound event detection using machine learning, spanning across industries and domains to address various needs related to monitoring, safety, automation, and user experience enhancement.

11 Conclusion

In this paper, we surveyed and analyzed different models of sound event detection. We also reviewed various algorithms and critical techniques to achieve better results. This paper also illustrates multiple parameters and metrics to evaluate the sound event and localization. Polyphonic sound detection has become one of the key reviews in this research. The significance of accurate definitions for evaluating sound metrics cannot be overstated. It comprises distinct algorithms and obtains acceptable results based on the benchmark databases being implemented using a uniform assessment process. The research performed in this paper is considered part of our effort toward getting the reference point and a better understanding of defining task-based metrics for implementing polyphonic sound event detection. In our future research, we plan to implement sound classification in forest areas to classify tree-cutting sounds and protect forest natural resources. This project may help government bodies to find illegal logging in the forest. In this case, forests consist of various sounds, such as bird calls, animal noises, vehicle sounds from nearby roads, wind sounds, and tree-cutting activities. Based on the environmental situation, our future model will classify and identify the use of total sound. Our research focuses mainly on preserving the classification of forests' natural resources based on sound classification.