Keywords

1 Introduction

Impulse noise is generated during military installation and this noise propagates to the surrounding communities thereby resulting in public annoyance [1, 2]. This type of noise can also be characterized by its high-pressure level in a very short duration [3]. Blast noise typically refers to as impulse noise that generated from military bases [4]. Using instrument attack such as weapons and mortal strength to protect and guide everyone interest is the term that described military. The military weapons are machine gun, bomb blast, tank firing, suicide bomber, AK47 assault rifle, double barrel shotgun firing, and so on [5]. The goal of powerful offensive weapons used by the military is to overpower the people fighting against the country by long range and highly accurate mortal strikes. Non-impulse sounds are other sources of sounds that are not from a military weapon such as the cry of a baby, wind, aircraft, and so on. The annoyance caused by impulse and non-impulse noise is important in making a stable task to follow up thus, record sound events to provide additional evidence of any damage claims [6].

Several research works have been done in measuring and analyzing weapon noise such as gunfire noise detection system but there are still problems in classifying military impulse noise and other noise sources such as challenges of severe overlap of Interquartile Range (IQR) between those noises due to wind introduction [7]. Figure 1 shows the diagrammatic representation of a noise classification system.

Fig. 1.
figure 1

A diagrammatic representation of a noise classification system.

Military sounds from larger weapons have a much deeper sound than sounds from lighter ones. This suggests that there might be a discernible difference in frequencies which Spectrogram analysis will be able to quantify [8]. False detection mostly occurs when classifying a waveform either impulse noise or non-impulse noise such as wind.

According to [9], it is very useful to have information not only about direction and distance but also about specific weapon category and the sound event it could belong to. Moreover, this can help during an investigation of crime incidents in common life where sound evidence is available.

Several techniques have been used to classify military impulse sounds and non-impulse sound like Bayesian classifier, Multi-Layer Perceptron (MLP), Support Vector Machine (SVM), Fast Random classifier, Artificial Neural Network Classifier. In this study, Deep Convolutional Neural Network classifier was applied to classify both military impulse and non-impulse sounds. The rest of the paper is organized as follows: Sect. 2 gave a detailed literature review and existing methodology deployed in related studies while Sect. 3 describes the proposed methodology. The results and discussion are presented in Sect. 4 and the paper concludes with recommendations in Sect. 5.

2 Literature Review

This section gave a comprehensive review of existing studies based on Machine Learning methods and deep convolutional neural network in sound classification.

Machine Learning is a field emerged from Artificial Intelligence which has gained vast application in several research areas ranging from industry to basic science [10]. Its primary aim is to make machines exhibit or mimic human kind of intelligence for the purpose of decision making, classification, detection, etc. The application of Machine learning varies from supervised learning, unsupervised learning, reinforcement learning, etc. Literature has shown that the majority of the method used in identifying military impulse noise are ML ranging from ANN, SVM, KNN classifier, etc. [11].

Deep Learning is a subarea of Machine learning which is based on algorithms inspired by architectural structure and function of the human brain known as Artificial Neural Network. Convolutional Neural Network (CNN) is a specialized NN used in grid-like topology using multiple filters with fewer connections and parameter thus easier to train [12, 13].

Yang and Chen presented a review of Machine recognition of music emotion in [14]. The study gave a comprehensive study of existing methods and the proposed solutions for recognizing music emotion. [9] identified gunshots sound using spectral characteristics by normalizing the amplitude and the frequency of the sample collected, and converted the signal from time domain to frequency domain through Fourier transform by extracting the features needed. MATLAB was used for the implementation using Neural Network toolbox and C programming language. [15] classified sound of frogs based on their species using three features: signal bandwidth, threshold-crossing rate, and spectral centroid. Frog sound was segmented into syllables before classifying the sound using SVM and Kth Nearest Neighboring (KNN) classifier. [16] however, developed an accurate method for noise classification with event detection of lower peak levels down to 100 dB (decibel) using all the ANN structures and proved that nonlinear capabilities of ANN give an edge over a linear classifier. Time and frequency domain features were used for the classification. [7] developed an ANN based classifier for 330 and 660 military impulse and non-impulse noise, respectively using time domain metrics and custom frequency domain metrics for ANN structure selection. The ANN structures are: SVM with radial basis function, SOM, MLP, and Least Square Classifier. The output of [7] proved that time domain metrics (kurtosis and crest factor) were good in classifying impulse noise. Military aircraft sound was classified by [17] using neural network and compact features vector. ANN method was introduced for aircraft engine signal classification, the ANN technique involved extracting a compact feature from the sound using Frequency Domain Metrics (FDM) as a method of extraction. FDM for the extractions are Spectral Centroid and Signal Bandwidth. [18] employed three architectures: CNN, ANN and Softmax regression. 480 samples of sounds were captured at 240 bmp for two minutes from 13 objects using drum kit and guitar. The three architectures failed to achieve accuracy above 20% on the latter representation after 500 iterations. However, CNN and ANN obtained accuracy above 80% using the frequency-space represented data. Softmax regression failed to successfully classify the data while CNN achieves an accuracy of over 97%. [19] applied different ANN structures, Self Organizing Map (SOM), MLP, image recognition and SVM for sound classification using feature extraction method (time domain and frequency domain metrics). MLP performed most accurate among all the ANN structures.

Cakir applied multilabel Deep Neural Network (DNN) in [20] for real-time detection of multiple recorded sound events. Kumar and Raj applied deep CNN on weakly labeled web data for audio event recognition [21]. The approach emphasized temporal localization was able to train and test recordings of variable length accurately. Piczak proposed a sound event classification using DCNN classifier [22]. All sounds were inputted using Log-scaled Mel-spectrograms as feature extraction technique. The proposed system utilizes DCNN architecture consisting of a convolutional layer, Max-pooling layer, convolutional layer, fully connected layer, dropout layer and two fully connected layers. In conclusion, DCNN gave an accuracy of 73%. Bucci and Vipperman developed an ANN-based classifier for identifying military impulse noise [16]. The study was based on two time-domain and frequency-domain metrics which are kurtosis and crest factors with spectral slope and weighted square error. The study concluded that the system gave up to 100% accuracy during training and testing: The summary of reviewed related works is shown in Table 1.

Table 1. Overall summary of related works.

3 Methodology

This study uses a Deep learning technique to classify six different military impulse and non-impulse sounds [35]. The basic step required for military impulse sounds classification are Data collection, Feature extraction, and sounds classification.

3.1 Data Collection

The experiment was conducted on the dataset from six noise type military sound [35]. The dataset consists of six different sounds which we classified as impulse and non-impulse sound. These sound types comprise bomb-blast, wind, machine gun, aircraft, vehicle, and thunder. All sounds were extracted using 25 signal metrics as input with an overall of 37,464 records of sounds. The summary of data collected is depicted in Table 2.

Table 2. Summary of data collected.

3.2 Feature Extraction

For better human interpretation of military impulse sounds and non-impulse sounds, is important to extract the features needed. This start from the initial set of data in order to derive values (features) intended to be formative with no redundancy. [19] extracting features required the following signal metrics of ANN:

  1. i.

    Time domain metric: is the variation of amplitude of signals with time.

  2. ii.

    Frequency domain metric: is how much signal lies in a frequency range.

Both metrics used were successful in the past considering fault check in the direct analysis of the input data [19], which is most likely the similar problem in identifying military impulse noise. Kurtosis and crest factor refer to as time domain metrics while weighted square error and spectral slope refer to frequency domain metrics. To give a good performance of sound classification, the input sounds need to be regularized for the purpose of avoiding overfitting.

Spectral Slope (m): is computed by creating a least-squares fit to a line as depicted in Eq. (1).

$$ {\text{Y}} = {\text{mx }} + {\text{ b}}. $$
(1)

Where:

\( y = \log_{10} PSD \) is the base-10 logarithm of the power spectral density (PSD), and is the base-10 logarithm of frequency.

Weighted Square Error: This can be expressed as WSE:

$$ WSE = \sum\nolimits_{i = 1}^{n} {\left[ {y_{i} + y_{l} } \right]^{2} } [f_{i + 1]} - f_{1} ] $$
(2)
$$ y_{i} = \frac{{\log_{10} PSD_{i} - { \hbox{min} }[\log_{10} PSD]}}{{\hbox{max} \left[ {\log_{10} PSD} \right] - { \hbox{min} }[\log_{10} PSD ] }} $$
(3)

Where:

  • \( y_{i} \) is the log10 (PSDi) of the ith frequency of data;

  • \( y_{i} \) the estimate of \( y_{i} \) from the linear curve fit;

  • \( f_{i} \) the log base 10 of the ith frequency;

  • n is the number of the input data.

\( [y_{i} + y_{l} ]^{2} \) allows WSE to remain positive and also reflects the total magnitude of the error. [\( f_{i + 1} \)\( f_{1} \)] is to add greater weight to the error at the lower frequency bins. Distinguishing between military impulse noise and non-impulse noise is best with features which occurs at the lower end of the bandwidth in consideration.

Kurtosis and Crest Factors:

When comparing wind noise and military impulse sounds, Crest factor slightly overlaps the IQR. However, Kurtosis has no IQR overlap with comparison of other noise sources and military impulse noise sources. Kurtosis is used for describing or estimating a distribution’s peak level and frequency of extreme values, it can be computed as:

$$ K = \frac{1}{{\delta^{4} T}}\int_{0}^{T} {(x - \mu )^{4} dt} $$
(4)

Where:

  • \( x \) refers to the signal;

  • \( \updelta \) is the variance of the signal;

  • \( \upmu \) is the mean acoustic pressure;

  • T is the time frame over which the kurtosis is measured [16].

Crest factor is the peak value of the waveform (PPK) divided by the Root Mean Squared Value (PRMS) of the signal and it is calculated as:

$$ Crest \,factor = (peak \,value) /(rms\, value\, of\, current waveform) $$
(5)

3.3 DCNN Classifier

DCNN is a network embedded classifier with multiple hidden layers (non-linear) which can learn the complicated relationship between the input data and require output. This classifier is inspired by biological variant of MLs for classifying military impulse sound [36]. DCNN consist of three layers: Convolutional layer, pooling layer, and fully connected layer.

  1. i.

    Convolutional layer: this layer is the first layer and core building block of DCNN that ensure the equality be-tween the input and output parameters performance. This layer helps the DCNN model to train faster, no matter the number of data. Without a convolutional layer no DCNN model.

  2. ii.

    Pooling layer. This is the layer that must follow immediately after convolutional layer output, because the convolutional output is the pooling layer output which helps to simplify the information further.

  3. iii.

    Fully connected layer. This layer consists of all the input from the beginning of the layer. After each layer, activation function would be applied to give the model power to be more flexible to arbitrary relations. However, there are various activation functions, but Rectified Lineal activation functions (RELu) would be applied with its mathematical expression given in Equations below.

$$ y = \frac{1}{{(1 + e^{net} )}} $$
(7)

Figure 2 shows the dataflow diagram for the proposed DCNN model..

Fig. 2.
figure 2

Flow diagram for the proposed DCNN model.

The hyperbolic target sigmoidal activation function is represented as:

$$ y = \frac{{e^{net} - e^{ - net} }}{{e^{net} + e^{ - net} }} $$
(8)

Where: y is the output activated, and the net is the sum of the input layer without activation function.

3.4 ADAM (Adaptive Moment) Algorithm

Adam is an optimizer that can be used to solve problems with larger and noisy parameters in the field of deep learning. The optimizer was implemented using tested default settings for machine learning problems which are \( \propto \) = 0.002, \( \beta_{1} \) = 0.9, and \( \beta_{2} \) = 0.999. With \( \beta_{1}^{t} \) denote \( \beta_{1} \) to power of t. The learning rate with the bias-correction term for the first moment of ADAM is \( \frac{ \propto }{{1 - \beta_{1}^{t} }} \). Procedure for the ADAM algorithm is given as:

figure b

4 Results and Discussion

A total record of 37,464 impulse and non-impulse military sounds was used to train and test the developed DCNN classifier. The data was partitioned as training (67%) and testing (33%) datasets consisting of 25,101 and 12,363 sound records, respectively. The results obtained are presented in the sub-sections below.

4.1 Performance Evaluation

The DCNN classifier was evaluated using the confusion matrix containing True Positive (TP), False Positive (FP), True Negative (TN), False Negative (FN), Precision, Matthews Correlation Coefficient (MCC), Accuracy (Acc), Receiver Operating Characteristics curve (ROC) and the Area Under the ROC Curve (AUC). The order of partitioned for each sound type is depicted in Table 3.

Table 3. Partitioned dataset for each sound type.

4.2 DCNN Model Performance Result

The experimental result obtained for the performance of DCNN for the six classes of sound types is depicted in Table 4. Table 4 gave an analysis of the number of the predicted results against the actual result. The table shows the true values for the six classes of sound type with vales of TP, TN, FP, and FN.

Table 4. Positives and negative detection in DCNN.

The positive and negative detection for DCNN is shown in Table 4 while the overall accuracy of the six classes of sounds based on the precision, MCC and Accuracy is shown in Table 5 with the DCNN classifier returning best accuracy result for Machine gun, Wind and Thunder as 97.43%, 96.98%, and 95.16%, respectively.

Table 5. Positives and negative detection in DCNN.

The Table 5 further shows that the classification error rate for Machine gun as the lowest with 2.56% followed by wind with 3.02% thunder with 4.84%. This result depicts that DCNN classifier based on Machine gun, wind, thunder, and blast is quite encouraging as its error rate is still within the acceptable and standard rate.

5 Conclusion

This study was based on a developed DCNN model, a variant of MLP was used to classify six categories of sounds (military impulse and non-military impulse). The experimental result showed that DCNN model gave an optimal accuracy when classifying Machine gun, Wind and Thunder sounds types as 97.43%, 96.98%, and 95.16%, respectively. Classification of Aircraft and Vehicle sound type was lower with 87.0% and 88.83%, respectively. However, the average classification error rate for the six sound types was 6.57% which shows that the detection rate of DCNN signifies a promising classifier. We plan to compare the obtained result with pervious implementations of ANN with varying numbers of inputs features e.g. 4, 8 and 25.