Keywords

1 Introduction

The protection of wildlife is becoming important as it grows smaller every year. This is can be evident from the poaching activities [1]. Wildlife Department officers hunt people who involving in poaching activities in Semporna, Sabah [1]. This is can be due to legal loggers sometimes break rules of entering the wildlife zones [2]. To overcome this issue, Sabah Forestry Department favors to set up a dedicated wildlife enforcement team as intruders became more daring in forests and reserve areas [3]. Even though, protection initiative has been made, but the numbers of wildlife species grew lower and even near extinction for some species that reside in the sanctuary. Many approaches were used to protect wildlife and faced many challenges. A recent finding stressed on the urgent need for new or combined approaches that need to be taken up in the research challenges to enable better protection against poaching in wildlife zone [4]. One of the challenges is in the implementation of security in remote areas. It requires a special equipment such as camera trapping and it should be designed to endure the conditions of a rain forest. The use of the camera requires high maintenance due to the location as it has no power grid source, rely on its batteries for surveillance and high probability of being spotted by intruders [5]. The equipment and cameras can be stolen or destroyed by trespassers (WCS, 2017). The use of camera trapping surveillance by Wildlife Conservation Society (WCS), Malaysia acquired a high amount of memory for data storage and faced with fogs and blockages of the camera view. The stealth ability of the camera is low due to limitations of camera angle of view whereby the maintenance such as changing batteries and memory cards were troublesome. In addition, remote location with no cellular network access would be difficult to transmit video data.

There is a need to find a better solution to overcome this issue and consider the maintenance cost and security. Low investment in lack of protection Southeast Asia was a reason for the lack of protection of wildlife [6]. Thus, solution with less power consumption can be considered for less frequent maintenance and cost saving. There is effort in computing solution have been addressed to in detecting intruders mainly in acoustic surveillance. They detect the signals from the sound in the wildlife zone to classify them in two types; intrusion and non-intrusion. In this case, Fast Fourier Transform (FFT) spectrum of the voice signal extracts the information and calculate the similarity threshold to classify the intrusion.

Many researches focused on signal classification for several types of applications includes acoustic classification [7,8,9,10,11,12,13,14,15]. Machine learning methods are still used in acoustic signal solutions even though methods the recent method as such, as Convolution Neural Network and deep learning have been applied to the acoustic classifications [16, 17]. Quadratic discriminant analysis classifies audio signals of passing vehicles based on features based on short time energy, average zero cross rate, and pitch frequency of periodic segments of signals have demonstrated an acceptable accuracy with as compared to some methods in previous studies [18]. In addition, feature extraction of the audio signals is prime of importance task to determine features of audio. For instance, spectrum distribution and the second one on wavelet packet transform has shown different performance with the K-nearest neighbor algorithm, and support vector machine classifier [19]. This paper aims to identify a suitable technique to be efficient in identifying audio signals of an event of intrusion by the vehicle engine, environmental noise and chainsaw activities in wildlife reserves and evaluate an audio intrusion detection using data sets from WCS Malaysia.

2 Related Work

2.1 Signal Processing

The audio recording is a waveform whose frequency range is audible for humans. Stacks of the audio signs are used to define variance data formatting of stimulant audio signals [20]. To create an outline of the output signal, also analyses the stimulation signal and audio signal, classification systems are used which are helpful for catching the signal of any variation of a speech [21]. Prior to the classifications of audio signal, the features in the audio signal are extracted to minimize the amount of data [22]. Feature extraction is a numerical representation that later can be used to characterize a segment of audio signals. The valuable features can be used in the design of the classifier [23]. The audio signal features can be extracted as Mel Frequency Cepstral Coefficient (MFCC), pitch and sampling frequency [22].

MFCC represents the signals which are audio in nature are measured in a unit of Mel scale [24]. These features can be used for speech signal. MFCC is calculated by defining the STFT crescents of individual frame into sets of 40 consents using a set of the 40 weighting contours simulating the frequency sensing capability as humans. The Mel scale relates the frequency which is pre-received of a pure tone to its actual measured frequency.

Pitch determination is important for speech transforming algorithms [25]. Pitch is the quality of a sound in major correlations of the rate of vibration generating it, the amount of lowness or highness of the tone. The sound that comes from the vocal cords starts at the larynx and stops at the mouth. If unvoiced sounds are produced vocal cords do not shake and are open while the voiced sounds are being produced, the vocal cords vibrate and generate pulses known as glottal pulses [24].

2.2 Feature Extraction

One of the audio signal processing and speech processing is Linear Predictive Coding (LPC). It uses frequently in in extracting the spectral envelope of a digital signal of audio in a compact form factor. By applying information relevant to a linear predictive model. LPC provides very accurate speech parameter estimates for speech analysis [25]. LPC coefficient representation is normally used to extract features taking account of the spectral envelope of signals in the analog format [26]. Linear prediction is dependent on a mathematical computation whereas the upcoming values of a time discrete signal are specified as a linear function with consideration of previous samples. LPC is known as a subset of the filter theory in digital signal processing. LPC applies a mathematical operation such as autocorrelation method of, mhj autoregressive modeling allocating the filter coefficients. The feature extraction of LPC is quite sufficient for acoustic event detection tasks.

Selection of extracting features is important to get the optimized values from a set of features [27]. Selecting features from a large set of available features will allow a more scaled approach. These features will then use to determine the nature of the audio signal or classification purposes. It is used to select the optimum values to keep accuracy and performance level and minimizing computational cost altogether. It has resulted in drastic effects towards the accuracy and will require more computational cost if no optimum features were developed [28]. Reduction of features can improve the accuracy of prediction and may allow necessary, embedded, step of the prediction algorithm [29].

2.3 Random Forest Algorithm

Random forests are a type of ensemble method for predicting using the average over predictions of few independent base models [30]. The independent model is a tree as many trees make up a forest [31]. Random forests are built by combining the predictions of trees in which are trained separately [32]. The construct of random tree, it follows three choices [33].) as the following:

  • Method for splitting the leaves.

  • Type of predictor to use in each leaf.

  • Method for injecting randomness into the trees.

The trees in random forest are randomized based regression trees. The combinations will form an aggregated regression estimate at the end [34]. Ensemble size or the number of trees to generate by the random forest algorithm is an important factor to consider as it shows to differentiate in different situations [35]. Past implementations of the random forest algorithm and their accuracy level of relevance to the ensemble size affect accuracy levels majorly. Bag of Features is the input data for predictions [36]. Sizes of the ensemble in this case show that there is a slightly better accuracy in setting the trees to a large number [37].

3 Development of an Audio Event Recognition for Intrusion Detection

3.1 System Architecture

The development of the audio events recognition for intrusion detection starts with the identification of system architecture. Figure 1 demonstrates the system architecture and explains the main components of the system generally in block diagram form. The system should be able to classify the audio as an intrusion or non-intrusive to allow accurate alarms of intrusions notify rangers. Figure 2 shows the system flow diagram consisting of a loop of real time recording of audio and classification.

Fig. 1.
figure 1

System architecture

Fig. 2.
figure 2

System flow diagram

3.2 Data Acquisition and Preparation

This section explains data processing and feature extraction processes. A set of recordings/signal dataset was provided by WCS Malaysia. The recording consists of 60 s of ambient audio of rainforest environment and vehicle engine revving towards the recording unit in the rainforest.

Since acquiring raw data are unstructured and unsuitable for machine learning the data requires a standard form to allow the system to be able to learn from this source. A standardized form has been formulated to allow a more lenient approach to solving the problem. The parameters are 5 s in duration, waveform audio files of the mono channel on the frequency of 44100 Hz. Two segments of 5 s from the raw audio file is combined using Sony Vegas an application for audio & video manipulation to resynthesize into training data. Independent audio files of vehicle engines and rainforest background environmental overlap in various combinations as described in scenarios below. The vehicle audio is lowered to produce various distances of vehicles between the devices. To produce a long-distance scenario the vehicle audio is reduced by 5 dB up to 20 dB. The composed audio is then verified again by human testing to validate further into logical terms of hearing ability and classification. In Figs. 3, 4, 5 and 6 visualize the 4 scenarios of resynthesizing of 2 layers of audio signals, namely the above audio is a natural environment and below is the vehicle engine audio segment.

Fig. 3.
figure 3

Scenario 1, direct vehicle pass through

Fig. 4.
figure 4

Scenario 2, last 2 s vehicle pass through

Fig. 5.
figure 5

Scenario 3, first 2 s vehicle pass through

Fig. 6.
figure 6

Scenario 4, middle 3 s vehicle pass through

Resynthesized audio files that are created as the training data are divided into three separate audio events. The recording acquired are altered using software to extract various five seconds of applicable audio indication of a vehicle or chainsaw activity and rainforest typical conditions. Data on vehicle audio activity consist of 4 × 4 vehicles moving since machine learning requires the data in the form of numbers the training audio data is not yet ready for modelling. The next step is to extract the feature of LPC from the audio files created before. The feature extraction of waveform audio files is done using MATLAB R2017b digital Signal processing toolbox using the LPC function.

4 Results and Discussion

4.1 Audio Data Analysis Using Welch Power Spectral Density Estimate

To further examine the waveform audio files, it is converted from time domain to frequency domain. By using, Welch Power Spectral Density Estimate in MATLAB R2017b function, Figs. 7a–f, shows different scenarios and the representation of audio in which power spectral density estimation graph form.

Fig. 7.
figure 7

(a) Very low noise of vehicle passes through with high intensity of rainforest environment background, (b) Low noise of vehicle passing through with medium intensity rainforest environment audio environment background, (c) Obvious Noise of vehicle pass through with the low intensity rainforest environment audio environment background, (d) Low intensity rainforest environment audio, (e) Medium intensity rainforest environment audio and (f) High intensity rainforest environment audio.

The composition is constructed from double environmental audio overlapped of a minus 20 dB of the engine activity. This audio file was validated by human testing, but the results are no presence of vehicles activity. This shows that even humans cannot hear up to this level of detection. This finding has shown that machines has shown the capability of performing surveillance accurately.

4.2 Results of a Random Forest Simulation

The simulation of the random forest used “sklearn” a python machine learning library and “Graphviz” a visualization library to create the decision trees. The simulation is done by producing 4 trees created by several subsets from the entire dataset. Gini index or entropy is normally used to create decision trees on each subset with random parameters. Testing is done for all 4 trees with the same input data to find most trees resulting the same output. Random Forest tree generation system is a series of random selection of the main training dataset into smaller subsets that consist of even classed data [27]. In this case it is broken up into 2 subsets and each subset is used to generate tree with Gini index and entropy method. It indicates that producing an ensemble of 4 trees can be used for predicting in random forest technique. Figure 8 displays the random forest dataset selection process and tree generation process.

Fig. 8.
figure 8

Random Forest tree generation method

Test set A, B and C are features extracted from audio of the vehicle, the nature and chainsaw respectively. Variables L1, L2, L3, L4, L5, L6, L7, L8, L9 and L10 are the LPC extracted features from audio files. Table 1 shows the test inputs for the experiment. Figures 9 and 10 demonstrate the example of the generated and visualized tree.

Table 1. Test sets variables and target class
Fig. 9.
figure 9

Tree 1 generated and visualized

Fig. 10.
figure 10

Tree 2 generated and visualized

Each test set A, B and C will be tested in all 4 trees generated. The majority class will be the most similar results among tree results. Table 2 shows the results for each tree and test set respectively. It can be concluded that the results prove that as trees could produce false results the whole ensemble will allow better interpretation of the overall prediction. This shows that a cumulative result of a majority will help avoid false positives.

Table 2. Result of random forest simulation

Results of MATLAB 2017b Tree bagger classifier in the Classification Learner App and A series of test has been done to find results on the WEKA platform is shown in Table 3. On both platforms, it shows an average of 86% positive prediction based on the 10 variables of LPC features. The results obtained is promising enough as the training data is limited.

Table 3. Results using MATLAB 2017b tree bagger function.

By using Classification Learner App in MATLAB 2017b allow to run many classifiers. It is found that Linear Discriminant method is more accurate in predicting LPC extraction of audio files that consist of events such as vehicle, chainsaws and natural acoustic events. Basic decision tree results may differ based on their maximum splits that could be controlled to produce diversity of results. The performance of each type of tree is assessed on the entire data set. Fine Tree is defined by increasing the maximum splits allowed in the generation process. Medium tree is in between a fine tree and a coarse tree with just enough maximum splits allow. A Coarse tree allows low numbers of total splits. Table 4 shows the total results of all basic decision trees generated with their respective parameters.

Table 4. Basic decision trees of gini diversity index result on LPC dataset.

5 Conclusions

Random Forest technique with Linear Predictive coding feature extraction has been found to be efficient. The combinations of linear predictive coding feature extraction and random forest classification is the best combination with past studies. The current study only achieved 86%. It is believed to be connected to the data variance and amount collected for training the model. Thus, it could be concluded in the implementation of random forest require a decent data set for training to allow better results. LPC extracting and classification of audio signals are very light in requirements of computing power. In future, the evaluation of other techniques as such of deep learning and different type of signal datasets can be applied for a better solution.