Empirical mode decomposition based statistical features for discrimination of speech and low frequency music signal

Kumar, Arvind; Chandra, Mahesh

doi:10.1007/s11042-022-13267-3

Empirical mode decomposition based statistical features for discrimination of speech and low frequency music signal

Published: 03 June 2022

Volume 82, pages 33–58, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Empirical mode decomposition based statistical features for discrimination of speech and low frequency music signal

Download PDF

257 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

This work aims to investigate the significance of different Empirical Mode Decomposition (EMD) based statistical features for discrimination of speech and low frequency music signal (guitar signals) which mostly lie in the frequency range of 80–1200 Hz. Each of the speech/guitar audio samples is decomposed into 10 Intrinsic Function Mode (IMFs). These IMFs are further analyzed for discriminatory evidence using statistical features like Mean, Absolute Mean, Kurtosis, Variance and Skewness. These features are then fed to different classifiers and their performances were tabulated for varying tuning parameters of the classifiers. Initial experiments were conducted on isolated features to shortlist features with best discriminatory evidence. These shortlisted features were then used in different combinations and their performances were reported. An improvement of 19.13% is observed for hybrid features over isolated features. Speech samples were obtained from Scheirer and Slaney database and Guitar samples were generated from a continuous guitar monologue uploaded on YouTube. Feature selection technique using Fisher Method and F-ratio were also implemented and best feature vectors were reported for both the algorithm. Best overall accuracy of 82.16% is reported for Hybrid features with Radial Basis Function (RBF) kernel of SVM classifier when trained with top 38 feature vectors obtained using F-Ratio Method. Different experiments verified Absolute Mean and Variance as best performing features for our task.

Enhancement and Comparative Analysis of Environmental Sound Classification Using MFCC and Empirical Mode Decomposition

A Review of Various Techniques Related to Feature Extraction and Classification for Speech Signal Analysis

Clean speech/speech with background music classification using HNGD spectrum

Article 16 October 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Audio segmentation and classification has been an important area of interest for many applications where initially the target could be to categorize the incoming signal into one of music, speech or silence. This could further be analyzed for various subdivisions, for example, into speaker identification or instrument identification or genre detection. This work focuses on classifying speech and guitar signal using Empirical Mode Decomposition and evaluates its performance for different classifiers. Although, many works have been carried out in the past for speech vs. music classification for broadcast news data but still classification of speech and guitar signal is an unexplored area. Guitar signals are example of low frequency music signal and broadly lie in the range of 80–1200 Hz. These kinds of signal are usually encountered during live play performance or beat poetry where an actor speaks between guitar monologues. Since both guitar and speech signal are non-stationary in nature sharing common spectrum in frequency domain, their classification is a complicated task.

1.1 Related works

Earlier work in the field of speech vs. music classification was conducted by Saunders [24] who used energy contour and zero-crossing (ZC) rate based statistical features for separation of speech and music. Accuracy up to 98% was reported when probability measures on signal energy was used over skewness of ZC rate distribution. Zhang and Kuo [35] proposed heuristic rule-based method for audio segmentation and classification in song, speech, environment noise and silence. Features like fundamental frequency, average ZC rate and spectral peaks tracks were used with an accuracy of more than 90% for audio classification and 95% for audio segmentation. Scheirer and Slaney [25] explored Power Spectral Density (PSD) in their work and proposed various features like spectral flux, spectral roll-off and spectral centroid for the task of speech and music discrimination. They tested these features with different classifiers and found an accuracy of 98.2% for 2.4 s segments. Alexandre et al. [1] used spectral based features along with Mel Frequency Cepstral Coefficients (MFCCs) and high zero crossing rate ratio with Fisher Linear Discriminant classifier and k-Nearest Neighbour. Another method proposed probability-based features using Hidden Markov Model (HMM) [31]. Gaussian likelihood ratio test was used for classification.

This work explores the use of Empirical Mode Decomposition for analysis of non-stationary signals to extract features to obtain discriminatory evidence between speech and guitar signals. The Empirical Mode Decomposition and the Hilbert spectrum had been extensively explored in the past for nonlinear and non-Stationary time series analysis [11]. EMD has been explored in earlier works for Speech/Music discrimination and promising results have been observed [13]. EMD has also been explored for detecting situational interest amongst students during learning [2]. Different frequency scales present in the signals are extracted using EMD which acts as a dyadic filter for the incoming signals [6, 32]. These extracted scales are also known as Intrinsic Mode Functions (IMFs). These IMFs have been explored for different applications like speech analysis, climate analysis, biomedical application, etc. because of the information embedded in them and promising results have been achieved [4, 9, 10]. Cepstral coefficient has been explored in [17, 26] for classification and segmentation of speech and music signal using Gaussian Mixture Model and SVM. Speech specific features for classification task have been proposed in [12]. A significant improvement is seen on combining speech specific features with existing features. Convolution Neural Network using audio spectrogram has been proposed in [20]. Speech and music classification using IIR-CQT spectrogram based statistical descriptors and extreme learning machine has been proposed in [3]. A fast and efficient technique for segmentation and classification of speech and music signal using amplitude and Zero Crossing Rate (ZCR) is explored in [18]. Use of modified SVM for Speech/Music discrimination for Selectable Mode Vocoder (SMV) framework is explored in [16]. New feature vectors based on sinusoidal model for classification of speech and music signal is explored using SVM and GMM classifier in [28]. In [22], fundamental frequency is estimated for classification task. An audio-driven algorithm for the detection of speech and music events in multimedia content is introduced in [29]. D. Bykhovsky et al. improved robust voiced-unvoiced decision in presence of environmental noise using generalized likelihood ratio test (GLRT) [5]. Automatic threshold evaluation techniques were proposed in this work adapting both Constant false alarm rate (CFAR) and Bayes criterion thresholds.

1.2 Motivation

The aim of the study is to investigate and understand the efficacy of EMD based statistical features in classifying a speech and low frequency music signal sharing a common spectra range for different tuning parameters of state-of-the-art classifiers. A speech signal is governed by source-filter model which is almost similar across all speakers to produce speech. On the other hand, guitar signals are produced by vibration of strings in controlled manner. IMFs for both music and speech signals are expected to contain the information of the source. While, the IMFs for speech signals are expected to contain information of glottal activity in them, the IMFs for guitar signals should reflect the characteristics of vibrating strings. This fundamental difference in the production should result in different patterns of IMFs. The aim of this study was to extract features to exploit these differences and use it for classification of speech and guitar signals. The extracted features are tested with four different classifiers and results are compared.

Major objectives:

To investigate and understand the efficacy of EMD based statistical features in classifying a speech and low frequency music signal sharing a common spectra range.
To study the variation of the performance of the models on changing different tuning parameters of state-of-the-art classifiers.
To analyze and sort best performing statistical features based on experimental results and inference.
To study the improvement in performance of the models on passing different combination of best performing isolated features.
To study the impact of feature selection technique on raw data and verify whether the manual interpretation of the best performing hybrid features proposed using experiments matches the results obtained from two feature selection techniques.

The rest of the paper is organized as follows: Section 2 briefs the process of EMD decomposition. Section 3 describes the preparation of database used for the experiment and presents analysis of extracted IMFs for speech and guitar signal. Section 4 describes the feature extractions process. Section 5 discusses the classifiers used in this work. Experimental results and observations are discussed in Section 6. Section 7 studies the effect of feature selection on performance of classifiers. Section 8 draws a comparative analysis of the present work with past works and Section 9 presents the conclusion and future scope of this work.

2 Empirical mode decomposition for audio signals

EMD has found its application in many real time analyses to extract AM-FM components of any complex signals breaking them into many IMFs [13]. This method decomposes any real time signals without any parametric optimization or without using a priori information. Table 1 describe the process of EMD in brief [13].

Table 1 Steps to evaluate EMD

Full size table

An IMF represents a single frequency scale satisfying the condition of having equal number of zero crossing and number of extrema or differs by at most one. Because of this rigid criterion, researcher had proposed several sifting criteria [30]. For this work, decomposition of the signals is limited to 10 IMFs to preserves the dyadic nature of EMD. Limiting the number of IMFs to 10 prevents unnecessary processing of latter IMFs which contain mostly low frequency trends for both speech and guitar signals.

3 Database and analysis of IMFs

Earlier work in music vs. speech discrimination mostly used Scheirer-Slaney database [25]. Since this work was focused on performance evaluation of EMD signal for speech and low frequency music signal classification; the latter couldn’t be used in its original format. However, this work uses the speech samples contained in the data set leaving behind the music samples. Each of the speech samples was down sampled to 8 KHz from 22.05 KHz. Guitar sound samples were downloaded from YouTube [34]. A continuous guitar wave file playing chords for 1 hour was downloaded and down sampled to 8 KHz. This wav file was broken into 80 samples each of 15 seconds to match the number of speech files in Scheirer-Slaney database. 60 of these 80 files from both speech and guitar databases were used for training and rest 20 files were used for testing. Spectral-Spread of these wav files was observed. Figure 1 shows the FFT of a sample of speech and guitar wav file. Most of the peaks in the FFT of speech signal range from 0 to 4000 Hz whereas for guitar signals, the peaks spanned mostly across 0–1500 Hz with small peaks occurring around 2600 Hz.

All the simulations were carried on with MATLAB R2015b running on Intel® Core™ i7–6700 64-bit processor with 8 GB RAM.

3.1 Analysis of extracted IMF

This section discusses the IMF’s extracted from speech and guitar signal. Figure 2 shows the first 7 IMFs extracted from speech and guitar audio samples respectively.

3.1.1 Analysis of IMFs of speech signal

The complex speech signal is decomposed into seven different IMF’s in decreasing order of frequency components. Works in the past have done AM-FM analysis of speech signals where the attempt is to represent a speech signal in terms of AM-FM components [19]. The AM-FM nature of the speech signal is clearly visible in first 3 IMFs in Fig. 5a. Sinusoidal waveforms are also seen spread across different IMFs especially in IMF 5, 6 and 7. These sinusoidal reflect the voiced speech segments and have been used by researcher to find glottal activity [27]. However, the task becomes difficult because of mode-mixing where a frequency scale is distributed among different IMF’s and different frequency scales are merged in one IMF [2, 8]. This problem is solved by advanced version of EMD like Ensemble-Emperical Mode Decomposition (EEMD) [33] and its variants. EEMD works by adding small amplitude white noise to the original data and take the ensemble mean of the IMFs extracted from this noisy data. Over much iteration, this added white noise is averaged out leaving behind the original components of the signal simultaneously separating the modes into its proper IMFs.

3.1.2 Analysis of IMFs of guitar signal

Unlike, speech samples which are governed by source filter theory, the source of music sound is completely different and hence the difference in the IMF’s of the two was expected. Guitar wave files had musical chords recorded in them which are played by vibrating three or more notes simultaneously. The presences of different frequency components are clearly visible in first 6 IMF’s in Fig. 2b. A clear difference is seen in the first two IMF’s of speech and guitar signal. While the latter reflected AM-FM nature of speech, very few high frequency components were observed in the IMFs extracted from guitar signal. The IMFs from guitar were more sinusoidal in nature and mode-mixing can be clearly seen in IMF 3 to 6. A low frequency trend like waveform is observed for IMF 7 for guitar signal unlike for speech sample which have oscillation for residue too.

4 Feature extraction

The main objective of this work was to perform statistical data analysis on the IMF’s generated from Empirical Mode Decomposition of speech/guitar signal and observe the discriminatory characteristics in them. These can be used for machine learning to solve the classification problem. In the past, such statistical features have shown significance performance in different classification problems, especially in the field of EEG, ECG, speech, and music signal processing [2, 13, 15]. Statistical data analysis aims to quantify the data by applying various statistical operations for efficient use of data by classification algorithms. In this work, 5 different statistical operations are used. These are mean, absolute mean, variance, skewness and kurtosis.

Figure 3 illustrates the training and testing of models using audio samples each of 15 seconds. Training samples of speech/guitar signals were pre-processed and down-sampled to 8 KHz. The down-sampled signals are fed to Emperical Mode Decomposition (EMD) algorithm to generate ten IMFs. These IMFs are chopped into non-overlapping frame of one second resulting into 15 frames. These smaller frames are fed to feature extraction block where five different statistical features were evaluated. Hence, every training and testing speech/guitar sample signal was represented by a matrix of size 10 × 15 i.e., 15 frames of one second for 10 IMFs. These features are then individually normalized and are fed to different classifiers, along with target labels for training the models. Once the model is successfully trained, features from test data are fed to the trained model to label them into either of speech/guitar class. Table 2 tabulates the features used.

Table 2 Description of features

Full size table

5 Classifiers

This work is focused on binary classification task. Previous work in speech/music discrimination has extensively used Support Vector Machine (SVM) and k-Nearest Neighbour (k-NN) and has found satisfactory results [15, 25]. Hence, it was a motivation to test the performance of these two classifiers for this task. Along with these, some experiments were also run on Naïve Bayes and Artificial Neural Network classifiers. A comparative analysis is tabulated in the next sections.

5.1 Support Vector Machine

It is a discriminative classifier which separates two different classes by a hyperplane for a given vector weight w and bias b. Earlier work have explored the performance of SVM for speech/music classification [14]. The distance between the closest data points and the hyperplane is called margin of separation. These points which are closest to the hyperplane are called support vectors. The algorithm tries to find a hyperplane which maximizes the margin of separation. In 2D, this hyperplane is a line which divides the plane in two different parts where each class lies on either side. For more complex data which are not linearly separable in two dimensions, SVM uses kernels to map the data in higher dimension. Three different kernels namely Linear, Radial Basis Function (Gaussian) and Polynomial, were experimented for this work and a comparative analysis was done. Eq. 12 represents the general equation of SVM. Eqs. 12–14 represents the different kernels used in this experiment. All the simulations were done with fitcsvm function in MATLAB. Fitscsvm uses Sequential Minimal Optimization (SMO) as a solver for binary classification and optimally finds the width parameter Y and the cost parameter c. For a set of data with training vectors x_j and their categories y_j in some dimension d where x ∊ R^d and y_j = ±1, the equation of hyperplane is

$$ f(x)={x}^{\prime }w+b=0 $$

(12)

Where, w and b are weight vector and bias respectively.

5.1.1 Non-linear transformation using kernels

As discussed earlier, when simple hyperplane fails to classify some problems, variant of mathematical approaches are used which retains all the property of an SVM separating hyperplane. For a class of functions G(x₁, x₂), a function φ maps x to linear space S such that.

$$ G\left({x}_1,{x}_2\right)=<\varphi \left({x}_1\right),\varphi \left({x}_2\right)> $$

(13)

The functions used are:

(i).
Polynomials: For some positive integer p,

$$ G\left({x}_1,{x}_2\right)={\left(1+{x_1}^{\prime }{x}_2\right)}^p $$

(14)

For this experiment, p = 2 has been used.

(ii).
Radial basis function (Gaussian)

$$ G\left({x}_1,{x}_2\right)=\exp .\left(-\parallel {x}_1-{x}_2\right){\parallel}^2\Big) $$

(15)

5.2 K-Nearest Neighbours

K-Nearest Neighbours (k-NN) classifier uses data spread in multidimensional feature space each having class labels for training. It is not only easy to interpret; it takes very less computation time. ‘k’ in k-NN is a user defined constant and indicates the number of neighbors voting is made from. A test sample is assigned a class by assigning the label of the training samples which occurs maximum number of times among the training samples closest to the test point. Use of k-NN has been explored in [25]. This algorithm finds the distance of the test samples with k-nearest neighbors. Generally, Euclidean distance is a commonly used distance metric. However, in this work, performance of Chebychev and Mahalanobis distance metric is also observed. All the simulations were done with fitcknn function in MATLAB.

(i).
Euclidean Distance

For a given set of points (x₁, y₁), (x₂, y₂), Euclidean distance is given by

$$ d=\sqrt{{\left({y}_2-{y}_1\right)}^2+{\left({x}_2-{x}_1\right)}^2} $$

(16)

(ii).
Chebychev Distance

For a given set of points (x₁, y₁), (x₂, y₂), Euclidean distance is given by

$$ d=\mathit{\max}\mid \left({y}_2-{y}_1\right),\left({x}_2-{x}_1\right)\mid $$

(17)

(iii).
Mahalanobis Distance

For a given vector of data x, Mahalanobis distance is given by

$$ {d}^2={\left(x-m\right)}^T{C}^{-1}\left(x-m\right) $$

(18)

Where m is the mean of variables and C⁻¹ is inverse covariance matrix. Mahalanobis Distance transforms the variable into uncorrelated variable and makes their variance equal to 1. It normalized the variation of spread among variables and find simple Euclidean distance.

5.3 Naïve Bayes Classifier

Naïve Bayes Classifier is based on application of Bayes Theorem. These are simple probabilistic classifiers based on assumptions that there is strong independence between features. Using Bayes Rule,

$$ P\left(Y/{X}_{1,\dots \dots \dots \dots },{X}_n\right)=\frac{P\left(\frac{X_{1,\dots \dots \dots, }{X}_n}{Y}\right)P(Y)}{P\left({X}_{1,\dots \dots ..},{X}_n\right)} $$

(19)

Where,

$ P\left(\frac{X_{1,\dots \dots \dots, }{X}_n}{Y}\right) $ = Likelihood Probablity, P(Y/X_{1, …………}, X_n) = Posterior Probability, P(Y) = Prior Probability,

X₁,…….., X_n = Set of feature vectors.

If all the features are independent,

$$ P\left(\frac{X_{1,\dots \dots \dots, }{X}_n}{Y}\right)=\prod \limits_{i=1}^nP\left(\frac{X_i}{Y}\right) $$

(20)

This reduces computation complexity. Using these equations, a Naïve Bayes Probability model is generated which is combined with a decision rule. One such rule is to select the most probable outcome which is also known as maximum a-posteriori rule. Simulations were carried on with fitcnb function in MATLAB.

5.4 Artificial Neural Network

Artificial Neural Network is a framework for machine learning inspired by biological neural network. It has an input layer, hidden layers and an output layer. Features are fed to the framework from the input layer which is fed to well-connected hidden layers. These hidden layers finally dump the data into the output layer which is a softmax layer. Each of these nodes has a weight vector and a bias. Summation of the product of input vector and weight vector with bias is fed to an activation function which is generally a sigmoid function given by Eq. 21. The difference between target output and evaluated output is the error signal which is fed back into the system for weight update. Numbers of hidden layers are varied in this experiment to see its impact on the classification efficiency. MATLAB function nnstart is used to simulate the experiment. This tool uses scaled conjugate gradient back propagation method for training.

$$ f(x)=1/\left(1+{e}^{-x}\right) $$

(21)

Softmax function is given by standard exponential function on each of the variables divided by sum of the exponential function for each variable. This acts as a normalizing constant and sums the output variables to 1. Eq. 22 represents a softmax function

$$ \sigma {(x)}_j=\frac{e^{x_j}}{\sum \limits_{k=1}^K{e}^{x_k}}\ for\ j=1\ to\ K $$

(22)

6 Simulation results & discussion

This section presents the simulation results. Initial experiments were run on isolated features to select features with best discriminatory evidence. This was followed by study on hybrid features.

6.1 Isolated features

Initially, all the five different features were analyzed independently for their discriminatory nature across first 7 IMFs. Figure 4 displays the line plot of the normalized value of all the five different feature set for a sample speech and guitar file computed over 15 seconds. Green line represents the feature set for speech signal while blue line represents the feature set for guitar signal. Figure 4a shows the variance plot of the IMF’s. Variance of speech signal has comparatively higher value than that of guitar signal for most of the sample values in IMF 1 and 2. Variance plot for IMF 3 to 7 also showed good discriminatory evidence with very few samples merging on each other. On the other hand, plot of Kurtosis and Skewness showed a lot of correlation between feature set for speech and guitar signal, especially for IMF 4 to 7. Poor discriminatory evidence was also seen for IMF 1–4 for Mean feature. IMF 5–7 showed satisfactory results for the same. Good discriminatory evidence was seen for the plot of absolute mean for almost all the IMFs. The conclusions drawn from Fig. 4 were further validated from the scatter plot. Figures 5 and 6 shows the scatter plot of all the features computed for IMF 2 and 3 and also for IMF 5 and 6 respectively. It is clearly evident from Figs. 5 and 6 that absolute mean and variance shows good discriminatory evidence for classification of speech and guitar signal. Kurtosis feature shows better discriminatory evidence for IMF 5 and 6 than for IMF 2 and 3. Both Kurtosis and Skewness features were less spread in the plot with many data points merging over each other. Mean feature also shows good discriminatory evidence in both plots in Figs.5 and 6. These features were then fed into different classifiers and their performances were evaluated.

Classification accuracy was considered as the parameter for evaluation and is explained in Eq. 23. The numbers in the following tables represent the classification accuracies of the models across different features and classifiers parameters in percentage (%). Overall column represent the net classification accuracy of the model (in %) found by averaging the classification accuracies of individual classes (Speech and Guitar).

$$ Classification\ accuracy\left( in\%\right)=\left( TP+ TN\right)/\left( TP+ TN+ FP+ FN\right)\ast 100 $$

(23)

Where, TP: True Positive, TN: True Negative, FP: False Positive, FN: False Negative.

Table 3 displays the performance evaluation of SVM classifiers for five different features across three different kernels. Absolute Mean feature out-performed all other features with a classification accuracy of 68.83% when used with polynomial (order =2) kernel. Absolute mean and variance performed equally well for classification of both speech and guitar signal unlike skewness which performed better only for guitar signal. Performance of mean and kurtosis were satisfactory. Performance evaluation of SVM classifiers for different feature vectors with different kernels was also validated using Receiver Operator Characteristic (ROC) curve in Fig. 7. The parameter to judge the efficiency of a feature is Area Under the Curve (AUC). A perfect feature set will have AUC equal to 1. Curve closer to upper left corner will be comparatively better than other feature sets. Best results were observed for Radial Basis Function (RBF) kernels for kurtosis, absolute mean and variance feature and are reflected in their ROC plot in Fig. 7c.

Table 3 Classification accuracy of SVM classifier (in %) for different Kernels

Full size table

Table 4 displays the performance evaluation of KNN classifiers for three different distances metric. Euclidean distance metric performed comparatively best amongst the three followed by Mahalanobis and Chebychev distance metric. For KNN, better classification accuracy was observed for guitar signal in all three scenarios. Absolute mean feature had the best results amongst the five features used with classification accuracy of 62.83%. Variance, skewness and kurtosis feature performed satisfactorily. Mean feature showed the least performance efficiency amongst all.

Table 4 Classification accuracy of KNN classifier in % for different distance metric

Full size table

Table 5 displays the performance evaluation for Naïve Bayes Classifier. Best results were seen for variance with overall classification accuracy of 55.99%. While kurtosis performed well for classification of speech signal, skewness and mean performed exceptionally well for classification of guitar signal with classification accuracy of 88.33% and 95.00%. Absolute mean stood as a second best feature with overall classification accuracy of 54.16%. Figure 8 shows the ROC plot for different feature vectors with Naïve Bayes Classifier. Average performance was observed with poor AUC.

Table 5 Classification accuracy of Naïve Bayes Classifier (in %)

Full size table

Table 6 display the performance evaluation of ANN for 3 different numbers of hidden layers. While performance for some of the feature vectors improved as N is increased from 5 to 10, performance for others saw a decline. Classification accuracy for Variance and Skewness improved as N is increased from 5 to 10. Skewness and Absolute Mean saw a dip in its performance from 52.83% and 65.33% to 48% and 58.84%. As N is further increased to 20, performance for all the features saw a decline except Kurtosis which saw an improvement of 6.64%. Best results were observed for Variance with an accuracy of 68% for N = 10.

Table 6 Classification accuracy of ANN(in %) for different no. of hidden layers

Full size table

Comparative results for all the four classifiers are presented in Table 7 and Fig. 9. Best results were observed for the combination of SVM and Absolute Mean with an accuracy of 68.83%. Amongst the classifiers, SVM performed best with an overall accuracy of 61.73%, followed by ANN (61.43%), KNN (55.49%) and Naïve Bayes (52.56%). Amongst the features, the best performing feature was Absolute Mean with an overall accuracy of 62.78% followed by Variance (61.62%), Kurtosis (56.45%) and Skewness (52.29%).

Table 7 Comparative performance of different classifiers (in %)

Full size table

6.2 Hybrid features

From the above study on the use of isolated features with different classifiers for the task of guitar and speech signal classification, it was concluded that Absolute Mean and Variance stood as best two performing features. The experiments were continued with hybrid features concatenating two or more features and their classification accuracies were observed. To verify the discriminatory characteristics of best two performing feature i.e. Absolute Mean and Variance, a scatter plot is drawn. Figure 10 shows the scatter plot of Absolute Mean vs. Variance for two different IMFs. The discriminatory evidence is prominent in both. Comparative results for hybrid features are tabulated in Table 8. A sharp improvement is seen in performance of all classifiers when used with feature combination of Absolute Mean and Variance. Best results were observed for SVM (polynomial kernel) when the model is trained with hybrid of Absolute Mean, Variance and Kurtosis Features (82.00%). Figure 11 displays the ROC plot of the same. AUC for RBF and Linear kernel indicates promising results.

Table 8 Performance comparison for Hybrid Features (in %)

Full size table

6.3 Feature selection

Data reduction is the art of reducing the problem of high dimensionality to improve computational complexity and data acquisition cost by selecting most efficient features with maximum discriminatory evidence. In this study, two different techniques of feature selection are used.

6.3.1 Feature selection using fisher method

Fisher method assigns a score for a feature. This score is the ratio of interclass separation and intra-class variance [7]. Final feature selection occurs by segregating the m top ranked features, where m is a user defined constant ranging from one to total number of features. Fisher method was applied over the statistical raw features and its performances were evaluated for different values of m. Feature Selection Library (FSlib 2018) were used for simulating the results [21]. Table 9 tabulates the variation of efficiency of different SVM models with change in number of feature vector used. Highest efficiency of 80.33% is seen when RBF kernel is trained with 49 features. Figure 12 summarizes the ranking order amongst the feature vectors. Performance of varying number of hybrid feature vectors selected using Fisher Method with SVM (rbf) classifier can be seen in ROC curve in Fig. 12. As number of features used is increased from 10 to 50, AUC also increases, indicating improvement in results (Fig. 13).

Table 9 Classification Accuracy with Increasing number of features for Fisher Method

Full size table

Distributions of different feature vectors over rank 1 to 50 were studied and a histogram was plotted. Best performing features were Variance, Absolute Mean and Mean with maximum occurrence amongst top ranking vectors. Skewness showed maximum visibility across rank 21–30. Kurtosis was spread across the histogram with most occurrences across rank 41–50. This study confirms the order of discriminatory evidence amongst the statistical feature used. The discriminatory evidence amongst the feature vectors was further investigated using F-ratio.

6.3.2 Feature selection using F-ratio

F-ratio is a measure of variance of multi-class data [23]. It is the ratio of the variance of means between classes and the average variance within each class. Mathematically,

$$ F- ratio=\frac{\frac{1}{k}{\sum}_{j=1}^k{\left({\mu}_j-\overline{\mu}\right)}^2}{\frac{1}{k}{\sum}_{j=1}^k\frac{1}{n_j}{\sum}_{i=1}^{n_j}{\left({x}_{ij}-{\mu}_j\right)}^2} $$

(24)

Where, k is total number of class, μ_j is mean of j^th class, $ \overline{\mu} $ is total mean, n_j is total number of data points in a class and x_ij is i^th data point in j^th class. A higher F-ratio indicates more similarity within a class and more dissimilarity across the class. Following steps were performed for finding optimum features using F-ratio.

Step 1.
F-ratio for all the 5 statistical feature vectors is evaluated using (24).
Step 2.
Maximum of these F-ratio is found.
Step 3.
Threshold value for F-ratio is given by Eq. 25, where k is varied from 1 to 100.

$$ Thres=\left(F- ratio\right)/k\kern0.5em \mathrm{where},0<\mathrm{k}<100 $$

(25)

Step 4.
Feature Vectors with F-ratio more than the set threshold value is selected for training and testing the model. Results are tabulated in Table 10.

Table 10 Classification Accuracy for different threshold values of F-ratio

Full size table

To evaluate the performance of SVM model, k is varied from 1 to 100 changing the threshold value. As ‘k’ is increased decreasing the threshold, more numbers of feature vectors were appended to the training matrix. The number of training vectors for varying k repeated itself in much iteration as no new feature vector had F-ratio exceeding the threshold. Such rows were removed from Table 10 to avoid repetitions of data. Best results were seen for RBF kernel with k = 66 generating 38 feature vectors with an efficiency of 82.16%. As F-ratio for rest 11 columns had very diminishing value, they couldn’t participate in performance evaluation.

Figure 14 plots the feature distribution for different values of k. A clear dominance of Variance and Absolute Mean can be seen. A total of 39 feature vectors are selected for k = 95. Out of these, 10 feature vectors from each of Variance and Absolute Mean, 8 feature vectors from Mean, 6 feature vectors from Kurtosis and 5 feature vectors from Skewness were selected respectively. Figure 15 plots the performance curve for SVM for best performing set of Hybrid Features.

7 Comparision with past work

The aim of the study was to investigate and understand the efficacy of EMD based statistical features in classifying a speech and low frequency music signal(guitar) sharing a common spectra range for different tuning parameters of state-of-the-art classifiers. This work doesn’t propose a new set of features for efficient Speech Music Discrimination task rather focuses on analyzing the performance of EMD based features for classification task. Most of the works in literature are based on commercial speech and music segments either recorded from Television or Radio stations, with a wider spectrum spread. Hence, a comparison with other work would be futile. However, we compare and validate our results with a similar work done by Khonglah et al. where they explored statistical features from EMD to discriminate speech and music samples using S&S Database for broadcast Radio recordings [13]. They explored SVM and KNN in their work and achieved a highest accuracy of 90.83%. Following observations are tabulated in Table 11.

Table 11 Comparative Analysis

Full size table

From Table 11, we conclude that classification of Speech and Low frequency music signal is more challenging than classification of speech and commercial music signals as even after applying feature selection to raw features, the classification accuracy is below the earlier reported work. However, the rank of features for both the works almost matches each other with Absolute Mean, Variance and Kurtosis being the top ranked features for both scenarios which also validates our experimental works. Results from this study may be extended to Indian instruments like Bansuri and Been which also share a similar spectrum spread.

8 Conclusion

This work analyzes the performance of statistical features extracted from EMD for the classification of speech and low frequency guitar signal. A signal was decomposed into 10 IMF’s and each of the IMFs was framed into one second for feature extraction. The variations of the extracted features across different IMFs were studied using the line plot. The discriminatory evidences in them were further validated using scatter plot and ROC plot. Initial experiments were run on isolated features with four different classifiers. Absolute Mean and Variance stood as best performing features while SVM and ANN stood as the best classifiers. Best classification accuracy of 68.83% was observed for Absolute Mean feature when used with SVM (RBF). Further analysis was done on the discriminatory characteristics of hybrid features using scatter plot. Different hybrid features were created by combination of two or more isolated features. Overall improvement in performance was seen for all the four classifiers. Best results were observed for the hybrid of Absolute Mean, Variance and Kurtosis Features with an accuracy of 82.00%. An improvement of 19.13% is seen for best performing hybrid features over best performing isolated feature. To further validate the results, two different techniques for Feature Selection were evaluated on the dataset. Results from both Fisher Method and F-ratio indicated Variance and Absolute Mean as best performing features. Best efficiency of 82.16% is found with SVM classifier (rbf) with 38 features (10 Variance +10 Absolute Mean + 8 Mean + 5 Skewness +5 Kurtosis). Future work may concentrate on studying the application of EMD and its variant for analysis of polyphonic and folk music.

References

Alexandre-Cortizo E, Rosa-Zurera M, Lopez-Ferreras F (2005) Application of fisher linear discriminant analysis to speech/music classification. EUROCON 2005 - The International Conference on "Computer as a Tool", pp 1666–1669. https://doi.org/10.1109/EURCON.2005.1630291
Babiker A, Faye I, Mumtaz W, Malik AS, Sato H (2018) EEG in classroom: EMD features to detect situational interest of students during learning. Multimedia Tools and Applications, pp:1–21
Birajdar GK, Patil MD, (2018) Speech and music classification using spectrogram based statistical descriptors and extreme learning machine. Multimedia tools and applications, pp.1-28.
Bouzid A, Ellouze N (2004) “Empirical mode decomposition of voiced speech signal,” in Control, Communications and Signal Processing, 2004. First International Symposium on. IEEE, pp. 603–606
Bykhovsky D, Hadar O (2010) Evaluation of a GLRT threshold for voiced-unvoiced decision and pitch tracking in noisy speech. 2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel, pp 000680–000683. https://doi.org/10.1109/EEEI.2010.5662126
Flandrin P, Rilling G, Goncalves P (2004) Empirical mode decomposition as a filter bank. Signal Processing Letters, IEEE 11(2):112–114
Article Google Scholar
Gu Q, Li Z, Han J, (2012) Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725. https://doi.org/10.48550/arXiv.1202.3725
Huang NE, (2014) Hilbert-Huang transform and its applications (Vol. 16). World scientific
Huang H, Pan J (2006) Speech pitch determination based on Hilbert Huang transform. Signal Process 86(4):792–803
Article MATH Google Scholar
Huang NE, Shen SS (2005) Hilbert-Huang transform and its applications. World Scientific 5
Huang NE, Shen Z, Long SR, Wu MC, Shih HH, Zheng Q, Yen NC, Tung CC, Liu HH (1998) The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. In proceedings of the Royal Society of London a: mathematical, physical and engineering sciences. R Soc 454(1971):903–995
Khonglah BK, Prasanna SM (2016) Speech/music classification using speech-specific features. Digital Signal Processing 48:71–83
Article MathSciNet Google Scholar
Khonglah BK, Sharma R, Mahadeva Prasanna SR, (2015) Speech vs music discrimination using empirical mode decomposition. 2015 Twenty First National Conference on Communications (NCC), pp 1–6. https://doi.org/10.1109/NCC.2015.7084865
Kim SK, Chang JH (2009) Speech/music classification enhancement for 3GPP2 SMV codec based on support vector machine. IEICE Trans Fundam Electron Commun Comput Sci 92(2):630–632. https://doi.org/10.1587/transfun.E92.A.630
Lahmiri S, Gargour C, Gabrea M, (2012) Statistical features selection from intrinsic mode functions for pathologies detection in retina digital images. IECON 2012 - 38th Annual Conference on IEEE Industrial Electronics Society, pp 1585–1590. https://doi.org/10.1109/IECON.2012.6388532
Lim C, Chang JH (2015) Efficient implementation techniques of an svm-based speech/music classifier in SMV. Multimed Tools Appl 74(15):5375–5400
Article Google Scholar
Moreno PJ, Rifkin R, (2000) Using the fisher kernel method for web audio classification. In acoustics, speech, and signal processing, 2000. ICASSP'00. Proceedings. 2000 IEEE international conference on (Vol. 4, pp. 2417-2420). IEEE
Panagiotakis C, Tziritas G (2002) A speech/music discriminator based on RMS and zero-crossings. 2002 11th European Signal Processing Conference, pp 1-4
Pantazis Y, Rosec O, Stylianou Y (2011) Adaptive AM–FM signal decomposition with application to speech analysis. IEEE Trans Audio Speech Lang Process 19(2):290–300
Article Google Scholar
Papakostas M, Giannakopoulos T (2018) Speech-music discrimination using deep visual feature extractors. Expert Syst Appl 114:334–344
Article Google Scholar
Roffo G, Melzi S, (2017) Ranking to learn: feature ranking and selection via eigenvector centrality. In new Frontiers in mining complex patterns: 5th international workshop, NFMCP 2016, held in conjunction with ECML-PKDD 2016, Riva del Garda, Italy, September 19, 2016, revised selected papers (Vol. 10312, p. 19). Springer
Ruiz-Reyes N, Vera-Candeas P, Muñoz JE, García-Galán S, Cañadas FJ (2009) New speech/music discrimination approach based on fundamental frequency estimation. Multimed Tools Appl 41(2):253–286
Article Google Scholar
Sahoo JP, Ari S, Ghosh DK (2018) Hand gesture recognition using DWT and F-ratio based feature descriptor. IET Image Process 12(10):1780–1787
Article Google Scholar
Saunders J, (1996) Real-time discrimination of broadcast speech/music. In ICASSP (pp. 993-996). IEEE
Scheirer E, Slaney M (1997) Construction and evaluation of a robust multi-feature speech/music discriminator. In Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on. IEEE 2:1331–1334
Seck M, Bimbot F, Zugaj D, Delyon B, (1999) Two-class signal segmentation for speech/music detection in audio tracks. In Sixth European Conference on Speech Communication and Technology
Sharma R, Prasanna SM, (2015) Characterizing glottal activity from speech using empirical mode decomposition. In communications (NCC), 2015 twenty first National Conference on (pp. 1-6). IEEE
Shirazi J, Ghaemmaghami S (2010) Improvement to speech-music discrimination using sinusoidal model based features. Multimed Tools Appl 50(2):415–435. https://doi.org/10.1007/s11042-009-0416-3
Article Google Scholar
Tsipas N, Vrysis L, Dimoulas C, Papanikolaou G (2017) Efficient audio-driven multimedia indexing through similarity-based speech/music discrimination. Multimed Tools Appl 76(24):25603–25621. https://doi.org/10.1007/s11042-016-4315-0
Article Google Scholar
Wang G, Chen XY, Qiao FL, Wu Z, Huang NE (2010) On intrinsic mode function. Adv Adapt Data Anal 2(03):277–293
Article MathSciNet Google Scholar
Williams G, Ellis DP, (1999) Speech/music discrimination based on posterior probability features. Eurospeech 99: 6th European Conference on Speech Communication and Technology: Budapest, Hungary, September 5–9. https://doi.org/10.7916/D8KH0XRH
Wu Z, Huang NE (2004) A study of the characteristics of white noise using the empirical mode decomposition method. Proceedings of the Royal Society of London Series A: Mathematical, Physical and Engineering Sciences 460(2046):1597–1611
Article MATH Google Scholar
Wu Z, Huang NE (2009) Ensemble empirical mode decomposition: a noise-assisted data analysis method. Adv Adapt Data Anal 1(01):1–41. https://doi.org/10.1142/S1793536909000047
Article Google Scholar
YouTube. (2019). Relaxing Music from Sungha Jung (The Best of). [Online] Available at: https://www.youtube.com/watch?v=IP8vBL5Q8Ac&t=338s. Accessed 05 Jan 2021
Zhang T, Kuo CCJ (2001) Audio content analysis for online audiovisual data segmentation and classification. IEEE Transactions on speech and audio processing 9(4):441–457
Article Google Scholar

Download references

Acknowledgements

We, all the authors, would like to thank Prof. Dr. Sandeep Singh Solanki, HOD and Professor, BIT Mesra, for his guidance and support in this research work. We would also like to extend my gratitude to Birla Insititute of Technology for providing us with the facilities to conduct our research.

Author information

Authors and Affiliations

Department of ECE, Birla Institute of Technology, Ranchi, India
Arvind Kumar
Department of ECE, Reva University, Bengaluru, India
Mahesh Chandra

Authors

Arvind Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Mahesh Chandra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arvind Kumar.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kumar, A., Chandra, M. Empirical mode decomposition based statistical features for discrimination of speech and low frequency music signal. Multimed Tools Appl 82, 33–58 (2023). https://doi.org/10.1007/s11042-022-13267-3

Download citation

Received: 02 July 2019
Revised: 13 January 2021
Accepted: 19 May 2022
Published: 03 June 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s11042-022-13267-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Empirical mode decomposition based statistical features for discrimination of speech and low frequency music signal

Abstract

Similar content being viewed by others

Enhancement and Comparative Analysis of Environmental Sound Classification Using MFCC and Empirical Mode Decomposition

A Review of Various Techniques Related to Feature Extraction and Classification for Speech Signal Analysis

Clean speech/speech with background music classification using HNGD spectrum

Explore related subjects

1 Introduction

1.1 Related works

1.2 Motivation

2 Empirical mode decomposition for audio signals

3 Database and analysis of IMFs

3.1 Analysis of extracted IMF

3.1.1 Analysis of IMFs of speech signal

3.1.2 Analysis of IMFs of guitar signal

4 Feature extraction

5 Classifiers

5.1 Support Vector Machine

5.1.1 Non-linear transformation using kernels

5.2 K-Nearest Neighbours

5.3 Naïve Bayes Classifier

5.4 Artificial Neural Network

6 Simulation results & discussion

6.1 Isolated features

6.2 Hybrid features

6.3 Feature selection

6.3.1 Feature selection using fisher method

6.3.2 Feature selection using F-ratio

7 Comparision with past work

8 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation