An enhanced fuzzy c-means algorithm for audio segmentation and classification

Haque, Mohammad A.; Kim, Jong-Myon

doi:10.1007/s11042-011-0921-z

An enhanced fuzzy c-means algorithm for audio segmentation and classification

Published: 18 November 2011

Volume 63, pages 485–500, (2013)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

An enhanced fuzzy c-means algorithm for audio segmentation and classification

Download PDF

Mohammad A. Haque¹ &
Jong-Myon Kim¹

387 Accesses
10 Citations
Explore all metrics

Abstract

Automated audio segmentation and classification play important roles in multimedia content analysis. In this paper, we propose an enhanced approach, called the correlation intensive fuzzy c-means (CIFCM) algorithm, to audio segmentation and classification that is based on audio content analysis. While conventional methods work by considering the attributes of only the current frame or segment, the proposed CIFCM algorithm efficiently incorporates the influence of neighboring frames or segments in the audio stream. With this method, audio-cuts can be detected efficiently even when the signal contains audio effects such as fade-in, fade-out, and cross-fade. A number of audio features are analyzed in this paper to explore the differences between various types of audio data. The proposed CIFCM algorithm works by detecting the boundaries between different kinds of sounds and classifying them into clusters such as silence, speech, music, speech with music, and speech with noise. Our experimental results indicate that the proposed method outperforms the state-of-the-art FCM approach in terms of audio segmentation and classification.

Audio Classification Using GA-Based Fuzzy C-Means

A Method of Combining Gaussian Mixture Model and K-Means for Automatic Audio Segmentation of Popular Music

Content-Based Audio Classification and Retrieval Using Segmentation, Feature Extraction and Neural Network Approach

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Multimedia databases usually store thousands of audio recordings such as music, speech, and other sounds. The immense amounts of audio data in these domains necessitate the development of computerized methods for efficient, automated, content-based segmentation and classification of audio data [14]. Such methods have important applications in professional media production, audiovisual archive management, education, entertainment, and surveillance [20]. There are two major applications of audio content analysis. One is to segment an audio stream into a number of constituent audio streams; the other is to classify audio streams into different sound classes such as speech, music, environmental sound, and silence. A number of methods have been proposed to address issues inherent in audio segmentation and classification, such as detecting audio-cuts at which the audio signal changes, selecting audio-segments using the audio-cuts, and classifying segments into different types of audio groups [6, 8–10, 12–15, 17, 18, 20]. These methods are based on both perceptual and acoustical features.

Previously proposed audio segmentation methods can be categorized into two groups depending on how they detect and segment audio-cuts: (1) threshold processing based approaches [6, 9, 10, 20] and (2) fuzzy based approaches [12]. Threshold processing based approaches detect audio-cuts by applying threshold comparisons either to audio features or to differences between audio features at two different times. These methods are therefore subject to performance degradation when segmenting audio streams that contain audio effects such as fade-in, fade-out, and cross-fade. In previous studies, Zhang et al. [20] proposed a heuristic rule-based procedure for audio classification along with threshold processing based segmentation, which is based on morphological and statistical analysis of the time-varying functions of simple audio features including the energy function, average zero-crossing rate, fundamental frequency, and spectral peak tracks to ensure the feasibility of real-time processing while scarifying the accuracy of audio classification. Liu et al. [9] also used the threshold processing based method in audio segmentation, proposing a set of low-level audio features extracted from the time and frequency domains and then using these features as input for a neural net classifier. However, only non-speech and non-music audio data from TV programs were considered, and therefore the generality of their system was not demonstrated. Lu et al. [10] proposed a step by step classification method for classifying audio streams into different categories. They introduced a set of new audio features in both the energy and frequency domains and proposed a segmentation algorithm that is based on quasi-GMM and line spectral pair (LSP) correlation analysis. This method can segment audio streams of open-set speakers in real time without a priori knowledge about particular speakers’ speech characteristics. Gang et al. [6] proposed a classification-independent segmentation (CIS) method that calculates the similarities between audio feature vectors. All of these methods are based on threshold processing in audio-cut detection and are consequently vulnerable to the aforementioned problem of performance degradation when audio streams contain sound effects. To overcome this problem, a fuzzy method was proposed by Noaki et al. [12]. This is a soft-segmentation method that utilizes the fuzzy c-means (FCM) algorithm for audio-cut detection.

A number of other methods have been also proposed, which simply consider issues related to audio stream classification. Wold et al. [18] presented an audio retrieval system, and their study is treated as a milestone because it presented a method of content based audio analysis that distinguished it from previous works [17]. Statistical values for several time and frequency domain measurements are used in Wold et al.’s method to represent perceptual features like loudness, brightness, bandwidth, and pitch. Since this method considers only statistical values, it is only suitable for classifying sounds with a single timbre. Audio classification by support vector machine (SVM) methods was proposed in [17], in which Mel-frequency cepstral coefficients (MFCC) are taken as features. Since MFCCs do not accurately represent the timbres of sounds, this method fails to distinguish music and environmental sounds with different timbre characters. Audio classification by a Hidden Markov Model (HMM) approach was proposed in [8]. However, this method requires prior training data, which decreases the robustness of the system. Park et al. proposed three different fuzzy methods for classification of audio in [13–15]. These methods utilize the Gradient-Based FCM algorithm (GBFCM) for soft classification of audio contents. However, the FCM-based methods proposed in [12–15] do not consider the temporal relationships of audio features between audio frames or segments in segmentation or classification, respectively. Therefore, further research is warranted with the goal of improving the performance of GBFCM methods.

Through our intensive study of different methods of audio segmentation and classification, we observed that the FCM algorithm can effectively detect audio-cuts even if the audio signal contains fade-in, fade-out and cross-fade. The FCM algorithm interprets the existence of audio-cuts as real values between 0 and 1 and thereby detects segments. Further, it subdivides audio-segments into different audio classes such as silence, speech, music, speech with music background, and speech with noise. To improve the performance of audio segmentation and classification based on audio content analysis; this paper proposes a correlation intensive FCM (CIFCM) algorithm that employs the features of audio and incorporates the concepts of FCM based methods proposed in [12–15]. Unlike conventional FCM clustering approaches, the CIFCM algorithm utilizes temporal correlation information between neighboring frames and segments in the context of the current frame and segment for audio-cut detection and classification. We assess recall rate and precision rate to evaluate the performance of the CIFCM method for audio-cut detection and segmentation [12]. We analyze the classification performance of the proposed method on a labeled audio-segment dataset of five broad audio genres. In addition, we employ the four eminent cluster validity functions that are summarized in [11] to evaluate the performance of CIFCM for audio classification. Our experimental results indicate that the proposed CIFCM algorithm outperforms the conventional FCM-based method [12] for audio segmentation and classification.

The rest of this paper is organized as follows. Section 2 describes the conventional FCM algorithm and cluster validity functions. Section 3 introduces the proposed CIFCM algorithm for audio segmentation and classification, and Section 4 presents experimental results. Section 5 concludes the paper.

2 Background information

2.1 Fuzzy c-means algorithm

The FCM algorithm is an unsupervised clustering method. It was developed by Dunn in 1973, and revised by Bezdek in 1981 [1]. The FCM algorithm has been successfully applied to feature analysis, clustering, and especially pattern recognition [2, 7]. The effects of correlations between the features of the current experimental data and those of neighboring data in conventional FCM clustering were intensively analyzed in [11].

The conventional FCM algorithm is an iterative method of clustering that allows one piece of data to belong to two or more clusters. Let an unlabelled data set X = (x ₁ , x ₂ , x ₃ ,…, x _n ) represent the features of the n items. The FCM algorithm sorts the data set X into c clusters. The standard FCM objective function with the Euclidian distance metric is defined as follows:

$$ {{\text{J}}_{\text{m}}}\left( {U,V} \right) = \sum\nolimits_{{i = 1}}^c {\sum\nolimits_{{k = 1}}^n {u_{{ik}}^m{d^2}\left( {{x_k},{v_i}} \right)} } $$

(1)

where d ² (x _k , v _i ) represents the Euclidian distance between the data point x _k and the center v _i of the i-th cluster, and u _ik is the degree of membership of the data x _k to the k-th cluster, along with the constraint $ \sum\nolimits_{{i = 1}}^c {{u_{{ik}}} = 1} $. The parameter m controls the fuzziness of the resulting partition, with m ≥ 1, and c is the total number of clusters. Local minimization of the objective function J _m (U,V) is accomplished by repeatedly adjusting the values of u _ik and v _i according to the following equations:

$$ {u_{{ik}}} = {\left[ {{{\sum\nolimits_{{j = 1}}^c {\left( {\frac{{{d^2}\left( {{x_k},{v_i}} \right)}}{{{d^2}\left( {{x_k},{v_j}} \right)}}} \right)} }^{{\frac{1}{{m - 1}}}}}} \right]^{{ - 1}}} $$

(2)

$$ {v_i} = \frac{{\sum\nolimits_{{k = 1}}^n {u_{{ik}}^m{x_k}} }}{{\sum\nolimits_{{k = 1}}^n {u_{{ik}}^m} }},1 \leqslant i \leqslant c $$

(3)

As J _m is iteratively minimized, v _i becomes more stable. The iteration of the FCM algorithm is terminated when the ending condition $ \mathop{{\max }}\limits_{{1 \leqslant i \leqslant c}} \left\{ {abs\left( {v_i^t - v_i^{{t - 1}}} \right)} \right\}{ < }\varepsilon $ is satisfied, where v ^(t-1) are the centers of the previous iteration, abs() stands for the absolute value, and ε is the predefined termination threshold. Finally, all data points are distributed into clusters according to the maximum membership u _ik. In addition, the fuzzy partition matrix U is congregated for further operations to evaluate the efficiency of the clustering.

2.2 Cluster validity functions

Two important types of cluster validity functions are used for the quantitative evaluation of cluster performance; they are based on the fuzzy partitions [3] and the feature structure of the data set [5, 19]. The fuzzy partitions use two parameters: Bezdek’s partition coefficient v _pc and the partition entropy v _pe [3], which are defined as follows:

$$ {v_{{pc}}}(U) = \sum\nolimits_{{j = 1}}^n {\sum\nolimits_{{i = 1}}^c {u_{{ij}}^2} } $$

(4)

$$ {v_{{pe}}}(U) = - \frac{1}{n}\left[ {\sum\nolimits_{{j = 1}}^n {\sum\nolimits_{{i = 1}}^c {\left( {{u_{{ij}}}\log \left( {{u_{{ij}}}} \right)} \right)} } } \right] $$

(5)

When v _pc is maximal or v _pe is minimal, optimal clustering is achieved. However, these two parameters depend only upon the membership values of data in the clusters, not the data themselves. To overcome this shortcoming, two other validity functions based on the feature structure of data set have been proposed: the Fukuyama-Sugeno function v _fs [5] and the Xie-Beni function v _xb [19]. These are defined as follows:

$$ {v_{{fs}}}\left( {U,V,X} \right) = \sum\nolimits_{{i = 1}}^c {\sum\nolimits_{{j = 1}}^n {u_{{ij}}^m\left( {{{\left\| {{x_j} - {v_i}} \right\|}^2} - {{\left\| {{v_i} - \bar{v}} \right\|}^2}} \right)} } $$

(6)

$$ {v_{{xb}}}(U) = \frac{{\sum\nolimits_{{i = 1}}^c {\sum\nolimits_{{j = 1}}^n {u_{{ij}}^2{{\left\| {{x_j} - {v_i}} \right\|}^2}} } }}{{n\left( {\mathop{{\min }}\limits_{{i \ne k}} \left( {{{\left\| {{v_i} - {v_k}} \right\|}^2}} \right)} \right)}} $$

(7)

where $ \bar{v} = \frac{1}{c}\sum\nolimits_{{i = 1}}^c {{v_i}} $. The smaller the values of v _fs or v _xb, the better the clustering results.

The aforementioned four cluster validity functions are used as the bases for comparing the performance of the proposed CIFCM and the conventional FCM [12] for audio segmentation and classification.

3 The correlation intensive fuzzy c-means algorithm (CIFCM) for audio segmentation and classification

3.1 The proposed CIFCM algorithm

The traditional FCM algorithm for audio segmentation operates by detecting audio-cuts in a frame using the attributes of only that frame [12]. It classifies each segment in an audio stream using only the features of that segment [12–15]. However, the general aspects of an audio frame or segment are highly correlated with those of neighboring frames or segments due to the similarities of the temporal features. This leads to accuracy degradation in the segmentation and classification procedures. This aspect of FCM was comprehensively explored by Luong et al. in the image segmentation domain [11]. In audio segmentation, when frames contain sound effects such as fade-in, fade-out, or cross-fade; then abrupt changes (i.e., audio-cuts) in the signal cannot be detected from the differences between two consecutive frames. Therefore, it is important to consider the impact of changes of neighboring frames within a specified window length. Similar conditions also occur when classifying audio segments. To solve this problem, which is inherent in audio segmentation and classification, we propose the CIFCM algorithm that utilizes not only attributes of the current frame or segment but also considers the memberships of its neighboring frames or segments, respectively, by modifying the membership function of the traditional FCM algorithm. The membership of each data element is calculated as a weighted sum of the current element membership and the memberships of the previous and following neighboring elements in a window length of w _f, where the center element x _k is the current element.

In CIFCM, a neighboring impact factor, called P _ik, is used to consider temporal information about the neighbors to determine the fit of the data element x _k in the cluster i. The smaller the distances between the center element and its neighbors, the higher the probability that a given element and its neighbors are in the same cluster. The neighboring impact factor is defined as follows:

$$ {p_{{ik}}} = \sum\nolimits_{{j = k - \frac{{{w_f}}}{2}}}^{{k + \frac{{{w_f}}}{2}}} {h\left( {{x_k},{x_j}} \right){u_{{ij}}}} $$

(8)

where the function h(x _k, x _j) is the distance coefficient between the center element x _k and the neighbor x _j; and u _ij is the membership value of the neighbor x _j in the cluster i as described in Section 2 for conventional FCM. To assign an appropriate function of h(x _k, x _j) as shown in (9), we define hypotheses as follows:

1.
The neighbor impact factor P _ik ranges from [0:1] with j in the range of $ k - \frac{{{w_f}}}{2}:k + \frac{{{w_f}}}{2} $, indicates the neighbor elements.
2.
If all elements in the range of w _f belong completely to cluster i, then the impact factor value P _ik = 1. This implies that this segment is mostly impacted by its neighbors.

To determine the function h(x _k, x _j), it is assumed that u _ij =1. As a result $ \sum\nolimits_{{j = k - \frac{{{w_f}}}{2}}}^{{k + \frac{{{w_f}}}{2}}} {h\left( {{x_k},{x_j}} \right)} = 1 $ when the neighbor impact factor P _ik =1. The function h(x _k, x _j) is defined as follows:

$$ h\left( {{x_k},{x_j}} \right) = {\left[ {\sum\nolimits_{{l = k - \frac{{{w_f}}}{2}}}^{{k + \frac{{{w_f}}}{2}}} {\left( {\frac{{{d^2}\left( {{x_k},{x_j}} \right)}}{{{d^2}\left( {{x_k},{x_l}} \right)}}} \right)} } \right]^{{ - 1}}} $$

(9)

where longer distances between x _k and x _j generates smaller values of h(x _k, x _j).

Subsequently, the function p _ik is defined as follows:

$$ {p_{{ik}}} = {\left( {\sum\nolimits_{{l = k - \frac{{{w_f}}}{2}}}^{{k + \frac{{{w_f}}}{2}}} {\frac{1}{{{d^2}\left( {{x_k},{x_l}} \right)}}} } \right)^{{ - 1}}}\left( {\sum\nolimits_{{j = k - \frac{{{w_f}}}{2}}}^{{k + \frac{{{w_f}}}{2}}} {\frac{{{u_{{ij}}}}}{{{d^2}\left( {{x_k},{x_j}} \right)}}} } \right) $$

(10)

To make use of this impact factor, we include it in the distance measurement of the conventional FCM in (2), and generate a new distance function as follows, in place of simple Euclidian distance:

$$ d_{{new}}^2\left( {{x_k},{v_i}} \right) = {d^2}\left( {{x_k},{v_i}} \right){\left( {f\left( {{p_{{ik}}}} \right)} \right)^{{ - 1}}} $$

(11)

where f(P _ik) indicates the function of P _ik which is (1/P _ik)in this study. Thus, the new membership function using the new distance metric is calculated as:

$$ {w_{{ik}}} = {\left[ {{{\sum\nolimits_{{j = 1}}^c {\left( {\frac{{d_{{new}}^2\left( {{x_k},{v_i}} \right)}}{{d_{{new}}^2\left( {{x_k},{v_j}} \right)}}} \right)} }^{{\frac{1}{{m - 1}}}}}} \right]^{{ - 1}}} $$

(12)

By simplifying (12), we obtain the membership function for CIFCM in (13) and the center of clusters in (14) in place of (2) and (3), respectively, in the conventional FCM algorithm:

$$ {w_{{ik}}} = \frac{{{u_{{ik}}}{{\left( {f\left( {{p_{{ik}}}} \right)} \right)}^{{\frac{1}{{m - 1}}}}}}}{{{{\sum\nolimits_{{j = 1}}^c {{u_{{jk}}}\left( {f\left( {{p_{{ik}}}} \right)} \right)} }^{{\frac{1}{{m - 1}}}}}}} $$

(13)

$$ {v_i} = \frac{{\sum\nolimits_{{k = 1}}^n {w_{{ik}}^m} {x_k}}}{{\sum\nolimits_{{k = 1}}^n {w_{{ik}}^m} }},1 \leqslant i \leqslant c $$

(14)

The steps of CIFCM for audio segmentation and classification are summarized as follows:

1.
Distribute the data elements of the audio stream into data set X and initiate the center values $ {V^0} = \left( {v_1^0,v_2^0,.....,v_c^0} \right) $.
2.
Calculate the membership values u _ik from (2) and the impact factor P _ik from (10).
3.
Compute the new membership values w _ik from (13).
4.
Calculate the new center values of the clusters using (14).
5.
Evaluate the termination condition $ \mathop{{\max }}\limits_{{1 \leqslant i \leqslant c}} \left\{ {abs\left( {v_i^t - v_i^{{t - 1}}} \right)} \right\}{ < }\varepsilon $ The process is finished if this condition is satisfied, otherwise repeat the process starting with step 2.
6.
Assign each data element to clusters according to their maximum memberships in the clusters as derived from the membership matrix U _c×N.

The proposed CIFCM is further used for audio segmentation and classification.

3.2 Audio segmentation and classification using CIFCM

In order to segment and classify an audio stream, we must initially extract the audio features. After rigorous studies of different audio features as described in [12, 20] and [16], we calculate some characteristic parameters of audio signals as follows:

1.
The power of the audio signal E (n) in a frame of w _l samples is given as follows:
$$ E(n) = \frac{1}{{{w_l}}}{\sum\nolimits_{{k = 1}}^{{{w_l}}} {\left( {\frac{{si{g_n}(k)}}{{\max \left( {abs\left( {si{g_n}} \right)} \right)}}} \right)}^2} $$
(15)
where n is the index of the frames in the signal, sig _n (k) is the amplitude of the k ^th sample in the n ^th frame, and sig _n is the array of all sample values within the n ^th frame. This provides a convenient representation of the amplitude variation in the signal over time [20].
2.
The parameter sequence C(n) is defined as:
$$ C(n) = \frac{{\sum\nolimits_{{j = 0}}^{{{w_c} - 1}} {E\left( {n + j} \right)E\left( {n - {w_c} + j} \right)} }}{{\sqrt {{\sum\nolimits_{{j = 0}}^{{{w_c} - 1}} {E{{\left( {n + j} \right)}^2}} \sum\nolimits_{{j = 0}}^{{{w_c} - 1}} {E{{\left( {n - {w_c} + j} \right)}^2}} }} }} $$
(16)
where w _c indicates the number of frames in a pre-specified window. This feature is useful for identifying abrupt changes in the signal [12]. When the value of C (n)is closer to 0, the possibility of the existence of an audio-cut in the n ^th frame is increased. The length w _c must be set in such a way that multiple audio-cuts do not occur within a window.
3.
The mean μ _E and the standard deviation σ _E of the power sequence E(n). The patterns of variation of these two values help to classify audio signals into different groups.
4.
The center of gravity G(n) is a parameter that observes alterations of signal in a low frequency domain [12] and is computed as follows:
$$ G(n) = \frac{{\sum\nolimits_{{k = 1}}^{{{w_l}}} {k \times {{\left\{ {{F_n}(k)} \right\}}^2}} }}{{\sum\nolimits_{{k = 1}}^{{{w_l}}} {{{\left\{ {{F_n}(k)} \right\}}^2}} }} $$
(17)
where F _n (k) is the Fourier transform coefficient of the k ^th sample in the n ^th frame.
5.
The mean μ _G and the standard deviation σ _G of the center of gravity G(n) facilitate the analysis of the spectral shape of the audio data [16].
6.
If successive samples have different signs, a zero-crossing occurs in a discrete time signal. The rate at which zero crossing occurs is a simple measure of the frequency content of a signal. The zero-crossing rate Z(n) is calculated as follows [20]:
$$ Z(n) = \frac{1}{{{w_l}}}\sum\nolimits_{{k = 1}}^{{{w_l}}} {\frac{1}{2}\left\{ {{\rm sgn} \left[ {si{g_n}(k)} \right] - {\rm sgn} \left[ {si{g_n}\left( {k - 1} \right)} \right]} \right\}} $$
(18)
where $ {\rm sgn} \left[ {si{g_n}(k)} \right] = \left\{ {\begin{array}{*{20}{c}} { - 1,si{g_n}} & {(k) < 0} \\ {1,si{g_n}} & {(k) \geqslant 0} \\ \end{array} } \right. $
7.
The zero ratio Z _R, which is defined as the ratio of the number of zero indices to the total number of indices in a signal, plays an important role in measuring the noisiness in audio signals of different classes [16]. It can be derived from (18) as shown below:
$$ {Z_R} = \frac{1}{N}\sum\nolimits_{{n = 1}}^N {Z(n)}, $$
(19)
where N is the number of frames in the signal.

Audio-cut detection and segmentation: Audio-cuts can be detected efficiently by observing the parameter sequence C (n) [12]. Therefore, the proposed CIFCM utilizes C (n) to detect audio-cuts. In audio signal segmentation, three vectors defined in (20), (21) and (22) are grouped into two clusters by applying the proposed CIFCM algorithm:

$$ {{\mathbf{P}}_n} = {\left[ {C(n),......,C\left( {n + {w_x} - 1} \right)} \right]^T} $$

(20)

$$ {{\mathbf{P}}_{{n - \nabla }}} = {\left[ {C\left( {n - \nabla } \right),......,C\left( {n - \nabla + {w_x} - 1} \right)} \right]^T} $$

(21)

$$ {\mathbf{Z}} = {\left[ {0,......,0} \right]^T} $$

(22)

where w _x is the number of frames in a predefined window, and ∇ and T represent the step size of the window and the transpose of the matrix, respectively. We determine the values of w _x and ∇ analytically so that audio-cuts do not occur simultaneously within the periods from n to $ \left( {n + {w_x} - 1} \right) $ and from (n–∇) to $ \left( {n - \nabla + {w_x} - 1} \right) $ as shown by the windows L ₁ and R ₁ in Fig. 1.

Audio-cuts can be detected by the membership of P _n in the cluster of z. When an audio-cut exists in the time interval from n to $ \left( {n + {w_x} - 1} \right) $, for instance as illustrated by the window R ₁ in Fig. 1, the element C(n) of vector P _n closes to 0, and all elements of vector P _n–∇ become sufficiently high compared to 0. Since the elements of P _n are closer to 0 than those of P _n–∇, the distance between P _n and z becomes smaller and the membership of P _n in the cluster of z is increased. Therefore they are assigned to the same cluster, and an audio-cut can be identified in the n ^th frame as shown in Fig. 2, which plots the signal amplitude and the membership value of P _n in the cluster containing z _, against the time. For example, the membership values become very high at 9 s (app.) and 11 s (app.) as indicated by the red dotted-lines, since two audio-cuts are detected in these two points of the signal and these audio-cuts act as the delimiters of a speech audio-segment. On the other hand, when no audio-cut exists in either of the two time intervals from n to $ \left( {n + {w_x} - 1} \right) $ and from (n–∇) to $ \left( {n - \nabla + {w_x} - 1} \right) $, as illustrated by the windows L ₂ and R ₂ in Fig. 1, the elements of both P _n and P _n–∇ become significantly greater than 0. Thus, the distance between P _n and P _n–∇ becomes shorter, and they move away from z.

Audio-segment classification: The audio-cuts indicate the segment boundaries in the signal, which allows us to identify the segments. After segmenting a long audio stream that includes different classes of sounds, each segment is classified into one of five groups such as silence, speech, music, speech with music background, and speech with noise background. The selected feature vector for classifying audio signals is shown in (23):

$$ {{\mathbf{V}}_f} = \left[ {{\mu_E},{\sigma_E},{\mu_G},{\sigma_G},{Z_R}} \right] $$

(23)

This includes five features from the aforementioned characteristic parameters of audio signals defined in (15)–(19). The distributions of these features for the five audio-classes are depicted in [12]. The rule based approach of segment-classification into five audio classes using the feature vector V_f is summarized below:

Silence: This type of signal only contains quasi-stationary background noise and has relatively low values for μ _E and σ_E, but high values for Z_R.
Speech: An audio signal that contains the voices of human beings, such as the sound of conversation, and has relatively high values for μ _E and σ _E, and low values for Z _R compared to silence and music.
Music: These are audio signals that contain sounds made by musical instruments. These sounds have relatively low values for Z _R, μ _G and σ _G.
Speech with music background: These audio signals contain speech in an environment with music in the background. These can be discriminated from pure music by differences in σ _G.
Speech with noise background: This type of audio signal contains speech in an environment with noise in the background. These sounds have relatively higher values for μ _G and σ _G than those of other audio classes.

We employ the proposed CIFCM approach to identify the specific class of sound represented by each audio segment. We determine the audio class of a segment according to its highest class membership. The steps of the heuristic rule-based fuzzy clustering procedure are listed below:

Step 1: Acquisition of the audio signals.
Step 2: Extraction of the feature C (n) by calculating E (n) from (15) and (16); and defining the vectors P _n, P _n–∇, and z by using (20)–(22).
Step 3: Application of the CIFCM for segmentation.
Step 4: Detection of segments from the audio-cuts.
Step 5: Extraction of the feature vector V _f in (23) from each segment by applying (15) and (17)–(19).
Step 6: Application of the CIFCM to classify segments into five audio classes by the values in V _f.
Step 7: Determination of the specific class of each segment depending upon its highest class membership.

4 Experimental results

This section evaluates the performance of the proposed CIFCM algorithm in audio segmentation and classification experiments. This section also compares the performance of the proposed approach and the conventional FCM algorithm [12].

4.1 Experimental environment

To evaluate the performance of the proposed CIFCM algorithm, we developed a Graphical User Interface (GUI) using MATLAB7.6, shown in Fig. 3. Here we included two frames, a signal browsing frame and a classification distribution frame. The signal browsing frame presents the signal being processed as well as segmentation results while the classification distribution frame shows audio-segment classification results. A “Detect Audio-cuts” button is used to determine audio-cuts and “Classify Audio-segs” button is used for classification.

In this study, we used a number of audio samples obtained from TV programs, including music and drama programs, as the input signals of the proposed algorithm. We used the following empirical parameters: the fuzzy weighting exponent m = 2.0, the convergence tolerance ε = 0.001, the maximum number of iterations in CIFCM = 100, the number of samples in each frame w _l = 400, the step size ∇ = 10, and the pre-defined window length w _c = w _x =10.

In this study, two kinds of errors were generated and evaluated for audio-cut detection: (1) misdetection, in which the algorithm fails to detect existing audio-cuts and (2) over-detection, in which it incorrectly detects an audio-cut even where no audio-cut exists. In addition, we used two metrics to measure the effectiveness of audio-cut detection using the proposed approach: (1) recall rate and (2) precision rate [12], which are as follows:

$$ {\text{Recall rate}} = \frac{{{\text{Number of correctly detected audio - cuts}}}}{{{\text{Number of manually detected audio - cuts}}}} $$

(24)

$$ {\text{Precision rate}} = \frac{{{\text{Number of correctly detected audio - cuts}}}}{{{\text{Number of all detected audio - cuts}}}} $$

(25)

The values of both rates are within the range [0:1]. If the recall and precision rates approach 1, there are few misdetections and over-detections, respectively. We use a well-known metric to measure the effectiveness of classification [12], which is given as follows:

$$ {\text{Classification Precision rate, CPR}} = \frac{\text{Number of correctly classified audio segments}}{{\text{Number of all audio segments}}} \times 100\% $$

(26)

The value of CPR is within the range [0:100]%. Higher values indicate higher accuracy in classification.

The clustering performance of the proposed algorithm for audio classification is also measured in terms of four cluster validity functions described in Section 2.2.

4.2 Analysis and results

Figure 4 (a, b, c, d, e) depicts the results of the power sequence E (n), the gravity G (n), the zero-crossing rate Z (n), and the parameter sequence C (n), respectively, extracted from a selected audio sample using the proposed algorithm. These values depend on the types of sounds existing in the audio signal, as stated in Section 3.2 and analyzed in [12]. E (n) is used to discriminate speech signals from the others. G (n) and Z (n) help to classify music, speech-with-music and speech-with-noise signals. The parameter sequence C (n) is used to detect audio-cuts. For instance, the signal depicted in Fig. 4(a) contains speech-with-music in the 10–15 s (app.) time interval and merely music in the 15–20 s (app.) time interval. If we observe E (n) of the signal depicted in Fig. 4(b), we do not get sufficient difference in these values and cannot discriminate speech-with-music and music signals. However we can discriminate them by using the values of G (n), since the mean and variance parameters of G (n) vary sufficiently within the time intervals 10–15 s and 15–20 s, as shown in Fig. 4(d).

Table 1 shows the audio-cut detection results of the proposed CIFCM and conventional FCM algorithm [12]. These results were generated from two tests, namely TEST1 and TEST2, which were conducted on two different audio signals with a total of 241 audio-cuts. Here ‘the number of all audio-cuts’ indicates the number of manually detected audio-cuts in the audio signals. Manually detected audio-cuts are identified by listening to the audio signals and by observing the waveforms in the frame level. We observe that, the proposed CIFCM algorithm outperforms the conventional FCM algorithm for audio-cut detection in terms of both recall rate and precision rate. However, the proposed CIFCM algorithm generates a number of over-detections of audio-cuts due to the relatively long periods of silence between consecutive sentences in the speech segments, yet it is still better than the conventional FCM algorithm. We can reduce the inaccuracy of these over-detections by merging those three consecutive segments, where two segments of same audio-group are separated by a small duration of silence segment. Misdetection was also observed in the results due to the gradient way of changing signals at transition points. However this type of error has been reduced considerably in the proposed CIFCM in compare with conventional FCM algorithm.

Table 1 Audio-cut detection results

Full size table

There is no standard dataset on broad audio genres of the sort we are interested in, for analyzing the classification performance [12]. Thus we used a number of audio signals obtained from TV programs and the Internet (including the signals used in audio-cuts detection) to create the dataset. We summarized an experimental dataset of 573 audio-segments from all five broad categories. To ensure robustness in the experimental analysis, we included male speech, female speech, conversation of both male and female in the speech audio-segments, and instrumental as well as vocal songs of different music genres in the musical audio-segments. In addition, we included speech-with-noise with different levels of environmental noises. The confusion matrixes using the proposed CIFCM algorithm and the conventional FCM are shown in Tables 2 and 3, respectively. The rows represent the manually classified results and the columns represent the classification results of these algorithms. We found misclassifications for different groups in the results of Table 2, especially for speech and speech-with-music groups. Misclassifications occurred in the speech group (1st column) due to a relatively lower fraction of music or noise in speech with music or noise, and in some music-segments with lack of continuous strings. Similarly, in the speech-with-music group, misclassification occurred mostly due to fast speech-segments. Classification in the other groups is remarkably efficient.

Table 2 Resulting confusion matrix of genres from applying the proposed CIFCM for classification of audio-segments

Full size table

Table 3 Confusion matrix results of genres from applying the FCM for classification of audio-segments

Full size table

Table 4 presents the performance comparison among the proposed CIFCM, one-against-all multiclass SVM-based approach [4], and conventional FCM [12] in classification of audio-segments into five broad audio genres. We used the same feature vector for all three classification approaches. In addition, as SVM is a supervised learning method, we used a small and manually labeled training set from our dataset in order to train the SVM. From the comparison results, we observed that by considering temporal correlations of neighboring audio data, we could achieve better classification performance (89.53%) by using the proposed CIFCM algorithm. In addition, experimental results indicate that the CIFCM algorithm outperforms the conventional FCM algorithm and one-against-all multiclass SVM-based approach [4] in audio-segments classification. Quantitative evaluation using cluster validity functions is also used, to analyze the effectiveness of evolving better clusters in the proposed CIFCM over the traditional FCM algorithm [12]. As described in Section 2.2, superior clustering performance, as assessed by compactness, is achieved if V _pc is maximized and V _pe, V _fs, V _xb are minimized. Table 5 is a comparison of the clustering performance of audio-segments classification by using the proposed CIFCM and the traditional FCM in terms of the cluster validity function. It is observed from the results that, the proposed CIFCM algorithm achieves better clustering performance than the conventional FCM algorithm in the classification experiments (namely, TEST1 and TEST2) due to the increased compactness of fuzzy clusters of segments that is achieved by considering neighboring impact factors in the proposed CIFCM.

Table 4 Comparison results of the audio-segments classification

Full size table

Table 5 Comparison of clustering performances in terms of fuzzy-cluster validity functions in audio-segments classification

Full size table

Overall, the proposed CIFCM algorithm outperforms the conventional FCM in both audio-cut detection and audio-segment classification.

5 Conclusions

This paper presents a CIFCM algorithm that was designed to improve audio segmentation and classification performance. Unlike the conventional FCM approach, the CIFCM algorithm efficiently incorporates the influence of neighboring frames and segments from the audio stream for improved segmentation and classification of the current frame and segment, respectively. In addition, we analyzed a number of characteristic audio features to explore the differences among different types of audio data. Our experimental results indicate that the proposed CIFCM algorithm outperforms the conventional FCM algorithm and one-against-all multiclass SVM-based approach in audio segmentation and classification.

In the future, we will explore audio feature extraction from compressed audio data, since most digital audio data that are currently available are in compressed formats such as WMA, MP3, AAC, etc. Numerous different types of audio signals will be studied as well for segmentation and classification. In addition, different statistical classifiers will be investigated with the goal of producing a fully featured automatic audio retrieval system.

References

Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Pleum Press, New York
Book MATH Google Scholar
Bezdek JC, Keller J, Krisnapuram R, Pal NR (2005) Fuzzy models and algorithms for pattern recognition and image processing. Kluwer Academic Publishers, Norwell
Google Scholar
Bezdek JC, Nikhil R (1995) On cluster validity for the fuzzy c-means model. IEEE Trans on Fuzzy Syst 3(3):370–379
Article Google Scholar
Chen L, Gunduz S, and Ozsu MT. (2006) Mixed type audio classification with support vector machine. IEEE Int. Conf. on Multimedia and Expo, 781–784
Fukuyama Y and Sugeno M. (1989) A new method for Fuzzy clustering. 5th Fuzzy System Symp., 247–250
Gang C, Hui T, Xin-meng C (2005) Audio segmentation via the similarity measure of audio feature vectors. Wuhan Univ J Natur Sci 10(5):833–837
Article Google Scholar
Krinidis S, Chatzis V (2010) A Robust fuzzy local information c-means clustering algorithm. IEEE Trans Image Process 19(5):1328–1337
Article MathSciNet Google Scholar
Liu Z, Huang J, and Yang Y. (1998) Classification of TV programs based on audio information using hidden markov model. IEEE 2nd Workshop on Multimedia Sig Process, 27–32
Liu Z, Wang Y (1998) Audio feature extraction and analysis for scene segmentation and classification. J VLSI Sign Process 20:61–79
Article Google Scholar
Lu L, Zhang HJ, Jiang H (2002) Content analysis for audio classification and segmentation. IEEE Trans Speech Audio Process 10(7):504–516
Article Google Scholar
Luong HV and Kim J-M. (2009) A Generalized spatial fuzzy C-means algorithm for medical image segmentation. IEEE Int. Conf. on Fuzzy Systems, 409–414
Nitanda N, Haseyama M, Kitajima H (2006) Audio signal segmentation and classification using fuzzy c-means clustering. Syst Comput Jpn 37(4):23–34
Article Google Scholar
Park DC (2009) Classification of audio signals using fuzzy c-means with Divergence-based Kernel. Pattern Recognit Lett 30(9):794–798
Article Google Scholar
Park DC, Nguyen DH, Beack SH, Park S (2005) Classification of audio signals using Gradient-based fuzzy c-means algorithm with divergence measure. Adv Multimedia Inf Process PCM 2005:698–708
Google Scholar
Park DC, Tran CN, Min BJ, and Park S. (2006) Modeling and classification of audio signals using Gradient-based fuzzy C-means algorithm with a Mercer Kernel. In 9th Pacific Rim International Conference on Artificial Intelligence: 1104–1108
Tzanetakis G, Cook P (2002) Music genre classification of audio signals. IEEE Trans on Speech Audio Process 10(5):293–302
Article Google Scholar
Wang JC, Wang JF, Lin CB, Jian KT, Kuok WH (2006) Content-based audio classification using support vector machines and independent component analysis. 18^th Int. Conf. on. Pattern Recognit 4:157–160
Google Scholar
Wold E, Blum T, Keislar D, Wheaton J (1996) Content-based classification search and retrieval of audio. IEEE Multimedia Mag 3:27–36
Article Google Scholar
Xie XL, Beni GA (1991) A validity measure for fuzzy clustering. IEEE Trans on Pattern Anal Mach Intell 13(8):841–847
Article Google Scholar
Zhang T, Kuo C-CJ (2001) Audio content analysis for online audiovisual data segmetation and classification. IEEE Trans on Speech Audio Process 9(4):441–457
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2011-0017941)

Author information

Authors and Affiliations

School of Electrical Engineering, University of Ulsan, Ulsan, South Korea, 689-749
Mohammad A. Haque & Jong-Myon Kim

Authors

Mohammad A. Haque
View author publications
You can also search for this author in PubMed Google Scholar
Jong-Myon Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jong-Myon Kim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Haque, M.A., Kim, JM. An enhanced fuzzy c-means algorithm for audio segmentation and classification. Multimed Tools Appl 63, 485–500 (2013). https://doi.org/10.1007/s11042-011-0921-z

Download citation

Published: 18 November 2011
Issue Date: March 2013
DOI: https://doi.org/10.1007/s11042-011-0921-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An enhanced fuzzy c-means algorithm for audio segmentation and classification

Abstract

Similar content being viewed by others

Audio Classification Using GA-Based Fuzzy C-Means

A Method of Combining Gaussian Mixture Model and K-Means for Automatic Audio Segmentation of Popular Music

Content-Based Audio Classification and Retrieval Using Segmentation, Feature Extraction and Neural Network Approach

1 Introduction