Abstract
In this work, we present an algorithm for voiced/unvoiced decision and pitch estimation from speech signals. Our approach is based on classifying the peaks provided by the autocorrelation of the speech multi-scale product. The multi-scale product is based on making the product of the speech wavelet transform coefficients at three successive dyadic scales. The autocorrelation function of the multi-scale product is calculated over frames of a specific length. The experimental results show the robustness and the effectiveness of our approach. Besides, the proposed method outperforms some existing algorithms in a clean and noisy environment.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
The classification of the speech signal into voiced, unvoiced and silence provides a preliminary acoustic segmentation of the speech, which is important for speech analysis. The nature of the classification is to determine whether a speech signal is present and if so, whether the production of speech involves the vibration of the vocal folds. The vibration of vocal folds produces periodic or quasi periodic excitations to the vocal tract for voiced speech whereas pure transient and/or turbulent noises are aperiodic excitations to the vocal tract for unvoiced speech [1].
This type of classification [2] finds other applications mainly in fundamental frequency estimation, formant extraction or syllable marking and so on. In fact, pitch detectors for speech signal can only work correctly if the fundamental frequency estimation is linked with a reliable voiced–unvoiced decision.
Moreover, the fundamental frequency is an important parameter in the speech analysis and synthesis. It plays an eminent role in the speech production and perception. In application areas such as speech enhancement, analysis and prosody modelling, low-bit rate coding, and speaker recognition, reliable pitch estimation is required [3].
A wide variety of sophisticated voicing classification and pitch detection algorithms (PDAs) have been proposed in the speech processing literature [4–8, 13].
Most voicing decision algorithms exploit almost any elementary speech signal parameter that may be computed independently of the type of input signal: energy, amplitude, short-term autocorrelation coefficients, zero-crossings count, ratio of signal amplitudes in different sub-bands or after pre-processing as, linear prediction error, or the salience of a pitch estimate. Voicing decision algorithms can be grouped into three essential categories: (1) simple threshold analysis algorithms, which exploit only a few basic parameters; (2) more-complex algorithms based on pattern recognition methods; and (3) integrated algorithms for both voicing and pitch determination.
Besides, the pitch estimation from the speech signal only is basically done by relying on different types of speech transformation. This transformation can be operated following three domains:
The first approach works in the time domain. The common transformation is the autocorrelation function (ACF) like the YIN algorithm, the Praat Software application [9–12]. The second approach works in the frequency domain. The frequently used transformation is the spectrum [13, 14]. The third approach combines both time and frequency domains, using the short time Fourier transform (STFT) and the wavelet transform (WT) [15].
Although many PDAs were proposed, there is still no reliable algorithm that can be used for various speech processing applications. The difficulty of accurate and robust pitch estimation of speech is due to several reasons as the fast variation of the instantaneous pitch and formants.
In this paper, we detail and evaluate our improved algorithm called Multi-Scale Product Autocorrelation for voicing decision and fundamental frequency estimation from both clean and noisy speech.
The proposed algorithm was originally inspired by our works reported in [16, 17] where we used the speech multi-scale product spectrum (SMP) for pitch estimation and voicing decision.
The paper is presented as follows. After the introduction, we present the multi-scale product (MP) method used in this work to provide the derived speech signal. Section “Autocorrelation of the Speech Multi-Scale Product” introduces the multi-scale product autocorrelation (MPA) approach for the voicing detection and fundamental frequency estimation. In section “Voicing Decision and Pitch Estimation”, we evaluate our approach and compare it to other well-known algorithms. Evaluation results are also presented for speech corrupted by real noise at various SNR levels.
Multi-Scale Product
The WT is a multi-scale analysis which has been shown to be very well suited for speech processing as glottal closure instant (GCI) detection, pitch estimation, speech enhancement and recognition and so on. Moreover, a speech signal can be analysed at specific scales corresponding to the range of human speech [18–21].
One of the most important WT applications is the signal singularity detection. Continuous WT produces modulus maxima at signal singularities allowing their localisation. However, one-scale analysis is not accurate. So, decision algorithm using multiple scales is proposed by different works to circumvent this problem [22, 23].
The MP is essentially introduced to improve signal edge detection. It is based on the multiplication of WTC at some scales. The non-linear combination of wavelet transform coefficients (WTC) attempts to enhance the peaks of the gradients caused by true edges, while suppressing the spurious peaks.
This method was first used in image processing. Xu et al. [24] rely on the variations in the WT decomposition level. They use multiplication of WT of the image at adjacent scales to distinguish important edges from noise. Sadler and Swami [25] have studied the MP method of a signal in presence of noise.
The choice of the mother wavelet is crucial to detect discontinuities. It depends essentially on the wavelet vanishing moment number and the wavelet support. The WT with n vanishing moments can be interpreted as a multi-scale differential operator of nth order of the smoothed signal. This provides a relationship between the differentiability of the signal and wavelet modulus maxima decay at fine scales.
It has been demonstrated that wavelet with n vanishing moments can be expressed as follows:
where θ is a smoothing function. So, the WT of a function f can be written as:
with
So if the wavelet is chosen to have one vanishing moment, modulus maxima appear at discontinuities of the signal and represent the maxima of the first derivative of the smoothed signal.
The MP [25] consists of making the product of WTC of the function f(n) at some successive dyadic scales as follows:
where \( w_{{2^{j} }} f(n) \)is the WT of the function f at scale 2j. The MP is taken at three levels to preserve the edge sign.
In this work, we are motivated by the MP use because it provides a derived speech signal which is simpler to be analysed. The Fig. 1 summarises the steps of the MP.
The voiced speech MP has a periodic structure with more reinforced singularities marked by extrema. It has a structure that reminds the derivative laryngograph signal. So, the autocorrelation function can be operated on the obtained signal.
Autocorrelation of the Speech Multi-Scale Product
We propose a new technique to determine voiced frames with an estimation of the fundamental frequency. The method is based on the autocorrelation analysis of the speech MP. It can be decomposed into three essential steps, as shown in Fig. 2. The first step consists of making the speech MP. Then, we decompose the obtained signal into overlapping frames. Each frame includes N samples and is weighted by a Hanning window s w (n), n = 0, 1,…, N − 1 (N = 1,024 samples with an overlapping of 512 points at a sampling frequency of 20 kHz). The wavelet used in this MP analysis is the quadratic spline function with a support of 0.8 ms at scales s 1 = 2−1, s 2 = 20 and s 3 = 21. The second step consists of calculating the ACF of each frame extracted from the obtained signal. The third step consists of looking for the ACF maxima that are classified to make a voicing decision and then giving the fundamental frequency estimation for the voiced frames.
For the first step, the MP computing is detailed in the previous section. Then, the product p[n] is divided into frames of N length by multiplication with a sliding analysis Hanning window w[n]:
where i is the window index, and N/2 is the overlap.
The weighting w[n] is assumed to be non-zero in the interval [0, N − 1]. The frame length N is chosen in such a way that, on the one hand, the parameters to be measured remain constant and, on the other hand, that there are enough samples of p[n] within the frame to guarantee reliable frequency parameter determination.
The choice of the windowing function influences the values of the short-term parameters, the shorter the window, the greater is its influence [14].
In the second step, we compute the short-term autocorrelation function of each weighted block p wi [n] as follows:
The third step is detailed in the next section.
Voicing Decision and Pitch Estimation
After calculating the ACF of the speech MP in the ith frame, we store all the peak positions in the vector P i corresponding to the frequencies. Peaks with very low value, below a fixed threshold T, are removed and T is fixed to 0.2 = Max(ACF)/5.
If there are no peaks, the frame is declared unvoiced, else we calculate the distance separating two successive peak positions D ij = P ij+1 − P ij constituting the D i vector elements. Where i is the frame index, j is the peak index (j = 1, 2,…, M) and M is the peak number.
These elements are ranked in the growing order to compose the E i vector. To make a voicing decision, we look for well-defined groups constituted from the E ij set. The groups are sorted as follows:
If E i1 − E i2 < S, where S is the threshold chosen to be 12. So E i1 and E i2 belong to the same group G i1 and we calculate E i1 − E i3, else, E i1 is in G i1 and E i2 is in G i2. Then, we calculate E i2 − E i3 and so on until reaching the last elements in the E i vector. Once the groups are formed, we look for their number N i . If N i = 1, the ith frame is voiced, else, the frame is unvoiced.
Figure 3 shows a voiced speech signal followed by its MP. The MP has a periodic structure and reveals maxima corresponding to the glottal opening instant (GOI) and clear minima corresponding to the GCI. The Fig. 4 shows the autocorrelation function of the speech MP depicted in Fig. 3. The calculated function is obviously periodic and has the same period than the speech signal. Its first maximum at the non-zero index value corresponds to the pitch period.
On the other hand, the Fig. 5 illustrates the MP of the unvoiced speech signal. The MP shows maxima and minima randomly separated.
Figure 6 illustrates the autocorrelation function of the unvoiced speech signal MP. This function shows extrema that are also randomly separated with weak amplitude. These two different behaviours (voiced and unvoiced cases) allow us to operate a voicing decision.
Now we try to underline the effect of the MP to reduce noise when added to a speech signal.
Figure 7 depicts a noisy voiced speech signal with an SNR of −5 dB followed by its MP. The MP lessens the noise effects leading to an autocorrelation function with clear maxima comparing to the one calculated directly on the noisy speech signal as shown in Fig. 8.
Evaluation
To evaluate the performance of our algorithm, we use the Keele pitch reference database [26, 27]. This database consists of speech signals of five male and five female English speakers each reading the same phonetically balanced text with varying duration between about 30 and 40 s. All the speech signals were sampled at a rate of 20 kHz. The Keele database includes reference files containing a voiced–unvoiced segmentation and a pitch estimation of 25.6 ms segments with 10 ms overlapping. The reference files also mark uncertain pitch and voicing decisions. The reference pitch estimation is based on a simultaneously recorded signal of a laryngograph. Unvoiced frames are indicated with zero pitch values, and negative values are used for uncertain frames.
The commonly used criteria for evaluating pitch estimation performance are the gross pitch error (GPE) and the root mean square error (RMS). A GPE is identified when the estimated fundamental frequency F 0 value is 20% higher or lower than the reference one. The RMS is computed as the root mean square difference in Hertz between the reference F 0 and the estimation for all frames having no GPE.
To evaluate a voicing decision algorithm, we calculate the V-UV error corresponding to the percentage of voiced frames misclassified as unvoiced, and the UV-V error defined as the unvoiced frames considered as voiced, it is about the rate of false alarms.
Evaluation in a Clean Environment
Table 1 reports evaluation results for voicing classification of the proposed method in a clean environment. We compare our method to other state-of-the-art algorithms [8, 17, 28–30] that are based on the same reference database.
As can be seen, our method yields interesting performance in comparison with well-known approaches with the lowest V-UV and UV-V rates of 1.8 and 2.7%, respectively. Moreover, the autocorrelation of the speech MP outperforms our previous proposed approach SMP.
Table 2 presents the evaluation results of the proposed algorithm (MPA) for pitch estimation in a clean environment and compared with the other state-of-the-art algorithms [8, 11, 12, 17, 28–31].
The MPA shows a reduced GPE rate of 0.61% and an interesting RMS of 1.72 Hz. It is obviously more accurate than the SMP that has 0.75% of GPE rate and a RMS of 2.41 Hz.
Evaluation in a Noisy Environment
To test the robustness of our algorithm, we add various background noises (white, babble and vehicle) at four SNR levels to the Keele database speech signals. The noise is taken from the noisex-92 database [32].
Table 3 presents evaluation results for voicing decision of the proposed method in a noisy environment.
As reported in Table 3, when decreasing the SNR level, the performances of the proposed approach decrease but remain robust and more performing than the SMP and NMF-HMM-PI methods.
Table 4 illustrates the GPE of the proposed approach, the SMP [17], the PRAAT [12], the YIN [11], the RCEPS [31] and the NMF-HMM-PI [28] in a noisy environment. As depicted in Table 4, when the SNR level decreases, the MPA algorithm remains robust even at −5 dB and appears as the most efficient approach for pitch estimation. The SMP method has a greater GPE than the NMF-HMM-PI in the case of babble and vehicle noises.
Besides, the MPA method presents the lowest RMS values showing its convenience for pitch estimation in hard situations.
As depicted in Table 5, when the SNR level decreases, the MPA algorithm remains reliable even at −5 dB and appears as the most accurate approach for pitch estimation.
Moreover, the voiced/unvoiced decision and the pitch estimation accuracy are closely related to the threshold T. We have studied the GPE rate versus the T value and we find as depicted in Fig. 9 that for the used T = 0.2 = Max(ACF)/5, the GPE rate is the lowest.
Conclusion
In this paper, we present a voicing classification and pitch estimation method that relies on the autocorrelation analysis of the speech multi-scale product. The proposed approach can be summarised in three essential steps. First, we compute the product of the speech WTC at three successive dyadic scales. The obtained signal is divided into frames of 1,024 samples, and each frame is weighted by a Hanning window having the same length. Second, we calculate the autocorrelation function of each weighted frame. Thirdly, we detect the peaks given by this function. A peak classification is operated respecting some defined rules to be able to make a voiced/unvoiced decision concerning the frame. For voiced frame, the pitch period is the index non-zero corresponding to the first maximum. The fundamental frequency can be estimated as the inverse of the pitch period.
The experimental results show the efficiency of our proposed approach for voicing detection and pitch estimation from clean speech, and its robustness in the noisy environment compared with the state-of-the-art algorithms. The MPA approach outperforms the cited algorithms in this work not only for voiced/unvoiced decision but also for pitch estimation.
Future work concerns the extension of the proposed approach to the multi-pitch estimation.
References
Qi Y, Hunt BR. Voiced-unvoiced-silence classifications of speech using hybrid features and a network classifier. IEEE Trans Speech Audio Process. 1993;1(2):250–6.
Martin A, Charlet D, Mauuary L. Robust speech/non-speech detection using LDA applied to MFCC. IEEE Int Conf Acoust Speech Signal Process. 2001;1:237–40.
Shaughnessy DO. Speech communications: human and machine. 2nd ed. Piscataway, NJ: IEEE Press; 1999.
Childers DG, Hahn M, Larar JN. Silent and voiced/unvoiced/mixed excitation classification of speech. IEEE Trans Acoust Speech Signal Process. 1989;37(11):1771–4.
Liao L, Gregory M. Algorithms for speech classification. IEEE Int Conf Signal Process Appl. 1999;2:623–7.
Hess W. Pitch determination of speech signals: algorithms and devices. New York: Springer; 1983.
Bagshaw PC, Hiller SM, Jack MA. Enhanced pitch tracking and the processing of f0 contours for computer aided intonation teaching. In: The 3rd European conference on speech communication and technology; 1993.
Talkin D. A robust algorithm for pitch tracking. In: Kleijn WB, Paliwal KK, editors. Speech coding and synthesis. Amsterdam: Elsevier; 1995. p. 495–518.
Rabiner L. On the use of autocorrelation analysis for pitch detection. IEEE Trans Acoust Speech Signal Process. 1977;25(1):24–33.
Krubsack DA, Niederjohn RJ. An autocorrelation pitch detector and voicing decision with confidence measures developed for noise-corrupted speech. IEEE Trans Signal Process. 1991;39(2):319–29.
De Cheveigné A, Kawahara H. YIN, a fundamental frequency estimator for speech and music. J Acoust Soc Amer. 2002;111(4):1917–30.
Boersma P. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. Proc Inst Phon Sci. 1993;17:97–110.
Noll AM. Cepstrum pitch determination. J Acoust Soc Amer. 1967;41(2):293–309.
Shimamura T, Takagi H. Noise-robust fundamental frequency extraction method based on exponentiated band-limited amplitude spectrum. IEEE Int Conf Midwest Symposium on Circuits and Systems. 2004;47(2):141–4.
Shahnaz C, Zhu WP, Ahmad MO. A spectro-temporal algorithm for pitch frequency estimation from noisy observations. In: IEEE international symposium on circuits and systems. Seattle, WA; 2008. p. 1704–7.
Ben Messaoud MA, Bouzid A, Ellouze N. Spectral multi-scale product analysis for pitch estimation from noisy speech signal. In: Solé-Casals J, Zaiats V, editors. Advances on non-linear speech processing, International conference on non-linear speech processing, NOLISP’09, LNAI, vol. 5933. Berlin: Springer; 2010. p. 95–102.
Ben Messaoud MA, Bouzid A, Ellouze N. A new method for pitch tracking and voicing decision based on spectral multi-scale analysis. Signal Process: An Int J. 2009;3(5):144–9.
Burrus CS, Gopinath RA, Guo H. Introduction to wavelets and wavelet transforms: a primer. Englewood Cliffs: Prentice Hall; 1998.
Mallat S. A wavelet tour of signal processing: the sparse way. 3rd ed. Burlington, VT: Academic Press; 2008.
Berman Z, Baras JS. Properties of the multiscale maxima and zero-crossings representations. IEEE Trans Signal Process. 1993;41(12):3216–31.
Kadambe S, Boudreaux-Bartels GF. Application of the wavelet transform for pitch detection of speech signals. IEEE Trans Inf Theory. 1992;38(2):917–8.
Bouzid A, Ellouze N. Voice source parameter measurement based on multi-scale analysis of electroglottographic signal. Speech Commun. 2009;51(9):782–92.
Bouzid A, Ellouze N. Open quotient measurements based on multiscale product of speech signal wavelet transform. New York: Hindawi Publishing Corp, Res Lett Signal Process; 2007. p. 1–6.
Xu Y, Weaver JB, Healy DM, Lu J. Wavelet transform domain filters: a spatially selective noise filtration technique. IEEE Trans Image Process. 1994;3(6):747–58.
Sadler BM, Swami A. Analysis of multi-scale products for step detection and estimation. IEEE Trans Inf Theory. 1999;45(3):1043–9.
Meyer G, Plante F, Ainsworth WA. A pitch extraction reference database. The 4th European conference on speech communication and technology, EUROSPEECH. Madrid, Spain; 1995. p. 837–40.
Keele Pitch Database. In: Psychology Home page-human machine perception. University of Liverpool. 1995. http://www.liv.ac.uk/Psychology/hmp/projects/pitch/speech/keele_pitch_database.html. Accessed 24 April 2010.
Joho D, Bennewitz M, Behnke S. Pitch estimation using models of voiced speech on three levels. IEEE Int Conf Acoust Speech Signal Process. 2007;4:1077–80.
Sha F, Saul LK. Real time pitch determination of one or more voices by nonnegative matrix factorization. In: Saul LK, Weiss Y, Bottou L, editors. Advances in neural information processing systems. Cambridge: MIT Press; 2005. p. 1233–40.
Sha F, Burgoyne JA, Saul LK. Multiband statistical learning for f0 estimation in speech. IEEE Int Conf Acoust Speech Signal process. 2004;5:661–4.
Nakatani T, Irino T. Robust and accurate fundamental frequency estimation based on dominant harmonic components. J Acoust Soc Amer. 2004;116(6):3690–700.
Noisex92. In: Signal Processing Information Base (SPIB). The Signal Processing Society and the National Science Foundation. 2007. http://spib.rice.edu/spib/select_noise.html. Accessed 24 April 2010.
Acknowledgments
The authors are very grateful to Alain de Cheveigné for providing the fundamental frequency estimation software (Yin algorithm).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ben Messaoud, M.A., Bouzid, A. & Ellouze, N. Autocorrelation of the Speech Multi-Scale Product for Voicing Decision and Pitch Estimation. Cogn Comput 2, 151–159 (2010). https://doi.org/10.1007/s12559-010-9048-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-010-9048-1