Introduction

The classification of the speech signal into voiced, unvoiced and silence provides a preliminary acoustic segmentation of the speech, which is important for speech analysis. The nature of the classification is to determine whether a speech signal is present and if so, whether the production of speech involves the vibration of the vocal folds. The vibration of vocal folds produces periodic or quasi periodic excitations to the vocal tract for voiced speech whereas pure transient and/or turbulent noises are aperiodic excitations to the vocal tract for unvoiced speech [1].

This type of classification [2] finds other applications mainly in fundamental frequency estimation, formant extraction or syllable marking and so on. In fact, pitch detectors for speech signal can only work correctly if the fundamental frequency estimation is linked with a reliable voiced–unvoiced decision.

Moreover, the fundamental frequency is an important parameter in the speech analysis and synthesis. It plays an eminent role in the speech production and perception. In application areas such as speech enhancement, analysis and prosody modelling, low-bit rate coding, and speaker recognition, reliable pitch estimation is required [3].

A wide variety of sophisticated voicing classification and pitch detection algorithms (PDAs) have been proposed in the speech processing literature [48, 13].

Most voicing decision algorithms exploit almost any elementary speech signal parameter that may be computed independently of the type of input signal: energy, amplitude, short-term autocorrelation coefficients, zero-crossings count, ratio of signal amplitudes in different sub-bands or after pre-processing as, linear prediction error, or the salience of a pitch estimate. Voicing decision algorithms can be grouped into three essential categories: (1) simple threshold analysis algorithms, which exploit only a few basic parameters; (2) more-complex algorithms based on pattern recognition methods; and (3) integrated algorithms for both voicing and pitch determination.

Besides, the pitch estimation from the speech signal only is basically done by relying on different types of speech transformation. This transformation can be operated following three domains:

The first approach works in the time domain. The common transformation is the autocorrelation function (ACF) like the YIN algorithm, the Praat Software application [912]. The second approach works in the frequency domain. The frequently used transformation is the spectrum [13, 14]. The third approach combines both time and frequency domains, using the short time Fourier transform (STFT) and the wavelet transform (WT) [15].

Although many PDAs were proposed, there is still no reliable algorithm that can be used for various speech processing applications. The difficulty of accurate and robust pitch estimation of speech is due to several reasons as the fast variation of the instantaneous pitch and formants.

In this paper, we detail and evaluate our improved algorithm called Multi-Scale Product Autocorrelation for voicing decision and fundamental frequency estimation from both clean and noisy speech.

The proposed algorithm was originally inspired by our works reported in [16, 17] where we used the speech multi-scale product spectrum (SMP) for pitch estimation and voicing decision.

The paper is presented as follows. After the introduction, we present the multi-scale product (MP) method used in this work to provide the derived speech signal. Section “Autocorrelation of the Speech Multi-Scale Product” introduces the multi-scale product autocorrelation (MPA) approach for the voicing detection and fundamental frequency estimation. In section “Voicing Decision and Pitch Estimation”, we evaluate our approach and compare it to other well-known algorithms. Evaluation results are also presented for speech corrupted by real noise at various SNR levels.

Multi-Scale Product

The WT is a multi-scale analysis which has been shown to be very well suited for speech processing as glottal closure instant (GCI) detection, pitch estimation, speech enhancement and recognition and so on. Moreover, a speech signal can be analysed at specific scales corresponding to the range of human speech [1821].

One of the most important WT applications is the signal singularity detection. Continuous WT produces modulus maxima at signal singularities allowing their localisation. However, one-scale analysis is not accurate. So, decision algorithm using multiple scales is proposed by different works to circumvent this problem [22, 23].

The MP is essentially introduced to improve signal edge detection. It is based on the multiplication of WTC at some scales. The non-linear combination of wavelet transform coefficients (WTC) attempts to enhance the peaks of the gradients caused by true edges, while suppressing the spurious peaks.

This method was first used in image processing. Xu et al. [24] rely on the variations in the WT decomposition level. They use multiplication of WT of the image at adjacent scales to distinguish important edges from noise. Sadler and Swami [25] have studied the MP method of a signal in presence of noise.

The choice of the mother wavelet is crucial to detect discontinuities. It depends essentially on the wavelet vanishing moment number and the wavelet support. The WT with n vanishing moments can be interpreted as a multi-scale differential operator of nth order of the smoothed signal. This provides a relationship between the differentiability of the signal and wavelet modulus maxima decay at fine scales.

It has been demonstrated that wavelet with n vanishing moments can be expressed as follows:

$$ \Uppsi (t){ = }( - 1)^{n} {\frac{{{\text{d}}^{n} {{\uptheta}}(t)}}{{{\text{d}}t^{n} }}} $$
(1)

where θ is a smoothing function. So, the WT of a function f can be written as:

$$ Wf(u,s){ = }s^{n} {\frac{{{\text{d}}^{n} }}{{{\text{d}}u^{n} }}}(f^{*} \bar{\theta }_{s} )(u) $$
(2)

with

$$ \bar{\theta }_{s} (t){ = }{\frac{ 1}{\sqrt s }}\theta \left( {{\frac{ - t}{s}}} \right) $$
(3)

So if the wavelet is chosen to have one vanishing moment, modulus maxima appear at discontinuities of the signal and represent the maxima of the first derivative of the smoothed signal.

The MP [25] consists of making the product of WTC of the function f(n) at some successive dyadic scales as follows:

$$ p(n){ = }\prod\limits_{{j = j_{0} }}^{{j = j_{L} }} {w_{{ 2^{j} }} } f(n) $$
(4)

where \( w_{{2^{j} }} f(n) \)is the WT of the function f at scale 2j. The MP is taken at three levels to preserve the edge sign.

In this work, we are motivated by the MP use because it provides a derived speech signal which is simpler to be analysed. The Fig. 1 summarises the steps of the MP.

Fig. 1
figure 1

Diagram of the speech multi-scale product

The voiced speech MP has a periodic structure with more reinforced singularities marked by extrema. It has a structure that reminds the derivative laryngograph signal. So, the autocorrelation function can be operated on the obtained signal.

Autocorrelation of the Speech Multi-Scale Product

We propose a new technique to determine voiced frames with an estimation of the fundamental frequency. The method is based on the autocorrelation analysis of the speech MP. It can be decomposed into three essential steps, as shown in Fig. 2. The first step consists of making the speech MP. Then, we decompose the obtained signal into overlapping frames. Each frame includes N samples and is weighted by a Hanning window s w (n), n = 0, 1,…, N − 1 (N = 1,024 samples with an overlapping of 512 points at a sampling frequency of 20 kHz). The wavelet used in this MP analysis is the quadratic spline function with a support of 0.8 ms at scales s 1 = 2−1, s 2 = 20 and s 3 = 21. The second step consists of calculating the ACF of each frame extracted from the obtained signal. The third step consists of looking for the ACF maxima that are classified to make a voicing decision and then giving the fundamental frequency estimation for the voiced frames.

Fig. 2
figure 2

Block diagram of the proposed approach for voiced/unvoiced decision and the fundamental frequency estimation

For the first step, the MP computing is detailed in the previous section. Then, the product p[n] is divided into frames of N length by multiplication with a sliding analysis Hanning window w[n]:

$$ P_{wi} [k] = p[k]w[k - i{N \mathord{\left/ {\vphantom {N 2}} \right. \kern-\nulldelimiterspace} 2}] $$
(5)

where i is the window index, and N/2 is the overlap.

The weighting w[n] is assumed to be non-zero in the interval [0, N − 1]. The frame length N is chosen in such a way that, on the one hand, the parameters to be measured remain constant and, on the other hand, that there are enough samples of p[n] within the frame to guarantee reliable frequency parameter determination.

The choice of the windowing function influences the values of the short-term parameters, the shorter the window, the greater is its influence [14].

In the second step, we compute the short-term autocorrelation function of each weighted block p wi [n] as follows:

$$ \begin{aligned} R_{i} (k) & = \sum\limits_{1 = 0}^{N - 1} {p_{wi} (1)p_{wi} (1 + k)} \\ {\text{ACF}}_{i} (k) & = {\frac{{R_{i} (k)}}{{R_{i} (0)}}} \\ \end{aligned} $$
(6)

The third step is detailed in the next section.

Voicing Decision and Pitch Estimation

After calculating the ACF of the speech MP in the ith frame, we store all the peak positions in the vector P i corresponding to the frequencies. Peaks with very low value, below a fixed threshold T, are removed and T is fixed to 0.2 = Max(ACF)/5.

If there are no peaks, the frame is declared unvoiced, else we calculate the distance separating two successive peak positions D ij  = P ij+1 − P ij constituting the D i vector elements. Where i is the frame index, j is the peak index (j = 1, 2,…, M) and M is the peak number.

These elements are ranked in the growing order to compose the E i vector. To make a voicing decision, we look for well-defined groups constituted from the E ij set. The groups are sorted as follows:

If E i1 − E i2 < S, where S is the threshold chosen to be 12. So E i1 and E i2 belong to the same group G i1 and we calculate E i1 − E i3, else, E i1 is in G i1 and E i2 is in G i2. Then, we calculate E i2 − E i3 and so on until reaching the last elements in the E i vector. Once the groups are formed, we look for their number N i . If N i  = 1, the ith frame is voiced, else, the frame is unvoiced.

Figure 3 shows a voiced speech signal followed by its MP. The MP has a periodic structure and reveals maxima corresponding to the glottal opening instant (GOI) and clear minima corresponding to the GCI. The Fig. 4 shows the autocorrelation function of the speech MP depicted in Fig. 3. The calculated function is obviously periodic and has the same period than the speech signal. Its first maximum at the non-zero index value corresponds to the pitch period.

Fig. 3
figure 3

a Voiced speech signal. b Its multi-scale product

Fig. 4
figure 4

Autocorrelation of the voiced speech multi-scale product

On the other hand, the Fig. 5 illustrates the MP of the unvoiced speech signal. The MP shows maxima and minima randomly separated.

Fig. 5
figure 5

a Unvoiced speech signal. b Its multi-scale product

Figure 6 illustrates the autocorrelation function of the unvoiced speech signal MP. This function shows extrema that are also randomly separated with weak amplitude. These two different behaviours (voiced and unvoiced cases) allow us to operate a voicing decision.

Fig. 6
figure 6

Autocorrelation of the unvoiced speech multi-scale product

Now we try to underline the effect of the MP to reduce noise when added to a speech signal.

Figure 7 depicts a noisy voiced speech signal with an SNR of −5 dB followed by its MP. The MP lessens the noise effects leading to an autocorrelation function with clear maxima comparing to the one calculated directly on the noisy speech signal as shown in Fig. 8.

Fig. 7
figure 7

a Voiced and noisy speech signal (SNR = −5 dB). b Its multi-scale product

Fig. 8
figure 8

a Autocorrelation of the noisy voiced speech. b Autocorrelation of the noisy voiced speech multi-scale product

Evaluation

To evaluate the performance of our algorithm, we use the Keele pitch reference database [26, 27]. This database consists of speech signals of five male and five female English speakers each reading the same phonetically balanced text with varying duration between about 30 and 40 s. All the speech signals were sampled at a rate of 20 kHz. The Keele database includes reference files containing a voiced–unvoiced segmentation and a pitch estimation of 25.6 ms segments with 10 ms overlapping. The reference files also mark uncertain pitch and voicing decisions. The reference pitch estimation is based on a simultaneously recorded signal of a laryngograph. Unvoiced frames are indicated with zero pitch values, and negative values are used for uncertain frames.

The commonly used criteria for evaluating pitch estimation performance are the gross pitch error (GPE) and the root mean square error (RMS). A GPE is identified when the estimated fundamental frequency F 0 value is 20% higher or lower than the reference one. The RMS is computed as the root mean square difference in Hertz between the reference F 0 and the estimation for all frames having no GPE.

To evaluate a voicing decision algorithm, we calculate the V-UV error corresponding to the percentage of voiced frames misclassified as unvoiced, and the UV-V error defined as the unvoiced frames considered as voiced, it is about the rate of false alarms.

Evaluation in a Clean Environment

Table 1 reports evaluation results for voicing classification of the proposed method in a clean environment. We compare our method to other state-of-the-art algorithms [8, 17, 2830] that are based on the same reference database.

Table 1 Voicing decision performances in a clean environment

As can be seen, our method yields interesting performance in comparison with well-known approaches with the lowest V-UV and UV-V rates of 1.8 and 2.7%, respectively. Moreover, the autocorrelation of the speech MP outperforms our previous proposed approach SMP.

Table 2 presents the evaluation results of the proposed algorithm (MPA) for pitch estimation in a clean environment and compared with the other state-of-the-art algorithms [8, 11, 12, 17, 2831].

Table 2 Evaluation results of the MPA algorithm and others for pitch estimation in a clean environment

The MPA shows a reduced GPE rate of 0.61% and an interesting RMS of 1.72 Hz. It is obviously more accurate than the SMP that has 0.75% of GPE rate and a RMS of 2.41 Hz.

Evaluation in a Noisy Environment

To test the robustness of our algorithm, we add various background noises (white, babble and vehicle) at four SNR levels to the Keele database speech signals. The noise is taken from the noisex-92 database [32].

Table 3 presents evaluation results for voicing decision of the proposed method in a noisy environment.

Table 3 Performance comparison of the MPA algorithm and others for voicing decision in a noisy environment

As reported in Table 3, when decreasing the SNR level, the performances of the proposed approach decrease but remain robust and more performing than the SMP and NMF-HMM-PI methods.

Table 4 illustrates the GPE of the proposed approach, the SMP [17], the PRAAT [12], the YIN [11], the RCEPS [31] and the NMF-HMM-PI [28] in a noisy environment. As depicted in Table 4, when the SNR level decreases, the MPA algorithm remains robust even at −5 dB and appears as the most efficient approach for pitch estimation. The SMP method has a greater GPE than the NMF-HMM-PI in the case of babble and vehicle noises.

Table 4 GPE rate for some pitch estimation algorithms in a noisy environment

Besides, the MPA method presents the lowest RMS values showing its convenience for pitch estimation in hard situations.

As depicted in Table 5, when the SNR level decreases, the MPA algorithm remains reliable even at −5 dB and appears as the most accurate approach for pitch estimation.

Table 5 RMS (in Hz) for different pitch estimation algorithms in a noisy environment

Moreover, the voiced/unvoiced decision and the pitch estimation accuracy are closely related to the threshold T. We have studied the GPE rate versus the T value and we find as depicted in Fig. 9 that for the used T = 0.2 = Max(ACF)/5, the GPE rate is the lowest.

Fig. 9
figure 9

GPE rate variation versus the threshold T

Conclusion

In this paper, we present a voicing classification and pitch estimation method that relies on the autocorrelation analysis of the speech multi-scale product. The proposed approach can be summarised in three essential steps. First, we compute the product of the speech WTC at three successive dyadic scales. The obtained signal is divided into frames of 1,024 samples, and each frame is weighted by a Hanning window having the same length. Second, we calculate the autocorrelation function of each weighted frame. Thirdly, we detect the peaks given by this function. A peak classification is operated respecting some defined rules to be able to make a voiced/unvoiced decision concerning the frame. For voiced frame, the pitch period is the index non-zero corresponding to the first maximum. The fundamental frequency can be estimated as the inverse of the pitch period.

The experimental results show the efficiency of our proposed approach for voicing detection and pitch estimation from clean speech, and its robustness in the noisy environment compared with the state-of-the-art algorithms. The MPA approach outperforms the cited algorithms in this work not only for voiced/unvoiced decision but also for pitch estimation.

Future work concerns the extension of the proposed approach to the multi-pitch estimation.