1 Introduction

Pitch detection contains important information about speech, and music sounds in the area of speech analysis, speech recognition, prosody analysis, music information retrieval, chord recognition, automatic music transcription, and onsets detection (Gavat et al. 2002; Klapuri 2004; Bello et al. 2005; Roy et al. 2011). The acoustic music sounds are often quasi-periodic, and they present imperfect periodicity in different ways. Generally, pitch is the subjective perception of a note, and every pitched musical instrument can produce individual notes with well detectable fundamental frequencies (F0 s) (Muller et al. 2011).

There have been many methods for pitch determination. In most cases, the methods of determination are based on the analysis of spectrum, psychoacoustic model of human hearing or a combination of them. Comparative studies of various methods are presented by Hess (Hess 1992; Klapuri 2000), who compared the methodologies and performance of each method.

In the time domain, many methods of pitch estimation apply the autocorrelation function (ACF) (Brown and Zhang 1991), and average magnitude difference function (AMDF) (Li et al. 2006). Another significant variety, for instance, the approach introduced by De cheveigné (De Cheveigné and Kawahara 2002), implements a modified autocorrelation to analyze the speech signal and applies an operation of subtraction using in AMDF, a parabolic interpolation and a cumulative mean normalization, to lessen error rate. However, the methods based on ACF and AMDF tend to determine the two times of true period, and it produces a sub-harmonic error in the spectral domain.

In the frequency domain, the commonly used transformation is the spectrum, and the cepstrum. The one constraint of these techniques, however, is that it attributes the same weight to all harmonic frequencies, which will be tending to twice octave error (Klapuri 2000). In addition, methods based on spectrum autocorrelation have been proposed, such as logarithmic spectrum ACF (Kunieda et al. 1996). Unfortunately, a major limitation of these methods is that they will result in twice octave error when predominant harmonics exist. As the above declared methods deal with harmonic position estimators. Examples of other more efficient harmonic position estimators are sub-harmonic to harmonic ratio (SHR) by (Sun 2000), and sawtooth waveform method (SWIPE) by (Camacho and Harris 2008).

To recapitulate, existence of salient harmonics, missing harmonics, and other different challenges finding in speech and music signals render the efficient estimation of F0 very hard. Furthermore, the above presented methods are not able to determinate the F0 of imperfect harmonic sound signals by the fact that due to the non-stationarity and quasi-periodicity of the speech sound, the physical vibration, the harmonics of the signals produced from the musical instrument cannot be spaced with correct interval, but marginally shift from perfect positions.

Although a large number of fundamental frequency determination algorithms have been described in the state-of-the art for clean sound, fundamental frequency determination from a noisy sound has been essayed only by a few researchers (Shimamura and Kobayashi 2001; Shahnaz et al. 2007; 2008). However, in practical applications, a fundamental frequency determination task has to be performed using only the given noisy sound. For example, for speech separation in noisy environments, extraction of fundamental frequency of the dominant speech is required as it can be used as a cue for separation of concurrent speech. In (Mahmoodzadeh et al. 2012), experiments have been carried out as an attempt to separate signal from a background noise based on computational auditory scene analysis (CASA). The fundamental frequency determination algorithm employed in (Mahmoodzadeh et al. 2012) for determining the concurrent time–frequency region of the desired sound is based on instantaneous amplitude comb filtering and can determinate the F0 of vowels in noisy environments but the estimates are not accurate enough.

In the weighted autocorrelation method (Shimamura and Kobayashi 2001), using the same periodicity property of AMDF and ACF, the ACF is weighted by the reciprocal of the AMDF in order to emphasize the true pitch-peak for noisy sound. Since, under a high level of noise, the global maximum of AMDF or the global minimum of ACF may occur at a lag that is a multiple or sub-multiple of true pitch period, in the weighted ACF, the peaks at non-pitch locations may be wrongly emphasized more than those at the true pitch location. This causes inaccurate fundamental frequency determination at a low SNR. It is worth mentioning that most of the fundamental frequency determination algorithms reported determination performance for sound corrupted by white noise only. In general, fundamental frequency determination performance would deteriorate significantly in a scenario, where sound is corrupted by a realistic noise (Prasanna and Yegnanarayana 2004), but there is a growing demand of many practical applications in which it is important to determinate fundamental frequency accurately from the sound corrupted by the realistic noise, such as multi-talker babble noise with very low levels of SNR.

The objective of this work is to present a robust approach that effectively overcomes the limitations of the existing fundamental frequency determination under severe noisy conditions and simultaneously be efficient in the clean speech and monophonic music.

To this end, this work presents an efficient method for pitch estimation of speech and music signals based on spectrum of multi-scale product of gammachirp auditory filter-bank (GAMMA-MP). The proposed paper is based on the auditory structure characteristics, and depends on both the spectral behaviour of the cochlea and the out/middle ear filtering. The out/middle ear filtering is designed by a low-pass filter. The cochlear filter is designed by a gammachirp filter-bank. The F0 is next deduced in each channel by calculating its fast Fourier transform of multi-scale product analysis. Then, we extract the best candidate as the F0 determination. In this paper, the F0 is determined in all voiced segments of the evaluated databases. So, the study of the voiced decision methods is the objective of future work.

The rest of the paper is organized as follows. Section II provides the details of the approach. Experimental evaluations and discussions are described in Section III. Finally, Section IV concludes the research work and presents future directions.

2 The pitch estimation method of speech and music sounds

The overall approach is decomposed on two stages. A block diagram of the proposed approach is depicted in Fig. 1. In the first stage, a sound signal is analyzed by a gammachirp filter bank. This processing results in a decomposition of the sound signal into a time frequency map. Then, the output sound is split into two channels, the channel below 1 kHz is filtered by a low-pass filter and the high-channel sound is half-wave rectified and low-pass filtered. In the second stage, we compute the multi-scale product in frequency domain in each channel to extract the fundamental frequency F0.

Fig. 1
figure 1

Block diagram of the proposed method for pitch estimation of sound signal

2.1 First stage: Auditory feature extraction

The auditory filter modeling corresponds to the mathematical model which tends to simulate the psychophysical aspects and basic perceptual of the human auditory characteristics (Lyon et al. 2010). The concept of fundamental frequency has a complex relationship to physical properties of the speech and makes reference to auditory perception. Thus, it is natural to apply approaches that assume human perception. The peripheral auditory model using time-domain processing of periodicity properties can simulate many features of pitch perception which are often considered to be more central (Meddis and O’Mard 1997).

First, the sound signal passes through a second-order low-pass filter as an out-middle ear filtering (Van Immerseel and Martens 1992). The main motivation of using an out-middle ear filter is to increase the strength of high frequency harmonics. As a result, the harmonics have relatively the same amplitude. It’s based on the transfer filter below:

$$H\left( x \right) = \frac{{\omega_{0}^{2} }}{{\omega_{0}^{2} + 0.33\omega_{0} x + x^{2} }}$$
(1)

where \(fr = {\raise0.7ex\hbox{${2\pi }$} \!\mathord{\left/ {\vphantom {{2\pi } {\omega_{0} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${\omega_{0} }$}}\) is the resonance frequency equal to 4 kHz.

Then, we apply a gammachirp filter-bank to simulate the sound processing in the cochlea. It is an extension of the gammatone filter with a frequency modulation factor. Also, it allows to determine an approximation of the basilar membrane frequency of the cochlear (Patterson et al. 2003; Irino and Patterson 2006). The impulse response of a gammachirp filter is defined by the equation below:

$$g\left( t \right) = At^{k - 1} e^{{ - 2\pi \,mERB\,\,\left( {p_{0} } \right)t}} e^{{j2\pi p_{0} t + jr\ln \left( t \right) + j\varphi }}$$
(2)

where t > 0, r, p 0, φ, and A are the chirp rate, the asymptotic frequency, the phase and the amplitude, respectively. The parameters km designate the gamma envelope. Based on the work of (Tolonen and Karjalainen 2000), the sound is separated into two channels, below and above 1 kHz. The low-channel sound is low-pass filtered and the high-channel sound is half-wave rectified and low-pass filtered. The high channel is phase-sensitive since it follows the amplitude envelope of the sound in the frequency band above 1000 Hz. Thus, all phase-sensitivity in our approach is inherently caused by the high channel. This is different from the standard system where all channels are phase-sensitive since they follow the envelope of the sound in the corresponding frequency band.

The ERB(p 0) is the equivalent rectangular bandwidth (ERB) of the gammachirp auditory filters centred around p 0(Wang and Brown 2006). The ERB is presented by the following equation:

$$ERB\left( p \right) = 24.7 + 0.108p.$$
(3)

The ERB-rate scale describes an logarithmic function which associates the frequency value to the ERBs number. ERB – rate (p) is presented as follow:

$$ERB - rate\left( p \right) = 21.4\log_{10} \left( {\frac{4.37}{1000}p + 1} \right).$$
(4)

The low and high channels have a different time response. Since high frequency channel has a faster response than low frequency channel, a channel alignment is needed in time domain.

2.2 Second stage: Multi-scale product in frequency domain

In recent years, the wavelet transform (WT) has been successful used in many speech processing applications. WT can analyze time–frequency characteristics of sound, and can track abrupt changes of sound. So it becomes a powerful tool for F0 determination. The WT shows whether details of a certain scale are introduced in a sound and quantifies their respective participation. Generally, the WT is meant to offer good frequency resolution at low frequencies. They have sets of properties, including: null moments to determine the useful information, uncorrelated coefficients to lessen the temporal correlation, and compact support to ensure local analysis. The quadratic spline wavelet (QSW) is introduced by Mallat (1999). An appropriately chosen wavelet for discontinuity detection is a wavelet that is the second derivative of a smoothing function corresponding to the QSW. We denote a shifted and dilated version of the QSW by:

$$\varPsi_{{2^{j + 1} ,i}} \left( x \right) = \sqrt {{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {2^{j} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${2^{j} }$}}} \varPsi \left( {{\raise0.7ex\hbox{${x - i}$} \!\mathord{\left/ {\vphantom {{x - i} {2^{j} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${2^{j} }$}}} \right)$$
(5)

where i, and 2j+1 represent the translation and the dilatation factors respectively.

We have used the Dyadic WT. It is the special case of Continuous WT.

The dyadic WT is described by the equation below:

$$W_{{2^{j + 1} ,i}} f\left( x \right) = f \times\varPsi_{{2^{j} ,i}} \left( x \right)$$
(6)

where j = −1, 0, and 1.

By using multi-scale products (mp), the extrema due to abrupt transitions are reinforced because of their correlated presence across scales while those due to noise are suppressed because noise is mostly confined to lower scales.

This idea was first applied to magnetic resonance images by Xu (1994). In our paper, we integrate information of some scales to extract feature points.

The MP consists of multiplication of WT coefficients of the function f(i, s j )at some successive dyadic scales as follows:

$$mp(i){\kern 1pt} {\kern 1pt} {\kern 1pt} = {\kern 1pt} {\kern 1pt} \prod\limits_{{j{\kern 1pt} {\kern 1pt} = 1}}^{i} {{\kern 1pt} W{\kern 1pt} {\kern 1pt} } f{\kern 1pt} (i,s_{j} )$$
(7)

where \(Wf{\kern 1pt} (i,s_{j} )\) is the WT of the sound signalfat scale 2j.

The Fig. 2 recapitulates the steps of the mp.

Fig. 2
figure 2

Block diagram of the sound signal multi-scale product

In each output filter channel, the product mp(i) is divided into frames with a sliding analysis window d(i).

For the second step, the product mp(i)is divided into frames by multiplication with a sliding analysis window d(i):

$$mp_{d} (i,l){\kern 1pt} {\kern 1pt} {\kern 1pt} = mp(i){\kern 1pt} d(i - j\Delta i){\kern 1pt} {\kern 1pt}$$
(8)

where Δi is the overlap, and l is the window index.

Then, we compute the Fast Fourier transform (FFT) of multi-scale product for each channel.

The effect of the FFT of mp analysis is shown in Figs. 3 and 4.

Fig. 3
figure 3

Pitch estimation of speech sound. a Speech sound frame, b multi-scale product analysis (mp), c FFT of mp

Fig. 4
figure 4

Pitch estimation of music sound. a Music sound frame, b multi-scale product analysis (mp), c FFT of mp

The FFT function of each weighted block mp d is given by:

$$MP_{d}^{l} (k){\kern 1pt} {\kern 1pt} {\kern 1pt} = \sum\limits_{i = 1}^{N} {mp_{d} \left( {i,l} \right)} e^{{{{ - j2\pi ik} \mathord{\left/ {\vphantom {{ - j2\pi ik} N}} \right. \kern-\nulldelimiterspace} N}}} .$$
(9)

After this, we measure the harmonic summation on the FFT of multi-scale product (HSMP) in the lth frame. It consists to summarize the order of dominance for all harmonic elements at each frame.

The HSMP for the lth peak of the FFT of MP is defined as:

$$HSMP(pc_{i} ,t){\kern 1pt} {\kern 1pt} {\kern 1pt} = \sum\limits_{m = 1}^{{mpc_{i} < p_{\hbox{max} } }} {MP\left( {h\left( {mpc_{i} } \right),t} \right)} {\kern 1pt}$$
(10)

where pc i is a pitch candidate, mpc i is the frequency of its mth harmonic element, and p max is equal to 1700. The function h(mpc i ) transfers mpc i to the center frequency of the nearest FFT of MP bin. Then, we find the frequency that maximizes the \(HSMP(pc_{i} ,t){\kern 1pt}\) as the fundamental frequency.

Figure 3 shows a clean voiced speech signal followed by its multi-scale product (mp) and the fft of mp.

Figure 4 shows a monophonic music signal followed by its multi-scale product (mp) and the fft of mp.

The Figs. 3 and 4 show the efficiency of the MP method for pitch estimation. In Fig. 3c and 4c, the obtained signal shows spectral rays. The first element corresponds to the fundamental frequency F0. The following rays correspond to the fundamental frequency harmonics.

Figure 5 represents a noisy voiced speech sound corrupted by a White noise at -5 dB followed by its multi-scale product (mp) and the fft of mp.

Fig. 5
figure 5

Pitch estimation of speech sound corrupted by a −5 dB White noise. a Speech sound frame, b multi-scale product analysis (mp) of the noisy voiced speech, c FFT of mp

The mp in Fig. 5b lessens the noise effects leading to an FFT function with clear maxima giving the F0 determination (see Fig. 5c).

3 Experiments ad evaluation

Performance evaluation of our approach for pitch estimation in the case of Keele database (Meyer et al. 1995) and monophonic music uses the Musical Instruments Samples (University of lowa 2012). The Keele database contains 10 speakers sampling frequency of 20 kHz. It contains a reference fundamental frequency determination and voiced/unvoiced segmentation of 25.6 ms segments with 10 ms overlapping. The reference fundamental frequency determination of Keele database is based on a simultaneously recorded speech and signal of the laryngograph signal. The F0 of all compared methods are determined in each reference voicing frame.

The Musical Instruments Samples consists of 4000 notes, and one hundred and 50 min of sound composed by twenty different musical instruments. All the music sound signals were sampled at a rate of 44.1 kHz and down-sample it to 10 kHz. The notes are given in sequence employing a chromatic scale. Each document usually covers one octave and is identified with the name of the instrument, the initial and final notes. The documents of musical database were separate into files containing a single note without silence. For this purpose, we use an automatic segmentation method, and then testing the quality of the segmentation (Ben Messaoud et al. 2015).

We apply the gross pitch error (GPE) criteria and the root mean square error (RMSE) measures to evaluate the pitch estimation performance. A GPE is identified when the estimated fundamental frequency value is 20 % higher or lower than the reference one. The RMSE is defined as square root of the average squared estimation error with estimation errors which are smaller than the GPE threshold of 20 Hz. It is used for evaluate the speech sound.

For all compared approaches, we use a default pitch search range is 50–800 Hz (30–1700 Hz) respectively for speech signal (music sound). Each of the methods was asked to give a fundamental frequency determinate every millisecond, using the default settings of the method. In this work, we follow the recommendations suggested by the authors of the algorithms:

The SWIPE method is based on a sawtooth waveform in frequency domain. [p,t] = swipe(x, fs, [50 800], 0.001, 1/96, 0.1, −Inf);

The TEMPO method applies the instantaneous frequency of the results of a filter-bank. It’s tested only with monophonic music. f0 raw = exstraight source(x, fs);

The YIN algorithm is based on computing the normalized difference function and a parabolic interpolation. p.min f0 = 50; p.max f0 = 800; p.hop = 20; p.sr = fs; r = yin(x, p);

The SHR method applies the subharmonic-to-harmonic ratio. [t, p] = shrp(x, fs, [50 800], 40, 1, 0.4, 1250, 0, 0);

3.1 Results in clean speech and monophonic music sound

Table 1 presents the GPE estimation and RMSE measures of the proposed approach (GAMMA-MP), the SWIPE (Camacho and Harris 2008), the YIN (De Cheveigné and Kawahara 2002), and SHR (Sun 2000) for speech database.

Table 1 Fundamental frequency estimation for speech sound

For all the compared methods, the fundamental frequency determined in each reference voicing frame of reference Keele database and exactly in the same frames.

The GAMMA-MP shows a reduced GPE rate of 0.64 % and a low RMSE of 1.68 Hz. It’s obviously more accurate than the other methods.

Table 2 illustrates the GPE of over estimation and under estimation of the proposed approach (GAMMA-MP), the SWIPE (Camacho and Harris 2008), the TEMPO (Kawahara et al. 1999) the YIN (De Cheveigné and Kawahara 2002), and SHR (Sun 2000) for musical instrument database.

Table 2 Fundamental frequency estimation for musical instrument

Table 2 shows that GAMMA-MP has the lowest GPE in both over estimation, and under estimation. SWIPE and YIN perform better than TEMPO, while SHR produces the largest GPE over the whole database.

In Table 2, the GAMMA-MP appears as the most accurate approach for pitch estimation of musical instrument.

Table 3 presents the GPE results by instrument group. We have classed the musical instruments in five groups. The group bowed contains cello, violin, double bass, and viola. The group brass contains bass, trumpet, trombones, tuba, and French horn. The group plucked contains violin, and double bass. The group woodwinds contain clarinets, saxophones and flutes. The last group contains piano.

Table 3 Fundamental frequency estimation for musical instrument by instrument group

In Table 3, our approach performs better than other methods except for the plucked, for which TEMPO gives practically no error. On the other hand, SWIPE performance on piano is relatively bad compared to correlation based algorithms. The brass group obtained the fewer GPE errors. However, the bowed, and plucked group have given the most GPE errors, it may be caused by pizzicato sounds.

Table 4 shows the GPE for the musical instrument by octave.

Table 4 Fundamental frequency estimation for musical instrument by octave

As depicted in Table 4, the GAMMA_MP approach presents the best performance.

3.2 Results in noisy speech

To test the robustness of our algorithm, we add various background noises (white, babble, and vehicle) at three SNR levels to the Keele database speech signals. For this, we use the noisex-92 database (Varga 1993).

Table 5 illustrates the GPE of GAMMA-MP, SWIPE, YIN and SHR methods in a noisy environment.

Table 5 Pitch estimation Performance of GPE in a noisy environment

As depicted in Table 5, when the SNR level decreases, our proposed approach remains robust even at -5 dB in hard situations.

As seen, the GPE of SWIPE method degrades with the Babble and white noises at -5 dB. This can be explained by the fact that the SWIPE method doesn’t consider the weak voicing state like in the beginning and the end of any voiced sound. However, our proposed approach has the highest performances in all cases, which proves our Fig. 5.

3.3 Computational complexity of our approach

The proposed approach has only two channels and does not attempt directly to follow human resolvability. The approach produces similar and comparable results to those of an elaborate multi-channel pitch analysis models. The computational demands of multi-channel F0 analysis models have prohibited their application in practical cases (Meddis et al. 2010). The computational complexity is mostly determined by the number of channels used in the auditory filter-bank. In this paper, we have presented a suitable model of pitch perception in practical applications. Computational efficiency was shown by testing our approach on a 2.13 GHz Core Duo processor.

Table 6 presented the obtained results. For every file, the total execution time of all stages is equal approximately to 20 s.

Table 6 Execution time results of our approach

4 Conclusions

The proposed method (GAMMA-MP) estimates the fundamental frequency of speech and music sounds. It is based on a new auditory feature extraction technique method combined with a multi-scale product analysis in frequency domain. This auditory model consists of applying the out-middle ear filtering and the cochlea behaviour in frequency domain by a gammachirp filter-bank, where the values of those centre frequencies are selected in accordance to the equivalent rectangular bandwidth. For the two channels, the obtained sound signal is divided into frames, and each frame is weighted by a Hamming window. Next, we calculate the fast Fourier transform of each multi-scale product of weighted frame. Finally, a harmonic summation technique is applied to determine the fundamental frequency F0. The experimental results show the efficiency of our proposed method for pitch estimation from a large speech and musical instrument database, and its high accuracy compared with the state-of-the-art methods. Future work may address the extension of the proposed method to the determination of F0 for multi-talker speech, and polyphonic music sounds.