1 Introduction

Pitch is among the most prominent parameters in speech. From a phonological point of view, pitch is responsible of intonation and accentuation, whereas from the acoustic side, pitch is quantified by voiced/unvoiced (V/UV) decision and \(f_{0}\) contour. Along with acoustic energy, it conveys most para-verbal content and may dramatically change the meaning of the verbal component, for instance by representing the interrogative form or ironic intent. It is also a major component of emotion, a key human–machine communication mode which is still in its infancy, and can be used to diagnose several neuropsychological conditions like early cognitive impairment or depression.

These application domains have received a strong boost in the past few years with the widespread diffusion of vocal assistants, vocal interfaces, personal health devices, and with the developments in collaborative and cooperative robotics, all of which call for detection methods that are both effective and computationally light-weight and more accurate than the state of the art.

Pitch detection is probably the speech processing problem which have had the biggest interest. Several techniques have been implemented during the last half century, to provide an accurate measure of such a highly variable speech feature. Actually, pitch depends on a variety of parameters, mainly the speaker’s gender, age and the language type, i.e., tonal or non-tonal. A classification of the main pitch detection techniques can be made according to the domain of analysis, whether temporal, spectral or time-frequency [12]. In [17], another classification is proposed, dividing the pitch detection methods into event-detection techniques, like peak-picking and zero-crossing, and short-time average \(f_{0}\) detection techniques, such as cepstrum [28], autocorrelation [32] and average magnitude difference functions (AMDF) [34], minimal distance methods [17] and harmonic analysis-based techniques [12, 16, 38]. As a common point, the aforementioned techniques are applied on short time frames, to reduce the effects of non-stationarity of the speech signal. However, such a short time processing may lead to errors while estimating the pitch periods [19]. On another side, multiresolution analysis methods, such as discrete wavelet transform (DWT), are utilized to extract pitch [21]. Nevertheless, their performance is influenced by their inherent defaults, mainly poor time-frequency resolution and spectral leakage, as noticed in [19].

To tackle these issues, another concept has emerged in the last two decades, based on techniques applied along the whole signal, instead of short-time analysis. The majority of these techniques are based on the analysis of instantaneous frequency (\(f_{i}\)), which is a theoretic concept. By definition, \(f_{i}\) is the time-derivative of the phase of the analytic signal. The latter is a complex signal obtained by Hilbert transform [6]. However, using \(f_{i}\) values to extract \(f_{0}\) contour still suffers from the lack of a direct/explicit relationship between the two quantities.

Therefore, a novel relationship, although still empirical, is proposed in this work, in order to (a) determine the voiced vs. unvoiced parts of the speech signal, and (b) extract \(f_{0}\) contour from \(f_{i}\) values in the voiced parts. This work is described as follows: Sect. 2 reviews the related work, with a focus on harmonic analysis-based and \(f_{i}\)-based pitch detection techniques; Sect. 3 presents the method adopted, including the proposed empirical relationship between \(f_{i}\) and \(f_{0}\) in speech signal, and detailing the algorithm developed to extract \(f_{0}\) from \(f_{i}\) through this relationship. Section 4 presents the objective evaluation protocol and the results yielding from the application of the proposed algorithm on clean and simulated noisy speech. Finally, the performance measures are commented and discussed, with some proposals for improvement.

We note that the empirical relationship between \(f_{0}\) and \(f_{i}\) was already presented in [26], whereas the description of the proposed PDA has recently been accepted in [27]. The present paper contains an extended description of the utilized method, and especially a comprehensive evaluation, including not only the performance of the proposed algorithm with respect to its parameters, but also the comparison to state-of-the-art techniques, in various noise conditions and separately for male and female speakers.

2 Related Work

Even though pitch detection is a classical audio and speech processing problem, research in this field has never given up. In fact, several challenges are still open, such as multi-pitch detection, accurate pitch detection in noisy environments and real-time pitch tracking. Thus, two main directions are followed, (i) the classical harmonic analysis based on short-time Fourier transform (STFT) and (ii) the instantaneous spectrum based on Hilbert transform (HT).

2.1 Pitch Detection by Harmonic Analysis

The use of harmonics to detect \(f_{0}\) in speech and music signals has been the key idea of several pitch detection algorithms, such as the subharmonic summation (SHS) [16], subharmonic-to-harmonic ratio (SHR) [38] and residual harmonics [12].

In [40], a harmonic model is applied to estimate voiced speech parameters. In particular, a maximum a posteriori probability is estimated to compute the fundamental frequency (\(f_{0}\)). In [45], an algorithm named YAAPT (Yet Another Algorithm for Pitch Tracking) leverages the STFT spectrum of the filtered squared signal to provide a primary estimation of \(f_{0}\). Then, the \(f_{0}\) candidate values are refined, and dynamic programming is applied to select the path containing the series of \(f_{0}\) candidates that minimizes a cost function composed of a merit term and a transition term. This method has been particularly robust to white and babble noise.

In [5], the BaNa algorithm is presented. This algorithm is qualified as hybrid since it combines two main pitch detection approaches, i.e., harmonic ratios, like in SHR [38], and cepstrum analysis [28]. The final pitch value is calculated using a Viterbi algorithm to decode the optimal path search, where the cost function is defined as a sum of the log-ratio of each pair of \(f_{0}\) candidates plus a weighted confidence score. The evaluation of this method to some state-of-the-art PDA’s such as PRAAT [7] and YIN [10] shows that it is more robust to noise.

In [42], a method based on multi-band summary correlogram (MBSC) is developed for the problem of pitch detection in noisy environments. Thus, the input signal is filtered by multiple wide-band FIR filters. The extracted envelope at each frequency band is filtered again with a multi-channel comb filter and harmonic-to-subharmonic ratio (HSR) computation. Then, the MBSC is computed on the samples selected by HSR. Finally, the pitch value is retained as the candidate that has the smallest MBSC.

In [14], a PDA named PEFAC (Pitch Estimation Filter with Amplitude Compression) has been proved to be quite efficient in noisy speech, even with negative \(\text {SNR}\). This PDA proceeds by: (i) normalizing the signal to remove channel dependencies, (ii) attenuating the strong noise components by harmonic summation and (iii) applying temporal continuity constraints to the selected pitch candidates.

More recently, [44] has proposed harmonic enhancement to cope with the issue of missed and submerged harmonics in spectrum before performing pitch detection; and lately, [33] has developed a spectrum-based PDA for singing voices. The latter method is based on candidate harmonic partials detection using a random forest classifier. Each triplet of successive \(f_{0}\) candidates undertakes a harmonicity/ unharmonicity check using adapted thresholds.

2.2 Instantaneous Frequency-Based Pitch Detection

Using instantaneous frequency (\(f_{i}\)) for pitch detection is an alternative way to get around some problems of conventional methods, such as harmonic analysis and multiresolution analysis. In fact, \(f_{i}\) pattern can be continuously analyzed along the whole signal, which allows avoiding some constraints, such as short-time analysis, that is usually required to reduce the effect of non-stationarity of the speech signal, and wavelet scale adjustment, which is necessary to enhance the time-frequency resolution [19].

Even though it is less frequent to use instantaneous analysis for pitch detection, a few PDA’s based on \(f_{i}\) analysis were proposed in the literature [1, 2, 18, 31] and more recently in [22], with valuable performance. These \(f_{i}\)-based methods extract \(f_{0}\) contour as a continuous function of time in voiced regions. For instance, Qiu et al. [31] proceed as follows: First, the harmonics are attenuated using a band-pass filterbank; secondly, the discrete instantaneous frequency (DIF) is estimated at different scales of the band-pass filterbank; and finally, the V/UV decision is taken upon certain criteria related to: (i) the DIF value (unvoiced if \(\text {DIF}\le 50~\hbox {Hz}\) or \(\text {DIF}\ge 500~\hbox {Hz}\)), or (ii) to the variation between neighboring DIF’s (unvoiced if \(\varDelta (\text {DIF})\ge 1.4~\hbox {Hz}\)), or (iii) to the duration of sustained DIF (unvoiced it is less than \(20~\hbox {ms}\)). However, using this technique is likely to cause the problem of pitch halving/doubling, where the low harmonics, i.e., the multiples of \(f_{0}\) that are less than \(500~\hbox {Hz}\), could be also taken for \(f_{0}\) values. To cope with this issue, multiple scales of the filterbank are used, to retain the smallest non-zero DIF as \(f_{0}\) value.

In [1], Abe et al. used \(f_{i}\) pattern to extract \(f_{0}\) by tracking the harmonics. To achieve this goal, the signal is decomposed into harmonic components by applying a filterbank with a variable center frequency. Then, \(f_{i}\) values of each component are considered as the harmonic pattern. Finally, the lowest \(f_{i}\) pattern, i.e., the lowest harmonic, is retained as the \(f_{0}\) contour [1]. In continuation to this work, the same authors proposed in [2] an IF-based method where IF is extracted from the spectrum of the short-time Fourier transform (STFT) to enhance harmonics, by suppressing aperiodic components. This method was reported to perform well in presence of noise, in comparison to its contemporaneous state of the art, such as the dynamic programming-based cepstrum methods, proposed by [23].

In [18], the Hilbert-Huang transform (HHT) is applied for pitch detection from \(f_{i}\) pattern. Originally, HHT is a twofold process, that is performed first by applying empirical mode decomposition (EMD), and then by decomposing the signal into intrinsic mode functions (IMF) through a special process called sifting. Each resulting IMF is characterized by its instantaneous frequency (\(f_{i}\)) and its instantaneous amplitude (\(A_{i}\)). After extracting all IMF’s, \(f_{0}\) and V/UV decision are estimated, first by filtering all IMF’s, where only \(f_{i}\) values between \(50~\hbox {Hz}\) and \(600~\hbox {Hz}\) are kept, and where \(f_{i}\) values are set to zero if \(\varDelta f{i}\ge 100~\hbox {Hz}\) in a \(5~\hbox {ms}\) frame or when the instantaneous amplitude \(A_{i}(t) \le \frac{\max (A_{i})}{10}\). At each instant, the \(f_{i}\) value corresponding to the highest \(A_{i}\) value in all IMF’s, is retained as \(f_{0}\) value. Finally, the extracted \(f_{0}\) contour is merged and smoothed by post-filtering.

More recently, [22] leveraged aperiodicity bands and short-time Fourier transform (STFT)-based channel-wise instantaneous frequency, defined as in (1)

$$\begin{aligned} f_{i}(t,\omega )=\frac{1}{2\pi }\frac{\partial \Phi (t,\omega )}{\partial t} \end{aligned}$$
(1)

where \(\Phi (t,\omega )\) is the STFT phase spectrum, to estimate \(f_{0}\) for speech signal synthesis, using a three-stage process. In the first stage, aperiodicity bands are detected using a wavelet-based analysis filter with a highly selective temporal and spectral envelope. In this stage, instantaneous frequency is filtered to yield the periodicity probability map. The second stage generates a first estimate of \(f_{0}\) trajectory from the periodicity probability map and signal power information. Finally, the third stage refines the estimated \(f_{0}\) trajectory using the deviation measure of each harmonic component and \(f_{0}\) time warping. It is worth noting that this PDA has been included to Google’s vocoder named YANG (Yet ANother Generalized Vocoder) [3].

2.3 Interaction Between Fundamental Frequency and Instantaneous Frequency

Most of the aforementioned \(f_{i}\)-based pitch extraction techniques have been successfully compared to the rest of state-of-the-art methods, yielding a very accurate V/UV decision and \(f_{0}\) values, which proves that using \(f_{i}\) is a good alternative to extract \(f_{0}\) without taking care of the non-stationarity of the speech signal. Nevertheless, these methods are mostly based on empirical hypotheses, such as considering \(f_{0}\) as a filtered discrete instantaneous frequency [31], or as the smallest harmonic [1], or as the instantaneous frequency matching to the greatest instantaneous amplitude of the intrinsic mode functions (IMF), which are extracted from the signal by empirical mode decomposition (EMD) [18]. Thus, none of these techniques is based on a direct or an explicit relationship between \(f_{i}\) and \(f_{0}\), even though in each case, \(f_{0}\) contour is extracted from \(f_{i}\) values.

Such a relationship could fill the gap between accurate empirical methods and the lack of a theoretical link between both quantities, i.e., \(f_{0}\) and \(f_{i}\). It should be noted that in [24], some interaction between Fourier transform and Hilbert transform is proposed for harmonic signals. Actually, [24] relates from that Fourier transform (\(\text {FT}\)) weakly generates Hilbert transform (\(\text {HT}\)) for any well-defined function g(x), as follows:

$$\begin{aligned} \text {FT}(\text {HT}(g(x))=\text {i}\sigma (x)\text {FT}(g(x)) \end{aligned}$$
(2)

where \(\text {i}\) is the imaginary number and \(\sigma (.)\) is the sign function. However, (2) is not enough to generalize the relationship between \(f_{0}\) and \(f_{i}\) for two major reasons: First, speech signal is far from being strictly harmonic (notwithstanding the possibility to model speech by a harmonic-plus-noise model (HNM) [36]); and secondly, \(f_{0}\) contour is not continuous over the speech signal due to binary voiced/unvoiced (V/UV) decision.

Also, Shimauchi et al. [35] have recently established an explicit relationship between the STFT magnitude spectrum and the channel-wise instantaneous frequency, defined by (1), such that:

$$\begin{aligned} f_{i}(t,\omega )=\frac{1}{2\pi \sigma ^2} \frac{\partial log(A(t,\omega ))}{\partial \omega }+\frac{\omega }{2\pi }. \end{aligned}$$
(3)

where \(\Phi (t,\omega \) and \(A(t,\omega )\) are the STFT phase and magnitude, respectively; t and \(\omega \) are the time and the angular frequency in the Fourier domain, respectively. This relationship has been used for phase retrieval, i.e., phase estimation given the magnitude spectrum only. Nevertheless, this result does not lead to a direct/explicit relationship between the STFT-based channel-wise instantaneous frequency and the Hilbert transform-based one.

3 Method

In this work, a direct relationship is proposed, albeit it is still empirical. This relationship relies on the same assumptions utilized in the aforementioned \(f_{i}\)-based techniques. Then, this relationship is used to implement an algorithm able to: (a) determine the voiced/ unvoiced parts and (b) extract \(f_{0}\) contour from \(f_{i}\) values in the voiced regions.

3.1 Computation of Instantaneous Frequency

Instantaneous frequency (\(f_{i}\)) is a theoretic concept defined as the time-derivative of the phase of the analytic signal z(t). The latter is the complex signal given by:

$$\begin{aligned} z(t)=s(t)+j s_{H}(t) = a(t)e^{\,j\phi (t)}, \end{aligned}$$
(4)

where

$$\begin{aligned} s_{H}(t)=\text {HT}(s(t))=\text {pv}\left( \int _{-\infty }^{+\infty }\frac{s(t-\tau )}{\pi \tau }\,\mathrm{d}\tau \,\right) . \end{aligned}$$
(5)

\(\text {HT}\) and \(\text {pv}\) denote the Hilbert transform and the Cauchy principal value, respectively, whereas a(t) and \(\phi (t)\) are the instantaneous amplitude and the instantaneous phase, respectively. As z(t) is unique for a given s(t) [13], then:

$$\begin{aligned} s(t)=a(t)\cos (\phi (t)), \end{aligned}$$
(6)

Since no restrictions are required concerning the stationarity or the linearity of the system that generates s(t), (6) is valid for any natural signal. In [6], based on the earlier works of [30, 43], the generalized instantaneous phase \(\phi (t)\) can be written as:

$$\begin{aligned} \phi (t)=2\pi \int _{0}^{t}f(t)\,\mathrm{d}t. \end{aligned}$$
(7)

It is obvious that \(\phi (t)\) would have the classical formula \(\phi (t)~=~2\pi ft+\phi _{0}\) in case of a simple harmonic signal. Hence, the instantaneous frequency \(f_{i}\) can be defined as the time-derivative of the instantaneous phase \(\phi \) as in (8), based on [6]:

$$\begin{aligned} f_{i}(t)=\frac{1}{2\pi }\frac{\,\mathrm{d}\phi (t)}{\,\mathrm{d}t}=\frac{1}{2\pi }\frac{\,\mathrm{darg}(z(t))}{\,\mathrm{d}t}. \end{aligned}$$
(8)

For discrete signals, \(f_{i}\) is easily calculated by (9), where z(n) is the associated discrete analytic signal and \(f_{s}\) is the sampling frequency (for \(n \ge 1\)):

$$\begin{aligned} f_{i}(n)=\frac{f_{s}}{4\pi }(\arg (z(n+1))-\arg (z(n-1))). \end{aligned}$$
(9)

3.2 Proposed Empirical Relationship Between Pitch and Instantaneous Frequency

In spite of the absence of a direct relationship between \(f_{i}\) and \(f_{0}\), both types of frequency share a common point, which is continuity over time, at least in the regions where \(f_{0}\) contour is defined, such as the voiced parts of a speech signal. This suggests that in such a region, the observed instantaneous frequency can be a relative integer multiple of \(f_{0}\) with or without some residual frequency (note that in (9), \(f_ {i}\) can be negative). Starting from this assumption, some working notations are defined in the following, with the sole aim to describe the proposed method.

3.2.1 Instantaneous Pitch

It can be defined as the value of \(f_{0}\) at every discrete instant n inside the voiced regions only (\(f_{0}\) is undefined in unvoiced segments). This is different from conventional PDA’s, where pitch is usually obtained by one value at each frame and then \(f_{0}\) contour is obtained by interpolation.

3.2.2 Instantaneous Pitch Multiples

They are defined at each instant n as the positive integer multiples of instantaneous pitch \(f_{0}(n)\) below \(|f_{i}(n)|\). The highest instantaneous multiple is defined as the closest one to \(|f_{i}(n)|\). Consequently, the maximum order of instantaneous pitch multiples, denoted \(H_{\max }(n)\), is defined as:

$$\begin{aligned} H_{\max }(n)=\left\lfloor \frac{|f_{i}(n)|}{f_{0}(n)}\right\rfloor . \end{aligned}$$
(10)

We also highlight that in this particular case and for mathematical rigor, we avoided using the term harmonics to refer to pitch multiples for the following reasons: (a) Harmonics are related to Fourier transform, whereas \(f_{i}\) is obtained by Hilbert transform; (b) to the best of our knowledge, no explicit relationship has been proved so far between \(f_{0}\) and \(f_{i}\), even though some interaction may exist in harmonic signals [24].

3.2.3 Instantaneous Residual Frequency

It is defined as the difference between \(|f_{i}|\) and the instantaneous pitch multiple:

$$\begin{aligned} f_{ir}(n)=|f_{i}(n)|-H(n)f_{0}(k)~\forall ~H(n)\le H_{\max }(n), \end{aligned}$$
(11)

where \(1\le H(n) \le H_{\max }(n)\) are the orders of the instantaneous pitch multiples at time n.

3.3 Estimation of Instantaneous Pitch from Residual Instantaneous Frequency

It is obvious that for the highest instantaneous pitch multiple order \(H_{\max }(n)\), the residual instantaneous frequency \(f_{ir}\) is minimal and we have:

$$\begin{aligned} f_{ir}(n)\le f_{0}(n). \end{aligned}$$

In this particular case, we empirically notice that \(f_{0}\) contour can be obtained as the upper bound of the envelope of the instantaneous residual frequency \(f_{ir}\). This upper bound is calculated on overlapping frames of small duration (less than 40 ms):

$$\begin{aligned} f_{0, \text {est}}(n_{k})=\max _{n_{k}-\frac{L}{2}\le l < n_{k}+\frac{L}{2}} f_{ir}(l), \end{aligned}$$
(12)

where \(n_{k}\) and L are the center and the length of the kth frame, respectively.

Fig. 1
figure 1

Example of the established relationship between ground-truth \(f_{0}\) and the instantaneous frequency \(f_{i}\) extracted using (4)–(9)

Figure 1 shows the results for a speech signal, as follows:

  • Figure 1 (top plot) shows the instantaneous frequency (\(f_{i}\)) of a speech signal, calculated by (9).

  • Figure 1 (middle plot) illustrates the residual instantaneous frequency (\(f_{ir}\)) and its envelope, calculated by (11) and (12), respectively.

  • In Fig. 1 (bottom plot), the ground-truth \(f_{0}\) contour is superposed with the envelope of \(f_{ir}\), to show a quasi-superposition between both patterns. This leads us to consider the envelope of \(f_{ir}\) as an estimated value of the \(f_{0}\), as proposed in (12).

To validate the results given by (12), the ground-truth \(f_{0}\) values provided by the standard pitch tracking database, PTDB-TUG [29], were utilized. Therefore, the ground-truth \(f_{0}\) contour was first aligned to the instantaneous frequency \(f_{i}\); then the residual frequency \(f_{ir}\) and the estimated fundamental frequency \(f_{0_\mathrm{est}}\) were calculated using (10)–(12) for different values of the order of instantaneous pitch multiples (H(n)), to confirm that maximizing H(n), i.e., using \(H_{\max }(n)\), obtained by (10), in (11), improves the superposition between the ground-truth \(f_{0}\) and \(f_{0,\mathrm{est}}\) given by (12), as shown in Fig. 2.

To check further the validity of this empirical result, the root mean square error (RMSE) were measured between both contours, i.e., ground-truth \(f_{0}\) and \(f_{0,\mathrm{est}}\) contour, for a large subset of signals from PTDB-TUG database [29]. In addition, the areas covered by both contours are compared, to confirm their superposition (cf. Table 1).

The test signals correspond to randomly selected 400 speech signals, uttered by 10 male and 10 female speakers. The results mentioned in Table 1 show that increasing the maximum order of instantaneous pitch multiples \(H_{\max }\) in (10)–(12) makes the difference between the ground-truth \(f_{0}\) contour and the estimated contour \(f_{0,\mathrm{est}}\), small enough to consider them as superposed. However, since ground-truth \(f_{0}\) is already used to calculate \(f_{0_\mathrm{est}}\) (cf. (10)–(12)), it means that there is a recursive relationship between both, so the problem is how to extract \(f_{0_\mathrm{est}}\) directly from the instantaneous frequency \(f_{i}\), such that it approximates the ground-truth \(f_{0}\).

Fig. 2
figure 2

Estimated \(f_{0}\) contour (dotted line) versus ground-truth \(f_{0}\) (continuous line) for different orders of instantaneous pitch multiples (\(H_{\max }\)) using (10)–(12)

Table 1 Covered area error and root mean square error (RMSE) between the contours of ground-truth \(f_{0}\) and \(f_{0,\mathrm{est}}\) estimated using (10)–(12)

3.4 Proposed Pitch Detection Algorithm

ToFootnote 1 extract \(f_{0}\) from instantaneous frequency \(f_{i}\) using equations (10)–(12), the following algorithm is proposed. The algorithm is divided into three main steps: (a) preprocessing, where the instantaneous frequency is calculated using (4)–(9) and V/UV decision is evaluated, (b) \(f_{0}\) extraction using (10)–(12), and (c) postprocessing, where the obtained \(f_{0}\) contour is smoothed and segmented into voiced and unvoiced parts. The pseudocode is detailed in Algorithms 1–3 whereas Table 2 lists the settings of the parameters and thresholds used.

Step 1: Preprocessing

  • Initialization

    1. 1.

      Extract \(f_{i}\) from a digital speech signal using (4)–(9).

    2. 2.

      Set the range of minimum and maximum \(f_{0}\) values \([f_{0_\text {min}},f_{0_\text {max}}]\).

    3. 3.

      Set the sweeping step of \(f_{0}\) candidates (\(f_{0_\text {cand}}\)) within the range \([f_{0_\text {min}},f_{0_\text {max}}]\).

  • V/UV decision

    1. 4.

      At each time n, the differential instantaneous frequency defined as

      $$\begin{aligned} \varDelta f_{i}(n)~=~(f_{i}(n+1)-f_{i}(n-1))/2, \end{aligned}$$

      is calculated. If \(\varDelta f_{i}(n)~\ge ~Th_{1}\) then the point n is considered as unvoiced (cf. Table 2), otherwise voiced.

    2. 5.

      If the ratio of points marked as voiced within a frame is higher than the threshold \(Th_{2}\) (cf. Table 2), then the whole frame is marked as voiced, otherwise unvoiced.

      Step 2: \(f_{0}\) extraction

    3. 6.

      Fix a set of \(M\ge 1\) values of \(f_{0}\) candidates, equally spaced by \(f_{0,step}\) and ranging between \(f_{0_\text {min}}\) and \(f_{0_\text {max}}\) (cf. Table 2), \(f_m = (f_{0_\text {max}}-f_{0_\text {min}})(m-1)/(M-1))+f_{0_\text {min}},\) \(m=1,..,M\).

    4. 7.

      Set the maximum order of instantaneous pitch multiples \(H_{\max }\) to be calculated at each instant n.

    5. 8.

      For each instant n, calculate the vector of the orders of instantaneous pitch multiples \(1 \le (H_{m})_{m=1,..,M} \le H_{\max }\) corresponding to each \(f_{0}\) candidate value (\(f_{0_{cand}}(n,m)\)) such that

      $$\begin{aligned} H_{\mathrm{max},m}(n)=\min \left( H_{\max }, \lfloor \frac{|f_{i}(n)|}{f_{0_{cand}}(n,m)}\rfloor \right) . \end{aligned}$$
    6. 9.

      For each \(f_{0}\) candidate value \(f_{0_\mathrm{cand}}(n,m)\) and each corresponding maximum harmonic order \(H_{\mathrm{max},m}(n)\), calculate the instantaneous residual frequency corresponding to each \(f_{0}\) candidate value \((f_{ir}(n,m))_{m=1,..,M}\), using (11).

    7. 10.

      Calculate the value of \(f_{0_\mathrm{cand}}(n,\hat{m})\) at instant n such that

      $$\begin{aligned} \hat{m}=\arg \underset{m=1\dots M}{\min }(|f_{ir}(n,m)-f_{0_\mathrm{cand}}(n,m)|). \end{aligned}$$
    8. 11.

      If \(|f_{ir}(n,\hat{m})-f_{0_\mathrm{cand}}(n,\hat{m})|~\le ~Th_{3}\) (cf. Table 2) then \(f_{0_\mathrm{cand}}(n,\hat{m})\) is kept as a potential \(f_{0}\) value at point n.

    9. 12.

      For each set of potential \(f_{0}\) values kept at time n, i.e., \(\{f_{0_\mathrm{cand}}(n,\hat{m})\}_{\hat{m}=1,\dots ,\hat{M}}\), if a subset of values are multiples of other ones, then keep only the lowest value within this subset, e.g., if {80 Hz, 160 Hz, 240 Hz} and {90 Hz, 180 Hz, 270 Hz} satisfy the conditions of passes (8–10) in Step 2, then the kept \(f_{0}\) candidates are {80 Hz, 90 Hz}. Note that to bypass strict numerical inaccuracies, a kept \(f_{0}\) candidate value (\(f_{0,\text {cand}}(n,\hat{m_{2}})\) is considered as an integer multiple of a smaller \(f_{0,\text {cand}}(n,\hat{m_{1}})\) if \(\mod \left( \frac{f_{0,\text {cand}}(n,\hat{m_{2}})}{f_{0,\text {cand}}(n,\hat{m_{1}})}\right) ~<~Th_{4}\) (cf. Table 2).

    10. 13.

      At the end of this process, if there are still \((\hat{M}~>~1)\) \(f_{0}\) candidate values at point n that still satisfy the conditions above, then choose the \(f_{0}\) candidate value which highest multiple is the closest to \(|f_{i}(n)|\), i.e.,

      $$\begin{aligned} f_{0}(n)=\arg \underset{\hat{m}=1\dots \hat{M}}{\min }\left( ~\mod ~\left( \frac{|f_{i}(n)|}{f_{0_\mathrm{cand}}(n,\hat{m})}\right) \right) . \end{aligned}$$

      Step 3: Postprocessing

    11. 14.

      Smoothing: Apply a smoothing filter, i.e., median or linear, to the extracted \(f_{0}\) values to smooth the obtained \(f_{0}\) contour.

    12. 15.

      V/UV segmentation: Apply element-wise multiplication of the smoothed \(f_{0}\) contour and the voiced/unvoiced vector obtained at Step 1, to set \(f_{0}\) to zero in the unvoiced frames.

figure a
figure b
figure c
Table 2 Specific parameters and thresholds of the proposed algorithm

4 Objective Evaluation

Before undertaking objective evaluation, an experimental protocol has been set in order to meet the requirements of such an evaluation, following recent PDA reviews [20, 37].

Table 3 Description of the PDA’s used for benchmarking

4.1 Evaluation Protocol

  1. 1.

    Select a random subset from the standard pitch tracking database, PTDB-TUG [29], containing 400 signals equally divided between the 10 male and the 10 female speakers of the database. In fact, this database provides also ground-truth \(f_{0}\), extracted from the high-pass-filtered laryngograph (LAR) signals.

  2. 2.

    Mix the evaluation wave files, containing initially clean speech, with babble noise and Gaussian white noise, at different \(\text {SNR}\) levels, ranging from \(20~\hbox {dB}\) to \(0~\hbox {dB}\), to obtain simulated noisy speech signals.

  3. 3.

    Extract the \(f_{0}\) contour from the input signals using state-of-the-art PDA’s, namely PRAAT [7], RAPT [41], SWIPE [8], YIN [10], SHR [38], YANG [22], and finally the proposed algorithm (Prop.) [25]. More details about the aforementioned PDA’s are in Table 3. We note that these PDA’s have been selected for benchmarking based on their high performance for clean and noisy speech in a recent review [20].

  4. 4.

    For each pair of ground-truth and extracted \(f_{0}\) contours calculate the standard metrics used in pitch detection evaluation, i.e., V/UV decision error (\(\text {VDE}\) (%)), gross pitch error (\(\text {GPE}\) (%)), \(f_{0}\) frame error (\(\text {FFE}\) (%)) and fine pitch error (\(\text {FPE}\) (cents)) [9, 12]. These standard measures are usually used to assess pitch detection quality. They are calculated as follows [9]:

  • V/UV decision error (\(\text {VDE}(\%)\)): The rate of misclassified V/UV decisions, i.e., \(\text {V}\rightarrow \text {U}(\%)\) and \(\text {U}\rightarrow \text {V}(\%)\), respectively, the rate of voiced frames detected as unvoiced (false negatives) and of unvoiced frames detected as voiced (false positives).

    $$\begin{aligned} \text {VDE}(\%)=\text {V}\rightarrow \text {U}(\%)+\text {U}\rightarrow \text {V}(\%), \end{aligned}$$
    (13)

    where

    $$\begin{aligned} \text {V}\rightarrow \text {U}(\%)=\frac{\text {N}_{\text {V} \rightarrow \text {U}}}{\text {N}}\times 100, \end{aligned}$$
    (14)

    and

    $$\begin{aligned} \text {U}\rightarrow \text {V}(\%)=\frac{\text {N}_{\text {U} \rightarrow \text {V}}}{\text {N}}\times 100. \end{aligned}$$
    (15)

    \(\text {N}_{\text {V}\rightarrow \text {U}}\), \(\text {N}_{\text {U}\rightarrow \text {V}}\) and \(\text {N}\) are the number of false negatives, false positives and total frames, respectively.

  • Gross pitch error (\(\text {GPE}(\%)\)): The rate of voiced frames that are detected as voiced, where in addition the relative error between ground-truth \(f_{0}\) and \(f_{0,\mathrm{est}}\) is higher than \(20\%\), among the total true positives \(\text {N}_{\text {V} \rightarrow \text {V}}\):

    $$\begin{aligned} \text {GPE}(\%)=\frac{\text {N}_{\text {GPE}}}{\text {N}_{\text {V} \rightarrow \text {V}}}\times 100. \end{aligned}$$
    (16)
  • \(f_{0}\) frame error (\(\text {FFE}(\%)\)): The rate of frames concerned by either a \(\text {VDE}\) error, i.e., (\(\text {N}_{\text {V} \rightarrow \text {U}}+\text {N}_{\text {U} \rightarrow \text {V}}\)) or by a \(\text {GPE}\) error, i.e., \(\text {N}_{\text {GPE}}\), among all frames:

    $$\begin{aligned} \text {FFE}(\%)=\text {VDE}(\%)+\frac{\text {N}_{\text {V}\rightarrow \text {V}}}{\text {N}}\times \text {GPE}(\%). \end{aligned}$$
    (17)
  • Fine pitch error (FPE(\(\text {cents}\))): The standard deviation of the relative error (in cents) of pitch values in the voiced frames where there is no gross pitch error:

    $$\begin{aligned} \text {FPE(cents)}=\sqrt{\frac{\sum _{i=1}^{\text {N}_{\text {FPE}}}(\epsilon _{i}-\overline{\epsilon })^{2}}{\text {N}_{\text {FPE}}}}\times 100 \end{aligned}$$
    (18)

    \(\text {N}_{\text {FPE}}\) is the number of true positives for which there is no gross pitch error, \(\epsilon \) is the absolute error between ground-truth \(f_{0}\) and \(f_{0,\mathrm{est}}\) for such a frame and \(\overline{\epsilon }\) is the mean of this error over all concerned frames.

Table 4 Average pitch error measures of benchmarking PDA’s for clean speech over all speakers
Table 5 Average pitch error measures of the proposed algorithm for noisy speech over all speakers

4.2 Evaluation Results

For coherence of error measures, the same values of frame and shift duration used for extraction of ground-truth \(f_{0}\) were set for all the evaluated algorithms, i.e., \(32~\hbox {ms}\) and \(10~\hbox {ms}\), respectively. Also, the same \(f_{0}\) boundaries were set, i.e., \([80~\hbox {Hz}, 270~\hbox {Hz}]\) for male speakers and \([120~\hbox {Hz}, 400 ~\hbox {Hz}]\) for female ones. Evaluation has been made by measuring \(\text {VDE}\), \(\text {GPE}\), \(\text {FFE}\) and \(\text {FPE}\) rates for clean and noisy speech calculated using (13)–(18) in a systematic way, as follows:

  1. 1.

    Performance for clean vs. noisy speech for all benchmarking PDA’s is reported in Tables 4 and 5, respectively.

  2. 2.

    The results are analyzed in different qualitative aspects for all benchmarking PDA’s, i.e., by type of noise (cf. Fig. 3) and by gender of speaker, (cf. Fig. 4).

  3. 3.

    The performance of the proposed algorithm is evaluated in respect to its intrinsic parameters, i.e., by the preset maximum order of instantaneous pitch multiples (\(H_{\max }\)) (cf. Fig. 6) and by the sweeping step of \(f_{0}\) candidates (\(f_{0,step}\)) (cf. Fig. 7).

4.2.1 Performance for Clean Versus Noisy Speech

Performance for clean speech: The analysis of Table 4 shows that the proposed algorithm is as good as the state-of-the-art algorithms YIN [10] and SWIPE [8] in V/UV decision detection, i.e., \(\text {VDE}(\%)\), especially thanks to its low rate of false negatives, i.e., (\(\text {V}\rightarrow \text {U}(\%)\)). However, its rate of false positives, i.e., \(\text {U}\rightarrow \text {V}(\%)\), is slightly higher. Also, it should be noted that the low performance of SHR algorithm[38] is due to the unified frame length and shift imposed to all algorithms during evaluation. Actually, SHR should give better results for shorter frames.

A finer analysis shows the performance of (Prop.) slows down when looking to \(\text {FFE}\) rate. This should be due to its higher \(\text {GPE}\) rate, as \(\text {FFE}\) is a weighted mean of \(\text {VDE}\) and \(\text {GPE}\), cf. (17). Nevertheless, the proposed PDA (Prop.) provides the best \(\text {FPE}\) in clean speech. This confirms the effect of good voicing detection, since \(\text {FPE}\) concerns only the true positives where there is no gross pitch error cf. (18). This means also that for a region detected as voiced, if the gross pitch error is less than 20%, then \(f_{0}\) contour estimated by (Prop.) is closer to the ground truth than all other PDA’s.

Performance for noisy speech: Table 5 shows the performance of the proposed PDA (Prop.) in different noise conditions, namely babble and white noise, with SNR ranging from 20 dB to 0 dB. Results show that for both types of noise, (Prop.) is doing well only for low noise levels (\(\text {SNR} \ge 15~\hbox {dB}\)) whereas for higher noise levels, all rates are less satisfactory. In particular, VDE rate is as good as for clean speech, which proves that the proposed methods succeeds to (a) detect the presence of speech activity, particularly in white noise, and (b) make a distinction between the right voice and other voices in babble noise. Figure 3 shows the measured rates for all PDA’s in both babble and white noise. The following remarks can be noticed:

  • For \(\text {SNR} \ge 15~\hbox {dB}\), most PDA’s are performing as well as in clean speech, i.e., PRAAT and RAPT, and in a lesser degree SWIPE, YIN and (Prop.); whereas for higher noise levels, all PDA’s performance gets worse.

  • For higher noise levels, (Prop.) succeeds to keep an intermediate position, especially for babble noise, whereas the top PDA’s like PRAAT and RAPT lose their efficiency.

  • For all tested PDA’s, FPE rate gets too low at \(\text {SNR} \le 5~\hbox {dB}\) (cf. Fig. 3g, h).

Comparison with \(f_{i}\)-based PDA’s: YANG is a PDA that has recently been proposed by [22] and utilized in Google’s vocoder for speech synthesis. The particularity of this PDA lies in the fact that is based on another type of instantaneous frequency, i.e., the channel-wise STFT instantaneous frequency (cf. (1)), which makes it interesting to compare it to the proposed approach. For clean speech, Table 4 shows that YANG outperforms the proposed PDA in all metrics except FPE. However, the main difference lies in GPE, which influences also FFE, whereas the difference between both PDA’s in VDE is less sensitive.

For noisy speech, Fig. (3a–h) shows that even if YANG does better than the proposed PDA for low noise levels, i.e., \(\text {SNR}~\ge ~15~\hbox {dB}\), its performance decreases for higher levels of both types of noise, i.e., babble and white, whereas the proposed PDA remains more stable. This confirms the robustness of the proposed PDA to high noise levels.

4.2.2 Qualitative Performance Evaluation

Evaluation by type of noise: First, for babble noise (cf. Table 5 and Fig. 3a, c, e, g), the proposed algorithm is among the top PDA’s at low noise levels, i.e., \(\text {SNR}\ge 15~\hbox {dB}\). This means that it is capable to distinguish the pitch of the right speaker among other voices. Also for high noise levels, i.e., \(\text {SNR}\le 10~\hbox {dB}\) , the proposed PDA is ranked among the top ones, even though all benchmarking PDA’s are not so efficient.

Secondly, for white noise (cf. Table 5 and Fig. 3b, d, f, h), the proposed algorithm is interestingly efficient for low noise levels, with error rates close to clean speech, cf. Table 5. However, this trend is less maintained when dealing with high noise levels, i.e., \(\text {SNR}\le 10~\hbox {dB}\), where the proposed algorithm is less efficient than some benchmarking PDA’s such as RAPT, SWIPE and YIN.

Finally, it is important to note that the low \(\text {FPE}\) value for \(\text {SNR} \le 15~\hbox {dB}\) for all benchmarking PDA’s (cf. Fig. 3g, h) is rather caused by the poor V/UV estimation, i.e., a high \(\text {VDE}\) (cf. Fig. 3a, b) than to a good pitch estimation. Actually, if most of frames are detected as unvoiced, the overall \(\text {FPE}\) would be calculated only on a few voiced frames, where there is no gross pitch error.

Fig. 3
figure 3

Performance of benchmarking PDA’s by type of noise for all speakers

Evaluation by gender of speaker: Figure 4 shows that there is no substantial difference in the performance of all PDA’s between male and female speakers. In particular, the proposed algorithm is registering similar levels for each type of error measure for both genders. This means that the parameters are set correctly. In fact, the search range \([f_{0,min},f_{0,max}]\) and the voicing threshold (Th1), both for clean and for noisy speech, depend on the speaker’s gender (cf. Table 2).

Fig. 4
figure 4

Performance of benchmarking PDA’s by gender of speaker for all types of noise

4.2.3 Evaluation of the Proposed PDA in Respect to Its Intrinsic Parameters

Evaluation by the preset maximum order of instantaneous pitch multiples: Figure 6 illustrates the results of the proposed algorithm for different values of the maximum order of instantaneous pitch multiples (\(H_{\max }\)), for both babble and white-noised speech, and at different \(\text {SNR}\) levels (from clean speech to a high noise level, i.e., \(\text {SNR}=0~\hbox {dB}\)). It is also worth noting that voicing decision error \(\text {VDE}\) does not depend on \(H_{\max }\), since V/UV it is calculated using \(f_{i}\) and its first derivative \(\varDelta _{fi}\) (cf. Algorithm 1), and therefore only \(\text {GPE}\) and \(\text {FPE}\) are mentioned in Fig. 6a–d.

The main observation is that both evaluation metrics, i.e., \(\text {GPE}\) and \(\text {FPE}\), depend more on the level of noise than on the maximum order of instantaneous pitch multiples \(H_{\max }\), that is preset as a parameter using values as mentioned in Table 2. Figure 6a–d shows that from \(H_{\max }=50\), the performance of the proposed PDA does not alter remarkably, for any type or level of noise. This can be accounted as an advantage, since setting a low \(H_{\max }\) as an upper bound for instantaneous pitch multiples reduces the number of potential \(f_{0}\) candidates (cf. Algorithm 2), hence reducing significantly the computational load.

Also, we checked out the effective number of pitch multiples used \(H_{max,m}(n)\) at each time n and for every \(f_{0,cand}\) index \(m=1,\dots ,M\) by setting \(H_{\max }\) to the maximum value, i.e., 1000 (cf. Algorithm 2). The histograms shown in Fig. 5, indicate the distribution of the effective \(H_{max,m}(n)\) calculated for speech signals uttered by male and female speakers. It is interesting to note that for both genders, the majority of effective \(H_{max,m}(n)\) are less than 50, and that for a few cases only, it reaches high values, i.e., more than 100. In particular, for male voice, most values of \(H_{max,m}(n)\) are around 15, whereas for female voices, the most common value is 10. Following the equation setting \(H_{max,m}(n) = min(H_{\max }, \frac{f_i(n)}{f_{0,cand}(n,m)})\) (cf. Algorithm 2), the obtained histograms confirm that: a) the choice of \(H_{\max }\) is not critical if it is high enough, and b) most of values of instantaneous frequency \(f_i(n)\) fall in the range of \(15\times f_{0,cand}\) for male voices and \(10\times f_{0,cand}\) for female ones, i.e., in a bandwidth bounded by nearly 4 KHz, if we set \(f_{0,max}\) to 270 Hz for male and 400 Hz for female speakers, which is, interestingly, the same bandwidth containing \(f_{0}\) and the three main formants.

Fig. 5
figure 5

Distribution of the effective number of pitch multiples \(H_{max,m}(n)\) at each time (n) and for each \(f_{0,cand}\) candidate value (m)

Evaluation by sweeping step of \(f_{0}\) candidates: Another parameter that may influence the quality of the extracted \(f_{0}\) contour is the sweeping step (\(f_{0,step}\)). As explained in Table 2, this parameter defines the precision of \(f_{0}\) candidate selection within the interval of search \([f_{0,min}, f_{0,max}]\). For each \(f_{0}\) candidate value, the algorithm decides whether it corresponds to the best fitting \(f_{0}\) value at each instant n. Therefore, it is mentioned in Table 2 that such a step should be within the interval [0.1 Hz, 2 Hz], so that the tradeoff between the computational load and the precision of the extracted \(f_{0}\) contour is preserved. Actually, precision standards of \(f_{0}\) detection would not require less than 0.1 Hz, as pitch variation is not perceptible below 1 Hz. However, a precision higher than 2 Hz would be perceptible. Figure 7 confirms this trend, since for \(f_{0,step}\) within the interval [0.1 Hz,2 Hz], most of GPE and FPE measures are stable. For the same reason as for the maximum order of pitch multiples, only \(\text {GPE}\) and \(\text {FPE}\) are mentioned in Fig. 7a–d.

The analysis of these figures shows that a small sweeping step, i.e., \(f_{0,step} \le 1~\hbox {Hz}\), the pitch errors, whether gross, \(\text {GPE}\) or fine, \(\text {FPE}\) are smaller for any level of noise. For a bigger step, i.e., \(f_{0,step} > 1~\hbox {Hz}\), both pitch errors are decreasing. Nevertheless, this is not due to a better estimation of \(f_{0}\), but rather to a high rate of \(\text {GPE}\), since \(\text {FPE}\) is calculated only for frames where there is no gross pitch error. An exception is registered for \(\text {GPE}\) of babble noise (cf. Fig. 7a) which decreases when \(f_{0,step}\) increases for \(20~\hbox {dB}\ge \text {SNR} \ge 5~\hbox {dB}\). This is due to the high \(\text {VDE}\) for noisy speech, which makes many voiced frames classified as unvoiced. Finally, the very low value of FPE for high noise (cf. Fig. 7c, d) cannot be accounted as a positive result since it, as already explained, comes from the high VDE at that noise level (cf. Table 5).

Fig. 6
figure 6

Performance of the proposed PDA by maximum order of instantaneous pitch multiples (\(H_{\max }\)) for all speakers and each type of noise

Fig. 7
figure 7

Performance of the proposed PDA by \(f_{0}\) candidate selection step (\(f_{0,step}\)) for all speakers and each type of noise

4.3 Discussion

The main advantages and shortcomings of the proposed PDA can be opposed face-to-face as follows, with some proposed solutions.

V/UV decision: It is satisfactory for both clean and noisy speech, at least for low noise levels (\(\text {SNR} \ge 15~\hbox {dB}\)). This confirms the role of instantaneous frequency to detect periodicity in speech signal. On the other hand, it is not clearly outperforming the top state-of-the-art methods. In particular, it is less efficient in high noise levels (\(\text {SNR} \le 10~\hbox {dB}\)), even though this is a common notice for all benchmarking PDA’s.

The parametric structure of the proposed algorithm: It allows improving its performance through combining the values of different parameters and thresholds using a grid search. In particular, a small step of \(f_{0}\) candidates (\(f_{0,step}\)) and a high order of instantaneous pitch multiples (\(H_{\max }\)) should improve the overall performance. Nevertheless, this may lead to increasing the computational load, which makes it difficult to run online for some real-time application such as on-the-fly pitch tracking. To cope with such a shortcoming, some solutions can be suggested to reduce the computational load, e.g., by setting the optimal parameters corresponding to type and level of noise, gender of speaker, etc. into a look-up table, or by implementing an adaptive parameter adjustment solution.

The instantaneous frequency: It is computed for each sample along the whole signal, hence there is no need for short-time analysis of the signal, which avoids assuming local stationarity. However, such a sample-wise procedure is computationally heavy, especially for a high sampling rate. An intermediate solution, to keep the tradeoff between temporal and frequency resolution could be subsampling the signal before computing the instantaneous and the fundamental frequencies.

Shortcomings and proposals: The method, as proposed, may be considered as rather heuristic than rigorously theoretic. In fact, while studying this problem, we reviewed the past works/elements that help finding a mathematical proof; however, all what we found were some results in limited cases, as mentioned in Sect. 2.3. Therefore, we believe that, in spite of this limitation, this method may be useful for the following reasons: a) It shows that natural signals, such as speech may have some properties that are still to investigate, to provide more accurate models to represent either the signal itself or its parameters, such as \(f_{0}\); b) The results obtained by this method, at least during the validation of the proposed relationship (cf. Subsection 3.3.) may hopefully spark the curiosity of the signal processing community in general, and speech/audio processing in particular, to study this problem and to precise if it holds for any type of signals and at which conditions, or if it is just a propriety of a particular class of signals; c) Finally, and in case no explicit mathematical proof could be sorted out for the proposed relationship between \(f_{0}\) and \(f_{i}\), machine learning could be an alternative to set a model to create a mapping between both, taking into consideration the particularities of every type of signals.

5 Conclusion

In this paper, a novel pitch detection algorithm was presented. The key idea relies on proposing an empirical relationship between fundamental frequency \(f_{0}\) and instantaneous frequency \(f_{i}\). This relationship stipulates that \(f_{0}\) contour could be approximated as the smoothed envelope of the residual \(f_{i}\), which is calculated as the rest of the division of the absolute value of \(f_{i}\) by the highest pitch multiples at each instant. The superposition of the so-estimated \(f_{0}\) and the ground-truth values was verified. Then, an algorithm was implemented based on this relationship, in order to detect voiced/unvoiced regions and then to extract \(f_{0}\) contour from \(f_{i}\) values in the voiced parts. In comparison to some well-rated state-of-the-art PDA’s, the proposed algorithm has been highly successful in taking accurate V/UV decision, and quite satisfactory in approximating \(f_{0}\) values in voiced parts, either in clean or in simulated noisy speech at low SNR levels.

The proposed algorithm has two major advantages: (a) It does not rely on short-time signal analysis and thus is able to perform instantaneous pitch detection, (b) its parametric structure, which allows its adaption for several considerations, such as type and level of noise, gender of speakers, etc., through fine-tuning its specific parameters, such as \(H_{\max }\) and \(f_{0,step}\), in addition to its thresholds. Further improvement can be achieved through investigating more in depth the proposed empirical relationship between \(f_{0}\) and \(f_{i}\), in order to make it more explainable and interpretative. Finally, the proposed method can be useful for audio and speech signal analysis, reconstruction and synthesis, using an \(f_{i}\)-based vocoder, like in [22]. Besides, it can be extended to other audio and speech applications such as compressive sensing, where only a few amount of data is required to reconstruct the signal.