1 Introduction

Emotions in speech signals are reflected as the subtle variations in the excitation source parameters and vocal tract parameters [42]. Both these parameters contribute equally toward the characterization of various emotions. Nevertheless, there is a special emphasis on the analysis and recognition of emotion using the source parameters in the literature [7, 19, 23, 28, 29, 31, 38, 42]. This is mainly due to the availability of reference electroglottographic signal [4] and well-established tools for the estimation of source parameters [35, 53].

Instantaneous pitch [3, 29, 50], strength of excitation [23, 28, 38] and glottal flow parameters [5, 6] are reported as the major emotion-dependent source parameters. Among these parameters, the instantaneous pitch is widely used for the analysis and synthesis of the emotional speech signal. For instance, Bulut et al. reported that the statistical measures derived only from instantaneous pitch show significant emotion class discrimination characteristics [3]. Bulut et al. also report that the instantaneous pitch is more important than the average pitch during emotional speech synthesis. Besides, the instantaneous pitch contour plays a significant role in the analysis stage of applications such as emotion recognition [7, 23, 29, 38] and emotion conversion [5, 6, 16, 20].

The instantaneous pitch contour for a given speech signal is derived as the inverse of the time interval between successive epoch locations (or glottal closure instants) [35]. This in turn demands the accurate estimation of epochs, which are the major source of excitation during the vibration of vocal folds [11, 12]. Furthermore, analysis of other source features such as the strength of excitation and glottal flow parameters also requires the accurate estimation of epochs from the emotional speech signal [5, 19, 38, 41]. Hence, the objective of the present work is the estimation of epoch locations from the speech for the emotion analysis.

The estimation of epochs from the speech signal is a challenging task due to the interaction of the vocal tract response [11]. There exist many efficient algorithms which can provide an accurate estimation of epochs from the speech signal by removing the vocal tract influence to a maximum extent. These methods are discussed briefly in the next subsection.

1.1 Existing Methods for Epoch Estimation

The methods proposed for the estimation of epochs from the speech signal employ different criteria for the identification or localization of epochs. The first type includes methods that rely on the residual signal extracted from the speech signal using linear prediction (LP) analysis for epoch estimation [11, 39]. The LP residual signal shows large values of error as discontinuities around epoch locations. However, the bipolar nature of peaks in the LP residual creates ambiguities in locating epochs [1]. Therefore, the Hilbert envelope (HE) of the LP residual was proposed by Ananthapadmanabha et al. [1] for unambiguous epoch estimation, exploiting its unipolar nature. However, the use of prediction error for epoch estimation is found to be less effective since the LP residual is often influenced by the vocal tract system [11]. This is because the inverse filter does not remove the vocal tract response completely.

The other criterion includes zero crossings of the phase slope function derived from the LP residual or the wavelet transform [36, 46], properties of impulse-like excitation [35], singularity exponents [27] and the structure of the glottal flow derivative [30]. The dynamic programming phase slope algorithm (DYPSA) [36] uses the phase slope function of the LP residual for identification of the locations of epoch candidates, as negative zero crossings. Besides, the algorithm employs a phase slope projection technique to recover the undetected epoch locations. True epoch locations are then obtained by N-best dynamic programming. Later, ‘yet another GCI/GOI algorithm’ (YAGA) [46] was proposed by modifying DYPSA. In contrast to DYPSA, YAGA identifies the epoch location by applying the phase slope function on the wavelet transform of the source signal. The epoch identification rate is improved in YAGA by a GCI refinement process, which is not performed in DYPSA. The zero frequency filtering (ZFF)-based method proposed by Murty et al. [35] exploits the nature of impulse excitation during glottal closures. That is, the discontinuities due to impulse excitation are reflected across all frequencies including the zero frequency. Hence, the speech signal is passed through two cascaded zero frequency resonators. The resonator output is then trend removed to obtain the zero frequency filtered signal (ZFFS). The trend removal operation is performed by subtracting the mean over 1–2 times the average pitch period of the speech signal. The locations of positive zero crossings of the ZFFS are identified as the epoch locations. In speech event detection using the residual excitation and a mean-based signal (SEDREAMS) algorithm [12], the first step is to obtain a mean-based signal from the speech signal. Again, the window length is fixed based on the average pitch period for the computation of mean-based signal. Then, this mean-based signal is used to determine the intervals where an epoch is present. Finally, the peak in the LP residual is examined in that interval to identify the epoch. However, this in turn requires prior estimation of the polarity of the speech signal [17, 25], to decide about the sign of peaks corresponding to epochs. The micro-canonical multi-scale formalism (MMF) [27] relies on the estimation of a multi-scale parameter called singularity exponents for detection of epoch locations. The MMF shows that epoch location corresponds to samples with lower singularity exponents. The glottal closure/opening instant estimation forward-backward algorithm (GEFBA) [30] estimates the epoch locations only in voiced regions of the speech signal. The GEFBA exploits the structure of the glottal flow derivative using simple time-domain criteria.

1.2 Drawbacks of the Existing Methods in the Context of Emotive Utterances

Most of the aforementioned methods identify more than one epoch candidates in one glottal cycle, which is followed by a candidate selection procedure. Therefore, the choice of thresholds or window size fixed for localization of true epochs may directly affect the reliability of epoch estimation. For example, the fixation of window length based on the average pitch period creates an issue of missing epoch or spurious epoch in the zero frequency filtering approach. In summary, the performance of epoch estimation in the speech signal is mainly dependent on factors such as vocal tract resonances, size of analysis window length, algorithmic thresholds, polarity and uncontrollable variations in pitch. Nevertheless, the aforesaid factors are not reported to show any significant degradation in the performance of the epoch estimation in neutral speech [24]. However, all these factors contribute to the degradation in the performance of the epoch estimation in emotional speech signal [18, 24]. Researchers have come up with studies concentrating on the robustness of epoch estimation techniques to additive noise and reverberation [27, 30]. However, attempts focusing on epoch estimation from the emotional speech signals are limited.

1.3 Methods Proposed Exclusively for Emotive Speech Signals

In the literature, there exists only the modification of the ZFF method (m-ZFF) for the estimation of epochs from the emotional speech signal. Besides emotive speech, various other types of speech signals such as singing [26], laughter [13, 32, 45] and telephonic voices [8, 21] are also analyzed based on modified ZFF method. For instance, Kumar et al. [32] used m-ZFF for the estimation of excitation source information (instantaneous pitch and epoch strength) for the analysis and characterization of laughter signals. Later, Thati et al. in [45] modified the estimated excitation source features for the synthesis of laughter signals. Also, Kadiri et al. in  [26] has studied the effect of wider pitch range in singing voice using the m-ZFF method for the extraction of GCIs. Furthermore, Govind et al. proposed a m-ZFF approach [18] for epoch extraction from the emotional speech by re-filtering the ZFF signal using a low pass filter. Even though the method provides fair results for epoch estimation from emotional speech, it introduces many artifacts due to block processing [16]. Recently, Kadiri et al. proposed a method [24] based on the multi-scale product (MSP) of the single frequency filtered signal for deriving impulse-like events from the emotional speech signal. Then, prominent epochs are identified from derived impulses using the m-ZFF approach. Nevertheless, the performance evaluation results of these approaches show still scope for improvement.

1.4 Motivation and Formulation of the Proposed Method

The state-of-the-art methods approximate the resonance effect from the vocal tract system on the glottal excitation signal as a linear filter model. However, this kind of approximation is not appropriate for dealing with the highly nonlinear source filter interaction during the production of emotional speech signals. Consequently, it affects the performance of epoch estimation. Hence, it is more appropriate to analyze the speech signal using techniques meant for nonlinear signal processing. This has motivated us to explore the possibilities of a new adaptive time series decomposition technique called variational mode decomposition (VMD) for analyzing the non-stationary speech signals.

The discontinuities due to impulse excitation at epochs occur with a fundamental frequency defined for each glottal cycle [33]. These variations can be analyzed by decomposing the given emotional speech signal around the fundamental frequency defined for each glottal cycle. Among the three well-known adaptive signal decomposition techniques such as empirical wavelet transform (EWT) [15], empirical mode decomposition (EMD) [22] and variational mode decomposition (VMD) [10], VMD has been extensively used in areas of biomedical signal processing, speech signal processing and seismic signal processing [34, 47, 51]. The advantage of using VMD is that it captures the relevant center frequencies, ensuring good frequency separation [10]. Moreover, VMD is efficient for identifying various discontinuities present in a non-stationary signal [33, 43]. In Lal et al. [33] and Deshpande et al. [9], the authors propose the estimation of GCIs from the electroglottographic signal using VMD. Furthermore, VMD algorithm has been applied on the neutral speech signal in an iterative manner for voice/unvoiced detection and estimation of the instantaneous pitch frequency [47, 48]. Experimental results from Upadhyay et al. show that the iterative application of the VMD algorithm separates the fundamental frequency (F\(_0\)) component from the neutral speech signal. Upadhyay et al. do not use the epoch information for the estimation of the instantaneous fundamental frequency. However, there is no guarantee that the vocal tract system generates similar speech waveforms for each impulse-like excitation [53]. Also, we cannot assure any periodicity in the impulsive excitation at epochs. Hence, it is more advantageous to use an epoch-based approach for the estimation of instantaneous fundamental frequency.

In contrast to Upadhyay et al. [47] and Lal et al. [33], the proposed method tries to estimate epochs from emotive speech signals whose characteristics are completely different from neutral speech signals and EGG signals. Thus, the novelty of the proposed work is the effective utilization of the VMD algorithm in capturing the glottal source characteristics of emotive speech utterances for the estimation of epochs. Precisely, the proposed method tries to decompose the emotional speech signal to a sub-signal (mode) similar in structure to that of the excitation signal. The important characteristic of the desired mode is that its center frequency of oscillation should be close to the fundamental frequency (F\(_0\)) defined for each glottal cycle. Finally, we use this center frequency characteristic of the sub-signal for the estimation of epochs.

The rest of the paper is organized as follows. In Sect. 2, we describe the methodology for estimation of epochs from emotional speech signals. Section 3 discusses the database used, empirical experiments conducted for fixing the tuning parameter of VMD, performance evaluation of the proposed method and performance comparison results with other popular methods. Finally, Sect. 4 draws the conclusion and future directions.

2 Proposed Method for Epoch Estimation Using VMD

In the proposed method, we perform an iterative decomposition of the emotional speech signal using VMD. The desired VMD mode signal is then analyzed for identification of epoch location. Firstly, a brief description of the VMD algorithm is given below.

2.1 VMD Algorithm

VMD is a non-recursive and adaptive decomposition technique for any kind of non-stationary signal. It decomposes the non-stationary signal into a set of sub-signals or modes, with the number of components specified in prior [10]. Each of these decomposed modes has a compact support around a corresponding center frequency. VMD algorithm identifies these modes by minimizing the sum of the bandwidth of the modes. However, it enforces a constraint that the original signal should be obtained by summing up the decomposed modes. The procedure for identifying the mode is as follows. For each mode,

  1. 1.

    The one-sided frequency spectrum of the signal is obtained by using Hilbert transform.

  2. 2.

    The frequency spectrum is shifted to baseband region by multiplying an exponential tuned to the estimated center frequency.

  3. 3.

    The bandwidth is estimated through H1 Gaussian smoothness of the demodulated signal, i.e., the squared L2-norm of the gradient.

The mathematical representation of the procedure is given below.

$$\begin{aligned} \mathop {\min }\limits _{{x_{k,}}{\omega _k}} \left\{ {\sum \limits _k {\left\| {\frac{\partial }{{{\partial _t}}}\left[ {\left( {\delta \left( t \right) + \frac{j}{{\pi t}}} \right) *{x_k}\left( t \right) } \right] {\mathrm{e}^{ - j{\omega _k}t}}} \right\| _2^2} } \right\} {} {} {} {} {} {} {} {} {} {} {} {} {} {} s.t\,\,\,\,\,\,\,\,\,\sum \limits _{k = 1}^K {{x_k}\left( t \right) = x(t)} \end{aligned}$$
(1)

where \(\,\frac{\partial }{{{\partial _t}}}\left[ . \right] \) denotes the partial derivative of a function. Further, \({x_k}\left( t \right) \) corresponds to kth component of the signal \(x\left( t \right) \) having center frequency \(({\omega _k})\) and K represents the total number of modes. The analytical signal corresponding to \({{x_k}\left( t \right) }\) is obtained by convolution operation with \({\left( {\delta \left( t \right) + \frac{j}{{\pi t}}} \right) }\) [Hilbert transform]. Here, \(j = \sqrt{ - 1}\) and \({\delta \left( t \right) }\) is the unit impulse function whose value is zero everywhere except at the origin (where it is infinity). The new signal formed has a unilateral spectrum which is shifted to the baseband by mixing with \({{\mathrm{e}^{ - j{\omega _k}t}}}\) tuned to mode’s center frequency \({\omega _k}\). Finally, the bandwidth of the mode is estimated based on the squared L2-norm of the gradient. Precisely, the formulation tries to find the K central frequencies and the corresponding modes \({{x_k}\left( t \right) }\).

Now, this optimization procedure is converted into an unconstrained one as follows.

$$\begin{aligned} {\mathcal {L}}({x_k},{\omega _{k,}}\lambda ):= & {} \alpha \sum \limits _k {\left\| {{\partial _t}\left[ {\left( {\delta (t) + \frac{j}{{\pi t}}} \right) *{x_k}(t)} \right] {\mathrm{e}^{ - j{\omega _k}t}}} \right\| } _2^2\nonumber \\&+ \left\| {x(t) - \sum \limits _{k = 1}^K {{x_k}(t)} } \right\| _2^2 + \left\langle {\lambda (t),x(t) - \sum \limits _k {{x_k}(t)} } \right\rangle \end{aligned}$$
(2)

In Eq. 2, \(\mathcal {L}\) represents the augmented Lagrangian , \(\lambda \) is the Lagrangian multiplier, and \(\alpha \) is the bandwidth control parameter.

The above unconstrained problem is solved using alternate direction method of multipliers (ADMM) [10]. The ADMM solves one variable at a time assuming all other variables are known. Firstly, the update for \({{x_k}\left( t \right) }\) is obtained by absorbing the last inner product term \(\left\langle {\lambda (t),x(t) - \sum \limits _k {{x_k}(t)} } \right\rangle \) into the second term \(\left\| {x(t) - \sum \limits _{k = 1}^K {{x_k}(t)} } \right\| _2^2\). Therefore,

$$\begin{aligned} x_k^{n + 1}= & {} \mathop {\arg \min }\limits _{{x_k}(t)} \alpha \sum \limits _k {\left\| {\frac{\partial }{{{\partial _t}}}\left[ {\left( {\delta (t) + \frac{j}{{\pi t}}} \right) *{x_k}(t)} \right] {\mathrm{e}^{ - j{\omega _k}t}}} \right\| _2^2} {} \nonumber \\&+ \left\| {x(t) - \sum \limits _{k = 1}^K {{x_k}(t) + \frac{\lambda }{2}} } \right\| _2^2 \end{aligned}$$
(3)

Equation 3 is solved in the spectral domain by noting the fact that norm in the time domain is same as that in the frequency domain. The solution for updated mode is obtained as follows.

$$\begin{aligned} {{\hat{X}}_k}^{n + 1}(\omega ) = \frac{{\hat{x}(\omega ) - \sum \limits _{i \ne k} {{{\hat{X}}_i}(\omega ) + \frac{{\hat{\lambda }(\omega )}}{2}} }}{{1 + 2\alpha {{(\omega - {\omega _k})}^2}}} \end{aligned}$$
(4)

where \({\hat{x}(w)}\), \(\hat{X}_i(w)\), \({\hat{\lambda }(\omega )}\) and \({{\hat{X}}_k}^{n + 1}(\omega )\) represent the Fourier transforms of x(t), \({x_i}(t)\), \(\lambda (t)\) and \({x_k}^{n + 1}(t)\), respectively. Similarly, the update for center frequency is obtained by solving the first term of Eq. 2 in the spectral domain. The updated central frequency is as follows.

$$\begin{aligned} {{\hat{\omega }}_k}^{n + 1} = \frac{{\int \limits _0^\infty {\omega {{\left| {{{\hat{X}}_k}(\omega )} \right| }^2}\mathrm{d}\omega } }}{{\int \limits _0^\infty {{{\left| {{{\hat{X}}_k}(\omega )} \right| }^2}\mathrm{d}\omega } }} \end{aligned}$$
(5)

The complete algorithm of VMD can be found in [10].

Fig. 1
figure 1

Variational mode decomposition of a multi-component synthetic signal. Waveform and linear magnitude spectrum of ab synthetic signal, corresponding cd mode 1 component, ef mode 2 component

In order to illustrate the effectiveness of VMD in decomposing a multi-component signal, we have simulated a synthetic signal resembling the characteristics of a voiced segment of a speech signal. A voiced speech signal can be represented as an amplitude and frequency modulated (AM–FM) signal in the low frequency region (50–500 Hz) as follows [47].

$$\begin{aligned} {F_\mathrm{LFR}}(n) = \sum \limits _{k = 1}^N {{a_k}(n)} \cos \left( {2\pi k{f_0}[n]n + {\theta _k}[n]} \right) \end{aligned}$$
(6)

where \({{f_0}[n]}\), \({{a_i}(n)}\), \({{\theta _k}[n]}\) and N represents the time varying frequency, time varying amplitude and phase of the kth harmonic of \({{f_0}[n]}\) and the number of harmonics, respectively. Here, we simulate a signal containing frequency components of 200 and 400 Hz. The time varying amplitude parameters are fixed as 1 and 0.5, respectively, and neglected the phase part. The sampling frequency used is 8 kHz. Further, we added white Gaussian noise (SNR of 10 dB) to the signal. The noisy simulated signal and its linear magnitude spectrum are shown in Fig. 1a, b. This input signal has been decomposed into two modes using the VMD algorithm. The modes and the corresponding linear magnitude spectrum obtained after the decomposition are shown in Fig. 1c, d and 1e, f. The center frequencies of the two modes obtained are 199.75 and 401.06 Hz, respectively. This confirms the effectiveness of VMD in separating the frequency components from the signal.

In VMD, the decomposition of a particular mode to a compact center frequency is largely dependent on two tuning parameters such as the number of modes K and bandwidth control parameter \(\alpha \). The control parameter K is fixed based on the number of sub-signals or components required, while \(\alpha \) is fixed based on the center frequency of interest [52]. In theory, \(\alpha \) is inversely proportional to the bandwidth of the components of the original signal. Further, the number of modes K controls the energy distribution among modes. A combination of very small \(\alpha \) and a very few modes result in sharing of components among themselves. The sharing of mode components is termed as mode mixing [10, 33]. Mode mixing occurs when the center frequency of the neighboring modes is very near to each other. Also, a combination of \(\alpha \) with superfluous modes K would lead to redundant VMD information [52]. A smaller value of \(\alpha \) makes the bandwidth of the filter wider. This tends to add more background noise to the results of VMD. Conversely, a narrow bandwidth makes distorted VMD results [52]. Furthermore, a combination of accurate \(\alpha \) and K will include all the frequency components of the input in the results of VMD. Hence, proper selection of these two parameters is essential for assuring the accuracy of the results of VMD.

The emotion specific source features such as the location of glottal closures (epochs) and strength of excitation of a speech signal have its roots embedded in the glottal waveform itself. Hence, the glottal waveform alone is sufficient for the estimation of epoch locations. However, this glottal excitation signal is filtered by the vocal tract system to produce the speech signal [11]. Moreover, there will be an increase in the fundamental frequency and energy of higher harmonics due to the rapid vibration of the vocal folds. Hence, it is required to separate the excitation characteristic from the influence of higher harmonics for the reliable estimation of epoch locations. The decomposition should be in such a way that one of the modes should preserve the excitation characteristics. The other mode corresponds to higher-frequency oscillations, which will be discarded. Therefore, we fix the number of modes K as two for the decomposition. Precisely, the center frequency of one of the modes should be near to the fundamental frequency (F\(_0\)) defined for each glottal cycle. However, VMD being a non-recursive algorithm, a single iteration might not be sufficient to bring the center frequency of a mode close to the fundamental frequency defined for each glottal cycle. Therefore, we apply VMD iteratively on the emotional speech signal until the center frequency of a mode is near to the fundamental frequency. The average F\(_0\) of the emotional speech signal is obtained using the fxrapt algorithm [2, 44].

The determination of \(\alpha \) is challenging in the sense that the excitation characteristics of the various emotional speech signals are entirely different. In this work, we fix \(\alpha \) for the first iteration of VMD based on the center frequency of interest. That is, we selected \(\alpha \) such that the deviation of the center frequency from the average F\(_0\) is the least. The results of the empirical evaluation are discussed in Sect. 3.1. Further, the value of \(\alpha \) for successive iterations is fixed such that the gross error and mean absolute deviation are minimized in the estimation of the instantaneous pitch from emotional speech signals. Again, the results of pitch evaluation experiments are discussed in Sect. 3.1. Precisely, we used the empirically obtained optimal \(\alpha \) combination (100,000, 10,000) [100,000 for the first iteration, 10,000 for successive iterations] for the estimation of epochs from the emotional speech signal.

2.2 Procedure for Epoch Estimation

The flow diagram of the proposed method is given in Fig. 2. The procedure is as follows.

Fig. 2
figure 2

a Flow diagram of the proposed method. \(\mathrm{CF}_\mathrm{sm}\) indicates the center frequency of the selected mode signal, \(\mathrm{CFD}_\mathrm{mm}\) denotes the absolute value of the difference between the center frequency two modes. Threshold is fixed as \({\left( {1/4} \right) }\)th of the minimum pitch. b Waveform representation of the flow graph

  1. 1.

    Apply VMD on the emotional speech signal with K and \(\alpha \) set to 2 and 100,000.

  2. 2.

    Select the mode with lesser center frequency and discard the other mode. Center frequency less than 80 Hz is also discarded since the human pitch ranges from 80 to 400 Hz.

  3. 3.

    If \(\mathrm{CF}_\mathrm{sm}\) is less than or equal to the average \(F_{0}\) of the emotive speech signal, the VMD iteration is stopped. The selected mode signal is taken as the VMD output signal and proceeds to step 5.

  4. 4.

    If \(\mathrm{CF}_\mathrm{sm}\) is greater than the average \(F_{0}\), apply VMD iteratively on the mode having the lesser center frequency (K = 2 and \(\alpha \) = 10,000). The iteration is stopped if \(\mathrm{CF}_\mathrm{sm}\) is less than or equal to the average \(F_{0}\). Now, the selection of a particular mode or combination of modes as the VMD output signal is fixed based on \(\mathrm{CFD}_\mathrm{mm}\) between the two modes. If \(\mathrm{CFD}_\mathrm{mm}\) is greater than the threshold, choose the mode with lower center frequency as the VMD output signal. If \(\mathrm{CFD}_\mathrm{mm}\) is less than or equal to the threshold, choose the combination of modes as the VMD output signal.

  5. 5.

    The positive to negative zero crossings of the VMD output signal are hypothesized as epoch locations.

The parameters \(\mathrm{CF}_\mathrm{sm}\), \(\mathrm{CFD}_\mathrm{mm}\) and threshold are defined as follows.

  • \(\mathrm{CF}_\mathrm{sm}\) indicates the center frequency of the selected mode signal.

  • \(\mathrm{CFD}_\mathrm{mm}\) denotes the absolute value of the difference between the center frequency of the two modes.

  • The threshold is fixed by computing \(\mathrm{CFD}_\mathrm{mm}\) between the two modes. We keep the threshold at \({\left( {1/4} \right) }\)th of the minimum pitch, which is 20 Hz.

Figure 2b demonstrates the flow graph using waveforms obtained during each step. Here, we can observe that center frequency converges to the average \(F_{0}\) in the second VMD iteration. The selected mode signal is identified as the VMD output signal, and its positive to negative zero crossings are hypothesized as epoch locations.

3 Experimental Results and Discussion

In this study, we perform the following different experiments with regard to epoch estimation from the emotional speech signal.

  1. 1.

    Experiments for determining the optimal value for \(\alpha \) of VMD.

  2. 2.

    Experiments for evaluating the performance of the epoch estimation in emotional speech signals.

  3. 3.

    Experiments for the performance comparison with the state-of-the-art methods.

Firstly, we provide a brief description of the speech material and ground truth used for conducting the aforesaid experiments.

Database and ground truth The proposed method has been evaluated on the German emotional speech corpus (EMO-DB) having the simultaneous recording of electroglottogram (EGG) signals [4]. The database comprises of six basic emotions such as boredom, sad, disgust, fear, anger and happiness along with corresponding neutral versions [4]. It includes approximately 100 speech utterances (10 test sentences per emotion) spoken by 10 professional German actors (5 males and 5 females) with simultaneous EGG recordings. The recordings were initially sampled at 48 KHz and later down sampled to 16 KHz [4].

The ground truth for evaluating the performance of epoch estimation in the emotional speech signal is obtained manually from the corresponding DEGG signals. We used the Wavesurfer tool [49] for creating manual reference epochs. The labeling is done by observing the locations corresponding to significant negative peaks in the DEGG signal. Besides manual reference epochs, we collected algorithmic reference epochs based on the method proposed in Lal et al. [33]. In Lal et al., we show that epochs can be estimated more accurately and reliably from the EGG signal using the VMD algorithm. Thus, even if manual references are not available, one can use the complementary algorithmic references obtained using the method proposed in Lal et al. for evaluating the performance of epoch estimation.

Fig. 3
figure 3

Illustration of epoch estimation from the emotional EGG using VMD. a Voiced segment of anger EGG signal, corresponding b DEGG signal, c mode 1 component, d mode 2 component, e Mode 3 component. Epoch locations corresponding positive peaks in the DEGG signal are marked using thick red line (Color figure online)

Figure 3 shows an illustration of epoch estimation from the emotional EGG signal using the method proposed in Lal et al. [33]. Here, Fig. 3a, b depicts a voiced region of the EGG signal corresponding to anger speech and its first-order derivative (DEGG). Figure 3c–e shows the three modes obtained from VMD. From the decomposition results, it is observed that the positive to negative zero crossings [marked ‘x’ (blue)] of the second mode coincide with the locations of prominent positive peaks in the DEGG signal [marked as thick red lines in Fig. 3d]. This phenomenon occurs because the center frequency of the second mode coincides with the fundamental frequency of oscillation (\(F_{0}\)) in the EGG signal. Therefore, the positive to negative zero crossings of the second mode (Fig. 3d) correspond to epoch locations. This phenomenon cannot be seen in other modes because their center frequencies are far apart from \(F_{0}\)).

3.1 Determination of the Optimal \(\alpha \) Value by Empirical Evaluation

The optimal \(\alpha \) value pair for VMD iterations is determined empirically based on the experiments conducted on emotional speech signals taken from the EMO-DB. Initially, we search for the best \(\alpha \) value (for the first iterations of VMD) which can minimize the deviation in center frequency from the average \(F_{0}\). For successive iterations, we fix the \(\alpha \) value such that the best performance is attained in the estimation of instantaneous pitch values. The standard performance measures for pitch evaluation are given below [40, 53].

  1. 1.

    Mean absolute error (MAE) Mean of the absolute value of the difference between the estimated and reference pitch values.

    $$\begin{aligned} {\text {MAE}},\,\,\bar{e} = \frac{1}{N}\sum \limits _{i = 1}^N {\left| {e({m_i})} \right| } \,\,\,\,\,\,{\text {where}}\,\,\,e({m_i}) = {P_i}(r) - {P_i}(e) \end{aligned}$$
    (7)

    In Eq. 7, \({P_i}(r)\) and \({P_i}(e)\) represent the reference and estimated pitch value of the ith voiced frame and N is the total number of voiced frames.

  2. 2.

    Standard deviation (SD) Standard deviation of the difference between the estimated and reference pitch values.

    $$\begin{aligned} {\text{ SD }},\sigma = \sqrt{\frac{1}{N}\sum \limits _{i = 1}^N {{e^2}({m_i})} - {{\bar{e}}^2}} \end{aligned}$$
    (8)
  3. 3.

    Gross error (GE) The percentage of voiced frame with an estimated pitch value that deviates from the reference pitch value by more than 20\(\%\).

    $$\begin{aligned} {\text{ GE }} = \frac{{{D_p}}}{N} \times 100 \end{aligned}$$
    (9)

    where \({{D_p}}\) is the number of voiced frames with an epoch deviation greater than 20\(\%\).

In our experiments, we used the gross error and mean absolute error as measures for fixing the best \(\alpha \) value for successive VMD iterations.

Table 1 Deviation in center frequency from the average \(F_{0}\) (\(\mathrm{CF}_\mathrm{{error}}\)) averaged over the ten emotive utterances

Experiments conducted We find that a lower \(\alpha \) value includes high-frequency oscillations in the estimated modes from the emotional speech signal. This is true from a theoretical point of view since \(\alpha \) is inversely proportional to the bandwidth of modes. Precisely, if we select a lower value of \(\alpha \) (< 5000), it will make the bandwidth of the filter wider. Therefore, the estimated modes will contain high-frequency oscillation than the fundamental frequency of oscillation of the emotive utterance. This in turn affects the epoch estimation performance. Thus, a higher value of \(\alpha \) is required to have a frequency band in the decomposed modes close to the fundamental frequency range (80–400 Hz) of an adult human being. Moreover, Yang et al. report that the value of \(\alpha \) is fixed based on the center frequency of interest [52].

Here, we compare the influence of a lower \(\alpha \) value and higher \(\alpha \) value in capturing the center frequency close to the average fundamental frequency. The experiments are performed on the ten sentences of EMO-DB, spoken in the anger emotion by a female speaker. We varied the \(\alpha \) value from a low value (5000) to a high value (100,000) for the first iteration of VMD on each utterance. Then, we measured the deviation in the center frequency of the selected mode from the average \(F_{0}\) for each alpha value considered. Further, we computed the average deviation in center frequency (denoted as \(\mathrm{CF}_\mathrm{{error}}\)) of the ten utterances. The results of the empirical studies are given in Table 1. From the results, it is evident that the deviation error is on the higher side for a lower \(\alpha \) value. The deviation error reduces after \(\alpha =50{,}000\) and attains the least value when \(\alpha \) = 100,000. Hence, one can choose an \(\alpha \) value anywhere in the range between 50,000 and 100,000 for capturing the required center frequency from the emotive utterance. In this work, we have used an \(\alpha \) value of 100,000 for the first iteration of VMD on the emotional speech signal. However, if we use a higher \(\alpha \) value for every iteration of VMD, the correct center frequencies of modes will not be captured. Therefore, we conducted the pitch evaluation experiments for obtaining the optimal \(\alpha \) combination for the iterative procedure. The experiments are conducted on a test sentence (‘Das will sie am Mittwoch abgeben,’ meaning ‘She will hand it in on Wednesday’) spoken in all emotions by 10 speakers. Reference pitch values are obtained by taking the inverse of the time interval between two successive ground truth epoch locations (manually labeled epochs). Then, pitch evaluation is performed based on measures such as gross error and mean absolute deviation.

During the first experiment, we kept the same \(\alpha \) value of 100,000 for every iteration of VMD on the selected mode signal. Then, using the proposed method discussed in Sect. 2.2, epoch locations are identified. Further, pitch values are estimated by taking the inverse of the difference between two successive epoch locations. Mathematically, it is expressed as follows.

$$\begin{aligned} {F_i}(t) = \frac{1}{{\left[ {{e_l}{\text {(t + 1)}} - {e_l}{\text {(t)}}} \right] }} \end{aligned}$$
(10)

where \({F_i}(t)\) represents the instantaneous pitch or fundamental frequency and \({e_l}(t)\) represents the beginning of the pitch period.

Table 2 Empirical results of various experiments conducted on emotional speech signals for fixing \(\alpha \)
Fig. 4
figure 4

Influence of \(\alpha \) value on capturing correct frequency of oscillation. a Voiced segment of anger speech signal, corresponding b EGG signal, c selected mode after the first iteration of VMD using \(\alpha \) = 100,000, d VMD output signal using \(\alpha \) = 100,000 for successive iterations, e VMD output signal using \(\alpha \) = 10,000 for successive iterations

Finally, we computed the gross error and mean absolute error between the estimated and the reference pitch values. During the second experiment, we kept \(\alpha \) value as 100,000 only for the first iteration of VMD on the emotional speech signal. For successive iterations of VMD on the selected mode signal, \(\alpha \) value is changed to 75,000. During the third experiment, the \(\alpha \) value is changed to 50,000 for all successive VMD iterations on the selected mode signal after the first iteration. For the fourth, fifth and sixth experiment, \(\alpha \) value pair used is (100,000, 25,000), (100,000, 10,000) and (100,000, 1000), respectively.

Furthermore, we have conducted the same pitch evaluation experiments for a lower value of \(\alpha \) equal to 2000, without iteration. It is observed that the performance measures such as gross error and mean absolute error are very high when compared with the optimal \(\alpha \) combination (100,000, 10,000).

Thus, the application of VMD in a non-iterative manner with a lower value of \(\alpha \) is not found to be effective in improving epoch estimation performance. Further, we checked the feasibility of \(\alpha \) equal to 2000 in an iterative manner. Again, the error measures are found to be on the higher side. Moreover, the number of VMD iteration required for a lower \(\alpha \) value is always more than that of the proposed \(\alpha \) value pair. For instance, the maximum number of iterations recorded for \(\alpha \) = 2000 in the pitch evaluation experiment is eight. However, for \(\alpha \) = [100,000, 10,000], the number of iterations went up to a maximum of three only. Table 2 gives the gross error and mean absolute deviation obtained for the various experiments. From the results, it is evident that when the \(\alpha \) value pair is (100,000, 10,000), the gross error and mean absolute error are the least.

The influence of \(\alpha \) value pair (100,000, 100,000) and (100,000, 10000) on capturing the fundamental frequency of oscillation in the emotional speech signal is demonstrated in Fig. 4. A voiced segment of the anger speech signal and the corresponding EGG signal are given in Fig. 4a, b. Figure 4c depicts the selected mode component (mode with lesser center frequency) after the first iteration of VMD with \(\alpha \) value set to 100,000. By visual inspection of Fig. 4c, it is observed that the selected mode is not near to the fundamental frequency of oscillation in the glottal wave. Hence, the iteration continues on the selected mode. During the first experiment, we obtained the VMD output signal using an \(\alpha \) value of 100,000 for successive VMD iterations. During the second experiment, we used a lower \(\alpha \) value of 10,000 for successive VMD iterations. Figure 4d shows the VMD output signal obtained using an \(\alpha \) value of 100,000 (from second iteration onwards). From Fig. 4d, it is evident that the fundamental frequency of oscillation is not captured in the output signal. In contrast, the VMD output signal obtained using an \(\alpha \) value of 10,000 (Fig. 4e) clearly captures the fundamental frequency of oscillation in the glottal wave.

3.2 Performance Evaluation of Epoch Estimation in Emotional Speech Signals Using the Proposed Method

In order to illustrate the proposed method for epoch estimation in the emotional speech signal, an anger speech signal is taken. Then, the signal has been decomposed into two modes by keeping \(\alpha \) as 100000. The results of the decomposition are given in Fig. 5, where (a) represents a voiced segment of anger speech, and (b)–(c) represents two modes obtained with center frequencies 230 and 570 Hz, respectively. The average fundamental frequency calculated using the fxrapt algorithm is approximately 193 Hz. Hence, the VMD iteration (\(\alpha \) = 10,000) continues on the mode with lesser center frequency (Fig. 5b). The modes obtained after the fourth iteration (with center frequencies 198 and 184 Hz) are shown in Fig. 5d, e, respectively.

Fig. 5
figure 5

Emotional speech signal decomposition using VMD. a Voiced segment of anger speech signal, corresponding bc mode 1 and mode 2 after the first VMD iteration, de mode 1 and mode 2 after the final VMD iteration

Fig. 6
figure 6

Linear magnitude spectrum corresponding to modes selected after the first and final iteration of VMD on anger speech segment. The waveform and corresponding linear magnitude spectrum of ab voiced segment of anger speech signal, cd EGG segment corresponding to anger speech, ef mode selected after the first VMD iteration, gh VMD output signal

Now, the VMD iteration has been halted since the center frequency of one of the modes has fallen below the average fundamental frequency . Finally, based on step 4 of the procedure for epoch estimation described in Sect. 2.2, the combination of modes is taken as the VMD output signal. Figure 6 plots the linear magnitude spectrum corresponding to modes selected after the first iteration and that of the VMD output signal. By visual inspection of all subplots, the spectrum of the selected mode after the first iteration (Fig. 6f) shows spectral peaks beyond the fundamental frequency. In contrast, the spectrum of the VMD output signal shows only the spectral peaks corresponding to the fundamental frequency.

Fig. 7
figure 7

Illustration of epoch estimation in emotional speech signal using proposed method. a Voiced segment of anger speech signal, corresponding b EGG segment, c DEGG signal with manually labeled reference epochs indicated using ‘\(\scriptstyle {\times }\)’ (magenta), d VMD output signal from EGG with reference epochs indicated using ‘\(\scriptstyle {+}\)’ (blue), e VMD output signal. Estimated epochs are marked using ‘o’(red) (Color figure online)

Precisely, the spectral peak in Fig. 6h resembles with the spectral peaks corresponding to the fundamental frequency in the glottal waveform (Fig. 6d). Analysis of the VMD output signal shows rapid changes around the positive to negative zero crossings. From Fig. 7, it is evident that the time instants corresponding to these rapid changes represent the epoch locations. Here, Fig. 7a depicts the same segment of anger speech signal used in Fig. 5a. Figure 7b,c plots the corresponding EGG and DEGG waveforms, respectively. The reference epoch locations labeled using Wavesurfer are indicated in the DEGG signal as ‘\(\scriptstyle {\times }\)’(magenta). Besides, the reference epoch locations estimated from the EGG signal using VMD are marked as ‘\(\scriptstyle {+}\)’(blue) in the corresponding selected mode component (mode 2) [Fig. 7d]. It is observed that the positive to negative zero crossings (marked ‘o’ (red)) of the VMD output signal (Fig. 7e) closely coincide with reference epoch locations shown in Fig. 7c,d. Hence, the time instants corresponding to positive to negative zero crossings are identified as epoch locations.

VMD is suitable for the extraction of noise robust component since it follows the Wiener filter structure [47]. However, we found that the signal-to-noise ratio should be a minimum of 5 dB for reliable estimation of epochs as positive to negative zero crossings. This is validated by measuring the reliability of the proposed method for emotive speech with additive noises at SNR levels from 0 to 30 dB. Firstly, we briefly describe the measures for testing the reliability and accuracy of the proposed method.

Performance measures Performance evaluation is performed on the voiced regions of the speech signal by defining the larynx cycle as in [36].

If the rth reference epoch occurs at \({e_r}\), then larynx cycle is defined as the range of samples \((1/2)({e_{r - 1}} + {e_r}) \le n \le (1/2)({e_r} + {e_{r + 1}})\). Based on the larynx cycle, two sets of measures are defined for evaluating the reliability and accuracy of the proposed method. The first set includes the following.

  1. 1.

    Identification rate (IDR) The percentage of larynx cycle for which exactly one epoch is detected.

  2. 2.

    Miss rate (MR) The percentage of larynx cycle for which no epoch is detected.

  3. 3.

    False alarm rate (FAR) The percentage of larynx cycle for which more than one epoch is detected.

The IDR, MR and FAR quantify the reliability of epoch estimation. The second set includes the following.

  1. 1.

    Identification error \(\zeta \) The timing error between the reference epoch and the estimated epoch in larynx cycle for which one epoch was identified.

  2. 2.

    Identification accuracy (IDA in ‘ms’) The standard deviation of identification error \(\zeta \).

  3. 3.

    Accuracy to ±0.25 ms (IDA to ±0.25 ms in ‘\(\%\)’) The percentage of larynx cycles for which exactly one epoch is identified and \(\zeta \) is within ±0.25 ms.

IDA in ‘ms’ and ‘\(\%\)’ quantifies the accuracy in the estimation of epochs. For IDA in ‘ms’, lower value indicates higher accuracy. Further, IDA in ‘\(\%\)’ is measured as follows.

$$\begin{aligned} {{\text {IDA in `}}\%{\text {'}}} = \frac{{{\text {Number of epochs having }}{} {\text {`}}\zeta {\text {'}} {} {\text {within}} \pm 0.25\,{\text {ms}}}}{{{\text { Total number of correctly identified epochs}}}} \times 100 \end{aligned}$$
(11)

3.2.1 Assumption on Signal-to-Noise Ratio

We identified the positive to negative zero crossings of the VMD output signal as epochs. However, VMD method is sensitive to noise. Hence, we conducted the following experiments for checking the conditions under which such an assumption can be made.

The experiments are conducted on a test sentence (‘Das will sie am Mittwoch abgeben,’ meaning ‘She will hand it in on Wednesday’) spoken in all emotions by a female speaker. In the first experiment, we calculated the average IDR value and average deviation in center frequency (from average \(F_{0}\) of each utterance) for the clean emotive signals. The results obtained are 93.91\(\%\) and 5.90 Hz, respectively. In the subsequent experiments, we added white Gaussian noise at different SNR level to the emotive utterances. Again, we calculated the measures (average IDR and average deviation in center frequency) for each SNR value considered. The results obtained are given in Table 3. From the table, it is clear that the identification rate reduces by more than 2\(\%\) for SNR values below 5 dB. Also, the deviation in center frequency from the average \(F_{0}\) (denoted as \(\mathrm{CF}_\mathrm{{error}}\) ) is found to be more for SNR levels below 5 dB. Therefore, we conclude that the identification of epoch locations as positive to negative zero crossings is robust to white Gaussian noise for SNR level as low as 5 dB.

Table 3 Empirical results of the test for robustness to additive noise

3.2.2 Performance Evaluation Using Manual Reference and VMD-Based Reference

Now, the performance evaluation of the estimation of epochs in the emotional speech signal is evaluated across six basic emotions (boredom, disgust, fear, anger, sorrow and happiness ) taken from German emotional speech corpus (EMO-DB). The results of the performance evaluation are given in Table 4. We provide results of the evaluation based on both the manual reference and VMD-based reference. From the results, we can observe that the IDR values are lesser in highly aroused emotions such as anger and happiness. This is due to the reduced strength of excitation in the speech signal corresponding to these emotions. This in turn might have reduced the energy associated with modes, leading to spurious epochs (as indicated by higher FAR values). However, the IDR values of emotions with lesser loudness levels (boredom, sad and disgust) are on the higher side. Furthermore, the IDR values obtained based on manual references and VMD-based references are found to show small deviation for the same emotion.

Table 4 Performance evaluation of the proposed method for epoch estimation in emotional speech signals

For example, the IDR of boredom for VMD reference is lower than manually labeled and the IDR of happy for VMD reference is higher than manually labeled. This inconsistent variation is due to the following two reasons.

  • The epoch in the voicing offset regions of the EGG signal are clearly identified using VMD [22]. The manual method fails to identify any epoch at the end of the voiced segment where the DEGG signal is almost zero.

  • Again, manual reference epoch creation is prone to human error.

Therefore, the number of reference epochs considered for performance evaluation in both cases differs. This in turn results in a small difference (around 1–1.5\(\%\)) in the performance in terms of IDR.

Table 5 Performance comparison results of epoch estimation in emotional speech signals using manually labeled reference
Table 6 Performance comparison results of epoch estimation in emotional speech signals using VMD-based reference

Among the accuracy measures such as IDA in ‘ms’ and IDA in ‘\(\%\)’, the latter seems to the best measure for discussing the accuracy of the proposed method. This is because it identifies the percentage of epochs estimated within ± 0.25 ms of the reference epochs. From the results, it appears that the accuracy of epoch estimation decreases with the increase in arousal levels. The accuracy of boredom, sad and disgust is on the higher side compared to anger, fear and happy. A similar trend can be observed from the IDA measure in ‘ms.’ That is, IDA (ms) values are lower [indicating high accuracy] for low aroused emotions and vice versa. Further, the difference in accuracy of manual and VMD-based reference is due to the difference in IDR obtained for the same emotion category. That is, for lower IDR, the chance of an increase in accuracy is higher.

The general observation of the outcome is that the average reliability and the accuracy of the proposed method are almost equal for both types of reference used. The results confirm the effectiveness of using VMD-based reference epochs from the EGG signal for the performance evaluation of the proposed method.

3.3 Performance Comparison of the Proposed Method with Existing Methods

The performance of the proposed method for epoch estimation in emotional speech signals is compared with popular methods such as ZFF, SEDREAMS, DYPSA, MMF, GEFBA and modified ZFF. All these methods are evaluated in the voiced regions of the emotional signal based on manually labeled reference epochs and epochs estimated from EGG signals using the VMD algorithm [33]. The comparative results obtained for manual and algorithmic references are shown in Tables 5 and 6, respectively.

The IDR of all methods is found to be decreasing as the level of arousal increases in emotions. That is, the reliability is less in anger and happy when compared to boredom and sad. Methods such as SEDREAMS and GEFBA are found to be performing well on boredom and sad emotions. However, these methods show a reduced IDR performance in other emotions. This is due to the rapid changes in the glottal excitation characteristics of high aroused emotions. Among the methods compared, DYPSA and MMF are found to show the least IDR performance in all emotions. The standard ZFF method gives a higher performance than SEDREAMS, GEFBA, DYPSA and MMF. However, its performance is lower than that of the modified ZFF approach. This is because of the fact that the ZFF method uses a single window length based on the average pitch period for getting the trend removed signal. The chances of spurious or missed epoch estimation are higher in the zero frequency filtered signal when using a fixed window length. In contrast, the proposed method uses the average fundamental frequency only to check whether the iteration has brought modes of decomposition close to the fundamental frequency of oscillation defined for each glottal cycle. The center frequency of decomposed mode is controlled only by the tuning parameters of VMD. Proper selection of tuning parameters helps the VMD output signal to oscillate at the fundamental frequency defined for each glottal cycle in the emotional speech signal. This in turn improves the performance of the proposed method in terms of reliability. Precisely, the IDR of the proposed method is found to be higher than that of the five standard methods (SEDREAMS, GEFBA, DYPSA, MMF and ZFF) in highly aroused emotions.

Further, it is found that the m-ZFF method gives better reliability in epoch identification across various emotions. This improved performance in m-ZFF is due to the local pitch period oscillations in the ZFF signal. The proposed method is found to give a close match in IDR performance to that of the m-ZFF method, especially in emotions such as anger, happy, fear and sad. This is because of the fact that the VMD output signal also oscillates close to the fundamental frequency. The average reliability of the proposed method is found to be comparable with that of the m-ZFF approach.

Now, the identification accuracy (in terms of IDA in ‘ms’ and IDA in ‘\(\%\)’) also shows a similar trend with respect to arousal level. That is, the accuracy of all methods is on the higher side for boredom and sad when compared to anger and happy. Among the seven methods, the standard ZFF method shows better identification accuracy in boredom and sad.

The accuracy of the proposed method is found to be slightly less than the standard ZFF method in boredom and sad emotion. However, the proposed method outperforms the standard ZFF method in other emotions. Further, SEDREAMS, DYPSA and MMF show reduced epoch identification accuracy in all the emotion categories. The GEFBA has shown slightly better identification accuracy in sad emotion (for manual reference). However, this increase in accuracy is due to the decrease in IDR value. In contrast, the proposed method has shown almost equivalent IDA performance with better identification rate in sad emotion.

The m-ZFF approach is found to give better identification accuracy across highly aroused emotions (anger, fear, happy and disgust) when compared to other methods. The better result in m-ZFF is attributed to its nature of extracting impulsive excitations directly from the emotive utterance. Furthermore, the accuracy of the proposed method is found to be higher than the m-ZFF method by 1–3\(\%\) in low aroused emotions (boredom and sad). However, the m-ZFF outperforms the proposed method in highly aroused emotions. The deviation in accuracy is around 1–3\(\%\).

In summary, we can conclude that the proposed method is superior to five other methods except the m-ZFF method (in terms of identification accuracy) in highly aroused emotions . The average identification accuracy of the proposed method is found to be comparable with that of the m-ZFF approach.

Figure 8 depicts the histogram of epoch timing error averaged over the emotional database for the proposed method and the m-ZFF approach. The peaks in the distribution are mostly concentrated near the origin. The proposed method has a similar histogram as that of the m-ZFF approach.

Even though the m-ZFF approach provides a slightly better epoch estimation performance than the proposed method, it suffers from the following disadvantages [16].

  1. 1.

    m-ZFF uses block processing, which introduces unwanted spectral leakage during the post-processing of the ZFF signal.

  2. 2.

    The local pitch (F\(_0\)) value obtained for each frame is crucial in the estimation of epochs. A small change in the estimated F\(_0\) will degrade the performance.

The proposed method holds an advantage over m-ZFF in the sense that it does not use any kind of block processing. The proposed approach processes the entire emotional speech signal at once to estimate the epoch locations. Hence, any artifacts due to block processing and windowing are avoided. Further, the miss rate in the proposed method is found to be lesser than that of the m-ZFF approach.

3.3.1 Performance Comparison of the Proposed Method with m-ZFF Method for Degraded Emotive Speech Signals

We evaluated the robustness of the proposed method and the m-ZFF approach for different noise degradations. The evaluation is done by calculating the IDR measure for the total database in the presence of three additive noises (white, babble and pink) taken from NOISEX database [37] at SNR levels of 0 dB and 10 dB. The average results obtained for different noise degradations are given in Table 7. From the results, it is evident that the proposed method gives a significantly higher identification rate than that of the m-ZFF for each of the noise degradations considered. The improved performance of the proposed method is attributed to the selection of noise robust VMD output signal for epoch estimation. VMD embeds Wiener filtering to update the modes directly in the frequency domain [10]. This enables the extraction of noise robust modes [14]. In contrast, the m-ZFF method refines only the conventional ZFF signal by block processing and re-filtering. Therefore, the major issues associated with ZFF method (such as speech contaminated with interference from other speakers, spurious impulse-like sequences and so on) [35] prevails in the m-ZFF method also. This results in a reduced epoch estimation performance of the m-ZFF method for degraded emotive speech signals.

Fig. 8
figure 8

Histogram of the epoch timing error averaged over six different emotions. a Proposed method (IDA to ± 0.25 ms is 61.42\(\%\)), b m-ZFF method (IDA to ± 0.25 ms is 61.84\(\%\))

Table 7 Performance comparison in terms of IDR for the m-ZFF and the proposed method over the total database in noise-degraded conditions at SNR levels of 0 and 10 dB

4 Conclusion

This paper proposes a novel method for the estimation of epoch locations from the emotive utterance. The proposed approach benefits from the effectiveness of the VMD algorithm in decomposing the emotional speech signal into modes with correct center frequencies. Finally, the decomposed modes are analyzed for the estimation of epochs.

The major contributions of the proposed work are:

  • Effective utilization of the VMD algorithm in capturing the glottal source characteristics of the emotive speech utterances.

  • Reliable estimation of epoch locations from clean and noise-degraded emotional speech signal using center frequency criterion of VMD.

We show that the application of the VMD algorithm iteratively on the emotional speech signal helps to capture the required center frequency of the mode. The center frequency of the selected mode is found to be near to the fundamental frequency of the glottal excitation signal. This is significant in the sense that the epochs occur with a fundamental frequency defined for each glottal cycle. The center frequency characteristic of the corresponding mode is utilized for reliable and accurate estimation of epoch locations. Epoch locations are hypothesized as positive to negative zero crossings of the VMD output signal.

We evaluated the performance of the proposed method in terms of identification rate and identification accuracy on emotional speech signals taken from the German emotional database. Further, we compared the effectiveness of the proposed method with the state-of-the-art epoch estimation methods. Performance comparison results show that the proposed method is almost as reliable and accurate as the m-ZFF approach for clean speech signals. Besides, we show that the proposed method outperforms m-ZFF in the presence of additive noise degradations. Therefore, the proposed method can be used as a better approach toward epoch estimation in emotive speech degraded with additive noise. Moreover, the proposed method can be used as a tool for accurate emotion analysis by deriving instantaneous pitch contours from epochs estimated.

Furthermore, like other methods, the reliability of the proposed method is found to be lower in highly aroused emotions such as anger and happiness. Future work will address this limitation, and we are formulating suitable modifications to proposed work to resolve the issue. The reduced strength of excitation in these emotions might have reduced the energy associated with decomposed modes. This in turn resulted in spurious epoch estimation. We need to explore more to address this issue.