1 Introduction

Vowels are speech sounds produced with an open configuration of the vocal tract, with vocal cords vibration. The air stream is not severely obstructed in this process. These are primary units of the sound system of a language [1]. Vowel region detection is a task of identifying a vowel occurrences in a continuous speech with precise boundary markings. Vowel onset point (VOP) is the time instant at which vowel region begins and vowel end-point (VEP) can be considered as the time instant at which vowel region ends in a continuous speech. These two landmarks constitute vowel boundaries. The speech signal is a non-stationary signal and its generation can be perceived as a sequence of changes and these changes are known as events. VOP and VEP are two such acoustic events.

The importance of precise vowel region detection from a continuous speech can be felt in many applications [5]. The detection of accurate vowel regions from a continuous speech is expected in automatic speech recognition, smart audio filtering, recognition of CV units in Indian languages, determining the duration of vowels, speech rate manipulation and multimedia synchronization [3, 9, 11, 12]. The knowledge of phonetic classes can be integrated into the statistical based automatic speech recognition (ASR) systems to improve the overall performance of the system. This mechanism can provide a platform for an automatic speech recognition systems to utilize segment-based approach instead of dominant frame based HMM models [4, 13]. The motivation for the proposed work comes from the fact that, the performance of the above mentioned applications depends on the accurate estimation of vowel boundaries.

In Indian languages most of the syllables are of the consonant-vowel (CV) type. Vowel region starts with vowel onset point and ends with vowel end-point. In the literature, vowel region is considered as 100 ms duration from the vowel onset point. Next, the VOP and VEP detection methods are prone to speech production uncertainties. Therefore a robust vowel region detection method is required by considering the fixed duration and speech production uncertainties. In this paper, an improved method is presented for the detection of vowel regions using acoustic features. The proposed method is carried out on two levels. In the first level, the landmarks such as vowel onset point and vowel end-point are detected with an assumption that vowels exhibit high sonority and loudness. A combined approach is proposed from the previous works, using the features of zero frequency filtered signal for vowel region detection. These landmarks are detected based on spectral content intensity variation around glottal closure instants of zero frequency filtered speech signal. Zero frequency filtered speech signal around glottal closure regions is processed to bring out high information pertaining to the vowel regions. In the second level, two clues such as uniformity of epochs and strength of the excitation of the speech signal are used to eliminate spurious vowel regions along with the correction of onset point and end-point of vowels.

Rest of the paper is organized as follows: Related works for the vowel region detection are briefly discussed in Section 2. Section 3 describes baseline methods for detecting vowel regions using vowel onset points and end-points. The zero frequency filtering technique and extraction of the strength of the excitation of speech signal are also briefly described in this section. The proposed method for vowel region detection is described in Section 4. The performance of the proposed method is studied using TIMIT acoustic-phonetic speech corpus. and, results are discussed in Section 5. Section 6 provides summary and conclusions of this work.

2 Related works

Acoustic-phonetic feature based landmark detection has received considerable attention in recent years. In the literature, the problem of detecting vowel onset points and vowel end-points is studied using the extraction of distinctive features of the speech signal. Several vowel onset point detection methods have been proposed using the acoustic features of the speech signal and few of them are referred here [8, 10, 14, 16]. In [8] VOPs are detected from the excitation information of speech signal. The speech signal is processed using the Linear Prediction (LP) analysis technique to extract the LP residual. The Hilbert envelope of the LP residual is used as an evidence to detect VOPs as this signal contain excitation information of the speech signal. Vowel onset point detection methods are proposed based on rising slope of spectral amplitude in the magnitude spectrum of the speech signal. A method is proposed by combining evidence from the excitation source, spectral peaks and modulation spectrum for the detection of vowel onset point [10]. Each of these evidence possess complementary information pertaining to the VOPs and it can produce enhanced vowel detection performance. vowel onset point detection for vowel region extraction is proposed based on spectral energy present in glottal closure regions of speech signal [14, 16]. In this method, formants are extracted using the group delay function, and glottal closure instants are extracted by using zero frequency filtering technique. The spectral energy computed around the glottal closure instants is high and robust. Therefore it is used as an evidence for detecting the VOPs. The locations of the vowel onset points were corrected based on the uniformity of epoch intervals in the vowel region [15]. In another attempt, two methods were proposed to detect vowel end-points from the speech signal [18]. The first method uses the combined evidence from excitation source, spectral peaks and modulation spectrum for determining the vowel end-point. In the second method, spectral energy within the glottal closure region is used for determining the vowel end-point. In both the methods, enhanced evidence is convolved with FOGD operator to detect the locations of the VEPS. The region between vowel onset point and vowel end-point is considered as a vowel region. These methods report that most vowel onset and end-points fall within 40 ms deviation. Recently a technique was proposed [6] on improvements in detection of vowel onset point and vowel end-point using three-class classifier with front-end feature extraction technique that exploits spectral and temporal characteristics of the excitation source information of the speech signal.

3 Baseline methods for VOP and VEP detection techniques, Zero frequency filtering and strength of the excitation of the speech signal

Vowel boundaries start with vowel onset point and end with vowel end-point. In this Section, state of the art methods for vowel detection using VOP and VEP in a continuous speech is discussed. Two vowel region detection methods are fromulated using the detection of VOPs and VEPs. These methods are based on previous works mentioned in [10, 14, 18]. The experimental results obtained for the proposed method are compared with these methods. In addition, we briefly discuss speech signal processing technique known as Zero frequency filtering and strength of the excitation of the speech signal.

3.1 VOP region detection by combining the evidence from excitation source, spectral peaks and modulation spectrum

The combined evidence for the boundary markings of a vowel region is derived from three parameters namely excitation source, spectral peaks and modulation spectrum [10]. LP residual corresponds to the excitation source information and it is a bipolar signal. Hilbert envelope of linear prediction (LP) residual is estimated, which is unipolar. The smoothened Hilbert envelope of the LP residual is obtained by convolving with Hamming window of 50 ms. This is enhanced using first order difference (FOD) to improve the detection of VOP. The enhanced signal is convolved with first order Gaussian difference (FOGD) operator, and convolved output is the evidence for VOP. This acoustic event is based on the nature of the gradient of the output signal.

A 256-point DFT of the speech signal with 20 ms duration with 50% overlap produces amplitude spectrum. The sum of ten largest spectral peaks are selected from first 128 points and plotted as a function of a time, and this represents the energy of the spectral peaks. The VOP can be observed as a significant change in a complimentary manner in this time varying signal. The changes in VOP is enhanced in the spectral energy signal using FOD. The variations that represent VOP is extracted using FOGD operator.

The information in speech signal can also be represented by changes in the time domain envelope and this is referred as amplitude-modulation spectrum or simply modulation spectrum. The modulation spectrum of speech is dominated by the low-frequency components. The speech signal is analyzed using 18 trapezoidal critical band filters between 0 and 4 kHz. An amplitude envelope signal is computed using half wave rectification and low pass filtering on all bands. Amplitude envelope signals are down-sampled to 80 samples/s and normalized by the average envelope of that channel, measured over entire utterance. The modulations of the normalized envelope signals are analyzed by computing DFT over 250 ms with an overlap of 5% in order to capture dynamic properties of the signal. The 4-16 Hz components are added together across all critical bands to derive modulation spectrum energy. Thus obtained signal is enhanced and processed to obtain evidence for VOP using FOD and FOGD respectively. This method uses three independent and complementary evidence to derive a single combined evidence. This is obtained by adding three shreds of evidence sample by sample and the performance of the detection of VOP is better than individual evidence. The positive peaks of combined evidence are marked as VOPs in the speech signal.

3.2 VEP detection by combining the evidence from excitation source, spectral peaks and modulation spectrum

The time instant at which the offset of vowel take place in the speech signal is known as vowel end-point [18]. The individual vowel evidence are derived from the excitation source, spectral peaks and modulation spectrum respectively. They are enhanced and convolved with a the FOD of Gaussian window. After convolving each of the evidence with FOGD operator, the three individual evidence are combined by adding sample by sample to obtain a combined evidence. The negative peaks in the combined vowel evidence signal are hypothesized as the locations of vowel end-points

Vowel regions are detected by locating the VOPs and and VEPs using the evidence from excitation source, spectral peaks and modulation spectrum. This vowel region detection method is termed as COMB method in this paper. Figure 1 demonstrates the experimental result for the COMB method for a speech utterance /“personal predispositions tend to blunt”/. Figure 1a refers to a continuous speech utterance with the ground truth. Combined evidence is shown in Fig. 1b. Figure 1c, d, e, f refer to mean smoothed evidence contour, enhanced evidence using first order difference, VOP and VEP marking for the speech signal after convolving with FOGD operator and prediction respectively. This method produced high false alarm due to spurious peaks.

Fig. 1
figure 1

Detection of vowel region using COMB method for a speech utterance /“personal predispositions tend to blunt”/. a Speech signal with phonemic marking of vowel regions given in TIMIT acoustic-phonetic speech corpus. b Combined evidence. c Mean smoothed evidence contour. d Enhanced evidence using FOD. e Hypothesized VOPs and VEPs for the speech signal. f Predicted vowel regions

3.3 VOP detection using spectral energy around glottal closure instants

The vocal tract is completely isolated from trachea and lungs during glottal closure phase. Spectrum estimation during glottal closure phase will be more accurate as true vocal tract resonances are present during this period. In this method, spectral energy at the glottal closure instants is used as an evidence to detect VOP in a continuous speech [14]. Firstly, the glottal closure instants are extracted using zero frequency filtering. Around the glottal closure instants, formants are computed for 30% of speech samples using group delay function. The spectral energy of a speech signal is derived from the first three formants and the same is plotted as a function of time. This contour is smoothed using mean smoothing window of 50 ms and enhanced using FOD operator. Significant changes in the spectral characteristic present in the enhanced signal are detected by convolving the same with FOGD operator having 100 ms length and 25 ms standard deviation. After eliminating the spurious peaks, positive peaks of this signal represent locations of VOPs.

3.4 VEP detection using spectral energy around glottal closure instants

In [18], another method of VEP detection is also proposed. The spectral energy contour of a speech signal is derived from the first three formats around the glottal closure region. This is enhanced using first order difference function and convolved with a FOGD operator having 100 ms length and 25 ms standard deviation. The convolved output obtained represents vowel evidence. The spurious negative peaks of the this evidence signal are removed using distance criteria and the remaining negative peaks represent the locations of VEPs.

The regions between VOPs and VEPs obtained using spectral energy are marked as vowel regions. This method is referred as FGCI method in this paper. Figure 2 depicts the result for a speech utterance /“personal predispositions tend to blunt”/ using FGCI method. Figure 2a, b, c, d, e and f refers to a continuous speech utterance, Sum of first 3 formant peaks, mean smoothed evidence contour, enhanced evidence using first order difference operator, VOP and VEP marking after convolving with FOGD operator and prediction respectively.

Fig. 2
figure 2

Detection of vowel region using FGCI method for a speech utterance/“personal predispositions tend to blunt”/. a Speech signal with phonemic marking of vowel regions given in TIMIT acoustic-phonetic speech corpus. b Combined evidence. c Mean smoothed evidence contour. d Enhanced evidence using FOD. e Hypothesized VOPs and VEPs for the speech signal. f Predicted vowel regions

3.5 Zero frequency filtering and extraction of strength of the excitation information

Zero frequency filtering of a speech signal highlights the high information in lower frequency bands [7, 20]. Consider a speech signal s(n) and perform high frequency boosting as it is noted that higher frequencies are more important for signal disambiguation than lower frequencies.

$$ x\left( n\right)=s\left( n\right)-s\left( n-1\right) $$
(1)

speech signal is fed to a resonator centered at 0 Hz. The resonator is realized using the following transfer function. The output of cascade of two ideal second order digital resonators at zero frequency is computed as

$$ y\left( n\right)=\sum\limits_{k=1}^{4}\alpha_{k}y\left( n-k\right)+x\left( n\right) $$
(2)

where a 1 = 4, a 2 = -6, a 3 = 4 and a 4 = -1. The transfer function of the system is given by

$$ H\left( z\right)=\frac{1}{\left( 1-z^{-1}\right)^{4}} $$
(3)

The progression can be removed from the output signal using progression removal operation, which involves subtracting local mean of the original signal at every instant of time. This is represented using the following expression.

$$ \hat{y}\left( n\right)=y\left( n\right)-\tilde{y}\left( n\right) $$
(4)

where \(\tilde {y}\left (n\right )=\frac {1}{2N+1} {\sum }_{n=-N}^{N}\textit {y}\left (n\right )\). Here 2N + 1 is the size of window used for computing local mean, which is typically average pitch period. The resulting output signal is called zero frequency filtered (ZFF) signal. The strength of the excitation (SoE) can be computed from the zero frequency filtered signal [19]. The negative to positive zero crossings of zero frequency filtered signal corresponds to glottal closure instants (GCIs). The gradient of zero frequency filtered signal at each GCI is termed as SoE [2, 17, 19]. The SoE contour can be used as a complimentary evidence for finding the positions of VOPs and VEPs.

4 Proposed vowel region detection method

This section describes an improved method to detect vowel regions in a continuous speech. The zero frequency filtering serves as a tool, which emphasizes the low-frequency content of speech signal. The sharp rise and fall of energies around glottal closure instants can be visualized as VOPs and VEPs. The proposed method for vowel region detection is implemented in two stages. In the first stage, vowel onset and end-points are detected from the zero frequency filtered speech signal. The positions of VOPs and VEPs are corrected along with the removal of spurious vowel regions. The steps involved in the proposed method are illustrated in the Fig. 3.

Let x(n), n = 0, 1, 2,..........., N - 1 be a zero frequency filtered speech signal. This signal is perceived in terms of segments, possibly overlapping, of length L with the starting points of the segments D units distance apart. In this case, a window w(n) of size L with overlap is used to specify the speech segment for the processing. It can be written as follows

$$ x_{p}(n)\equiv x(n+(p-1)D).w(n),p=1,2,.......P $$
(5)

where P denote the number of segments of speech signal under consideration. The entire speech signal in terms of overlapping segments can be represented as

$$ x(n)\equiv\{x_{p}(n)\},p=0,1,2..........,P-1 $$
(6)

In this work, the speech signal is processed in blocks of 20 ms with a shift of 10 ms. For each segment of length L, a 256 point discrete Fourier transform (DFT) is computed in the following manner

$$ X_{p}(k)=\frac{1}{L}\sum\limits_{i=0}^{L-1}x_{p}(n)\exp\left( -j\frac{2\pi nk}{L}\right),k=0,1,2.....,L-1 $$
(7)

The largest ten peaks from each block are selected from the magnitude spectrum and sum of these spectral peaks are plotted as a function of time. Thus spectral energy contour is obtained for the speech signal. The VOPs and VEPs are characterized by the spectral peaks and valleys of this contour. This energy contour is enhanced by computing its slope using first order difference function. A peak finding method using first order Gaussian differentiator (FOGD) is used for automatic detection of the peaks and valleys of the spectral energy contour of the speech signal. A M-point Gaussian window function is defined as

$$ w(n)=e^{-}{\frac{1}{2}\frac{{~}^{(n-M/2)^{2}}}{\sigma^{2}}},n=0,1,2,......,M $$
(8)

M/2 is mean and s 2 is its variance. The derivative of a Gaussian which denotes the slope of a series at each sample is computed as

$$ w_{d}(n)=w(n+1)-w(n) $$
(9)

The Gaussian window and its first order derivative are shown in the Fig. 4. The convolution of FOGD function with spectral contour highlights positive slopes and negative slopes in the spectral contour. It can be noted that spectral positive slope in the contour gets manifested as peak and negative slope as a valley in the output. These peaks and valleys correspond to the VOPs and VEPs in the speech signal. These landmarks are susceptible to the uncertainties associated with speech production and noise. This procedure is termed as ZFF method in this paper. Experimentation results for a speech utterance is shown in Fig. 5. Zero frequency filtered signal, smoothed contour of spectral peaks, enhanced FOD signal and FOGD operated signal are shown in Fig. 5b, c, d, and e respectively. The vowel regions detected by the ZFF method (first stage) are not in-line with the phonemic marked vowel regions as per TIMIT acoustic-phonetic speech corpus. The prediction signal possesses spurious and misaligned vowel regions. This is depicted in in Fig. 5f.

Fig. 3
figure 3

a 801 point Gaussian window with spread = 100. b The FOGD operator

Fig. 4
figure 4

Flow chart describing the sequence of steps in the proposed for detecting vowel regions

Fig. 5
figure 5

Detection of vowel region using ZFF method (stage I) for a speech utterance/“personal predispositions tend to blunt”/. a Speech signal with phonemic marking of vowel regions given in TIMIT acoustic-phonetic speech corpus. b ZFF signal. c Sum of 10 spectral peaks in ZFF signal. d Enhanced spectral energy signal. e Hypothesized VOPs and VEPs for the speech signal. f Predicted vowel regions

In the second stage, spurious vowel regions are removed and positions of vowel landmarks are corrected. The spurious vowel regions are eliminated based on the uniformity of the epochs with an assumption that the successive pitch cycles will be same in the vowel region [15]. The positions of VOPs and VEPs are corrected based on strength of the excitation as it exhibits positive trend from a local minimum at VOP and a negative trend from a local minimum at VEP respectively. ZFF method is used to find epoch intervals and the strength of the excitation signal. This mechanism is demonstrated in Fig. 6 using a continuous speech utterance. Figure 6a shows the continuous speech signal with ground truth. Figure 6b, c, d, e and f correspond to prediction in the first level, epoch intervals, spurious removed vowel regions, strength of the excitation of the continuous speech and the hypothesized prediction obtained via post processing of VOPs and VEPs. The results produced by the proposed method is relatively better than COMB method and FGCI methods in terms of detection rate and false alarms.

Fig. 6
figure 6

Detection of vowel region using ZFF method with post processing (stage II) for a speech utterance/“personal predispositions tend to blunt”/. a Speech signal with phonemic marking of vowel regions given in TIMIT acoustic-phonetic speech corpus. b Prediction. c Epoch intervals of speech signal. d Prediction without spurious vowel regions. e Strength of the excitation of speech signal. f Corrected vowel regions by modifying the positions of VOPs and VEPs of the speech signal

5 Performance of the proposed method

In this section, experimental results are reported and comparisons between the proposed method and existing methods are provided. The proposed two-stage vowel region detection method is evaluated by considering a subset of TIMIT acoustic-phonetic speech corpus. 1000 test utterances from TIMIT acoustic-phonetic speech corpus, spoken by 100 speakers (50 male and 50 female) are used for evaluating vowel detection method from a continuous speech. The detected vowel regions are compared with phonemic marked vowel boundaries given in TIMIT acoustic-phonetic speech corpus. These boundaries are considered as ground truths for vowel region marking. The region between VOP and VEP in a continuous speech is considered as a vowel region. The performance of this method is evaluated based on metrics such as detection rate and false alarm for different amounts of overlap of vowel region with ground truth. Detection rate refers to number of vowel regions identified correctly in-line with ground truth and false alarm corresponds to the number of non-vowels interpreted as vowels. Tables 1 and 2 demonstrate the performance comparison of proposed method in the first stage (ZFF method) using TIMIT acoustic-phonetic speech corpus with combined method and formant energy based method for clean and noisy speech respectively. The first column indicates different vowel region detection techniques. Columns 2–6 indicate the different percentage of overlap with the ground-truth. The last column indicates the percentage of false alarm. From these Tables, it is observed that the first level processing produced competitive results when compared with state of art techniques. However, false alarm rate is found to be high and detection rate for 90% overlap with the ground-truth is low(47%). This is attributed to different kinds of uncertainties associated with the speech signal.

Table 1 Performance analysis of vowel region detection using combined method (COMB), group delay based formants around GCI (FGCI) and ZFF method for a clean speech on TIMIT database (Det = Detection; FA = False alarm)
Table 2 Performance analysis of vowel region detection using combined method (COMB), group delay based formants around GCI (FGCI) and ZFF method for a noisy speech on TIMIT database (Det = Detection; FA = False alarm)

Tables 3 and 4 illustrates the results obtained after the post processing of VOPs and VEPs. From these results, it is evident that the performance of the proposed vowel detection method is significantly better than the other state of art methods in clean and noisy conditions. It is observed that about 30% more vowel regions are detected for 90% overlap with the ground-truth. In addition, false alarm rate (6% and 8%) also reduced in clean and noisy conditions using the proposed method. This method produced significant improvement in vowel region detection through the post processing of VOPs and VEPs. The performance of the proposed method is more significant at a higher amount of overlap with the ground truth.

Table 3 Performance analysis of vowel region detection using combined method (COMB), group delay based formants around GCI (FGCI) and ZFF method with post processing for a clean speech on TIMIT database with landmark correction (Det = Detection; FA = False alarm)
Table 4 Performance analysis of vowel region detection using combined method (COMB), group delay based formants around GCI (FGCI) and ZFF method with post processing for a noisy speech on TIMIT database with landmark correction (Det = Detection; FA = False alarm)

6 Summary and conclusions

In this paper, an improved method for the vowel region detection through the post-processing of VOP and VEP were discussed. Although energy transition detection is used for vowel region detection in a continuous speech because of it’s simplicity and accuracy, it is difficult to spot VOP and VEP accurately due to speech production and noise uncertainties. A post processing methodology was introduced to circumvent this difficulty. We used the combination of the speech signal processing techniques to detect vowel regions and performed post processing of the same. Experimentation was carried out on zero frequency filtered speech signal to highlight the low-frequency content of speech. To enhance the detection performance, the proposed scheme jointly utilized uniformity of the epochs and the strength of the excitation to eliminate spurious vowel regions and to correct the positions of VOPs and VEPs respectively. The performance of the proposed method was compared with recent state of art methods for both clean and noisy speech. The performance of this method was evaluated using TIMIT acoustic-phonetic speech corpus. and a significant improvement in vowel region detection was observed using proposed method compared to the existing methods. The robustness of the proposed can be evaluated by detecting the vowel regions on different forms of speech corpus for different applications.