Keywords

1 Introduction

Speech signal is a result of variations in articulatory movements and is prone to include variability such as phonetic contents and distressing traits, characteristics of individual human speech production structure, and sometimes behavioral state of speaker while speaking [1]. The wider variability in speech signal of individual’s speech leads to automatic speech as well as speaker recognition a challenging. The physiology of human speech production system is the fundamental aspect of characterizing speech sounds. From practical perspective, human speech production starts from vocal cords (vocal folds) and end at mouth (lips) or nose.

Practically, the voice production system can be modeled as connected auditory pipe having some peaks (called as formants) as well as valleys generated based on nature of speech. Formants are basically the perception characteristics of vowels, in which concentration of acoustic energy is observed at certain frequencies, which are called as first formant, second formant, and so on. Formants represent characteristics of speech sound, which involves large dynamics because of physiological as well as behavioral aspects of a person while speaking.

Research have been carried out based on use of formant frequencies for automatic recognition of speech. The main approaches adopted in the literature are linear prediction [1,2,3] investigation, and analysis of speech using Fourier spectrum [4], and using features such as peaks of homomorphically smoothed cepstrum [5]. Vowels which typically have voiced characteristics are the most useful speech sounds in reliable formant estimation [6].

In this paper, analysis of formant estimation model is investigated for vowel sounds. The algorithm proposed is to track the most prominent formants considering the variabilities in speech during speech production. Following section discusses the methodology to track and estimate the formants from different speech sound.

2 Formant Estimation Algorithm

Input analog speech signal is converted in discrete form using a sampling frequency of 8 kHz and framed using 20 ms Hamming window. Pre-emphasis is performed using Butterworth IIR high pass filter. Pre-emphasis helps to reduce spectral tilt and improves spectral flattening providing more gain for high-frequency components. Further, this pre-emphasized signal is passed through Hilbert transform (all-pass filter) to create an analytic signal from a real signal. A set of adaptive FIR (all-zero) bandpass filters with linear phase characteristics is designed and cascaded with formant filter. Before estimating individual formant, speech signal is filtered out using a set of bandpass filters (filter bank). The most recent formant estimates are used to update the magnitude response of filters. This allows tracking of individual formant frequency over time, and in suppression of nearby formants and intrusion of surrounding noise.

The center frequency of these bandpass filters are first formant (F1):0.7 kHz, second formant (F2): 1.5 kHz, third formant (F3): 2.2 kHz, and fourth formant (F4):3 kHz, respectively. These four formants are spectrally separated using the adaptive filter bank. To isolate pitch frequency (F0) from the first formant (F1), additional zero is placed at F1 filter transfer function. The Hilbert transformed signal gives complex-valued filter coefficients, which help in designing filters with normalized gain and zero phase characteristics at the center frequency of each filter.

The kth all zero formant filter transfer function for k = 2, 3, 4 is given by [7]:

$$H_{Fk} \left( {z,n} \right) = k_{K} \left( {n,z} \right)\mathop \prod \limits_{l = 1, l \ne k}^{4} 1 - r_{z} {\text{e}}^{{ - j2\pi Fl\left( {n - 1} \right)}} z^{ - 1}$$
(1)

Here, rz = 0.98. Above equation of filter transfer function ensures minimum response of formant filters except for the kth formant. The term kk(n, z) ensures normalized magnitude response and zero phase characteristics of kth estimated frequency component.

$$k_{K} \left( n \right) = \frac{1}{{\mathop \prod \nolimits_{l = 1,l \ne k}^{4} 1 - r_{z } {\text{e}}^{{ - j2\pi Fl\left( {n - 1} \right) - Fk\left( {n - 1} \right)}} }}$$
(2)

An supplementary zero is added in the transfer function of first formant filter, with zero at pitch frequency at 200 Hz. This zero is to prevent interference of pitch frequency to first formant. Thus, transfer function of first formant frequency filter is given by:

$$H_{Fk} \left( {z,n} \right) = k_{1 } \left( n \right)\mathop \prod \limits_{l = 0,l \ne 1}^{4} 1 - r_{z} {\text{e}}^{{ - j2\pi Fl\left( {n - 1} \right)}} z^{ - 1}$$
(3)

where

$$k_{1 } \left( n \right) = \frac{1}{{\mathop \prod \nolimits_{l = 0,l \ne 1}^{4} 1 - r_{z} {\text{e}}^{{ - j2\pi Fl\left( {n - 1} \right) - F1\left( {n - 1} \right)}} }}$$
(4)

The signal filtered through all-zero FIR filter is further passed through a set of first-order IIR filter. The pole of each of these filters is updated based on formant frequency estimated in previous frame of that filter. The transfer function of kth single-pole IIR filter at time instant n is as below:

$$H\left( {n,z} \right) = \frac{{1 - r_{{\text{p}}} }}{{1 - r_{{{\text{p}} }} {\text{e}}^{{j2\pi Fk\left( {n - 1} \right)}} z^{ - 1} }}$$
(5)

Here, rp = 0.9 defines radius of pole, which decides the magnitude/gain at a formant frequency, and estimation of kth formant filter at index (n − 1) is given by \(F_{K } \left( {n - 1} \right)\). Equation (5) gives the design of four formant filters having complex-valued coefficients. Thus, these filters divides the spectrum of Hilbert transformed speech signal into four spectrally separated regions and estimate the formant frequencies based on updated filter coefficients [8].

The analytic speech contains all types of phonetic contents, out of which voiced part is most useful in detecting speech specific formants [9, 10]. A voicing detector is used to distinguish voiced and unvoiced frame, as formants are the characteristics related to voicing properties of the speech, generally of vowels. A simple zero cross-rate detector (ZCR) is used to classify voiced and unvoiced part of the speech. A simple measure of zero cross-rate is the count that the speech signal crosses zero (reference) amplitude. As a general observation, unvoiced or noisy speech is having more ZCR than voiced speech. It is calculated using signum function as:

$$z_{n } = \mathop \sum \limits_{n = - \infty }^{\infty } |{\text{sgn}}[(s\left( n \right)] - {\text{sgn}}\left[ {s\left( {n - 1} \right)} \right].w\left( {m - n} \right)$$
(6)

nn

Here, sgn() is signum function which is 1 for n ≥ 0 and −1 for n < 0, and w(n) is window function of N samples. Thus, the formant frequencies are estimated over each speech frame based on decision of voicing detector. This reduces the redundant estimation of formants during unvoiced or silence part of speech.

3 Results and Discussion

Figure 1 shows the narrowband spectrogram of two different women speech samples of vowel ‘ae’ extracted from the formant estimation algorithm discussed in Sect. 2. It is observed that variation of first formant is almost similar over the entire time duration, and the position (frequency) of higher formants is different for the same speech sound. Similar analysis is carried of for 12 vowels sounds from speech samples of five female speakers.

Fig. 1
figure 1

Narrowband spectrogram of two female speakers for vowel sound ‘ae’

The results in Fig. 2 show the analysis plots of first four formants of 12 different vowel sounds and its variation among five female speakers notes as W1 to W5.

Fig. 2
figure 2figure 2figure 2

Formant analysis of vowel sounds of five female speakers

The above plots show the variation of formants for 12 different vowel sounds by five female speakers. From the experimental results plotted, it is observed that the first formant, i.e., F1 (excluding the fundamental frequency), for each vowel sound is almost constant over the speakers. All higher formants (F2, F3, F4) for a vowel single sound vary considerably, over the speakers. These characteristics of formants can be used for formant as a feature for speech as well as speaker recognition. Thus, from experimental analysis, it can be concluded that first formant is most appropriate feature of speech sounds, whereas higher formants represent the characteristics of speaker.

4 Conclusion

In this paper, formant analysis of vowel sounds is done to explore the significance of formants for speech as well as speaker recognition. A set of twelve vowel sounds is used for experimental analysis. The purpose of the experimental analysis is to study and explore the significance of formants in relation to characteristics of speech sounds and speakers. As an initial study, experimentation is carried out on five female speech samples. Using the methodology described, the work will be extended for continuous speech, for a large speaker database and in a variety of real-world dynamic conditions for applications such as speech recognition and speaker recognition.