Abstract
Traditionally, the information in speech signals is represented in terms of features derived from short-time Fourier analysis. In this analysis the features extracted from the magnitude of the Fourier transform (FT) are considered, ignoring the phase component. Although the significance of the FT phase was highlighted in several studies over the recent three decades, the features of the FT phase were not exploited fully due to difficulty in computing the phase and also in processing the phase function. The information in the short-time FT phase function can be extracted by processing the derivative of the FT phase, i.e., the group delay function. In this paper, the properties of the group delay functions are reviewed, highlighting the importance of the FT phase for representing information in the speech signal. Methods to process the group delay function are discussed to capture the characteristics of the vocal-tract system in the form of formants or through a modified group delay function. Applications of group delay functions for speech processing are discussed in some detail. They include segmentation of speech into syllable boundaries, exploiting the additive and high resolution properties of the group delay functions. The effectiveness of segmentation of speech, and the features derived from the modified group delay are demonstrated in applications such as language identification, speech recognition and speaker recognition. The paper thus demonstrates the need to exploit the potential of the group delay functions for development of speech systems.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Aarabi P, Shi G, Shanechi M M, Rabi S A 2006 Phase based processing speech (Singapore: World Scientific Publishing Co. Pte. Ltd.)
Alsteris L D, Paliwal K K 2006 Further intelligibility results from human listening tests using the short-time phase spectrum. Speech Commun. 48: 727–736
Auckentaler R, Carey M, Lloyd-Thomas H 2000 Score normalisation for text-independent speaker verification systems. Digital Signal Process. 10: 42–54
Black A, Taylor P, Caley R 1998 The festival speech synthesis system. http://festvox.org/festival/
Bozkurt B, Couvreur L, Dutoit T 2007 Chirp group delay analysis of speech signals. Speech Commun. 49(3): 159–176
Chevireddy S, Murthy H A, Chandrasekhar C 2008a A syllable-based segment vocoder. Proc. National Conference on Communications, Mumbai, India, 442–445
Chevireddy S, Murthy H A, Chandrasekhar C 2008b Signal processing based segmentation and hmm based automatic clustering for a syllable based segment vocoder at 1.4kbps. Proc. EUSIPCO, Lausanne, Switzerland. www.eurasip.org/Proceedings/Eusipco2008/papers/1569104947.pdf
Childers D G 1977 The cepstrum: A guide to processing. Proc. IEEE 68: 1428–1443
CUED 2002 HTK Speech Recognition Toolkit. http://htk.eng.cam.ac.uk
Davis S, Mermelstein 1980 Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech, Signal Process 28: 357–366
DDNews 2001 Database for Indian languages. India, Speech and Vision Lab, IIT Madras, Chennai
Dupont S, Luettin J 2000 Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2(3) 141–151
Godfrey J J, Holliman E C, McDaniel J 1992 SWITCHBOARD: Telephone speech corpus for research and development. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, San Francisco, California, USA, 1. 517–520
Greenberg S 1999 Speaking in short hand - A syllable centric perspective for understanding pronounciation variation. Speech Commun. 29: 159–176
Greenberg S, Hollenback J, Ellis D 1996 Insights into spoken language gleaned from phonetic transcription of the switchboard corpus. Proc. Int. Conf. Spoken Language Process, Philadelphia, USA, 24–27
Gurban M, Thiran J-P 2008 Using entropy as a stream reliability estimate for audio-visual speech recognition. Proc. EUSIPCO, Lausanne, Switzerland. http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/papers/1569104998.pdf
Halberstadt A K, Glass J R 1998 Heterogeneous acoustic measurements and multiple classifiers for speech recognition. Proc. Int. Conf. Spoken Language Process. Sydney, Australia, paper 0396
Halberstadt A K 1998 Heterogeneous acoustic measurements and multiple classifiers for speech recognition. Ph.D. thesis, Massachussets Institute of Technology
Hermansky H 1990 Perceptually linear predictive (plp) analysis of speech. J. of the Acoust. Soc. of Am. 87: 1738–1752
Hirsch H, Pearce D 2000 The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. Proc. ISCA Tutorial and Research Workshop on Automatic Speech Recognition, Paris, France, 181–188
Janakiram R, Kumar C J, Murthy H A 2010 Robust syllable segmentation its application to syllable-centric continuous speech recognition. Proc. National Conference on Communications, Chennai, India, 276–280
Jelinek F 1999 Statistical methods for speech recognition (Cambridge, Massachusetts: The MIT Press)
Kamakshi Prasad V, Nagarajan T, Murthy H A 2004 Automatic segmentation of continuous speech using minimum phase group delay functions. Speech Commun. 42: 429–446
Kishore S P, Black A W 2003 Unit size in unit selection speech synthesis. Proc. EUROSPEECH, Geneva, Switzerland, 1317–1320
Kumar C J, Murthy H A 2009 Entropy based measures for incorporating feature stream diversity in the linguistic search space for syllable based automatic annotated recognizer. Proc. National Conference on Communication, Guwahati, India, 286–289
Kumar J C, Janakiraman R, Murthy H A 2010 Kl divergence based feature switching in the linguistic search space for automatic speech recognition. Proc. National Conference on Communication, Chennai, India, 281–285
Lakshmi Sarada G, Nagarajan T, Murthy H A 2004 Multiple frame size and multiple frame rate feature extraction for speech recognition. Proc. SPCOM, Bangalore, India, 592–595
Lakshmi A, Murthy H A 2008 A new approach to continuous speech recognition in indian languages. Proc. National Conference on Communication, Mumbai, India, 277–281
Li K 1994 Automatic language identification using syllabic spectral features. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, Adelaide, South Australia, 1. 297–300
Li X, Stern R 2003 Training of stream weights for the decoding of speech using parallel feature streams. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, 1: 832–835
Lim J 1979 Spectral root homomorphic deconvolution system. IEEE Trans. Acoust. Speech Signal Process 27: 223–233
Murthy H A 1997 The real root cepstrum and its applications to speech processing. Proc. National Conference on Communication, Chennai, India, 180–183
Murthy H A, Rao G V R 2003 The modified group delay function and its application to phoneme recognition. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, Hongkong, 1.68–71
Murthy H A, Yegnanarayana B 1991 Formant extraction from minimum phase group delay function. Speech Commun. 10: 209–221
Murthy K V M, Yegnanarayana B 1989 Effectiveness of representation of signals through group delay functions. Elsevier Signal Process. 17: 141–150
Nagarajan T, Murthy H A, Hegde R M 2003 Segmentation of speech into syllable-like units. Proc. EUROSPEECH, Geneva, Switzerland, 2893–2896
Nagarajan T, Prasad V K, Murthy H A 2001 The minimum phase signal derived from the magnitude spectrum and its applications to speech segmentation. Proc. SPCOM, Bangalore, India, 95–101
Neti C P, Luettin G, Matthews J, Vergyri J H G 2001 Large-vocabulary audio-visual speech recognition: A summary of the johns hopkins summer 2000 workshop. Proc. IEEE Fourth Workshop on Multimedia Signal Processing, Cannes, France, 619–624
NIST 2003 The NIST year 2003 speaker recognition evaluation plan. http://www.itl.nist.gov/iad/mig/tests/sre/2003/index.html
Noll A M 1967 Cepstrum pitch determination. J. Acoust. Soc. Am. 41(2): 179–195
OGI 1992 The OGI multi-language telephone speech corpus. Proc. Int. Conf. Spoken Lang., Banff, Alberta
Oppenheim A V, Schafer R W 1990 Discrete time signal processing (New Jersey: Prentice Hall, Inc.)
Padmanabhan R, Murthy H A 2010 Acoustic feature diversity and speaker verification. Proc. INTERSPEECH, Makuhari, Japan, 2110–2113
Padmanabhan R, Parthasarthi S H K, Murthy H A 2009 Robustness of phase based features for speaker recognition. Proc. INTERSPEECH, Brighton, U.K., 2355–2358
Paliwal K K, Alsteris L D 2005 On the usefulness of stft phase spectrum in human listening tests. Speech Commun. 45 153–170
Papoulis A 1977 Signal analysis (New York: McGraw Hill)
Pfitzinger H R, Burger S, Heid S 1996 Syllable detection in read and spontaneous speech. Proc. Int. Conf. Spoken Language Process., Philadelphia, USA, 1261–1264
Pradhan A, Chevireddy S, Veezhinathan K, Murthy H A 2010 A low-bit rate segment vocoder using minimum residual energy criteria. Proc. National Conference on Communication, Chennai, India, 246–250
Prasanna S, Reddy S B, Krishnamoorthy P 2009 Vowel onset point detection using source, spectral peaks and modulation spectrum energies. IEEE Trans. Audio Speech Language Process. 17(4): 556–565
Rabiner L R, Schafer R W 1969 The chirp z-transform algorithm and its application. Bell Syst. Tech. J. 48(5): 1249–1292
Ramasubramanian V, Jayaram A K V S, Sreenivas T V 2003 Language identification using parallel sub-word recognition — an ergodic hmm equivalence. Proc. EUROSPEECH, Geneva, Switzerland, 1357–1360
Rao M N, Thomas S, Nagarajan T, Murthy H A 2005 Text-to-speech synthesis using syllable-like units. Proc. National Conference on Communications, Kharagpur, India, 227–280
Rasipuram R, Hegde R M, Murthy H A 2008 Incorporating acoustic diversity into the linguistic feature space for syllable recognition. Proc. EUSIPCO 2008, Lausanne, Switzerland, www.eurasip.org/Proceedings/Eusipco/papers/1569104561.pdf
Sethi A, Narayanan S 2003 Split-lexicon based hierarchial recognition of speech using syllable and word level acoustic units. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, Hong Kong, 185–187
Shi G, Shanechi M, Aarabi P 2006 On the importance of phase in human speech recognition. IEEE Trans. on Audio Speech Language Processing 14(5): 1867–1874
TIMIT 1990 Acoustic-phonetic continuous speech corpus. National Institute of Standards and Technology Speech Disc 1-1.1. Fisher W, Doddington G, Goudie Marshall K M 1986 The DARPA speech recognition research database: Specifications and status. Proc. DARPA Workshop on Speech Recognition, California, 93–99
Tribolet J 1979 A new phase unwrapping algorithm. IEEE Trans. Acoust. Speech Signal Process 2: 170–179
Yegnanarayana B 1979 Formant extraction from linear-prediction phase spectra. J. Acoust. Soc. Am. 63: 1638–1640
Yegnanarayana B, Murthy H A 1992 Significance of group delay functions in spectrum estimation. IEEE Trans. Signal Process. 40(9): 2281–2289
Yegnanarayana B, Saikia D K, Krishan T R 1984 Significance of group delay functions in signal reconstruction from spectral magnitude or phase. IEEE Trans. Acoust. Speech Signal Process 3: 610–623
Yip P, Rao K R 1997 Discrete cosine transform: Algorithms, advantages and applicatons (Boston, USA: Academic Press)
Zissman M A 1996 Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process 4(1): 31–44
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
MURTHY, H.A., YEGNANARAYANA, B. Group delay functions and its applications in speech technology. Sadhana 36, 745–782 (2011). https://doi.org/10.1007/s12046-011-0045-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12046-011-0045-1