Vocal Tract Length Normalization Features for Audio Search

Madhavi, Maulik C.; Sharma, Shubham; Patil, Hemant A.

doi:10.1007/978-3-319-24033-6_44

Maulik C. Madhavi¹⁵,
Shubham Sharma¹⁶ &
Hemant A. Patil¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9302))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1851 Accesses
1 Citations

Abstract

This paper presents speaker normalization approaches for audio search task. Conventional state-of-the-art feature set, viz., Mel Frequency Cepstral Coefficients (MFCC) is known to contain speaker-specific and linguistic information implicitly. This might create problem for speaker-independent audio search task. In this paper, universal warping-based approach is used for vocal tract length normalization in audio search. In particular, features such as scale transform and warped linear prediction are used to compensate speaker variability in audio matching. The advantage of these features over conventional feature set is that they apply universal frequency warping for both the templates to be matched during audio search. The performance of Scale Transform Cepstral Coefficients (STCC) and Warped Linear Prediction Cepstral Coefficients (WLPCC) are about 3% higher than the state-of-the-art MFCC feature sets on TIMIT database.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

VTLN Using Different Warping Functions for Template Matching

Study of the Effect of Reducing Training Data in Speech Synthesis Adaptation Based on Frequency Warping

Timbre features with MEDIAN values for compensating intra-speaker variability in speaker identification of whispering sound

Article 03 August 2022

Keywords

References

The Spoken Term Detection (STD) 2006 Evaluation Plan (2006). http://www.itl.nist.gov/iad/mig/tests/std/2006/docs/std06-evalplan-v10.pdf (last accessed on March 25, 2015)
Vergyri, D., Shafran, I., Stolcke, A., Gadde, V.R.R., Akbacak, M., Roark, B., Wang, W.: The SRI/OGI 2006 spoken term detection system. In: INTERSPEECH 2007, Belgium, pp. 2393–2396 (2007)
Google Scholar
Parlak, S., Saraclar, M.: Spoken term detection for turkish broadcast news. In: Proc. IEEE Int. Conf. on Acous. Speech, and Signal Process. ICASSP 2008, Las Vegas, USA, pp. 5244–5247 (2008)
Google Scholar
Wallace, R., Vogt, R., Sridharan, S.: A phonetic search approach to the 2006 NIST spoken term detection evaluation. In: INTERSPEECH 2007, Belgium, pp. 2385–2388 (2007)
Google Scholar
Metze, F., Anguera, X., Barnard, E., Davel, M.H., Gravier, G.: Language independent search in mediaeval’s spoken web search task. Computer Speech & Language 28(5), 1066–1082 (2014)
Article Google Scholar
Hazen, T.J., Shen, W., White, C.M.: Query-by-example spoken term detection using phonetic posteriorgram templates. In: IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU, 2009, Merano/Meran, Italy, pp. 421–426 (2009)
Google Scholar
Zhang, Y., Glass, J.R.: Unsupervised spoken keyword spotting via segmental DTW on gaussian posteriorgrams. In: IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU, 2009, Merano/Meran, Italy, pp. 398–403 (2009)
Google Scholar
Anguera, X.: Speaker independent discriminant feature extraction for acoustic pattern-matching. In: IEEE Int. Conf. on Acoust. Speech and Signal Process., ICASSP 2012, Kyoto, Japan, pp. 485–488 (2012)
Google Scholar
Wang, H., Lee, T., Leung, C., Ma, B., Li, H.: Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection. In: IEEE Int. Conf. on Acoust. Speech and Signal Process., ICASSP 2013, Vancouver, BC, Canada, pp. 8545–8549 (2013)
Google Scholar
Tejedor, J., Szöke, I., Fapso, M.: Novel methods for query selection and query combination in query-by-example spoken term detection. In: Proc. of 2010 Int. Workshop on Searching Spontaneous Conversational Speech. SSCS 2010, New York, NY, USA, pp. 15–20. ACM (2010)
Google Scholar
Lee, L., Rose, R.C.: Speaker normalization using efficient frequency warping procedures. In: IEEE Int. Conf. on Acoust. Speech and Signal Process., ICASSP 1996, Atlanta, Georgia, USA, pp. 353–356 (1996)
Google Scholar
Umesh, S., Cohen, L., Marinovic, N., Nelson, D.J.: Scale transform in speech analysis. IEEE Transactions on Speech and Audio Processing 7(1), 40–45 (1999)
Article Google Scholar
Umesh, S., Sanand, D.R., Praveen, G.: Speaker-invariant features for automatic speech recognition. In: IJCAI 2007, Proc. 20th Int. Joint Conf. on Artificial Intelligence, Hyderabad, India, pp. 1738–1743 (2007)
Google Scholar
Sinha, R., Umesh, S.: Non-uniform scaling based speaker normalization. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2002, May 13–17 2002, Orlando, Florida, USA, pp. 589–592 (2002)
Google Scholar
Iii, J.O.S., Abel, J.S.: Bark and ERB bilinear transforms. IEEE Transactions on Speech and Audio Processing 7(6), 697–708 (1999)
Article Google Scholar
Kim, Y., Smith, J.O.: A speech feature based on bark frequency warping-the non-uniform linear prediction (nlp) cepstrum. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1999, pp. 131–134. IEEE (1999)
Google Scholar
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L.: DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM (1993)
Google Scholar
Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics Speech and Signal Processing 26(1), 43–49 (1978)
Article MATH Google Scholar
Park, A.S., Glass, J.R.: Unsupervised pattern discovery in speech. IEEE Transactions on Audio, Speech & Language Processing 16(1), 186–197 (2008)
Article Google Scholar
Nicholson, S., Milner, B.P., Cox, S.J.: Evaluating feature set performance using the f-ratio and j-measures. In: Fifth European Conference on Speech Communication and Technology, EUROSPEECH 1997, Rhodes, Greece (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, India
Maulik C. Madhavi & Hemant A. Patil
Indian Institute of Science, Bangalore, India
Shubham Sharma

Authors

Maulik C. Madhavi
View author publications
You can also search for this author in PubMed Google Scholar
Shubham Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Hemant A. Patil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maulik C. Madhavi .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Pavel Král
University of West Bohemia, Pilsen, Czech Republic
Václav Matoušek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Madhavi, M.C., Sharma, S., Patil, H.A. (2015). Vocal Tract Length Normalization Features for Audio Search. In: Král, P., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2015. Lecture Notes in Computer Science(), vol 9302. Springer, Cham. https://doi.org/10.1007/978-3-319-24033-6_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-24033-6_44
Published: 11 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24032-9
Online ISBN: 978-3-319-24033-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Vocal Tract Length Normalization Features for Audio Search

Abstract

Chapter PDF

Similar content being viewed by others

VTLN Using Different Warping Functions for Template Matching

Study of the Effect of Reducing Training Data in Speech Synthesis Adaptation Based on Frequency Warping

Timbre features with MEDIAN values for compensating intra-speaker variability in speaker identification of whispering sound

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Vocal Tract Length Normalization Features for Audio Search

Abstract

Chapter PDF

Similar content being viewed by others

VTLN Using Different Warping Functions for Template Matching

Study of the Effect of Reducing Training Data in Speech Synthesis Adaptation Based on Frequency Warping

Timbre features with MEDIAN values for compensating intra-speaker variability in speaker identification of whispering sound

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation