Abstract
This paper presents speaker normalization approaches for audio search task. Conventional state-of-the-art feature set, viz., Mel Frequency Cepstral Coefficients (MFCC) is known to contain speaker-specific and linguistic information implicitly. This might create problem for speaker-independent audio search task. In this paper, universal warping-based approach is used for vocal tract length normalization in audio search. In particular, features such as scale transform and warped linear prediction are used to compensate speaker variability in audio matching. The advantage of these features over conventional feature set is that they apply universal frequency warping for both the templates to be matched during audio search. The performance of Scale Transform Cepstral Coefficients (STCC) and Warped Linear Prediction Cepstral Coefficients (WLPCC) are about 3% higher than the state-of-the-art MFCC feature sets on TIMIT database.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
References
The Spoken Term Detection (STD) 2006 Evaluation Plan (2006). http://www.itl.nist.gov/iad/mig/tests/std/2006/docs/std06-evalplan-v10.pdf (last accessed on March 25, 2015)
Vergyri, D., Shafran, I., Stolcke, A., Gadde, V.R.R., Akbacak, M., Roark, B., Wang, W.: The SRI/OGI 2006 spoken term detection system. In: INTERSPEECH 2007, Belgium, pp. 2393–2396 (2007)
Parlak, S., Saraclar, M.: Spoken term detection for turkish broadcast news. In: Proc. IEEE Int. Conf. on Acous. Speech, and Signal Process. ICASSP 2008, Las Vegas, USA, pp. 5244–5247 (2008)
Wallace, R., Vogt, R., Sridharan, S.: A phonetic search approach to the 2006 NIST spoken term detection evaluation. In: INTERSPEECH 2007, Belgium, pp. 2385–2388 (2007)
Metze, F., Anguera, X., Barnard, E., Davel, M.H., Gravier, G.: Language independent search in mediaeval’s spoken web search task. Computer Speech & Language 28(5), 1066–1082 (2014)
Hazen, T.J., Shen, W., White, C.M.: Query-by-example spoken term detection using phonetic posteriorgram templates. In: IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU, 2009, Merano/Meran, Italy, pp. 421–426 (2009)
Zhang, Y., Glass, J.R.: Unsupervised spoken keyword spotting via segmental DTW on gaussian posteriorgrams. In: IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU, 2009, Merano/Meran, Italy, pp. 398–403 (2009)
Anguera, X.: Speaker independent discriminant feature extraction for acoustic pattern-matching. In: IEEE Int. Conf. on Acoust. Speech and Signal Process., ICASSP 2012, Kyoto, Japan, pp. 485–488 (2012)
Wang, H., Lee, T., Leung, C., Ma, B., Li, H.: Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection. In: IEEE Int. Conf. on Acoust. Speech and Signal Process., ICASSP 2013, Vancouver, BC, Canada, pp. 8545–8549 (2013)
Tejedor, J., Szöke, I., Fapso, M.: Novel methods for query selection and query combination in query-by-example spoken term detection. In: Proc. of 2010 Int. Workshop on Searching Spontaneous Conversational Speech. SSCS 2010, New York, NY, USA, pp. 15–20. ACM (2010)
Lee, L., Rose, R.C.: Speaker normalization using efficient frequency warping procedures. In: IEEE Int. Conf. on Acoust. Speech and Signal Process., ICASSP 1996, Atlanta, Georgia, USA, pp. 353–356 (1996)
Umesh, S., Cohen, L., Marinovic, N., Nelson, D.J.: Scale transform in speech analysis. IEEE Transactions on Speech and Audio Processing 7(1), 40–45 (1999)
Umesh, S., Sanand, D.R., Praveen, G.: Speaker-invariant features for automatic speech recognition. In: IJCAI 2007, Proc. 20th Int. Joint Conf. on Artificial Intelligence, Hyderabad, India, pp. 1738–1743 (2007)
Sinha, R., Umesh, S.: Non-uniform scaling based speaker normalization. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2002, May 13–17 2002, Orlando, Florida, USA, pp. 589–592 (2002)
Iii, J.O.S., Abel, J.S.: Bark and ERB bilinear transforms. IEEE Transactions on Speech and Audio Processing 7(6), 697–708 (1999)
Kim, Y., Smith, J.O.: A speech feature based on bark frequency warping-the non-uniform linear prediction (nlp) cepstrum. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1999, pp. 131–134. IEEE (1999)
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L.: DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM (1993)
Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics Speech and Signal Processing 26(1), 43–49 (1978)
Park, A.S., Glass, J.R.: Unsupervised pattern discovery in speech. IEEE Transactions on Audio, Speech & Language Processing 16(1), 186–197 (2008)
Nicholson, S., Milner, B.P., Cox, S.J.: Evaluating feature set performance using the f-ratio and j-measures. In: Fifth European Conference on Speech Communication and Technology, EUROSPEECH 1997, Rhodes, Greece (1997)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Madhavi, M.C., Sharma, S., Patil, H.A. (2015). Vocal Tract Length Normalization Features for Audio Search. In: Král, P., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2015. Lecture Notes in Computer Science(), vol 9302. Springer, Cham. https://doi.org/10.1007/978-3-319-24033-6_44
Download citation
DOI: https://doi.org/10.1007/978-3-319-24033-6_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24032-9
Online ISBN: 978-3-319-24033-6
eBook Packages: Computer ScienceComputer Science (R0)