Abstract
A pronunciation lexicon for speech synthesis is a key component of a modern speech synthesizer, containing the orthography and phonemic transcriptions of a large number of words. A lexicon may contain words with multiple pronunciations, such as reduced and full versions of (function) words, homographs, or other types of words with multiple acceptable pronunciations such as foreign words or names. Pronunciation variants should therefore be taken into account during voice-building (e.g. segmentation and labeling of a speech database), as well as during synthesis.
In this paper we outline a strategy to automatically deal with these variants, resulting in a speaker-specific pronunciation. Based on a labeled speech database, the pronunciation lexicon is pruned in order to remove as much as possible pronunciation variation from the lexicon. This pruned lexicon can be used to train speaker-specific letter-to-sound rules. If the speaker has uttered a word in different ways, then these variants are not pruned. Instead, decision trees are trained for each of those words, which are used to select the most suitable pronunciation during synthesis. We tested our approach on five speech databases, and two lexicons per speech database. The automatic selection of pronunciation variants yielded a small improvement over the baseline (selecting always the most common variant).
The research reported in this paper was partly supported by the projects IWT-SPACE, iMinds-RAILS, iMinds-SEGA and EC FP7 ALIZ-E (FP7-ICT-248116).
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Fitt, S.: Unisyn multi-accent lexicon, version 1.3, http://www.cstr.ed.ac.uk/projects/unisyn
Mertens, P., Vercammen, F.: FONILEX manual. Technical report, K.U.Leuven CCL (1998)
Kim, Y.J., Syrdal, A., Conkie, A.: Pronunciation lexicon adaptation for TTS voice building. In: Proceedings Interspeech 2004, Jeju Island, Korea, pp. 2569–2572 (2004)
Hamza, W., Eide, E., Bakis, R.: Reconciling pronunciation differences between the front-end and the back-end in the IBM speech synthesis system. In: Proceedings Interspeech 2004, Jeju Island, Korea, pp. 2561–2564 (2004)
Clark, R.A.J., Richmond, K., King, S.: Multisyn: Open-domain unit selection for the festival speech synthesis system. Speech Communication 49, 317–330 (2007)
Bennett, C., Black, A.: Prediction of pronunciation variations for speech synthesis: A data-driven approach. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 2005 (ICASSP 2005), Philadelphia, PA, USA, vol. 1, pp. 297–300 (2005)
Weide, R.L.: The carnegie mellon university pronouncing dictionary, version 0.4 (1995)
Mitton, R.: A description of a computer-usable dictionary file based on the oxford advanced learner’s dictionary of current english. Technical report, Oxford Text Archive (1992)
Kerkhoff, J., Marsi, E.: NeXTeNS: a new open source text-to-speech system for dutch. In: 13th Meeting of Computational Linguistics in the Netherlands (2002)
Baayen, R.H., Piepenbrock, R., Gulikers, L.: The CELEX lexical database (CD-ROM). Technical report, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA (1995)
Demuynck, K., Roelens, J., Compernolle, D.V., Wambacq, P.: SPRAAK: an open source “SPeech recognition and automatic annotation kit”. In: Proceedings Interspeech 2008, Brisbane, Australia, p. 495 (2008)
Kominek, J., Black, A.W.: The CMU arctic speech databases. In: Proceedings Fifth ISCA Workshop on Speech Synthesis (SSW5). ISCA (2004)
King, S., Karaiskos, V.: The blizzard challenge 2010. In: Blizzard Challenge Workshop 2010 (2010)
Mattheyses, W., Latacz, L., Verhelst, W.: Auditory and photo-realistic audiovisual speech synthesis for dutch. In: Proceedings International Conference on Auditory-Visual Speech Processing 2011 (AVSP 2011), Volterra, Italy, pp. 55–60 (2011)
Duchateau, J., Kong, Y.O., Cleuren, L., Latacz, L., Roelens, J., Samir, A., Demuynck, K., Ghesquière, P., Verhelst, W., et al.: Developing a reading tutor: Design and evaluation of dedicated speech recognition and synthesis modules. Speech Communication 51, 985–994 (2009)
Van Dalen, R.C., Wiggers, P., Rothkrantz, L.J.M.: Lexical stress in continuous speech recognition. In: Proceedings Interspeech 2006, Pittsburgh, PA, USA (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Latacz, L., Mattheyses, W., Verhelst, W. (2013). Speaker-Specific Pronunciation for Speech Synthesis. In: Habernal, I., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2013. Lecture Notes in Computer Science(), vol 8082. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40585-3_63
Download citation
DOI: https://doi.org/10.1007/978-3-642-40585-3_63
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40584-6
Online ISBN: 978-3-642-40585-3
eBook Packages: Computer ScienceComputer Science (R0)