Abstract
In this paper we present a system for automatically predicting prosodic breaks in synthesized speech using the Random Forests classifier. In our experiments the classifier is trained on a large dataset consisting of audiobooks, which is automatically labeled with phone, word, and pause labels. To provide part of speech (POS) tags in the text, a rule-based POS tagger is used. We use crossvalidation in order to be able to examine not only the results for a specific subset of data but also the systems reliability across the dataset. The experimental results demonstrate that the system shows good and consistent results on the audiobook database; the results are poorer and less robust on a smaller database of read speech even though part of that database was labeled manually.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
References
Atterer M.: Assigning Prosodic Structure for Speech Synthesis: A Rule-based Approach. In: Speech Prosody 2002, pp. 147–150 (2002)
Khomitsevich, O., Solomennik, M.: Automatic pause placement in a Russian TTS system. In: Computational Linguistics and Intellectual Technologies, vol. 9, pp. 531–537. RGGU, Moscow (2010) (in Russian)
Black, A.W., Taylor, P.: Assigning phrase breaks from part-of-speech sequences. Computer Speech & Language 12(2), 99–117 (1998)
Busser B., Daelemans W., Bosch A.V.D.: Predicting phrase breaks with memory-based learning. In: 4th ISCA Tutorial and Research Workshop on Speech Synthesis, pp. 29–34 (2001)
Parlikar A., Black A.W.: Modeling Pause-Duration for Style-Specific Speech Synthesis. In: Interspeech 2012, pp. 446–449 (2012)
Parlikar A., Black A.W.: Minimum Error Rate Training for Phrasing in Speech Synthesis. In: 8th ISCA Speech Synthesis Workshop, pp. 13–17 (2013)
Breiman L., Cutler A.: Random Forests, http://www.stat.berkeley.edu/breiman/RandomForests/cc_home.htm
Chistikov, P., Khomitsevich, O.: Improving prosodic break detection in a Russian TTS system. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 181–188. Springer, Heidelberg (2013)
Caruana, R., Niculescu-Mizil, A.: An Empirical Comparison of Supervised Learning Algorithms Using Different Performance Metrics. In: 23rd International Conference on Machine Learning, pp. 161–168 (2006)
Giménez, J., Márquez, L.: Svmtool: A general pos tagger generator based on support vector machines. In: 4th International Conference on Language Resources and Evaluation, pp. 43–46 (2004)
Manning, C.D.: Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In: Gelbukh, A.F. (ed.) CICLing 2011, Part I. LNCS, vol. 6608, pp. 171–189. Springer, Heidelberg (2011)
Sun, M.: Bellegarda J.R.: Improved pos tagging for text-to-speech synthesis. In: IEEE International Conference ICASSP 2011, pp. 5384–5387 (2011)
Ide N., Suderman K.: The American National Corpus First Release. In: 4th International Conference on Language Resources and Evaluation, pp. 1681–1684 (2004)
King S., Karaiskos V.: The Blizzard Challenge 2013. In: Blizzard Challenge 2013 Workshop (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Khomitsevich, O., Chistikov, P., Zakharov, D. (2014). Using Random Forests for Prosodic Break Prediction Based on Automatic Speech Labeling. In: Ronzhin, A., Potapova, R., Delic, V. (eds) Speech and Computer. SPECOM 2014. Lecture Notes in Computer Science(), vol 8773. Springer, Cham. https://doi.org/10.1007/978-3-319-11581-8_58
Download citation
DOI: https://doi.org/10.1007/978-3-319-11581-8_58
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11580-1
Online ISBN: 978-3-319-11581-8
eBook Packages: Computer ScienceComputer Science (R0)