Abstract
In the common unit selection implementations, F0 continuity is measured as one of concatenation cost features with the expectation that smooth units transition (regarding speech melody) is ensured when the difference of F0 is low enough. This measure generally uses a static F0 value computed at the units boundary. In the present paper we show, however, that the use of static F0 values is not enough for smooth speech units concatenation, and that a dynamic nature of the F0 contour must be taken into account. Two schemes of dynamic F0 handling are presented, and speech generated by both schemes is compared by means of listening tests on specially selected phrases which are known to carry unnatural artefacts. Advantages and disadvantages of the individual schemes are also discussed.
The research has been supported by the European Regional Development Fund (ERDF), project “New Technologies for Information Society” (NTIS), European Centre of Excellence, ED1.1.00/02.0090, and by the Technology Agency of the Czech Republic, project No. TA01011264.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Bellegarda, J.R.: A novel discontinuity metric for unit selection text-to-speech synthesis. In: Proc. of 5th Speech Synthesis Workshop (SSW5), Pittsburgh, PA, USA, pp. 133–138 (2004)
Conkie, A., Syrdal, A.K.: Using F0 to constrain the unit selection Viterbi network. In: Proc. of Acoustics, Speech, and Signal Processing ICASSP, pp. 5376–5379. IEEE (2011)
Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proc. of Acoustics, Speech, and Signal Processing ICASSP 1996, vol. 1, pp. 373–376. IEEE (1996)
Klabbers, E., Veldhuis, R.N.J.: Reducing audible spectral discontinuities. IEEE Transactions on Speech and Audio Processing 9(1), 39–51 (2001), http://dblp.uni-trier.de/db/journals/taslp/taslp9.html#KlabbersV01
Legát, M., Matoušek, J.: Design of the test stimuli for the evaluation of concatenation cost functions. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS, vol. 5729, pp. 339–346. Springer, Heidelberg (2009)
Legát, M., Matoušek, J.: Collection and analysis of data for evaluation of concatenation cost functions. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS, vol. 6231, pp. 345–352. Springer, Heidelberg (2010)
Legát, M., Matoušek, J., Tihelka, D.: On the detection of pitch marks using a robust multi-phase algorithm. Speech Communication, 552–566 (2011), http://www.kky.zcu.cz/en/publications/LegatM_2011_Onthedetectionof
Legát, M., Matoušek, J.: Pitch contours as predictors of audible concatenation artifacts. In: Proc. of World Congress on Engineering and Computer Science 2011, San Francisco, USA, pp. 525–529 (2011)
Matoušek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: INTERSPEECH 2008, Proc. of 9th Annual Conference of International Speech Communication Association, Brisbane, Australia, pp. 1626–1629 (2008)
Matoušek, J., Tihelka, D., Psutka, J.V.: Experiments with automatic segmentation for Czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003), http://dx.doi.org/10.1007/978-3-540-39398-6_41
Matoušek, J., Tihelka, D., Romportl, J.: Current state of Czech text-to-speech system ARTIC. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 439–446. Springer, Heidelberg (2006), http://dx.doi.org/10.1007/11846406_55
Narendra, N.P., Rao, K.S.: Syllable specific unit selection cost functions for text-to-speech synthesis. ACM Transactions on Speech and Language Processing 9(3), 5:1–5:24 (2012), http://doi.acm.org/10.1145/2382434.2382435
Pantazis, Y., Stylianou, Y.: On the detection of discontinuities in concatenative speech synthesis. In: Stylianou, Y., Faundez-Zanuy, M., Esposito, A. (eds.) COST 277. LNCS, vol. 4391, pp. 89–100. Springer, Heidelberg (2007), http://dx.doi.org/10.1007/978-3-540-71505-4_6
Přibil, J., Přibilová, A.: Evaluation of influence of spectral and prosodic features on GMM classification of Czech and Slovak emotional speech. EURASIP Journal on Audio, Speech, and Music Processing 33(3), 1–22 (2013), http://dx.doi.org/10.1186/1687-4722-2013-8
Stylianou, Y., Syrdal, A.K.: Perceptual and objective detection of discontinuities in concatenative speech synthesis. In: Proc. IEEE Acoustics, Speech, and Signal Processing (ICASSP), pp. 837–840 (2001)
Syrdal, A.K., Conkie, A.D.: Data-driven perceptually based join costs. In: Proc. of 5th Speech Synthesis Workshop (SSW5), Pittsburgh, PA, USA, pp. 49–54 (2004)
Tihelka, D., Grůber, M., Hanzlíček, Z.: Robust methodology for TTS enhancement evaluation. In: Habernal, I. (ed.) TSD 2013. LNCS, vol. 8082, pp. 442–449. Springer, Heidelberg (2013), http://dx.doi.org/10.1007/978-3-642-40585-3_56
Tihelka, D., Kala, J., Matoušek, J.: Enhancements of Viterbi search for fast unit selection synthesis. In: INTERSPEECH 2010, Proc. of 11th Annual Conference of the International Speech Communication Association, pp. 174–177 (2010), http://www.isca-speech.org/archive/interspeech_2010/i10_0174.html
Tihelka, D., Stanislav, P.: ARTIC for assistive technologies: Transformation to resource-limited hardware. In: Proc. of World Congress on Engineering and Computer Science 2011, San Francisco, USA, pp. 581–584 (2011)
Vepa, J., King, S.: Kalman–filter based join cost for unit–selection speech synthesis. In: Proc. EUROSPEECH 2003 – INTERSPEECH 2003, Proc. of 8th European Conference on Speech Communication and Technology, pp. 293–296. ISCA (2003)
Vepa, J., King, S.: Join cost for unit selection speech synthesis. Ph.D. thesis, The University of Edinburgh, College of Science and Engineering, School of Informatics (2004), https://www.era.lib.ed.ac.uk/handle/1842/1452
Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book Version 3.4. Cambridge University Press (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Tihelka, D., Matoušek, J., Hanzlíček, Z. (2014). Modelling F0 Dynamics in Unit Selection Based Speech Synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2014. Lecture Notes in Computer Science(), vol 8655. Springer, Cham. https://doi.org/10.1007/978-3-319-10816-2_55
Download citation
DOI: https://doi.org/10.1007/978-3-319-10816-2_55
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10815-5
Online ISBN: 978-3-319-10816-2
eBook Packages: Computer ScienceComputer Science (R0)