Abstract
Unit selection speech synthesis systems generally rely on target and concatenation costs for selecting a best unit sequence. These costs, though often considering contextual features, mainly include local distances that are accumulated afterwards. In this paper, we describe a new duration target cost that takes a whole sequence into account. It aims at selecting a sequence globally good, instead of a very good sequence almost everywhere but having a few local duration cost leaps that are counter-balanced by other units. The problem of weighting this new duration cost with other sub-costs is also investigated. Experiments showed this new measure performed well on sentences featuring duration artefacts, while not deteriorating others.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Yamagishi, J., Ling, Z., King, S.: Robustness of HMM-based speech synthesis. In: Ninth Annual Conference of the International Speech Communication Association, pp. 2–5 (2008)
Sagisaka, Y.: Speech synthesis by rule using an optimal selection of non-uniform synthesis units. In: Proc. of ICASSP, pp. 679–682. IEEE (1988)
Black, A., Taylor, P.: Chatr: a generic speech synthesis system. In: Proc. of Coling, Association for Computational Linguistics (1994)
Hunt, A., Black, A.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proc. of ICASSP, pp. 373–376. IEEE (1996)
Taylor, P., Black, A., Caley, R.: The architecture of the festival speech synthesis system. In: Proc. of the ESCA Workshop in Speech Synthesis, pp. 147–151 (1998)
Breen, A., Jackson, P.: Non-uniform unit selection and the similarity metric within bts laureate tts system. In: Proc. of the ESCA Workshop on Speech Synthesis, pp. 373–376. Citeseer (1998)
Clark, R., Richmond, K., King, S.: Multisyn: Open-domain unit selection for the festival speech synthesis system. Speech Communication, 317–330 (2007)
Kumar, R.: A genetic algorithm for unit selection based speech synthesis. In: Eighth International Conference on Spoken Language Processing (2004)
Schröder, M.: Expressive Speech Synthesis: Past, Present, and Possible Futures. In: Affective Information Processing, pp. 111–126. Springer, London (2009)
Alías, F., Formiga, L., Llorá, X.: Efficient and reliable perceptual weight tuning for unit-selection text-to-speech synthesis based on active interactive genetic algorithms: A proof-of-concept. Speech Communication, 786–800 (May 2011)
Hashimoto, K., Oura, K., Nankaku, Y., Tokuda, K.: The effect of neural networks in statistical parametric speech synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4455–4459 (2015)
Guennec, D., Lolive, D.: Unit selection cost function exploration using an A* based text-to-speech system. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS, vol. 8655, pp. 432–440. Springer, Heidelberg (2014)
Tuerk, C., Robinson, T.: Speech synthesis using artificial neural networks trained on cepstral coefficients. In: Proc. of EUROSPEECH, pp. 4–7 (1993)
Karaali, O., Corrigan, G., Gerson, I.: Speech synthesis with neural networks. In: Proc. of World Congress on Neural Networks, pp. 45–50 (1996)
Taylor, P.: The target cost formulation in unit selection speech synthesis. In: Proc. of Stress, pp. 2038–2041 (2006)
Boeffard, O., Charonnat, L., Le Maguer, S., Lolive, D., Vidal, G.: Towards fully automatic annotation of audio books for tts. In: Proc. of LREC, pp. 975–980 (2012)
Chevelu, J., Lecorvé, G., Lolive, D.: Roots: a toolkit for easy, fast and consistent processing of large sequential annotated data collections. In: Proc. of LREC, pp. 619–626 (2014)
ITU-T: Itu-t recommendation p. 800: Methods for subjective determination of transmission quality (1996)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Guennec, D., Chevelu, J., Lolive, D. (2015). Defining a Global Adaptive Duration Target Cost for Unit Selection Speech Synthesis. In: Král, P., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2015. Lecture Notes in Computer Science(), vol 9302. Springer, Cham. https://doi.org/10.1007/978-3-319-24033-6_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-24033-6_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24032-9
Online ISBN: 978-3-319-24033-6
eBook Packages: Computer ScienceComputer Science (R0)