Abstract
This paper addresses the issue of producing synthetic speech for an interpreted dialogue where the attitudinal colouring of an original utterance is to be preserved; it describes differences in speaking style between read and spontaneous speech from the viewpoint of synthesis research and discusses the design of a synthesis system incorporating labels to encode the prosodic and segmental variation simultaneously. Spontaneous speech confronts us with phenomena that were not encountered in corpora of prepared or read speech, and to account for these we have to identify increasingly higher-level factors of discourse structure and speaker involvement. The paper makes three specific claims: (a) that it is better to label the distinctive characteristics of speech through higher-level context dependencies, and to select units for synthesis from appropriate contexts, rather than attempt to predict and modify fine phonetic detail; (b) that the labelling of segmental and prosodic characteristics can be done adequately for speech synthesis using automatic techniques, leaving the human labeller free to identify higher-level discourse-related aspects of the speech; and (c) that instead of minimizing the size of the source database of speech units, we should rather be concerned to maximize its variety and to efficiently select from it the units that most closely express the characteristics of the target speech. The CHATR resynthesis toolkit performs many of these tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
W. J. Barry. Phonetics and phonology in speaking styles. In Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, 1995.
A. W. Black and W. N. Campbell. Predicting the intonation of discourse segments from examples in dialogue speech. Proceedings of the ESCA Workshop on Spoken Dialogue, Hanstholm, Denmark, 1995.
G. Bruce, B. Granström, K. Gustafson, M. Home, D. House, and P. Touati. On the analysis of prosody in interaction. In Computing Prosody: Approaches to a Computational Analysis of the Prosody of Spontaneous Speech. New York: Springer- Verlag, 1997.
E. Blaauw. On the perceptual classification of spontaneous and read speech. Ph.D. thesis, OTS Dissertation Series, Utrecht University. ISBN 90-5434-045-2, 1995.
A. W. Black and P. Taylor. CHATR: A generic speech synthesis system. Proceedings of COLING-91 11:983–986, 1994.
W. N. Campbell. Multi-level timing in speech. PhD thesis, University of Sussex, Department of Experimental Psychology, 1992. Available as ATR Technical Report TR-IT-0035.
W. N. Campbell. Prosodic encoding of English speech. In Proceedings of the International Conference on Spoken Language Processing, Banff Canada, pp. 663–666, 1992.
W. N. Campbell. Synthesis units for natural English speech. Technical Report SP 91–129, IEICE, 1992.
W. N. Campbell. Automatic detection of prosodic boundaries in speech. Speech Communication, 13:343–354, 1993.
W. N. Campbell. Predicting segmental durations for accommodation within a syllable-level timing framework. Proceedings of the European Conference on Speech Communication and Technology, Berlin, Germany, pp. 1081–1084, 1993.
W. N. Campbell. Prosody and the selection of source units for concatenative synthesis. Proceedings of the ESC A/IEEE Workshop on Speech Synthesis, Mohonk, NY, pp. 61–64, 1994.
W. N. Campbell. Loudness, spectral tilt, and perceived prominence in dialogues. In Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, 1995.
W. N. Campbell and M. Beckman. Stress, loudness, and spectral tilt. Proceedings of the Acoustical Society Japan, Spring Meeting, 3–4–3, 1995.
W. N. Campbell and A. W. Black. Prosody and the selection of source units for concatenative synthesis. In Progress in Speech Synthesis. Berlin: Springer-Verlag, 1996.
J. C. Coleman. The phonetic interpretation of headed phonological structures containing overlapping constituents. In Phonetics Yearbook 9, pp. 1–44. New York: Academic, 1992.
W. N. Campbell and Y. Sagisaka. Automatic annotation of speech corpora. Proceedings of the SST92 Queensland, Aus¬tralia, pp. 686–691, 1992.
K. de Jong. The supraglottal articulation of prominence in English: Linguistic stress as localized hyperarticulation. J. Acoust. Society Am., 97:491–504, 1995.
Entropic Research Laboratory, 600 Pennsylvania Avenue, Washington DC 20003. HTK - Hidden Markov Model Toolkit, 1993.
L. Fais. Conversation as collaboration: some syntactic evidence. Speech Communication, 15:230–242, 1994.
J. Gauffin and J. Sundberg. Spectral correlates of glottal voice source waveform characteristics. JSHR, 32:556–565, 1989.
A. Hunt and A. Black. Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the International Conference on Acoustics, Speech, and Signal Processes, 1996.
D. Hirst. Automatic modelling of fundamental frequency using a quadratic spline function. Travaux de VInstitut de Phonetique 15, Aix en Provence, pp. 71–85, 1980.
J. Hirschberg. Using discourse context to guide pitch accent decisions in synthetic speech. In G. Bailly, C. Benoit, and T. R. Sawallis, editors, Talking Machines: Theories, Models, and Designs, pp. 367–376. Amsterdam: Elsevier Science, 1992.
J. Hirschberg. Acoustic and prosodic cues to speaking style in spontaneous and read speech. Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, Vol. 2, pp. 36–43, 1995. Symposium on speaking styles.
A. Kiessling, R. Kompe, H. Niemann, E. Noth, and A. Batlinger. Detection of phrase boundaries and accents. Progress and Prospects of Speech Research and Technology: Proceedings of the CRIM/ORWISS Workshop, Sankt Augustin, pp. 266–269, 1995.
K. J. Kohler. Articulatory reduction in different speaking styles. Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, Vol. 2, pp. 12–19, 1995. Symposium on speaking styles.
K. J. Kohler. Modelling prosody in spontaneous speech. In Computing Prosody: Approaches to a Computational Analysis of the Prosody of Spontaneous Speech. New York: Springer- Verlag, 1997.
B. E. F. Lindblom. Explaining phonetic variation: A sketch of the H&H theory. In H. J. Hardcastle and A. Marchal, editors, Speech Production and Speech Modelling, pp. 403–409. Dordrecht: Kluwer, 1990.
G. Mehta and A. Cutler. Detection of target phonemes in spontaneous and read speech. Language and Speech, 31:135–156, 1988.
E. Moulines and F. Charpentier. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using di- phone. Speech Communication, 9:453–467, 1990.
C. Nakatani and L. Shriberg. Draft proposal for labelling disfluencies in ToBI. Paper presented at 3rd ToBI Labelling Workshop, Ohio, 1993.
M. Ostendorf, P. J. Price, and S. Shattuck-Hufnagel. The Boston University Radio News Corpus. Technical Report ECS- 95–001, Boston University ECS Dept., 1995.
J. Pitrelli, M. E. Beckman, and J. Hirschberg. Evaluation of prosodic transcription labelling reliability in the ToBI framework. In Proceedings of the International Conference on Spoken Language Processing, Yokohama, Japan, Vol. 1, pp. 123–126, 1994.
J. B. Pierrehumbert and D. Talkin. Lenition of /h/ and glottal stop. In G. Doherty and D. R. Ladd, editors, Papers in Laboratory phonology 2, pp. 90–127. Cambridge, UK: Cambridge University Press, 1992.
K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, and J. Hirschberg. ToBI: a standard for labelling English prosody. In Proceedings of the International Conference on Spoken Language Processing, Banff, Canada, Vol. 2, pp. 867–870, 1992.
L. Shriberg. Preliminaries to a theory of disfluencies. Ph.D. thesis, University of California at Berkeley, 1994.
A. C. M. Sluijter. Phonetic correlates of stress and accent Holland Institute of General Linguistics, 1995.
A. Stenström. An Introduction to Spoken Interaction. London: Longman, 1994.
A. Sluijter and V. J. van Heuven. Perceptual cues of linguistic stress: intensity revisited. Working papers 41, Proceedings of the ESC A Workshop on Prosody, Lund University, Sweden, pp. 246–249, 1993.
D. Talkin and C. W. Wightman. The aligner: text-to-speech alignment using Markov models and a pronunciation dictionary. Proceedings of the ESCA/IEEE Workshop on Speech Synthesis, Mohonk, NY, pp. 89–92, 1994.
C. W. Wightman and W. N. Campbell. Automatic labelling of prosodic structure. Technical Report TR-IT-0061, ATR Interpreting Telecommunications Laboratories, Kyoto, Japan, 1994.
D. Whalen. Coarticulation is largely planned. Journal of Phonetics, 18:3–35, 1990.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1997 Springer-Verlag New York, Inc.
About this chapter
Cite this chapter
Campbell, W.N. (1997). Synthesizing Spontaneous Speech. In: Sagisaka, Y., Campbell, N., Higuchi, N. (eds) Computing Prosody. Springer, New York, NY. https://doi.org/10.1007/978-1-4612-2258-3_12
Download citation
DOI: https://doi.org/10.1007/978-1-4612-2258-3_12
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4612-7476-6
Online ISBN: 978-1-4612-2258-3
eBook Packages: Springer Book Archive