Correction of Prosodic Phrases in Large Speech Corpora

Hanzlíček, Zdeněk

doi:10.1007/978-3-319-45510-5_47

Zdeněk Hanzlíček¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9924))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1710 Accesses
3 Citations

Abstract

Nowadays, in many speech processing tasks, such as speech recognition and synthesis, really large speech corpora are utilized. These speech corpora usually contain several hours of speech or even more. To achieve possibly best results, an appropriate annotation of the recorded utterances is often necessary. This paper is focused on problems related to the prosodic annotation of the Czech speech corpora. In the Czech language, the utterances are supposed to be split by pauses into so-called prosodic clauses containing one or more prosodic phrases. The types of particular phrases are linked to their last prosodic words corresponding to various functionally involved prosodemes. The clause/phrase structure is substantially determined by the sentence composition. However, in real speech data, different prosodeme type or even phrase/clause borders can be present. This paper deals with 2 basic problems: the correction of the improper prosodeme/phrase type and the detection of new phrase borders. For both tasks, we proposed new procedures utilizing hidden Markov models. Experiments were performed on 4 large speech corpora recorded by professional speakers for the purpose of speech synthesis. These experiments were limited to the declarative sentences. The results were successfully verified by listening tests.

This research was supported by the Czech Science Foundation (GA CR), project No. GA16-04420S. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the programme CESNET LM2015042, is greatly appreciated.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Initial Experiments on Automatic Correction of Prosodic Annotation of Large Speech Corpora

Correction of Formal Prosodic Structures in Czech Corpora Using Legendre Polynomials

Speech Processing and Prosody

Keywords

1 Introduction

Nowadays, in many speech processing tasks, such as speech recognition and synthesis, really large speech corpora are utilized. These speech corpora usually contain several hours of speech or even more. To achieve possibly best results, an appropriate annotation of the recorded utterances is often necessary. In connection with using the large speech corpora, the automatic phonetic and prosodic annotation of speech [7] became an important task. This paper deals with 2 basic problems: the correction of the improper prosodeme/phrase type and the detection of new phrase borders.

1.1 Prosody Model

For our purposes, we used the formal prosody model proposed by Romportl [5]. On the basis of this model, an utterance is divided into prosodic clauses separated by short pauses. Each prosodic clause includes one or more prosodic phrases containing certain continuous intonation schemes. Furthermore, phrases are composed of prosodic words. The communication function the speaker intends the phrase to have (the type of the phrase) is supposed to be linked with the last prosodic word in the phrase. For this purposes, so called prosodemes are defined. The last prosodic word is linked with a functionally involved prosodeme, other words with null prosodemes. For the Czech language, the following basic classes of functionally involved prosodemes were defined [5]:

P1
– terminating satisfactorily (in declarative sentences)
P2
– terminating unsatisfactorily (in questions)
P3
– non-terminating (in non-terminal phrases of compound sentences)

Since this research is limited to the declarative sentences and neutral speech (i.e. without emphasis, expressions etc.), prosodemes P0, P1.1 and P3.1 were applied. According to the theoretical assumption, all the compound sentences consist of several phrases, where the last one is terminated with prosodeme P1.1 and the other phrases end with P3.1; see a simple example in Fig. 1.

Particular prosodemes are linked with specific speech features: P1.1 is characteristic with a pitch decrease within its last syllable and a pitch increase is typical for P3.1. Beside the pitch shape (which is the most relevant), spectral features, duration and energy can be important for particular prosodemes. Naturally, particular types of phrases do not vary solely within their last prosodic words. Some specific prosodic differences can be present throughout the whole utterances. However, those differencies are often rather content-related (e.g. emphasis on some key words) and a more complex prosody model would be required. The utilized prosody model seems to be adequate for the phrase classification [1].

1.2 Problems in Real Speech

In real speech data, a different prosodeme than expected could be present. A typical example is a compound sentence split into several independent sentences. Within the compound sentence, all phrases (except the last one) should be terminated with the prosodeme P3.1. However, when the link between particular phrases is weak, the utterance can be split into independent sentences which are naturally terminated by the prosodeme P1.1. This is illustrated in Fig. 2.

In the Czech text, particular phrases are supposed to be separated by punctuation marks, usually commas^{Footnote 1}. Corresponding segments of speech are supposed to be prosodic phrases ended by functionally involved prosodemes. However, this theoretical assumption is not always fulfilled: Pauses can appear inside text phrases, especially when they are long. Or contrarily, more text phrases can be uttered together without indication of any functionally involved prosodeme. Moreover, the pause absence does not always lead to the absence of a functionally involved prosodeme; please compare Figs. 3 and 4.

Badly annotated speech corpora can be a source of various troubles. In speech synthesis (specifically, in unit selection method), prosodeme labels are important attributes for selecting the optimal sequence of speech units for building resulting speech [6]. Using units from an inappropriate prosodeme or mixing units from various prosodemes can cause a degradation of the overall speech quality.

2 Proposed Approach

To model the prosodic properties of speech we employed a similar HMM framework as it is specific for the HMM-based speech synthesis [8]. Speech was described by a sequence of parameter vectors containing 40 mel cepstral coefficients obtained by STRAIGHT analysis method [2] and the pitch extracted by using the PRAAT software^{Footnote 2}. The speech parameter vectors were modelled by a set of multi-stream context dependent HMMs by using the HTS toolkit^{Footnote 3}.

In the HMM-based speech synthesis framework, the phonetic, prosodic and linguistic context is taken into account, i.e. a speech unit is given as a phone with its phonetic, prosodic and linguistic context information. In this manner, the language prosody is modelled implicitly – in various contexts different units/models can be used. In our experiments, each unit is represented by a string

where all subscripted bold letters are contextual factors defined as

Table 1. An example of splitting utterances by the punctuation: “Řekl, že přijde.” (“He said that he will come.”, its phonetic transcription: RekL Ze pQijde). Changes are underlined.

Full size table

2.1 Training Stage

For our experiments, we used 4 large speech corpora recorded for the purposes of speech synthesis. At the beginning, all utterances were segmented to phrases only by detected pauses, i.e. all phrases correspond to clauses. This manner of phonetic annotation is also used in our unit selection TTS system [3], since functionally involved prosodemes are ensured at the end of all phrases.

Model Training. Model parameters were estimated from the speech data by using maximum likelihood criterion. 3-state left-to-right MSD-HSMM with single Gaussian output distributions were used. For a more robust model parameter estimation, the context clustering based on the MDL criterion was performed. In this stage, the default prosodic annotation of particular phrases is used.

Prosodeme Correction. This procedure is a modified version of a more general method introduced in [1]. First, each individual phrase terminated by the prosodeme P3.1 is transcribed by using the prosodeme P1.1, i.e. both transcriptions differ only in the prosodeme contextual factors within the last prosodic word; it is analogous to the example in Table 1, but simpler. Then corresponding speech features are forced-aligned with both transcriptions and the transcription with the best value of alignment score is selected for the given phrase.

When a corrected transcription of all utterances is available, the whole process can be run iteratively. The procedure works on the assumption that most utterances correspond to the theoretical prosody model with some rare exceptions. Then, the trained HMMs are correct and can be used to reveal those exceptions. Problems occur in the case of less consistent prosody: the inconsistencies can cumulate, a part of models can be badly trained and some performed corrections are wrong. To cope with that, an additional step is performed at the end of each iteration: the prosodeme correction procedure is performed by using HMMs from another speaker. Only corrections performed in both cases are kept, other changes are annulled, therefore this step is referred to as the annulling step.

Splitting Phrases by Punctuation. First, all phrases containing punctuation marks are further split into phrases terminated by P3.1 (excluding the last one, naturally). A simple example is presented in Table 1. When more commas are present in the phrase, all possible split combinations are taken into account. Again, the corresponding speech features are forced-aligned with all transcriptions and the transcription with the best alignment score is selected.

3 Evaluation and Results

For our experiments, 4 large speech corpora recorded for the purposes of speech synthesis [4] were used: 2 male voices (denoted as AJ and JS) and 2 female voices (denoted as KI and MR). Each corpus contained about 10,000 declarative sentences. The detailed description of experimental data is present in Table 2.

Although all corpora are almost equal, some statistics are very different. This indicates various speaking styles of particular speakers. For example, the number of commas inside phrases corresponds how often speakers join text segments separated by a comma into one phrase. By contrast, the number of phrases without any end punctuation tells how often speakers make pauses inside continuous text segments. To illustrate the prosody variability, we performed one iteration of the correction procedure without the annulling step. The higher number of changes is, the lower the consistency is supposed to be – see Table 3.

Table 2. Description of experimental data. Please note that the total number of phrases is given as phrases ended by a comma + ended by a dot + without any end punctuation.

Full size table

Table 3. The initial number of prosodemes and the number of P3.1 \(\rightarrow \) P1.1 changes.

Full size table

Table 4. Changing prosodemes P3.1 \(\rightarrow \) P1.1: the initial number of prosodemes and the number of changes in particular iterations. Please remember that the corrections are always performed on the default corpora (the correction procedure is not cumulative).

Full size table

Table 5. Splitting utterances by the punctuation. The number of changes affects is equal for both number of phrases and prosodemes since each splitting produces a new phrase ended with the P3.1 prosodeme.

Full size table

The iterative correction procedure with annulling step was tested only on voices AJ and MR. The annulling step was performed by using models from JS and KI, since these voices seem to be more consistent and their models are expected to be more robust. Results are presented in Table 4. Splitting phrases by punctuation was performed for all speakers, results are presented in Table 5. Since this splitting procedure is presented as fully new, we did not perform iterations, nor the annulling step in our experiments.

3.1 Listening Tests

The suitability of the performed corrections was verified by one overall listening test. It contained 120 utterances with one selected prosodic word. Listeners picked one of 5 choices: definitely P1.1, probably P1.1, definitely P3.1, probably P3.1, null prosodeme. Sentences were selected to be short and simple like the examples in Figs. 1, 2, 3 and 4. Five participants took part in this test, all of them were speech processing experts capable to distinguish various prosodeme types.

The test contained 40 utterances (20 for both AJ and MR) for the evaluation of the prosodeme changing procedure: \(2\times 10\) utterances with P3.1 to P1.1 corrections and \(2\times 10\) utterances with corrections discarded in the annulling step. The remaining 80 utterances (20 for each speaker) were intent for the evaluation of the splitting procedure: \(4\times 10\) utterances that were additionally split by a comma (split utterances) and \(4\times 10\) utterances that contain a comma, but the splitting was not performed (non-split utterances).

Changed Prosodemes. The distribution of listeners’ choices is present in Fig. 5 and Table 6. The most relevant entries are the percentages of changed prosodemes that were marked as P3.1: 90 % and 76 % for AJ and MR, respectively. The other 6 % and 20 % were marked as P1.1 and the remaining 4 % (for both speakers) were indecisive cases. Since only 3 iterations of correction procedure were performed and it wasn’t the final state, further improvement could be expected.

As explained in Sect. 2, the purpose of annulling step is to increase the robustness within several initial iterations of the correction procedure. All changes can be still applied in the latter stage without the annulling step. Anyway, the more annulled cases really does not match the desired prosodeme, the more beneficial this step is. In our case, this rate is about 82 % and 62 % (all non-P1.1 cases).

Table 6. Results of listening test on changing prosodemes P3.1 \(\rightarrow \) P1.1: percentage of particular listeners’ choices. The agreement between human listeners and the proposed correction procedure is expressed mainly by the bold values.

Full size table

Split Phrases. Results of listening test are presented in Fig. 6 and Table 7. A high consistency between listeners and the proposed procedure is evident: prosodemes in split utterances were annotated as definitely or probably P3.1 in about 88 % cases for all speakers (ranged between 84 % for KI and 91 % for AJ). Surprisingly, an appreciable amount of P1.1 prosodemes appeared in listeners’ selections. Actually, it is in accordance with the experiment on changing prosodemes and some P1.1s could be expected here, too.

The actual benefit of the splitting procedure should be also apparent by a comparison of results for the split and non-split utterances. Above all, significantly less P3.1s and more null prosodemes should be present in non-split sentences. This is true; nevertheless, the number of P3.1s in non-split utterances is higher than expected, especially 72 % for JS. The reason could be the influence of the sentence structure on the listeners’ decision. Evidently, it depends on the actual speaker, too.

Table 7. Results of listening test on splitting phrases by punctuation: percentage of particular listeners’ choices. The agreement between human listeners and the proposed splitting procedure is expressed mainly by the bold values.

Full size table

4 Conclusion

This paper presented 2 procedures for the correction of the type and borders of prosodic phrases in large speech corpora. Experiments were performed on 4 corpora. The results have been verified in a listening test. The agreement between the listeners and the proposed procedures was about 83 % for changing the prosodeme type and 88 % for splitting utterances into phrases by the punctuation.

In our future work, both proposed procedures should be joint together into one iterative correction process. The robustness could be improved by employing speaker-independent models and their adaptation. Other types of phrases (e.g. various types of questions) will be included, too. A big challenge is the automatic prosody annotation of speech data, especially of non-professional speakers whose prosody could be problematic due to its bad consistency.

Notes

1.
This is in contrast with English, where using commas has more complex rules. However, some copulative conjunctions in Czech are also used without a comma, e.g. “a”, “nebo”, “ani”, etc. (“and”, “or”, “nor”, respectively).
2.
Praat: doing phonetics by computer, www.praat.org.
3.
HMM-based Speech Synthesis System (HTS), http://hts.sp.nitech.ac.jp.

References

Hanzlícek, Z.: Classification of prosodic phrases by using HMMs. In: Král, P., et al. (eds.) TSD 2015. LNCS, vol. 9302, pp. 497–505. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24033-6_56
Chapter Google Scholar
Kawahara, H., Masuda-Katsuse, I., de Cheveigne, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun. 27, 187–207 (1999)
Article Google Scholar
Matoušek, J., Tihelka, D., Romportl, J.: Current state of Czech text-to-speech system ARTIC. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 439–446. Springer, Heidelberg (2006)
Google Scholar
Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Proceedings of LREC 2008 (2008)
Google Scholar
Romportl, J., Matoušek, J., Tihelka, D.: Advanced prosody modelling. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 441–447. Springer, Heidelberg (2004)
Chapter Google Scholar
Tihelka, D., Matoušek, J.: Unit selection and its relation to symbolic prosody: a new approach. In: Proceedings of Interspeech 2006, pp. 2042–2045 (2006)
Google Scholar
Wightman, C., Ostendorf, M.: Automatic labeling of prosodic patterns. IEEE Trans. Speech Audio Process. 2, 469–481 (1994)
Article Google Scholar
Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

NTIS – New Technology for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Univerzitní 22, 306 14, Plze, Czech Republic
Zdeněk Hanzlíček

Authors

Zdeněk Hanzlíček
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zdeněk Hanzlíček .

Editor information

Editors and Affiliations

Masaryk University , Brno, Czech Republic
Petr Sojka
Masaryk University , Brno, Czech Republic
Aleš Horák
Masaryk University , Brno, Czech Republic
Ivan Kopeček
Masaryk University , Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hanzlíček, Z. (2016). Correction of Prosodic Phrases in Large Speech Corpora. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science(), vol 9924. Springer, Cham. https://doi.org/10.1007/978-3-319-45510-5_47

Download citation

DOI: https://doi.org/10.1007/978-3-319-45510-5_47
Published: 03 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45509-9
Online ISBN: 978-3-319-45510-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Correction of Prosodic Phrases in Large Speech Corpora

Abstract

Similar content being viewed by others

Initial Experiments on Automatic Correction of Prosodic Annotation of Large Speech Corpora