Keywords

1 Introduction

Nowadays, in many speech processing tasks, such as speech recognition and synthesis, really large speech corpora are utilized. These speech corpora usually contain several hours of speech or even more. To achieve possibly best results, an appropriate annotation of the recorded utterances is often necessary. In connection with using the large speech corpora, the automatic phonetic and prosodic annotation of speech [7] became an important task. This paper deals with 2 basic problems: the correction of the improper prosodeme/phrase type and the detection of new phrase borders.

1.1 Prosody Model

For our purposes, we used the formal prosody model proposed by Romportl [5]. On the basis of this model, an utterance is divided into prosodic clauses separated by short pauses. Each prosodic clause includes one or more prosodic phrases containing certain continuous intonation schemes. Furthermore, phrases are composed of prosodic words. The communication function the speaker intends the phrase to have (the type of the phrase) is supposed to be linked with the last prosodic word in the phrase. For this purposes, so called prosodemes are defined. The last prosodic word is linked with a functionally involved prosodeme, other words with null prosodemes. For the Czech language, the following basic classes of functionally involved prosodemes were defined [5]:

  1. P1

    – terminating satisfactorily (in declarative sentences)

  2. P2

    – terminating unsatisfactorily (in questions)

  3. P3

    – non-terminating (in non-terminal phrases of compound sentences)

Since this research is limited to the declarative sentences and neutral speech (i.e. without emphasis, expressions etc.), prosodemes P0, P1.1 and P3.1 were applied. According to the theoretical assumption, all the compound sentences consist of several phrases, where the last one is terminated with prosodeme P1.1 and the other phrases end with P3.1; see a simple example in Fig. 1.

Fig. 1.
figure 1

Declarative compound sentence “Málokdo věří, že by mohl zvítězit.” (“Few believe that he could win.”). This prosodeme combination corresponds to the prosody model: the first phrase ends with P3.1 prosodeme and the last one with P1.1.

Particular prosodemes are linked with specific speech features: P1.1 is characteristic with a pitch decrease within its last syllable and a pitch increase is typical for P3.1. Beside the pitch shape (which is the most relevant), spectral features, duration and energy can be important for particular prosodemes. Naturally, particular types of phrases do not vary solely within their last prosodic words. Some specific prosodic differences can be present throughout the whole utterances. However, those differencies are often rather content-related (e.g. emphasis on some key words) and a more complex prosody model would be required. The utilized prosody model seems to be adequate for the phrase classification [1].

1.2 Problems in Real Speech

In real speech data, a different prosodeme than expected could be present. A typical example is a compound sentence split into several independent sentences. Within the compound sentence, all phrases (except the last one) should be terminated with the prosodeme P3.1. However, when the link between particular phrases is weak, the utterance can be split into independent sentences which are naturally terminated by the prosodeme P1.1. This is illustrated in Fig. 2.

Fig. 2.
figure 2

Declarative compound sentence “My jsme ekonomické oddělení, ne detektivní kancelář.” (“We are the economic department, not a detective agency.”). The first phrase is terminated by an evident prosodeme P1.1.

In the Czech text, particular phrases are supposed to be separated by punctuation marks, usually commasFootnote 1. Corresponding segments of speech are supposed to be prosodic phrases ended by functionally involved prosodemes. However, this theoretical assumption is not always fulfilled: Pauses can appear inside text phrases, especially when they are long. Or contrarily, more text phrases can be uttered together without indication of any functionally involved prosodeme. Moreover, the pause absence does not always lead to the absence of a functionally involved prosodeme; please compare Figs. 3 and 4.

Fig. 3.
figure 3

Declarative compound sentence “Aby cíle dosáhl, musí mít výsledky.” (“To achieve the goal, the results are necessary.”). Though there is no pause, the prosodeme P3.1 terminating the first part is obvious.

Fig. 4.
figure 4

Declarative compound sentence “Nevím, kdo jiný by jim mohl pomoci.” (“I don’t know who else could help them.”). The punctuation in text has no evident impact on prosody realization; neither pause nor functional prosodeme are present.

Badly annotated speech corpora can be a source of various troubles. In speech synthesis (specifically, in unit selection method), prosodeme labels are important attributes for selecting the optimal sequence of speech units for building resulting speech [6]. Using units from an inappropriate prosodeme or mixing units from various prosodemes can cause a degradation of the overall speech quality.

2 Proposed Approach

To model the prosodic properties of speech we employed a similar HMM framework as it is specific for the HMM-based speech synthesis [8]. Speech was described by a sequence of parameter vectors containing 40 mel cepstral coefficients obtained by STRAIGHT analysis method [2] and the pitch extracted by using the PRAAT softwareFootnote 2. The speech parameter vectors were modelled by a set of multi-stream context dependent HMMs by using the HTS toolkitFootnote 3.

In the HMM-based speech synthesis framework, the phonetic, prosodic and linguistic context is taken into account, i.e. a speech unit is given as a phone with its phonetic, prosodic and linguistic context information. In this manner, the language prosody is modelled implicitly – in various contexts different units/models can be used. In our experiments, each unit is represented by a string

figure a

where all subscripted bold letters are contextual factors defined as

figure b
Table 1. An example of splitting utterances by the punctuation: “Řekl, že přijde.” (“He said that he will come.”, its phonetic transcription: RekL Ze pQijde). Changes are underlined.

2.1 Training Stage

For our experiments, we used 4 large speech corpora recorded for the purposes of speech synthesis. At the beginning, all utterances were segmented to phrases only by detected pauses, i.e. all phrases correspond to clauses. This manner of phonetic annotation is also used in our unit selection TTS system [3], since functionally involved prosodemes are ensured at the end of all phrases.

Model Training. Model parameters were estimated from the speech data by using maximum likelihood criterion. 3-state left-to-right MSD-HSMM with single Gaussian output distributions were used. For a more robust model parameter estimation, the context clustering based on the MDL criterion was performed. In this stage, the default prosodic annotation of particular phrases is used.

Prosodeme Correction. This procedure is a modified version of a more general method introduced in [1]. First, each individual phrase terminated by the prosodeme P3.1 is transcribed by using the prosodeme P1.1, i.e. both transcriptions differ only in the prosodeme contextual factors within the last prosodic word; it is analogous to the example in Table 1, but simpler. Then corresponding speech features are forced-aligned with both transcriptions and the transcription with the best value of alignment score is selected for the given phrase.

When a corrected transcription of all utterances is available, the whole process can be run iteratively. The procedure works on the assumption that most utterances correspond to the theoretical prosody model with some rare exceptions. Then, the trained HMMs are correct and can be used to reveal those exceptions. Problems occur in the case of less consistent prosody: the inconsistencies can cumulate, a part of models can be badly trained and some performed corrections are wrong. To cope with that, an additional step is performed at the end of each iteration: the prosodeme correction procedure is performed by using HMMs from another speaker. Only corrections performed in both cases are kept, other changes are annulled, therefore this step is referred to as the annulling step.

Splitting Phrases by Punctuation. First, all phrases containing punctuation marks are further split into phrases terminated by P3.1 (excluding the last one, naturally). A simple example is presented in Table 1. When more commas are present in the phrase, all possible split combinations are taken into account. Again, the corresponding speech features are forced-aligned with all transcriptions and the transcription with the best alignment score is selected.

3 Evaluation and Results

For our experiments, 4 large speech corpora recorded for the purposes of speech synthesis [4] were used: 2 male voices (denoted as AJ and JS) and 2 female voices (denoted as KI and MR). Each corpus contained about 10,000 declarative sentences. The detailed description of experimental data is present in Table 2.

Although all corpora are almost equal, some statistics are very different. This indicates various speaking styles of particular speakers. For example, the number of commas inside phrases corresponds how often speakers join text segments separated by a comma into one phrase. By contrast, the number of phrases without any end punctuation tells how often speakers make pauses inside continuous text segments. To illustrate the prosody variability, we performed one iteration of the correction procedure without the annulling step. The higher number of changes is, the lower the consistency is supposed to be – see Table 3.

Table 2. Description of experimental data. Please note that the total number of phrases is given as phrases ended by a comma + ended by a dot + without any end punctuation.
Table 3. The initial number of prosodemes and the number of P3.1 \(\rightarrow \) P1.1 changes.
Table 4. Changing prosodemes P3.1 \(\rightarrow \) P1.1: the initial number of prosodemes and the number of changes in particular iterations. Please remember that the corrections are always performed on the default corpora (the correction procedure is not cumulative).
Table 5. Splitting utterances by the punctuation. The number of changes affects is equal for both number of phrases and prosodemes since each splitting produces a new phrase ended with the P3.1 prosodeme.

The iterative correction procedure with annulling step was tested only on voices AJ and MR. The annulling step was performed by using models from JS and KI, since these voices seem to be more consistent and their models are expected to be more robust. Results are presented in Table 4. Splitting phrases by punctuation was performed for all speakers, results are presented in Table 5. Since this splitting procedure is presented as fully new, we did not perform iterations, nor the annulling step in our experiments.

3.1 Listening Tests

The suitability of the performed corrections was verified by one overall listening test. It contained 120 utterances with one selected prosodic word. Listeners picked one of 5 choices: definitely P1.1, probably P1.1, definitely P3.1, probably P3.1, null prosodeme. Sentences were selected to be short and simple like the examples in Figs. 1, 2, 3 and 4. Five participants took part in this test, all of them were speech processing experts capable to distinguish various prosodeme types.

The test contained 40 utterances (20 for both AJ and MR) for the evaluation of the prosodeme changing procedure: \(2\times 10\) utterances with P3.1 to P1.1 corrections and \(2\times 10\) utterances with corrections discarded in the annulling step. The remaining 80 utterances (20 for each speaker) were intent for the evaluation of the splitting procedure: \(4\times 10\) utterances that were additionally split by a comma (split utterances) and \(4\times 10\) utterances that contain a comma, but the splitting was not performed (non-split utterances).

Changed Prosodemes. The distribution of listeners’ choices is present in Fig. 5 and Table 6. The most relevant entries are the percentages of changed prosodemes that were marked as P3.1: 90 % and 76 % for AJ and MR, respectively. The other 6 % and 20 % were marked as P1.1 and the remaining 4 % (for both speakers) were indecisive cases. Since only 3 iterations of correction procedure were performed and it wasn’t the final state, further improvement could be expected.

As explained in Sect. 2, the purpose of annulling step is to increase the robustness within several initial iterations of the correction procedure. All changes can be still applied in the latter stage without the annulling step. Anyway, the more annulled cases really does not match the desired prosodeme, the more beneficial this step is. In our case, this rate is about 82 % and 62 % (all non-P1.1 cases).

Fig. 5.
figure 5

Results of listening test on changing prosodemes P3.1 \(\rightarrow \) P1.1.

Table 6. Results of listening test on changing prosodemes P3.1 \(\rightarrow \) P1.1: percentage of particular listeners’ choices. The agreement between human listeners and the proposed correction procedure is expressed mainly by the bold values.

Split Phrases. Results of listening test are presented in Fig. 6 and Table 7. A high consistency between listeners and the proposed procedure is evident: prosodemes in split utterances were annotated as definitely or probably P3.1 in about 88 % cases for all speakers (ranged between 84 % for KI and 91 % for AJ). Surprisingly, an appreciable amount of P1.1 prosodemes appeared in listeners’ selections. Actually, it is in accordance with the experiment on changing prosodemes and some P1.1s could be expected here, too.

The actual benefit of the splitting procedure should be also apparent by a comparison of results for the split and non-split utterances. Above all, significantly less P3.1s and more null prosodemes should be present in non-split sentences. This is true; nevertheless, the number of P3.1s in non-split utterances is higher than expected, especially 72 % for JS. The reason could be the influence of the sentence structure on the listeners’ decision. Evidently, it depends on the actual speaker, too.

Fig. 6.
figure 6

Results of listening test: splitting utterances into phrases by punctuation.

Table 7. Results of listening test on splitting phrases by punctuation: percentage of particular listeners’ choices. The agreement between human listeners and the proposed splitting procedure is expressed mainly by the bold values.

4 Conclusion

This paper presented 2 procedures for the correction of the type and borders of prosodic phrases in large speech corpora. Experiments were performed on 4 corpora. The results have been verified in a listening test. The agreement between the listeners and the proposed procedures was about 83 % for changing the prosodeme type and 88 % for splitting utterances into phrases by the punctuation.

In our future work, both proposed procedures should be joint together into one iterative correction process. The robustness could be improved by employing speaker-independent models and their adaptation. Other types of phrases (e.g. various types of questions) will be included, too. A big challenge is the automatic prosody annotation of speech data, especially of non-professional speakers whose prosody could be problematic due to its bad consistency.