Keywords

1 Introduction

Interactions between the segmental and prosodic levels of speech may be exemplified in several ways, but it is perhaps the temporal domain in which this interaction can most readily be observed. Temporal segmentation and the rhythm of speech have been shown to affect listeners to a great extent. Arhythmical or temporally unpredictable speech leads to longer reaction times in monitoring experiments [3, 19] and, therefore, to increased cognitive processing of the incoming speech signal. In addition, speech with unnatural durational patterns has been shown to decrease intelligibility [18], or to induce negative perceptions regarding the speakers, making the speakers sound, for instance, more nervous and anxious [30] or less competent.

In the field of text-to-speech (TTS) synthesis, the durational patterns (as well as other prosody patterns) can either be modelled explicitly, i.e. by means of an estimator, assigning a particular duration value to each speech unit to be synthesized [8, 12, 20], or by a symbolic description, defining only deep-level description (discriminative features defining what the prosody should/should not be) and expecting that the appropriate surface-level duration patterns (the particular unit durations) will emerge as a result of this description. While the former approach is used mainly in generative approaches like HMM [11], DNN [33] or single instance speech synthesis [16, 27], the latter is mostly employed in unit selection speech synthesis, where it provides more robust selection criteria when compared to the following of generated prosody contours [24]. The insidious part is, however, the definition of the discriminative features. The usual way is to describe the position of speech units within a prosodic pattern (whether a phrase, word or syllable) in the source speech recordings and to select each particular unit into the most similar position in the synthesized phrase. The side effect is the increased pressure on the selection criteria, trying to balance the trade-off for all the features used, which is in details described in Sect. 3. First, however, let us look at the specific cases where it is important to consider temporal/durational properties of speech units and why this is necessary.

2 The Special Status of Final Syllables

Concatenative speech synthesis may be regarded as one of the key stimulating factors in the research of the effects of the prosodic domain on the duration of speech segments. Klatt [13] showed how segmental duration interacts with linguistic characteristics at various levels and devised a series of rules which predicted the durations of speech sounds in American English through the multiplication of a base value by coefficients related to various segmental and prosodic attributes (see [5, 22] for similar studies).

The above-mentioned and other studies document several ways in which segmental duration is affected by the prosodic structure of utterances. In many languages, accented syllables are realized as longer than unaccented ones, or syllables tend to shorten in longer words [9]. Probably the most salient reflection of prosodic structure in the segmental strand is called phrase-final lengthening, which appears to be related to the general declination observed on a number of levels: ends of prosodic phrases are thus marked by lower speech rate and a drop in fundamental frequency [10, 14], as well as larger and longer articulatory gestures [4] and laxer phonation [17]. Phrase-final lengthening has been documented in many languages [9], including Czech [7, 29], and is considered to be universal in speech. A number of studies (see the Refs. in [9]) have agreed that the rate lowering concerns the rhyme of the final syllable (i.e., the vowel nucleus plus any consonants in the coda following the vowel).

Lengthening is thus a well-documented phenomenon at the level of prosodic phrases. In this study, we are interested in temporal adjustments at the level of individual words, which have been researched considerably less (although mentioned in literature). When comparing the duration of the schwa vowel framed in sentences / posed a problem/ and /Her pop the marriage/, Beckman and Edwards [2] found the schwa in the first sentence significantly longer. Similar results were obtained by [6], who compared pairs like /lettuce–let us/ or /inquires–in choirs/, though the effect was relatively small. More recently, word-final lengthening is discussed in [31], who, however, did not reach a definitive conclusion, and in [32], who report significantly longer durations in word-final positions. All of these studies consider English, though.

Until recently, no data on word-final lengthening have been available for Czech. A new study on the acoustic characteristics of Czech lexical stress [23] showed, however, that duration is the only parameter which systematically varies with lexical stress – note that Czech is a language with stress fixed on the first syllable of the base rhythm unit, also called the prosodic word [21]. Figure 1 shows part of the duration data taken from multi-syllabic words in spontaneous speech; only words which did not appear as phrase-final were analysed in the study. The results differ from what is typical in most languages: the vowel in the stressed syllable tends to be shortest and that in the last syllable is always longest (though not always statistically significant).

Fig. 1.
figure 1

Vowel duration in individual syllables of 3-, 4-, and 5-syllabic words; based on [23].

The account presented in this section indicates that phrase-final lengthening is essentially a ubiquitous and very salient phenomenon, but even word-final lengthening is non-negligible. The present study examines speech sounds in the last syllable of phonological words from the perspective of concatenative unit-selection synthesis of Czech. The results presented by [23] suggest that the word-final syllable may enjoy a special status in the prosodic hierarchy of Czech, and this may need to be reflected in the speech synthesis algorithm. Specifically, it is possible that units which appeared in the word-final position in the source recordings should not be selected for synthesis within words. This study therefore tests the hypothesis that the presence of word-final (source) units in word-internal positions in synthesized speech will be perceived negatively by listeners, since the lengthening will disrupt the local timing.

3 Handling of Unit Position in TTS Systems

The previous section indicates the great importance of the last syllable. We decided to check this in our TTS system ARTIC based on unit selection speech synthesis method. The highest influence of the unit position has the design of target cost (TC).

Fig. 2.
figure 2

The illustration of the correspondence of windowing functions to a prosodic word “synthesis”. Individual windows are distinguished by line style, points correspond to the values describing the candidates.

The target cost, as usually handled, is set to strive for putting the units into the same suprasegmental surroundings as they originally had in the source recordings, which is achieved when target features match. Thus, to make sure that the last-syllable position is handled properly, as described in Sect. 2, we could simply define an additional feature with onset, nucleus, coda symbolic values assigned to units recorded in the last syllable, and with not-last value for units from other syllables. Based on the match or mismatch of such feature values, an additional penalty would be added to the total TC, encouraging (in theory) the placement of units into the required position. However, in case of value mismatch, there is the same penalty no matter the value mismatch – i.e. the same penalty for e.g. unknown \(\leftrightarrow \) nucleus as for coda \(\leftrightarrow \) onset (unless a feature exchange matrix is defined somehow). An alternative way of defining the last-syllable position as a binary feature, i.e. a unit either belongs to the last syllable or not, is not going to improve the situation very much – there is better change to avoid mismatch of last/non-last syllable units, but units within the last syllable may still be interchanged between coda and onset parts (while both match the feature), which is not better either.

In our current TTS version, therefore, the computation of position within a prosodic word (hereafter referred to as p-word) is based on the assumption that there is no need to put a unit into the identical position in which the unit was originally recorded [26, 28]. This is achieved through the definition of a suitability to the required position related to the beginning/middle/end of each p-word, represented by 3 values, each corresponding to one von Hann window spanned through the p-word, as illustrated in Fig. 2. Such a position feature can avoid a completely wrong placement (i.e. unit originating at the beginning of a p-word to be placed to its end), while it still allows some kind of flexibility in units interchanging, when the unit originates in a position which is “close” to the one in which it is to be placed. From the point of view of the selection algorithm, the set of candidates suitable for the given p-word position (as defined by the three values) is wider than if the position were defined as, for example, distance from the beginning and from the end of the p-word. And the wider set to select from gives the algorithm a better chance to choose a more appropriate unit, when the other target and concatenation features are taken into account as well.

On the other hand, such a position measure (and either of similar) is not able to handle a “last-syllable” occurrence feature reliably. We observed unnatural unit lengthenings close to the end of a p-word which occurred when a unit from the last syllable was placed to the penultimate syllable, since its position was suitable to the position required, as measured by the current approach.

4 New Unit Position Reflecting Final Syllable Status

Based on the findings from Sect. 2, we thus re-defined the unit position feature in a way that instead of measuring appropriateness for the given position, we strongly penalize placements into inappropriate positions, while expecting any other placements to be equally suitable, i.e. not penalizing them in any way. Let us emphasize that this completely replaces the original position function from Sect. 3, instead of just being added as an extra measure. The reason is that adding a feature would increase the pressure on the selection algorithm, lowering the set of candidates matching the feature and thus increasing the chance that a unit not matching the required position will be used after all. And moreover, ensuring a position placement is not that significant, as suggested in Sect. 2.

Let us also note that there was no exact syllabification used to identify the last p-word syllable (neither is it possible to detect syllables exactly [15]). Instead, we simply found the last vowel (or syllabic consonant) V in each p-word, and the diphones [*-V], [V-*] as well as the remaining diphones until the end of p-word were considered to be the syllable constituents (cf. references in Sect. 2 which show that the rhyme of the final syllable is most affected by final lengthening).

Having the new positional feature, we synthesized more than a million phrases by the original (marked as \(TTS_{base}\) hereafter) and by the modified version of unit selection method (\(TTS_{syll}\)) embedded into our TTS system. The resulting sequences of units provided by \(TTS_{base}\) for each phrase were further analysed – both sequence of the phrase units and its corresponding selected units were passed through \(TTS_{syll}\) module which returned the high TC value for units with inappropriate syllable position. This number of position misplacements was used as the base score for the selection of phrases to be evaluated by listening tests.

4.1 Listening Tests Overview

To verify our presumption, we carried out a large 3-scale preference listening test, where two variants of the same phrase, synthesized by \(TTS_{base}\) and \(TTS_{syll}\), were compared. To select the phrases for the evaluation, the set of all the synthesized phrases was first limited to those having 40 characters at most, since smaller differences in long phrases are rather hard for listeners to recall [1]. Then, only the phrases with two or more misses, as described in Sect. 4, were kept. The final set of 25 phrase pairs was selected randomly, while 4 different professional synthetic voices, two male and two female, were used – in total 100 phrases were thus compared.

22 listeners participated in the test, 7 of them being speech synthesis experts, 6 being phoneticians and 9 naive listeners. We intentionally did not inform any of them about the experiment’s details, since the aim was to test the overall quality of the new TTS system. Furthermore, during the listening test, the order of the samples was randomized across the listening prompts, so the listeners did not know which one was synthesized by \(TTS_{base}\) and by \(TTS_{syll}\). The listeners were able to listen to each stimulus repeatedly, they were instructed to use earphones, and had to choose one of the following choices: sample A sounds better/samples are of the same quality/sample B sounds better. The A/B assignments were then normalized to \(A = 1\) where \(TTS_{syll}\) variant was preferred, \(A = -1\) where \(TTS_{base}\) was preferred, and \(A=0\) otherwise. The final score s of the listening test T was then computed using the Eq. 1

$$\begin{aligned} s = \frac{\sum _{A\in T} A}{\sum _{A \in T} 1} \end{aligned}$$
(1)

Thus, the positive value of s indicates the improvement of the overall quality when using the new last-syllable feature.

Let us note that for the purpose of this testing, we did not use the procedure designed in [25]. The main reason is that we do not have to rely on numbers of changed units when comparing the output sequences from the two system versions; instead we can simply detect wrong cases in \(TTS_{base}\), as described in Sect. 4. Nevertheless, we plan to use the procedure before the final deployment of \(TTS_{syll}\) to the production-ready version of our TTS system.

5 Results

The results of the listening tests are shown in the Table 1. All the score values s are positive, indicating a considerable improvement of the quality of synthesized samples.

Table 1. The results obtained from listening tests. The table contains the number of listener answers and the score values s computed by Eq. 1.

Since the score value for male speaker 1 obtained from phonetics experts’ answers seems to be low and not so conclusive, we decided to prove the statistical significance of this result. We have carried out the sign test with the null hypothesis H0:the outputs of the both systems are of the same quality, and alternative hypothesis H1:the output of one system sounds better. The computed p-value \( = 0.0193\), so we can reject the null hypothesis H0 at \(\alpha =0.05\) significance level, concluding that the quality of \(TTS_{syll}\) system is really higher.

However, there is a considerable number of “same quality” evaluations. It suggests that not all candidates from the last syllable cause a speech artefact when placed to non-last syllable position – note that, as described in Sect. 4, we know that there is a unit from the last syllable placed to non-last position in the \(TTS_{base}\).

On the other hand, the listeners sometimes preferred the \(TTS_{base}\) variant. Therefore, we inspected the problematic prompts, trying to find the cause of the preference. The main problem was a disturbing artefact of another kind (not directly related to syllable position) in \(TTS_{syl}\) variant, while none of the syllable position misses in \(TTS_{base}\) was perceived negatively, nor was there another artefact. In a few cases, moreover, even \(TTS_{syl}\) still contained unnatural inter-phrase lengthenings. These originated from the failures of p-word tokenization in source recordings, and thus the system used the last-syllable unit into inappropriate position without being able to realize it.

6 Conclusion

The results of the listening tests clearly confirm the importance of the correct handling of the last syllable of a p-word. Let us also note that the change of paradigm, when instead of “forcing” units into the expected position we try to avoid their use in positions where they are known to cause audible artefacts, follows the principle of units synonymy/homonymy established in [24, 26]. We believe that this is the right direction towards the tuning of unit selection features, as it allows the use of units in a much wider range of placements (than the one in which the unit has been placed in the source recordings), and it also avoids the definition of a penalty function which would evaluate a distance of what the unit is (where it is placed) to what it is required to be (where we try to place it).

One of the possible further improvements will now be the focus of p-word tokenization which was found to be incorrect in some of the cases examined and proved to be the clear cause of \(TTS_{base}\) preference in these. Also, the findings in Sect. 5 suggest that there may be some durational (or prosodic, in general) patterns, allowing the exchange of the last syllable and non-last syllable units under some conditions. Answering this phenomena would even more relax the pressure on the selection algorithm, thus widening the set of candidates to be used.