Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Today, speech synthesis systems (unit selection (Hunt and Black 1996), HMM-based (Zen et al. 2009)) are able to produce natural synthetic speech from text. Over the last decade, research has mainly focused on the modelling of speech prosody—“the music of speech” (accent/phrasing, intonation/rhythm)—for text-to-speech (TTS) synthesis. Among them, GMMs/HMMs (Gaussian mixture models and hidden Markov models) are today the most popular methods used to model speech prosody. In particular, the modelling of speech prosody has gradually and durably moved from short-time representations (“frame-by-frame”: Yoshimura et al. 1999; Zen et al. 2004; Tokuda et al. 2003; Toda and Tokuda 2007; Yan et al. 2009) to the use of large-time representations (Gao et al. 2008; Latorre and Akamine 2008; Qian et al. 2009; Obin et al. 2011b)). Also, recent researches tend to introduce deep architecture systems to model more efficiently the complexity of speech (deep neural networks (Zen et al. 2013)). However, current speech synthesis systems still suffer from a number of limitations, which consequence into the fact that the synthetic speech does not totally sound as “human”. In particular, the absence of alternatives/variants in the synthetic speech is a dramatical limitation compared to the variety of human speech (see Fig. 13.1 for illustration): for a given text, the speech synthesis system will always produce exactly the same synthetic speech.

Fig. 13.1
figure 1

Illustration of speech alternatives: human vs. machine

A human speaker can use a variety of alternatives/variants to pronounce a text. This variety may induce variations in the symbolic (prosodic event: accent, phrasing) and acoustic (prosody: prosodic contour; segmental: articulation, co-articulation) speech characteristics. These alternatives depend on linguistic constraints, specific strategies of the speaker, speaking style, and pragmatic constraints. Current speech synthesis systems do not exploit this variety during statistical modelling or synthesis. During the training, the symbolic and acoustic speech characteristics are usually estimated with a single normal distribution which is assumed to correspond with a single strategy of the speaker. During the synthesis, the sequence of symbolic and acoustic speech characteristics are entirely determined by the sequence of linguistic characteristics associated with the sentence—the most-likely sequence.

In real-world speech synthesis applications (e.g. announcement, storytelling, or interactive speech systems), expressive speech is required (Obin et al. 2011a; Obin 2011). The use of speech alternatives in speech synthesis may substantially improve speech synthesis (Bulyko and Ostendorf 2001), and fill the gap of the machine to the human. First, alternatives can be used to provide a variety of speech candidates that may be exploited to vary the speech synthesized for a given sentence. Second, alternatives can also be advantageously used as a relaxed constraint for the determination of the sequence of speech units to improve the quality of the synthesized speech. For instance, the use of a symbolic alternative (e.g. insertion/deletion of a pause) may conduct to a significantly improved sequence of speech units.

This chapter addresses the use of speech alternatives to improve the quality and the variety of speech synthesis. The proposed speech synthesis system (ircamTTS) is based on unit selection, and uses various context-dependent parametric models to represent the symbolic/acoustic characteristics of speech prosody (GMMs/HMMs). During the synthesis, symbolic and acoustic alternatives are exploited using a generalized Viterbi algorithm (GVA) (Hashimoto 1987). First, a GVA is used to determine a set of symbolic candidates, corresponding to the \(K_\mathrm{symb.}\) sequences of symbolic characteristics, in order to enrich the further selection of speech units. For each symbolic candidate, a GVA is then used to determine the \(K_\mathrm{acou.}\) sequences of speech units under the joint constraint of segmental and speech prosody characteristics. Finally, the optimal sequence of speech units is determined so as to maximize the cumulative symbolic/acoustic likelihood. Alternatively, the introduction of alternatives allows to vary the speech synthesis by selecting one of the K most likely speech sequences instead of the most likely one. The proposed method can easily be extended to HMM-based speech synthesis.

The speech synthesis system used for the study is presented in Sect. 13.2. The use of speech alternatives during the synthesis, and the GVA are introduced in Sect. 13.3. The proposed method is compared to various configurations of the speech synthesis system (modelling of speech prosody, use of speech alternatives), and validated with objective and subjective experiments in Sect. 13.4.

2 Speech Synthesis System

Unit-selection speech synthesis is based on the optimal selection of a sequence of speech units that corresponds to the sequence of linguistics characteristics derived from the text to synthesize. The optimal sequence of speech units is generally determined so as to minimize an objective function usually defined in terms of concatenation and target acoustic costs. Additional information (e.g. prosodic events—ToBI labels) can also be derived from the text to enrich the description used for unit selection.

The optimal sequence of speech units \(\overline{\mathbf{u}}\) can be determined by jointly maximizing the symbolic/acoustic likelihood of the sequence of speech units \(\mathbf{u}=[u_1, \dots, u_N]\) conditionally to the sequence of linguistic characteristics \(\mathbf{c}=[c_1, \dots, c_N]\):

$$\overline{\mathbf{u}} = \underset{\mathbf{u}}{\arg \hspace{-0.07cm} \max} \mathrm{p}({\mathbf O}(\mathbf{u}) \hspace{0.02cm} | \hspace{0.02cm} {\mathbf c})$$
(13.1)

where \({\mathbf O}(\mathbf{u}) = [{\mathbf O}_{\mathrm{symb.}}(\mathbf{u}), {\mathbf O}_{\mathrm{acou.}}(\mathbf{u})]\) denotes the symbolic and acoustic characteristics associated with the sequence of speech units \(\mathbf{u}\).

A suboptimal solution to this equation is usually obtained by factorizing the symbolic/acoustic characteristics:

$$\begin{aligned} \overline{\mathbf{u}}_{\mathrm{symb.}} & = \underset{\mathbf{u}_{\mathrm{symb.}}}{\mathrm{argmax}} \mathrm{p}({\mathbf O}_{\mathrm{symb.}}(\mathbf{u}_{\mathrm{symb.}}) | {\mathbf c}) \end{aligned}$$
(13.2)
$$\begin{aligned} \overline{\mathbf{u}}_\mathrm{acou.} & = \underset{\mathbf{u}}{\mathrm{argmax}} \mathrm{p}({\mathbf O}_{\mathrm{acou.}}(\mathbf{u}_{\mathrm{acou.}}) | {\mathbf c}, \overline{\mathbf{u}}_{\mathrm{symb.}})\end{aligned}$$
(13.3)

where \(\mathbf{u}_{\mathrm{symb.}}\) is the symbolic sequence of speech units (typically, a sequence of prosodic events, e.g. accent and phrasing), and \(\mathbf{u}_{\mathrm{acou.}}\) is the acoustic sequence of speech units (i.e. a sequence of speech units for unit-selection and a sequence of speech parameters for HMM-based speech synthesis). This acoustic sequence of speech units represents the short- (source/filter) and long-term (prosody: F0, duration) variations of speech over various units (e.g. phone, syllable, andphrase).

In other words, the symbolic sequence of speech units \(\overline{\mathbf{u}}_{\mathrm{symb.}}\) is first determined, and then used for the selection of acoustic speech units \(\overline{\mathbf{u}}_\mathrm{acou.}\). This conventional approach suffers from the following limitations:

  1. 1.

    Symbolic and acoustic modelling are processed separately during training and synthesis, which remain suboptimal and may degrade the quality of the synthesized speech.

  2. 2.

    A single sequence of speech units is determined during synthesis, while the use of alternatives enlarges the number of speech candidates available, and then improves the quality of the synthesized speech.

To overcome these limitations, the ideal solution is: the joint symbolic/acoustic modelling in order to determine the sequence of speech units that is globally optimal (Eq. 13.1); and the exploitation of speech alternatives in order to enrich the search for the optimal sequence of speech units. The present study only addresses the use of symbolic/acoustic alternatives for speech synthesis. In the present study, symbolic alternatives are used to determine a set of symbolic candidates \(\overline{\mathbf{u}}_{\mathrm{symb.}}\) so as to enrich the further selection of speech units (Eq. 13.2). For each symbolic candidate, the sequence of acoustic speech units \(\overline{\mathbf{u}}_{\mathrm{acou.}}\) is determined based on a relaxed-constraint search using acoustic alternatives (Eq. 13.3). Finally, the optimal sequence of speech units \(\overline{\mathbf{u}}\) is determined so as to maximize the cumulative likelihood of the symbolic/acoustic sequences.

The use of symbolic/acoustic alternatives requires adequate statistical models that explicitly describe alternatives, and a dynamic selection algorithm that can manage these alternatives during speech synthesis. Symbolic and acoustic models used for this study are briefly introduced in Sects. 13.2.1 and 13.2.2. Then, the dynamic selection algorithm used for unit selection is described in Sect. 13.3.

2.1 Symbolic Modelling

The prosodic events (accent and phrasing) are modelled by a statistical model based on HMMs (Black and Taylor 1994; Atterer and Klein 2002; Ingulfen et al. 2005; Obin et al. 2010a, 2010b; Parlikar and Black 2012; Parlikar and Black 2013). A hierarchical HMM (HHMM) is used to assign the prosodic structure of a text: the root layer represents the text, each intermediate layer a phrase (here, intermediate phrase and phrase), and the final layer the sequence of accents. For each intermediate layer, a segmental HMM and information fusion are used to combine the linguistic and metric constraints (length of a phrase) for the segmentation of a text into phrases (Ostendorf and Veilleux 1994; Schmid and Atterer 2004; Bell et al. 2006; Obin et al. 2011c). An illustration of the HHMM for the symbolic modelling of speech prosody is presented in Fig. 13.2.

Fig. 13.2
figure 2

Illustration of the HHMM symbolic modelling of speech prosody for the sentence: “Longtemps, je me suis couché de bonne heure” (“For a long time I used to go to bed early”). The intermediate layer illustrates the segmentation of a text into phrases. The terminal layer illustrates the assignment of accents

2.2 Acoustic Modelling

The acoustic (short- and long-term) models are based on context-dependent GMMs (cf. Veaux et al. 2010; Veaux and Rodet 2011, for a detailed description). Three different observation units (phone, syllable, and phrase) are considered, and separate GMMs are trained for each of these units. The model associated with the phone unit is merely a reformulation of the target and concatenation costs traditionally used in unit-selection speech synthesis (Hunt and Black 1996). The other models are used to represent the local variation of prosodic contours (F0 and durations) over the syllables and the major prosodic phrases, respectively. The use of GMMs allows to capture prosodic alternatives associated with each of the considered units (Fig. 13.3).

Fig. 13.3
figure 3

Illustration of acoustic alternatives for a given symbolic unit

3 Exploiting Alternatives

The main idea of the contribution is to exploit the symbolic/acoustic alternatives observed in human speech. Fig. 13.4 illustrates the integration of symbolic/acoustic alternatives for speech synthesis. The remainder of this section presents the details of the generalized Viterbi search to exploit symbolic/acoustic alternatives for TTS synthesis.

Fig. 13.4
figure 4

Illustration of symbolic/acoustic alternatives for text-to-speech synthesis. The top of the figure presents three symbolic alternative sequences to a given input text. The bottom of the figure presents four acoustic alternatives to the symbolic event circled on top. Fundamentally, each text has symbolic alternative sequence, and each symbolic alternative sequence has acoustic alternative sequences

Fig. 13.5
figure 5

Illustration of Viterbi Search and Generalized Viterbi Search. The boxes represent the list of states among which the best S path is selected. For the Viterbi Search, only one path is retained at all time, and only one survivor is retained during selection. For the Generalized Viterbi Search, K paths are retained at all time, and K survivors are retained during selection (alternative candidates, here K = 3). At all time, the Generalized Viterbi Search has a larger memory than the Viterbi Search

In a conventional synthesizer, the search for the optimal sequence of speech units (Eq. 13.1) is decomposed in two separate optimisation problems (Eqs. 13.2 and 13.3). These two equations are generally solved using the Viterbi algorithm. This algorithm defines a lattice whose states at each time t are the N candidate units. At each time t, the Viterbi algorithm considers N lists of competing paths, each list being associated to one of the N states. Then, for each list, only one survivor path is selected for further extension. Therefore the Viterbi algorithm can be described as a N-list 1-survivor (N,1) algorithm. The GVA (Hashimoto 1987) consists in a twofold relaxation of the path selection.

  • First, more than one survivor path can be retained for each list.

  • Second, a list of competing paths can encompass more than one state.

An illustration of this approach is given in Fig. 13.5, which shows that the GVA can retain survivor paths that would otherwise be merged by the classical Viterbi algorithm. Thus, the GVA can keep track of several symbolic/prosodic alternatives until the final decision is made.

In this study, the GVA is first used to determine a set of symbolic candidates corresponding to the \(K_\mathrm{symb.}\) most-likely sequences of symbolic characteristics, in order to enrich the further selection of speech units. For each symbolic candidate, a GVA is then used to determine the \(K_\mathrm{acou.}\) most-likely sequences of speech units under the joint constraint of segmental characteristics (phone model) and prosody (syllable and phrase models). Finally, the optimal sequence of speech units is determined so as to maximize the cumulative symbolic/acoustic likelihood.

4 Experiments

Table 13.1 Description of TTS systems used for the evaluation. Parentheses denote the optional use of symbolic alternatives in the TTS system

Objective and subjective experiments were conducted to address the use of speech alternatives in speech synthesis, with comparison to a baseline (no explicit modelling of speech prosody, no use of speech alternatives) and conventional (explicit modelling of speech prosody, no use of speech alternatives) speech synthesis systems (Table 13.1). In addition, symbolic alternatives have been optionally used for each compared method to assess the relevancy of symbolic and acoustic alternatives separately.

4.1 Speech Material

The speech material used for the experiment is a 5-h French storytelling database interpreted by a professional actor, which was designed for expressive speech synthesis. The speech database comes with the following linguistic processing: orthographical transcription; surface syntactic parsing (POS and word class); manual speech segmentation into phonemes and syllables, and automatic labelling/segmentation of prosodic events/units (cf. Obin et al. 2010b for more details).

4.2 Objective Experiment

An objective experiment has been conducted to assess the relative contribution of speech prosody and symbolic/acoustic alternatives to the overall quality of the TTS system. In particular, a specific focus will be made on the use of symbolic/acoustic alternatives.

4.2.1 Procedure

The objective experiment has been conducted with 173 sentences of the fairy tale “Le Petit Poucet” (“Tom Thumb”).

For this purpose, a cumulative log-likelihood has been defined as a weighted integration of the partial log-likelihoods (symbolic, acoustic). First, each partial log-likelihood is averaged over the utterance to be synthesized so as to normalize the variable number of observations used for the computation (e.g. phonemes, syllable, and prosodic phrase). Then, log-likehoods are normalized to ensure comparable contribution of each partial log-likelihood during the speech synthesis. Finally, the cumulative log-likelihood of a synthesized speech utterance is defined as follows:

$$LL = {w_{symbolic}}L{L_{symbolic}} + {w_{acoustic}}L{L_{acoustic}}$$
(13.4)

where \({{LL}_{\text{symbolic}}}\) and \({{LL}_{\text{acoustic}}}\) denote the partial log-likelihood associated with the sequence of symbolic and acoustic characteristics; and \({{w}_{\text{symbolic}}}\)and \({{w}_{\text{acoustic}}}\), corresponding weights.

Finally, the optimal sequence of speech units is determined so as to maximize the cumulative log-likelihood of the symbolic/acoustic characteristics. In this study, weights were heuristically chosen as \({{w}_{\text{symbolic}}}=1\), \({{w}_{\text{phone}}} = 1\), \({{w}_{\text{syllable}}} = 5\), and \({{w}_{\text{phrase}}} = 1\); 10 alternatives have been considered for the symbolic characteristics, and 50 alternatives for the selection of speech units.

4.2.2 Discussion

Cumulative likelihood obtained for the compared methods is presented in Fig. 13.6, with and without the use of symbolic alternatives. The proposed method (modelling of prosody, use of acoustic alternatives) moderately but significantly outperforms the conventional method (modelling of prosody, no use of acoustic alternatives); and dramatically outperforms the baseline method. In addition, the use of symbolic alternatives conducts to a significant improvement regardless of the method considered. Finally, the optimal synthesis is obtained for the combination of symbolic/acoustic alternatives with the modelling of speech prosody.

Fig. 13.6
figure 6

Cumulative negative log-likelihood (mean and 95 % confidence interval) obtained for the compared TTS, without (left) and with (right) use of symbolic alternatives

For further investigation, partial likelihoods obtained for the compared methods are presented in Fig. 13.7, with and without the use of symbolic alternatives. Not surprisingly, the modelling of speech prosody (syllable/phrase) successfully constraints the selection of speech units with adequate prosody, while this improvement comes with a slight degradation of the segmental characteristics (phone). The use of acoustic alternatives conducts to an improved speech prosody (significant over the syllable, not significant over the phrase) that comes with a slight degradation of the segmental characteristics (nonsignificant). This suggests that the phrase modelling (as described by Veaux and Rodet 2011) has partially failed to capture relevant variations, and that this model remains to be improved. Finally, symbolic alternatives are advantageously used to improve the prosody of the selected speech units, without a significant change in the segmental characteristics.

Fig. 13.7
figure 7

Partial negative log-likelihoods (mean and 95 % confidence intervals) for the compared methods, with and without use of symbolic alternatives

4.3 Subjective Experiment

A subjective experiment has been conducted to compare the quality of the baseline, conventional, and proposed speech synthesis systems.

4.3.1 Procedure

For this purpose, 11 sentences have been randomly selected from the fairy tale, and used to synthesize speech utterances with respect to the considered systems. Fifteen native French speakers have participated in the experiment. The experiment has been conducted according to a crowdsourcing technique using social networks. Pairs of synthesized speech utterances were randomly presented to the participants who were asked to attribute a preference score according to the naturalness of the speech utterances on the comparison mean opinion score (CMOS) scale. Participants were encouraged to use headphones.

4.3.2 Discussion

Figure 13.8 presents the CMOS obtained for the compared methods. The proposed method is substantially preferred to other methods, which indicates that the use of symbolic/acoustic alternatives conducts to a qualitative improvement of the speech synthesized over all other systems. Then, conventional method is fairly preferred to the baseline method, which confirms that the integration of speech prosody also improves the quality of speech synthesis over the baseline system (cf. observation partially reported in Veaux and Rodet 2011).

Fig. 13.8
figure 8

CMOS (mean and 95 % confidence interval) obtained for the compared methods

5 Conclusion

In this chapter, the use of speech alternatives/variants in the unit-selection speech synthesis has been introduced. Objective and subjective experiments support the evidence that the use of speech alternatives qualitatively improves speech synthesis over conventional speech synthesis systems. The proposed method can easily be extended to HMM-based speech synthesis. In further studies, the use of speech alternatives will be integrated into a joint modelling of symbolic/acoustic characteristics so as to improve the consistency of the selected symbolic/acoustic sequence of speech units. Moreover, speech alternatives will further be used to vary the speech synthesis for a given text.