Keywords

1 Introduction

In the past few years, there have been multiple concurrent approaches to speech synthesis at the centre of interest, ranging from traditional unit selection [3], through statistical-parametric synthesis (SPS, [32]), to the use of deep neural networks [17]. Each of the approaches, however, has its advantages and drawbacks – unit selection suffers from artefacts, “raw” SPS synthesis from parameters flattening and vocoder imperfection, and the DNN requires powerful hardware to run on. Therefore, there is research interest in hybrid speech synthesis, trying to combine the advantages of HMM or DNN acoustic parameters generation, driving then the unit selection module using either natural signals to generate the speech [16, 19, 31], or combining real signals with signals built by SPS in cases where no suitable speech segments are found for the generated parameter trajectories [18, 20, 29].

The present paper describes our attempt to employ the hybrid speech synthesis approach within the Czech TTS system ARTIC [25]. As the system contains both unit selection and SPS modules, it was a natural choice to join them together, as it has even been reported that a hybrid approach can achieve higher naturalness of speech it generates [1, 16, 18, 20, 31], except [31] on smaller speech corpora than we use, though. Since we have quite large speech corpora recorded by a professional (or semi-professional) speakers [25], we are not going to mix real and artificial speech signals in this paper due to our concerns about their different quality. Instead, we will drive the unit selection exclusively by the SPS-generated trajectories, but only real speech units will be concatenated.

2 Hybrid Speech Synthesis

The first experiments with hybrid synthesis started more than a decade ago with frame-based units [13], and were extended in various ways e.g. in [1, 7, 14, 18, 20, 29].

Fig. 1.
figure 1

Scheme of hybrid speech synthesis system. The gray parts of runtime are not used in this paper.

The most common approach is to replace the target cost [22] component of unit selection with a measure of similarity (or closeness) between target featuresFootnote 1 generated for the input text by an HMM model (or a DNN in last time) and the same feature extracted from the unit selection speech corpus from where the units are taken. We have also occasionally tried to employ hybrid speech synthesis in recent years, based on some of the papers cited, but until recently our experience was that the more unit selection path was approached, the better the output quality was.

The key issue was that we compared the generated target parameters to the parameters extracted from natural unit candidates, since there is a significant mismatch between the target (relatively smooth) and the candidate (varying), as was illustrated for \({\mathrm {F}}_0\) in [24] and a parameter coefficient in [29]. However, one of the fundamental operations of statistical modeling is the averaging of model parameters during the training and the generation of novel values during synthesis – we can liken this to the interpolation and extrapolation of the values found in the training data. Therefore, in the present paper, we re-generate the whole speech corpus with the same model as used to predict the target parameters, and each unit candidate in the acoustic units inventory is tied to the parameters generated by the model for framesFootnote 2 belonging to the candidate – see Fig. 1. By this re-assignment, the behaviour of the parameters used to drive the selection is unified (the parameters for candidates behave as smoothly as the parameters for the target), and thus the “closeness” of the target parameters, as represented by a distance measure (see Sect. 2.2), does make much more sense. Also, it still does not violate one of the unit selection assumptions – to have the target cost \(=0\), selecting the natural unit sequence, when the whole phrase from the corpus is required to be synthesized.

2.1 The Generation of Target Parameters

In the SPS method, the speech signal is represented as a sequence of parameters extracted at fixed 5 ms frame rate. In out case, these are 40 mel-generalized cepstral coefficients (MGC) extracted from STRAIGHT spectra [11], logarithm of fundamental frequency extracted from the glottal signal [12] by the PRAAT [2], and 21 band-aperiodicity coefficients (BAP) derived from STRAIGHT aperiodicity spectra. Thus, each phrase from the source speech corpus is represented by the p-dimensional vector \({{{\mathrm {F}}_0}}^\star = [f(1),f(2),\dots ,f(p)]\), the \(p\times 40\) matrix \(\hbox {MGC}^\star = \big [\mathbf {f}(1),\mathbf {f}(2),\dots ,\mathbf {f}(p)\big ]\) and the \(p\times 21\) matrix \(\hbox {BAP}^\star \) (naturally, p here depends on the length of a phrase). The corpus represents the common speech base also used as the source of speech units to be concatenated, either by baseline unit selection or the hybrid version described here. Then, the \(\star \) parameters are used to train 5-stated hidden semi-Markov models, which involves a few repetitions of 3 main stages – initialization and training phone models (disregarding the context), training of full-context models and model clustering [4, 5].

Once the models are trained, for the given sequence of units required to be synthesized, the streams of \({{{\mathrm {F}}_0}}, \hbox {MGC}\) and \(\hbox {BAP}\) parameters are generated by using a parameter generation algorithm considering global variance [30], and are passed to the vocoder which uses them to build the output speech.

In the case of hybrid synthesis, though, instead of using a vocoder, the stream is passed to the unit selection module which then selects “close enough” real speech chunks (candidates) to be concatenated, see Sect. 2.2. Note here that while all the streams must be used by the vocoder, only the \({\mathrm {F}}_0\) and cepstral coefficients were used to drive the unit selection, the aperiodicity was omitted since it is not supposed to bring any significant information to select according to. Although it could be used to generate speech of a unit as a replacement of a raw unit signal (a.k.a. “multiform” synthesis) when no candidate with match “close enough” is found [18, 20, 21, 29], this has been dismissed here.

2.2 Unit Selection

In the most common unit selection scheme, a number of metrics are used to define target and concatenation costs. While the former measures how feature values of a speech unit from acoustics unit inventory match the prescribed (target) values of the features (what is aimed at to be expressed), the latter attempts to evaluate how smooth units will be perceived when joined together (how it will sound).

Target Cost. Having the trajectories of parameters \({{{\mathrm {F}}_0}} = [f(1),f(2),\dots ,f(r)]\) and \(\hbox {MGC} = \big [\mathbf {f}(1),\mathbf {f}(2),\dots ,\mathbf {f}(r)\big ]\) generated by the SPS module for the given phrase to synthesize, we have used the scripting interface of ARTIC TTS [25] to modify the unit selection in such a way that the use of symbolic features [15, 26], denoted by [23] as independent feature formulation – IFF, was replaced by the measure of mismatch between parameters from the parameter streams generated by the HMM model.

Contrary to SPS, where the parameters are passed through a vocoder as they are, the unit selection works with unit candidates (in [19] called “tiles”). Thus, we define \(t_j\) as a target unit within a synthesized phrase \(\mathbf {T} = t_1, t_2,\dots \). Each target unit is tied to \(N(t_j)\)-dimensional vector \({{{\mathrm {F}}_0}}(t_j) = [f(1), f(2), \dots , f\big (N(t_j)\big )]\) and \(N(t_j) \times 40\) matrix \(\hbox {MGC}(t_j) = \big [\mathbf {f}(1), \mathbf {f}(2), \dots , \mathbf {f}\big (N(t_j)\big )\big ]\) corresponding to the generated parameters for that unit in the phrase being currently synthesized; \(r = \sum _{\forall j} N(t_j)\).

Similarly, we define a unit candidate \(c_i\) as a constituent of a phrase in the speech corpus the unit candidates are selected from. These unit candidates were tied to \({{{\mathrm {F}}_0}}(c_i) = [f(1), f(2),\dots ,f\big (N(c_i)\big )]\) and \(\hbox {MGC}(c_i) = \big [\mathbf {f}(1), \mathbf {f}(2),\dots ,\mathbf {f}\big (N(c_i)\big )\big ]\) in the same way as target units, except during the speech corpus re-assignment.

The target cost in the hybrid synthesis experiment described here was computed using various schemes (see Eqs. (2), (3), (4) and (5)), as we first want to obtain a notion of the mutual interaction between target and concatenation costs and the relevance of the individual features in the target cost. All target cost definitions, however, used both \({\mathrm {F}}_0\) and MGC features anyway. Contrary to [19], these parameters were not normalized in this experiment. Instead, the ad-hoc defined weights were assigned to them in order to balance the importance of the individual features, as described in Sect. 3.

Regarding the duration of units, it is expressed by the number of parameter vectors \(N(t_j), N(c_i)\). In [19], the authors always choose the candidates with the same duration as the target has. However, our initial experiment with such setting showed worse speech quality. Therefore, we allowed \(N(t_j) \ne N(c_i), \forall i,j\), with a small penalty added to the target cost when this non-equality occurred. The number of parameters the cost was computed from was set to \(N(t_j,c_i) = \min \big (N(t_j), N(c_i)\big )\), with the indexes aligned to the center of the parameter vectors, i.e. \(N(t_j,c_i)/2 = \{N(t_j)/2, N(c_i)/2\}\).

Concatenation Cost. The handling of concatenation cost CC was the same as in the “raw” unit selection [15], i.e. absolute difference of “static” \({\mathrm {F}}_0\) (as described in [27]) and energy, and the Euclidean distance of 12 MFCC coefficients.

3 Experiments

Let us emphasize that for each sequence of unit candidates \(\mathbf {C}_p = c_1, c_2,\dots , c_{N(p)}\) from the p-th phrase in the speech corpus, the sequence of units \(\mathbf {T}_p = t_1, t_2,\dots , t_{N(p)}\) was generated and \({{{\mathrm {F}}_0}}(c_j) {\mathop {=}\limits ^{def}} {{{\mathrm {F}}_0}}(t_j), \hbox {MGC}(c_j) {\mathop {=}\limits ^{def}} \hbox {MGC}(t_j), \forall j=1,2,\dots , N(p)\) was assigned. In this way, it is ensured that the continuous (i.e. natural) sequence of unit candidates is chosen from a phrase from the corpus when that phrase is to be synthesized.

The generic target cost definition is the weighted sum of \({\mathrm {F}}_0\) and MGC sub-costs with the penalty value \(\wp > 0\) set in case of \(N(t_j) \ne N(c_i)\):

$$\begin{aligned} TC_x(t_j,c_i) = w_x^\text {TC} \cdot \frac{w_x^{{{\mathrm {F}}_0}} \cdot TC_x^{{{\mathrm {F}}_0}}(t_j,c_i) + w_x^\text {MGC} \cdot TC_x^\text {MGC}(t_j,c_i) + 5 \cdot \wp (t_j,c_i)}{w_x^{{{\mathrm {F}}_0}} + w_x^\text {MGC} + 5} \end{aligned}$$
(1)

where \(c_i\) is here the i-th candidate for j-th unit \(u_j = \{c_1, c_2, \dots \}\) in the synthesized sequence \(\{t_1,u_1\}, \{t_2,u_2\}, \dots , \{t_J,u_J\}\). The x here denotes the number of experiment.

The very first experiment was designed to simply replace the “symbolic”-features-driven target cost (the baseline) with the target cost following the behaviour of parameters prescribed by the SPS generator.

$$\begin{aligned} \begin{aligned} w_1^\text {TC}&= 1.0 \\ w_1^{{{\mathrm {F}}_0}}&= 1.0 \\ TC_1^{{{\mathrm {F}}_0}}(t_j,c_i)&= \sum _{\{n^t,n^c\}}^{N(t_j,c_i)} \Big | {{{\mathrm {F}}_0}}(t_j,n^t) - {{{\mathrm {F}}_0}}(c_i,n^c) \Big | \\ w_1^\text {MGC}&= 1.0 \\ TC_1^\text {MGC}(t_j,c_i)&= \sum _{\{n^t,n^c\}}^{N(t_j,c_i)} \sqrt{\sum _{\begin{array}{c} \mathbf {f}^t = \hbox {MGC}(t_j,n^t)\\ \mathbf {f}^c = \hbox {MGC}(c_i,n^c) \end{array}} (f^t - c^t)^2} \end{aligned} \end{aligned}$$
(2)

Thus, the \({\mathrm {F}}_0\) sub-cost was the sum of \(\log \) \({\mathrm {F}}_0\) differences, the MGC-sub cost was the sum of Euclidean distances of cepstral vectors, both through the \(\{n^t,n^c\}\) couples. Similar to [15], we did distinguish between voiced and unvoiced speech units (or their parts), since the SPS module fills \({{{\mathrm {F}}_0}}(t_j,n) = -\inf \) when n corresponds to an unvoiced frame. But contrary to the baseline unit selection, where voicedness is checked only at units beginning and end, we set value 10 for each \(\{n^t,n^c\}\) with voice/invoice mismatch between \({{{\mathrm {F}}_0}}(t_j,n^t)\) and \({{{\mathrm {F}}_0}}(c_i,n^c)\).

Through the rough look at the CC and TC values in the first experiment it has been found that the TC value was about \(10\times \) higher than the CC value. Therefore, for the second experiment, the weight \(w_2^{\text {TC}}\) was adjusted to make the gap between costs lower:

$$\begin{aligned} w_2^{\text {TC}} = 0.1 \end{aligned}$$
(3)

In the following experiment \(x = 3\), we tried to put greater emphasis on the \({\mathrm {F}}_0\) contour (relative to the MGC parameters). The same rough look at data as in the previous experiment suggested that the \(TC^{{{\mathrm {F}}_0}}\) cost values were found about \(10{\times }\) lower than the values of \(TC^\text {MGC}\). Therefore, the weight of \({\mathrm {F}}_0\) match was increased:

$$\begin{aligned} w_3^{\text {TC}} = 0.1 \qquad w_3^{{{\mathrm {F}}_0}} = 50.0 \end{aligned}$$
(4)

The overall TC value, however, was kept at the range similar to the CC, as it was in the previous experiment.

It can be expected that the cost value depends on the number of parameters \(N(t_j,c_i)\) describing a unit, i.e. longer units may achieve higher cost values. To minimize this effect, the costs were normalized by unit lengths as:

$$\begin{aligned} \begin{aligned} TC_4^{{{\mathrm {F}}_0}}(t_j,c_i)&= \frac{TC_1^{{{\mathrm {F}}_0}}(t_j,c_i)}{N(t_j,c_i)} \\ TC_4^\text {MGC}(t_j,c_i)&= \frac{TC_1^\text {MGC}(t_j,c_i)}{N(t_j,c_i)} \\ w_4^{{{\mathrm {F}}_0}}&= 50.0 \end{aligned} \end{aligned}$$
(5)

Still, we emphasize the \({\mathrm {F}}_0\) feature, but we do not explicitly lower the whole target cost value since it has been lowered implicitly by the length normalization (thus, TC values already are in range comparable to CC values).

Let us also note that we have tried to use normalized \({\mathrm {F}}_0\) and MGC values in the costs \(TC_{1,\dots ,4}(t_j,c_i)\) with the weights adjusted appropriately. The values were z-score normalized per-phrase, both when re-synthesizing the speech corpus and when synthesizing a text. However, the quality of speech was noticeably lower than the quality of speech evaluated in this paper. It remains to be answered if per-corpus normalization would improve this situation.

4 Results

First, let us look at the behaviour of the individual parameter values and their mutual relations. In Sect. 3, the design of the individual costs/weights was based on the rough values analysis. To have a more precise view of the relation among the cost values, we have additionally analysed the 75922 values from 1, 895 synthesized phrases, with the averages summarized in Table 1 and random selected subset plot in Fig. 2. Let us note that the cost values differ even when computed by the same equation (e.g. \(TC_x^{{{\mathrm {F}}_0}}\) for \(x = 1,2,3\)), since the values are taken from the sequences of selected units, which naturally differ through the experiments as the overall costs computation changes.

Table 1. The mean of 75922 individual costs collected through 1, 895 synthesized phrases. The ratio was computed as \(w_x^\text {TC} \cdot \frac{TC_x}{CC}\) and \(w_x^{{{\mathrm {F}}_0}} \cdot \frac{TC_x^{{{\mathrm {F}}_0}}}{TC_x^\text {MGC}}\).
Fig. 2.
figure 2

Plot of 100 random selected values from the total set of 75922 costs. All the values include their wights applied.

Fig. 3.
figure 3

Detailed plot of MUSHRA test results for all 4 professional voices.

To evaluate the quality of the hybrid synthesis methods, we have carried out informal listening tests. Due to the larger number of versions, we have decided to use MUSHRA-like tests [8], without anchor and reference prompts, though. In the test, the 7 speech technologies experts were instructed to evaluate 15 shorter prompts for 4 of our professional unit selection voices [25], each prompt containing random-ordered 5 versions of a single prompt; the evaluation could use scale from 0 to 100 points (100 should be assigned to natural a sounding prompt, 0 to wrack). One of prompts was generated by the baseline unit selection [15] and four by the hybrid synthesizer with target computed as described in Sect. 2.2. The tests were reported rather demanding, with some of the versions sounding fairly similar, as illustrated by Fig. 3 and Table 2.

Table 2. Results of MUSHRA listening test for all target computation schemes \(x = 1,\dots ,4\) and the baseline, presented for the speakers independently as well as collected to the overall results.

5 Conclusions

It can be seen from the very first experiments that the hybrid synthesis is able to achieve comparable speech quality as the raw unit selection, using symbolic features (IFF) in the target cost. And for the very first time it is evaluated on Czech language. Although it has been reported that the hybrid synthesis should be able to outperform the unit selection, it was not clearly shown in this paper, when large speech corpora has been used. On the other hand, we must emphasize that there still is room for improvement and thus the method can show its expected potential.

Regarding the future work, we aim at further experiments with target cost computation, for example to use z-score normalized coefficients, or a computation more aligned with [19] or the other approaches reported quality improvements, e.g. [16]. Also, stressing both methods with reduced speech unit inventory [6], or focusing on known unit selection failures [9, 10] could provide valuable insights. In addition to target cost, the authors in [19, 31] also adjusted the computation of concatenation cost, using cross-correlation to find the optimal join point for each unit join. This, however, can help unit selection in general, if proved being beneficial.

Naturally, the hardware requirements of this method are higher than what is required for “classic” unit selection [28]. It is due to both the SPS parameters generation for target and more complex target cost computation. However, being able to outperform unit selection quality, it is the price one is ready to pay, especially when compared with DNN-based approaches.