Keywords

1 Introduction

In human speech, the fundamental frequency values varies within a sentence. The \(F_0\) contour, in general, is closely related to the position of stressed syllables and also to the phrasing of the sentence. The \(F_0\) movements (increases/decreases), especially at the phrase-final position, have a communication function in the particular language – the mismatch in these movements can cause the misunderstanding of the sentence’s meaning [15, 24]. Therefore, it is evident that the correct prosodic description of speech corpora is one of the crucial issues in text-to-speech synthesis.

In general, in the unit selection method, the join and target costs are computed to ensure that the optimal sequence of units is selected. These costs control the smoothness of the concatenated neighbouring units, as well as the unit’s suitability for the required position in the synthesized sentence. In our TTS ARTIC [11, 20], besides concatenation smoothness, the symbolic prosody features, called prosodemes (Sect. 3, [17, 18]), are used in the target cost to ensure the synthesized speech keeps the required communication function (i.e. listeners are able to distinguish declarative sentences from questions) [10]. However, due to some inaccuracies in the formal prosodic description of speech data, speech units are sometimes used in a different context than they were pronounced by a speaker and than they belong to. This may be manifested in the synthetic speech e.g. by unnatural dynamic melody or by inappropriate stress perception.

The presented paper focuses on the symbolic prosodic labels in our speech corpora and, using powerful Legendre polynomials (Sect. 2), offers the two-phase algorithm for their correction. The initial experiments were carried out in [14] and showed that the description of an \(F_0\) contour based on the Legendre polynomials is sufficient for classification-based approaches.

2 Legendre Polynomials

To describe the \(F_0\) contours, the authors used Legendre polynomials [9] – contrary, e.g. to usage of Gaussian mixture models by the author of [7], or HMM models used in [5, 6] for the correction of wrongly labelled formal prosodic structures in speech corpora. These polynomials are frequently encountered in physics and other technical fields.

Legendre polynomials are defined by Eq. 1,

$$\begin{aligned} L_n(x) = 2^n \sum _{k=0}^n x^k \left( {\begin{array}{c}n\\ k\end{array}}\right) \left( {\begin{array}{c}\frac{n+k-1}{2}\\ n\end{array}}\right) , \end{aligned}$$
(1)

and they form an orthogonal basis (i.e., non-correlated) suitable for modelling of \(F_0\) contours [4, 23]. An \(F_0\) contour is described by coefficients as a linear combination of these polynomials. Because of the orthogonality, the coefficients can be estimated using cross-correlation at a time lag of 0 (i.e., a mutual energy of \(F_0\) contour and Legendre polynomial).

The first four polynomials \(L_0(x)\), \(L_1(x)\), \(L_2(x)\) and \(L_3(x)\) (see Fig. 1a) match linguistic interpretation as \(L_0(x)\) responds to mean value of the pitch, \(L_1(x)\) to rise or fall depending on the positive or negative sign of the coefficient (the slope is determined by its absolute value), \(L_2(x)\) to peak or valley and \(L_3(x)\) to the wave shape of \(F_0\) contour.

For the purposes of the presented experiment, the authors used mPraat toolbox for Matlab [1] and for each \(F_0\) contour, the frequency values has been transferred to semitone scale, interpolated the contour in 1, 000 equidistant points and estimated the first four Legendre coefficients (for example, see Fig. 1b, coefficients are 10.7407 (mean value), \(-2.6880\) (falling slope), \(-1.5522\) (valley shape), 0.1685 (only a slight wave curvature)).

Fig. 1.
figure 1

Setup of the experiment.

3 Symbolic Prosody Features in Speech Corpora

The authors of [17, 18] introduced a new formal prosodic model to be used in text-to-speech systems to control the appropriate usage of intonation schemes within the synthesized sentence, the original idea was based on the Czech classical phonetic view described in [15]. This grammar parses the given text sentence in a derivation tree and each prosodic word (PW, i.e. a group of words with only one words stress) is assigned with an abstract prosodic unit, a prosodeme, marked as \(P_X\). The former grammar was focused mainly on the differentiation of phrase-final and other PW in the sentence since phrase-final words are, in general, characterized by a distinct increase/decrease of an \(F_0\), they have a certain communication function. However, as showed in [8], the phrase-initial PWs also distinguish from the following words, especially by the increase of the \(F_0\) [24]. Recently, based on these observations, the grammar was extended to describe the phrase-initial PWs by a new prosodeme type (\(P_{0.1}\), see below).

In our TTS ARTIC [11, 20], we distinguish the following prosodeme types assigned to each PW (see also Fig. 2):

  • \(P_1\) – prosodeme terminating satisfactorily (the last PWs of declar. sentences)

  • \(P_2\) – prosodeme terminating unsatisfactorily (the last PWs of questions)

  • \(P_3\) – prosodeme non-terminating (the last PWs in intra-sentence phrases)

  • \(P_0\)null prosodeme (assigned otherwise)

  • \(P_{0.1}\) – special type of null prosodeme (assigned to the first PW in phrases)

Fig. 2.
figure 2

The illustration of the tree built using the extended prosodic grammar [8, 18] for the Czech sentence “It will get colder and it will snow heavily, so he did not come.”

The prosodeme types are used in speech synthesis to ensure the required communication function on the phrase level of synthesized sentences [10, 22] – the usage of a correct prosodeme type is controlled by the target cost computation in the unit selection method. Unfortunately, despite the professional speakers were recording the speech corpora for the purposes of TTS, the prosodic description of recorded sentences (based on the formal prosody grammar applied on texts of segmented sentences) sometimes do not correspond to the real \(F_0\) contours. The problems mainly appear within the null prosodeme where a “neutral” speech is expected, but the speaker could pronounce a word in an unexpected way regarding prosody. This inaccurate description (and thus the wrong usage of some speech units in the synthesis itself) may lead to an unnatural excessive increase or decrease of the \(F_0\) contour in a non-phrase-final prosodic word with the null prosodeme which could be manifested by an inappropriate stress or an unnatural melody or, eventually, it may result in a misunderstanding due to not keeping the required communication function.

In the presented paper, the experiments are carried out on two large speech corpora – AJ and MR [12, 20]. The male synthetic voice, built from AJ corpus, is widely used in commercial products for its high naturalness. On the other hand, the female synthetic voice, built from MR corpus, is not very consistent in prosody (her prosody is very dynamic) – given the original prosodic description baseline, synthesized sentences quite often contain an unnatural intonation pattern (especially in the null prosodeme). The complete statistics of the corpora are listed in Table 1.

Table 1. Number of prosodic words labelled by a specific prosodeme type.

4 Correction Process

The basic idea behind the correction process is simple. With inconsistent prosody, the speech created by the unit selection does not sound naturally and it is unpleasant to listen due to the speech artefacts. If we were able to correct wrongly marked prosodic words, we might achieve more fluent and consistent prosody, which would lead to a better synthesis. The correction process has two phases and a choice of a suitable data description is a principal issue. Despite prosodemes (Sect. 3) being the only symbolic prosody features, each prosodeme type corresponds to the specific changes in the \(F_0\) contour – these could be modelled by the Legendre polynomials (Sect. 2) whose first four coefficients are used as the only representation of our data in the presented experiment.

In the first phase, anomalies among the null prosodemes are detected (Sect. 4.1). In the second phase, the detected outliers are classified by a multi-class classifier that gives them new labels (Sect. 4.2). Both phases are described below in detail. After the correction, the evaluation by listening tests was performed (see Sect. 5).

4.1 Phase One: Anomaly Detection

Anomaly (or novelty) detection [2, 13] is a well-known approach which is used to find items that do not have the same or similar properties as other items in a dataset. Our previous study [14] showed that, among other classification methods, the One-class Support Vector Machine (OCSVM) is the most suitable for this experiment. We are using the implementation of OCSVM from scikit-learn [16] which is based on libsvm [3] with radial basis function as a kernel and \(\gamma =0.1\). The parameter \(\nu \), which influences an upper bound on the fraction of training errors, was set to 10% – this value is the authors’ estimation of possible wrongly labelled PWs in corpora. Since we are looking for anomalies only in our closed dataset, we can afford to train the OCSVM model on the whole dataset to get the best decision function possible.

We trained two OCSVM models. The first one was trained by using \(35{,}781~P_0\) prosodemes from AJ corpus and the second one by using \(41{,}728~P_0\) prosodemes from MR corpus. After training the models, we tested how these models react to the different types of prosodemes and also to the training data. We detected anomalies in each group of prosodemes using the OCSVM model to obtain the number of outliers for each group. Since the model was trained with \(P_0\) prosodemes, where we supposed 10% of anomalies, we expected the number of outliers to be about 10% for \(P_0\) and significantly higher for the other groups. The results shown in Table 2 confirm our assumption – most of the \(P_1\) prosodemes were correctly detected as anomalous by OCSVM model trained on \(P_0\). All the results are described in [14].

Table 2. Number of anomalies detected by OCSVM.

4.2 Phase Two: Outliers Classification

By detecting the anomalies in \(P_0\) prosodemes, we obtained a group of outliers whose F0 does not have “neutral” contour. These outliers can be either strongly penalized to exclude them from speech synthesis process as described in Sect. 5.1 (see [14]), or classified to another prosodeme class – as mentioned in Sect. 3, apart from \(P_0\), we distinguish another 4 different prosodeme types: \(P_1\), \(P_2\), \(P_3\) and \(P_{0.1}\). To perform the multi-class classification of the \(P_0\) outliers, we had to train an appropriate model for each corpus.

We collected all available prosodeme data from one corpus to cover all the prosodeme types and then we trained a Support Vector Classifier (SVC) from scikit-learn as our multi-class model. SVC uses one-vs-all decision function and since our data are not evenly distributed among all types of prosodemes, we set the parameter for class weight to “balanced”, which means the weight of each class is adjusted inversely proportional to the class frequencies in input data. As in the previous case, we were working on the closed dataset and therefore we could used all data to train the classification model.

The classification and relabelling of \(P_0\) outliers was done again for both corpora. We classified 3, 578 outliers in AJ corpus and 4, 174 outliers in MR corpus; the classification results are listed in Table 3.

Table 3. Classification of \(P_0\) outliers.

It is obvious, that most of the \(P_0\) outliers (\(76.3\%\)) from MR corpus were labelled as a different type of prosodeme. However, \(23.7\%\) of them were given the \(P_0\) label again. These outliers were picked by the OCSVM model as anomalies, because their properties were somehow different from the other \(P_0\) data. Nevertheless, the properties of these outliers are still more similar to the \(P_0\) prosodeme than to another prosodeme type, hence the SVC labelled them as \(P_0\). The situation for AJ corpus is analogous with the difference that even more outliers were labelled back to \(P_0\). This is probably caused by a different prosody consistency of each corpus. The intonation of AJ speaker was more consistent and precise compared to the MR speaker and therefore, classifier marked them back to type \(P_0\) more often than in the case of MR corpus. The evaluation of the prosodeme corrections will be further described in Sect. 5.2.

5 Evaluation

To evaluate the process proposed in Sect. 4, we carried out two listening tests (see Sects. 5.1 and 5.2) in our new listening test framework. Both tests had the same structure, both were 3-scale preference listening test. The listeners were comparing sentences generated by our baseline TTS system ARTIC (with original corpora, TTS-base) and those generated by a modified system TTS-new build on the fixed corpora (based on the classification described in Sects. 4.14.2 respectively). They were instructed to use earphones and to compare the overall quality of samples A and B in each pair by selecting one from these options:

  • Sentence A sounds better.

  • I cannot decide.

  • Sentence B sounds better.

The answers where normalized for each listener and pair of samples in the listening test to \(p=1\) where the TTS-new output was preferred, to \(p=-1\) where the TTS-base output was preferred and \(p=0\) otherwise. These values were used for the final computation of the listening test score s, defined by Eq. 2,

$$\begin{aligned} s = \frac{\sum _{p\in T} p}{|T|}\;, \end{aligned}$$
(2)

where T is a set of all answers from all listeners. The positive value of s indicates the improvement of the overall quality when using TTS-new.

5.1 Evaluation of the Phase One

First, the authors evaluate the phase one, the anomaly detection using OCSVM in Sect. 4.1, directly in the unit selection speech synthesis itself [14]. In this evaluation, the modified TTS-new represents a system which highly penalizes units originated from anomalous PWs (those detected by OCSVM) during the Viterbi search [21]. This “ban” should ensure that these “strange” (anomalous) units are not used in the synthesis and it may, hopefully, increase the naturalness of speech synthesis. On the other hand, about 10% of all \(P_0\) units are dropped by this approach – this should, however, not be a big problem since the corpora are quite large and they were carefully designed [12] to cover all the different units sufficiently. In any case, this approach results in a different sequence of units compared to that generated by TTS-base.

To select the sentences for the listening test, we synthesized 6, 000 sentences by TTS-base and TTS-new and we randomly selected 20 sentences for each voice so that they fulfilled the criterion of having 8 or more anomaly units (similarly to [19], but the selection criterion was the number of anomalous units occurrences in TTS-base sentences in this experiment). Thus, the whole listening test contained 40 pairs of synthesized sentences, each pair included two variants of the same sentence – one generated by TTS-base and one generated by the modified system TTS-new.

The results of the listening test, gained from 16 listeners (5 of them being speech experts), are presented in Table 4. TTS-new was preferred for both voice corpora, the results are statistically significant (as proved in [14]). The positive score values s indicate that the penalizing of outlier speech units (those originated from PW outliers detected by OCSVM using Legendre polynomials coefficients) leads to more natural synthetic speech.

Table 4. Results of the first listening test.

5.2 Evaluation of the Phase Two

The results presented in the previous section indicate the improvement of the quality of speech synthesis when penalizing the units originated from \(P_0\) words detected as outliers by OCSVM. However, the outliers were in the phase two relabelled by a multi-class SVM classifier (described in Sect. 4.2) and so they could be used in the synthesis with the new label assigned. In this case, the TTS-new uses the same penalization of a mismatch of prosodeme types in the target cost computation as in the baseline TTS-base, the only difference of the two systems are the data with prosodeme labels – TTS-new uses the relabelled speech corpora, TTS-base uses the original speech corpora presented in Sect. 3.

Again, when designing sentences for the second listening test, we followed the methodology described in [19] with the selection criterion based on the number of relabelled units occurrences in TTS-new sentences. By this procedure, we randomly selected 10 sentences for the each non-null prosodeme type for both voices (80 sentences in total) to find out how the relabelled units performed in new prosodic contexts.

This listening test was finished by 16 listeners, 6 of them being speech synthesis experts. The results listed in Table 5 show that the relabelled prosodemes did not cause any serious problem in the synthesized sentences, the outputs of TTS-new were sometimes even much better evaluated by the listeners contrary to the TTS-base outputs.

Table 5. Results of the second listening test.

6 Conclusion and Future Work

In the presented paper, we examined the usage of the Legendre polynomials for correction of formal prosody grammar. The corpora we have been working with contained inconsistencies in the prosody description – some prosodic words were labelled as “neutral” (\(P_0\)) in the meaning of prosody even though their F0 did not have a neutral contour. Therefore, we proposed the two-phased correction method to correct these wrongly labelled prosodemes. To represent our data, we took only the first four coefficients of the Legendre polynomials and then we trained One-Class Support Vector Machine (OCSVM) detector and multi-class Support Vector Classifier (SVC).

In the first phase, outliers among the \(P_0\) prosodemes were detected by the OCSVM and then, in the second phase, we classified them with the multi-class SVC so we get the new labels for the \(P_0\) outliers. Afterwards, we conducted two listening tests to evaluate the benefit of this approach. By the first test, we verified that the synthetic speech sounds better if we are not using the anomalous \(P_0\) prosodemes. In the second test, we found out that if we relabel the anomalies to a different prosodeme type, we can still use them and the quality of speech will not decrease. Hence we do not need to penalize the anomalies or throw them away, which would be a waste of data. Furthermore, in some cases the synthesized speech even gets better with these relabelled prosodemes.

As a future work, we would like to test this method on our other corpora (Czech, English, Russian, etc.) and we also want to compare the quality of synthesized speech without all the anomalies and with the relabelled variants of them.