Keywords

1 Introduction

Although unit selection speech synthesis systems are still often preferred in the commercial sphere, according to [5] and our own experience, it is clear that heuristics–based approaches of unit selection features tuning basically fail. For example, papers such as [1, 6, 7, 15, 17, 19, 20, 2325] examined various concatenation cost features, but the results are rather inconsistent and sometimes even in contradiction. Therefore, instead of manual features tuning, we have started to examine machine–learning techniques for a data–driven automatic per–voice unit selection tuning.

One of the interesting ideas was introduced in [4], where the one–class classification (OCC) technique was used as a replacement for a classic spectral–related smoothness measure in concatenation cost computation. In [22], we tried to validate the results of the original research on our own speech database. In this paper, we present extended results, primarily focusing on parametrizations computed from various speech signal framings and their impact on the ability of OCC to detect the joins of speech units where unnatural artefacts are perceived by humans.

2 One–Class Classification in Unit Selection

One–class classification [10, 21], also known as anomaly or novelty detection, is used to address the problem of finding such occurrences in data that do not conform to expected behaviour. This is very advantageous and not yet widely used for unit selection speech synthesis, where usually large speech databases with natural recordings are available. However, it is common in this synthesis technique that unnatural disturbing artefacts may occur when incompatible units are concatenated. The reason is that the target and concatenation costs are generally designed to prefer units minimizing the trade-off of features evaluating similarity to the requirements, instead of reflecting whether the units will sound natural in the sequence they are used in. These artefacts, obviously not occurring in the source speech corpus, can thus be viewed as “anomalies” or “outliers”. However, the occurrence of the artefacts can be considered as a random process (if they could be predicted, they can be avoided), which makes their collection and the reliable analysis of their causes rather difficult. Therefore, the existence of natural sequences and the unavailability of unnatural anomalies lead to the idea of exploring the abilities of OCC to detect, and thus to avoid, those anomalies.

2.1 Distances to Train the Classifiers on

For the initial experiment [22], we focus only on spectral continuity classification (following [4]) but using our Czech male speech corpus [3] containing approximately 15 h of speech, designed as described in [11, 14].

To capture natural spectral transitions, for every two consecutive speech frames, with signal pre–emphasized by value 0.95 and Hamming–windowed, we computed Euclidean and Mahalanobis distances between MFCC vectors, Itakura–Saito distance between LPC coefficients and symmetrical Kullback–Leibler distance between spectral envelopes obtained from the LPC and between power FFT spectrum (referred to as “targets” or “references”); each distance vector thus consists of 5 values. Contrary to [4, 22], however, we examined several different framings of the signals:

  • async 20/20 is the original scheme from the initial experiment, where the signal frames are 20 ms long without overlap (20 ms shift). Since we compute feature distances on rather stable vowel parts (see Fig. 1), it is supposed that the spectrum does not change very much within a particular phone. Thus, the natural transition of neighbouring frames should lead to rather small features distance, contrary to a spectral change perceived as an artefact.

  • async 04/25 is a scheme with frames 25 ms long, shifted by 4 ms. This scheme was chosen as it provides the most accurate automatic phone segmentation for this voice. The significant signal overlap, and thus accented spectral similarity of the consecutive frames, was assumed to emphasize the effect of natural and smooth signal transition pattern which the OCC is required to train.

  • async 12/25 scheme, having 25 ms long frames with 12 ms shift, was chosen as a compromise between large overlap (4 ms shift) and no overlap at all, while there is still slight preference towards frame overlapping.

  • psync pm/25 is a pitch–synchronous framing, where 25 ms long frames are centred around pitch–marks [8, 9]. In this way, the MFCC, energy and F\(_0\) are computed for the “classic” concatenation cost computation in our TTS system. Contrary to the previous schemes, the shift is always one pitch period long and the overlap varies dynamically as pitch changes. In unvoiced regions, the distances were not computed.

Fig. 1.
figure 1

The example of non-overlapped framing for two illustrative variants of phone [a] with phone boundaries and centre marked by bold and dashed, respectively, vertical lines. Feature vectors are outlined for each frame.

As already mentioned, we limit the experiment to vowels only, as unnatural artefacts are perceived more strongly due to their larger amplitude. Nevertheless, the extension to other voiced phones is planned as soon as reliable results are obtained.

For all the various signal framings, the target (natural) distance vectors used to train OCC were collected per–vowel from:

  • all the consecutive frames covering the signal of the vowel, except frames spanning 8 ms at the vowel’s beginning and end, i.e. for \((f_i,f_{i+1}), (f_{i+1},f_{i+2})\) and \((f_{j},f_{j+1}), (f_{j+1},f_{j+2}), (f_{j+2},f_{j+3})\) pairs from Fig. 1. By using of diphones in our TTS system, with boundaries approximately in the middle of the underlying phone, this exclusion allows us to avoid distances near phone (vowel) transitions in the training/testing set.

  • the two consecutive frames nearest to the middle of each vowel, i.e. for \((f_{i+1},f_{i+2})\) and \((f_{j+1},f_{j+2})\) pairs from Fig. 1 — we will mark it as mid.only in Table 1. This might seem to be a natural choice reflecting the fact that only signal around phone centre is examined for smoothness during diphones concatenation.

2.2 Evaluation of Real Concatenations

When using only (smooth) distances computed on the corpus data, we do not know much about how well a trained classifier is able to detect real non–continuous spectral transitions. Therefore, we created artificial join in the middle vowel of several words by concatenating two halves of the words from different parts of the corpus. Around the join, the distance was computed in the way that when [a] from sentence m is to be concatenated with [a] from sentence n (see Fig. 1), \((f_{i+1}, f_{j+2})\) vectors are used — \(f_{i+1}\) is nearest to the middle of the initial vowel half and \(f_{j+2}\) is the one after the vector nearest to the middle of the final vowel half. Each such distance was coupled with the listeners evaluation whenever a concatenation discontinuity is perceived in the word (further referred to as outlier distances) or not. Since details can be found in [22], we just summarize here that only examples where at least two of three listeners agreed were taken for further processing.

2.3 Classifiers Examined

Having obtained positive experience with OCC [12, 13], we examined 3 classifier types, all implemented in Scikit-learn toolkit [16]. The first one is Multivariate Gaussian distribution (MGD), with all the distances modelled together in one go, tied through covariance matrix. The second one is One-class SVM (OCSVM), mapping distances into a high dimensional feature space via a kernel function, and iteratively finding the maximal margin hyperplane which best separates the training data from the outliers [18]. And the last one is Grubbs’ test [2] modified as described in [12] to detect multidimensional distance vector as outlier when any of the individual features is detected outlying (GRT).

Prior the training, the whole per–vowel set of target distances was reduced to 4, 000 randomly selected vectors, mostly due to speeding up the training process, but also to prevent potential OCC overfitting (see [22] for the total number of distances in async 20/20, which is the lowest of all used here). This reduced set was then further randomly split into 80 % for training distances targets and 20 % distances being held out for the final evaluation (see Sect. 3). From the training targets, 20 % were randomly chosen for 10–fold cross–validation. All the classifiers were trained to minimize F1 score, the details about parameters setup can be found in [22].

To further increase the robustness of the training, we added 50 % of the outlier distances (with discontinuity perceived, see Sect. 2.2) to the cross–validation process, if these were available for the corresponding vowel.

Table 1. The classification performance when the given number of target distances (for all the words without artefacts perceived) and the remaining 50 % of outlier distances (those not used for cross–validation), obtained by evaluations described in Sect. 2.2 and computed for the given framing, were passed to the classifier trained on the corresponding data. All the values are in %.

3 Results

Once the classifiers are trained, the 20 % of target corpus distance vectors and all the distance vectors for smooth joins evaluated by listeners (i.e. those without an artefact perceived) were used to evaluate the ability of classifiers to recognize target distances never seen. Also, the remaining 50 % of outlier distances not used in cross–validation were used to enumerate the reliability of probable artefacts detection.

In Table 1 we present results for all the framings mentioned in Sect. 2.1 and all the classifiers described in Sect. 2.3. In the table, the abbreviation TP describes true positives (targets detected as targets) and TPR is then percentage of TP from all the targets to be classified (also called recall). Similarly, TN stands for true negatives (correct outliers classification) and TNR is its percentage (specificity). Due to space limitation, we exclude here vowels with a smaller number of examples to evaluate (both due to less joined words evaluated and lower mutual agreement of listeners on artefact absence/presence, see Sect. 2.2). Also, we do not present here the classification of the 20 % target distances being held out.

It can clearly be seen that the results are rather shuffled, with no significant preference for a framing and/or classifier type. In general, the mid.only variant behaves worse than when distances taken through the whole vowels are taken into account. Another surprising fact is that the larger overlap leads to worse results – although the distances to train are computed from very similar signals, the classifiers are not able to recognise outlier distances. It can be said that distances between non–overlapping frames are better in recognising targets, while distances between frames with large overlap recognise outliers instead. The best compromise seems to be async 12/25, for which OCSVM can reliably classify phone [a:] and rather successfully detect outliers for other phones as well.

Looking at raw F1 scores, most of the best results are for async 20/20 framing, spread through various classifiers. However, taking for example phone [a] with \(\text {F}1 = 93\,\%\) (MGD), none of the 9 outliers was detected successfully. Similar situation is for [i] (F\(1 = 88.9\,\%\), OCSVM), where only 5 out of 10 outliers were detected. From the point of view of unit selection, where the classifiers should finally be used, we would prefer reliable detection of outliers at the expense of higher FN (continuous joins classified as outliers). This would ensure that no audible artefact (or minimum of them) will appear in the synthesized speech. On the other hand, however, discarding wrongly classified smooth joins can easily lead to the inability of following the required target specifications (those with better match were discarded), which is not a desirable situation either.

4 Conclusion

Hopefully, we have shown that this alternative approach to feature hand–tuning may have its potential despite the fact that there is no ultimate answer to the question of what features/classifiers to use to avoid unnatural artefacts sometimes occurring in unit selection generated speech (neither did in [4]).

To address further research directions, it is important to start with an error analysis, i.e. to examine the causes of the classification failures. Our hypothesis for them is that the cause of the artefacts perceived is either due to a mismatch of non–spectral related features, or due to a spectral mismatch not covered well by the features and distance scheme computations chosen. Therefore, we need to search for another set of features, not necessarily entirely spectral–related, which has a better capability of capturing the causes of artefacts perception — this may affect both concatenation and target cost features. And since the vowel joins evaluated by listeners (described in Sect. 2.2 and in details in [22]) were intentionally not limited with respect to spectral features anyway, they can be gradually extended and reused when searching for and experimenting with some other mismatch–describing features.

To make our results verifiable as well as to provide a solid springboard for prospective followers, we put all the data required to repeat the experiment on github under ARTIC-TTS-experiments/2016_SPECOM/ repository. Also, more detailed results can be found there. Do not hesitate to contact us in case of any questions.