Improving Deep Neural Network Based Speech Synthesis through Contextual Feature Parametrization and Multi-Task Learning

Wen, Zhengqi; Li, Kehuang; Huang, Zhen; Lee, Chin-Hui; Tao, Jianhua

doi:10.1007/s11265-017-1293-z

Improving Deep Neural Network Based Speech Synthesis through Contextual Feature Parametrization and Multi-Task Learning

Published: 02 October 2017

Volume 90, pages 1025–1037, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Signal Processing Systems Aims and scope Submit manuscript

Improving Deep Neural Network Based Speech Synthesis through Contextual Feature Parametrization and Multi-Task Learning

Download PDF

Zhengqi Wen¹,
Kehuang Li²,
Zhen Huang²,
Chin-Hui Lee² &
…
Jianhua Tao^1,3,4

525 Accesses
5 Citations
3 Altmetric
Explore all metrics

Abstract

We propose three techniques to improve speech synthesis based on deep neural network (DNN). First, at the DNN input we use real-valued contextual feature vector to represent phoneme identity, part of speech and pause information instead of the conventional binary vector. Second, at the DNN output layer, parameters for pitch-scaled spectrum and aperiodicity measures are estimated for constructing the excitation signal used in our baseline synthesis vocoder. Third, the bidirectional recurrent neural network architecture with long short term memory (BLSTM) units is adopted and trained with multi-task learning for DNN-based speech synthesis. Experimental results demonstrate that the quality of synthesized speech has been improved by adopting the new input vector and output parameters. The proposed BLSTM architecture for DNN is also beneficial to learning the mapping function from the input contextual feature to the speech parameters and to improve speech quality.

Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Article 30 April 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Deep neural network (DNN) based technologies [1,2,3,4,5,6] have been used with promising results in a wide collection of research areas, such as speech processing [7,8,9], computer vision [10, 11] and natural language processing [12,13,14]. Its great ability in classification and regression has been explored not only in the research areas, but also in the real-life applications. For example, speech synthesis is a widely used technology in our real life and also catches a lot of researchers’ attention. There are two typical synthesis methods in literature. One is synthesis by unit selection [15,16,17] and the generated waveform is concatenated from selected segments in a large speech corpus. The other is parametric speech synthesis [18,19,20] which estimate the related speech parameters directly from contextual features through some statistical models. In this paper, DNN is adopted as the model to predict speech parameters and called DNN-based speech synthesis [21].

There are three main components in a DNN-based speech synthesis system: input contextual features, network architecture and vocoder. New research is often focused on one or more of these three main components.

In the input layer of the DNN-based speech synthesis system, the contextual features include the main phoneme identity feature and other auxiliary information, such as part of speech, positional and prosodic features. Most of these contextual features are binary features suitable to construct the decision trees in the hidden Markov model (HMM)-based speech synthesis system [19], however they may be insufficient to represent the DNN’s input. For the phoneme identity feature, the consequence is that the relationship between two similar phonemes might not be effectively conveyed if the phoneme identity feature is hardcoded as a binary-valued vector with a one-hot representation in the input to the neural networks. It is much more critical in natural language processing (NLP) where the dimension for one word is about tens of thousands. This problem has been alleviated with word embedding [22,23,24,25,26]. This model encodes a word as a real-valued low-dimensional vector based on the assumption that the semantic meanings of a word can be predicted from the external contexts with large-scale corpora. It also works for the phonemes with their pronunciation often influenced by the neighboring words and phonemes. In this paper we adopt a real-valued vector to parameterize the phoneme pronunciation. In addition to the phoneme identify feature, part of speech (POS) and pause information are also encoded as a real-valued vector from the bottleneck layer of a prediction network. After these two substitutions, all DNN inputs of DNN based speech synthesis will be real-valued vectors.

As for the network architecture, DNN here is used to construct a mapping function from the input contextual features to the output speech parameters. For example, Kang et al. used a deep belief network (DBN) to model a joint distribution of contextual and speech features in [27]. In [28], Ling et al. replaced the Gaussian mixture model (GMM) with DBN in HMM-based speech synthesis [19]. Nonetheless Zen et al. [19] proposed a deep neural network (DNN) based speech synthesis framework by mapping from the contextual features to the speech features directly and Fan et al. [29] further employed a recurrent neural network (RNN) with bi-directional long short term memory (BLSTM-RNN) [5, 6] units to model the direct mapping relationship. When compared with HMM-based speech synthesis, the DNNs are learned with little discriminative information in the output layer because the decision trees [20] used in HMM-based speech synthesis for categorizing different classes of the speech parameters have been removed from DNN-based speech synthesis. Leveraging upon recent successes in DNN-based automatic speech recognition (ASR) [7] and DNN-based automatic speech attribute transcription (ASAT) [30, 31], a key motivation in this study is to facilitate an incorporation of some categorical information in decision trees into training DNN-based speech synthesis systems. It is realized by an auxiliary categorization framework with an extra classification layer on top of the hidden layers of the regression DNNs. This classification layer is trained together with the affine-transform layer in multi-task learning (MTL) [32] which has already been used in speech synthesis in [33]. When compared with [33], this paper explored several secondary tasks in training the DNN to determine the tasks that are beneficial to improving speech quality and also applied the MTL in incorporating the vocoder.

In the vocoder, the final speech waveform is generated by the predicted speech parameters from the estimation models. In HMM-based and DNN-based speech synthesis with feed-forward network [19], the predicted speech parameters include the first and second derivatives which used in the maximum likelihood parameter generation (MLPG) algorithm [34]. While in [29], the MLPG algorithm can be removed in the BLSTM-RNN-based speech synthesis system. So in this paper we only predict the speech parameters directly. There are also other models recently proposed for vocoder. For example, Song et al. proposed an improved time-frequency trajectory excitation model in [35]. Fan et al. proposed a phase-embedded waveform representation in [36]. Hu et al. proposed to model the results of the frequency analysis in the complex domain directly in [37]. Here we propose to adopt our pitch scaled analysis (PSA) based vocoder [38] in BLSTM-RNN based speech synthesis and train an excitation model at the phonemic level. Because LF0 and the pitch scaled spectrum (PSS) [39] only exist in the voiced regions, two BLSTM-RNNs were trained in the proposed system. The first equipped with the multi-task learning in last paragraph is used to predict the line spectrum pair (LSP) [40] and the UV decision from the contextual features. The second is constructed to predict the log fundamental frequency (LF0), PSS and aperiodicity for the voiced phonemes with the input of the generated LSPs and contextual features. Speech is synthesized from the generated LSPs, LF0, PSS and aperiodicity parameters with the PSA-based vocoder.

The remainder of this paper is organized as follows. In Section 3, we introduce real-valued parameterization of the input contextual. In Section 4, the multi-task learning framework in DNN-based speech synthesis is described with four secondary classification tasks. In Section 5, we integrate the PSA-based vocoder into DNN-based speech synthesis with two BLSTM-RNNs. We describe our experiments in Section 6. Finally, we summarize our conclusions and propose some future work in Section 6.

2 Contextual Feature Parameterization

In conventional DNN-based speech synthesis systems, the phonemic feature is represented by a binary vector with a one-hot representation [19]. This is inefficient because the co-occurrence of phonemes is represented by a long vector with the neighboring phonemes. Vector space model (VSM) was proposed to parameterize the phonemic information as continuous values in [41, 42]. It is trained from the matrix of co-occurrence statistics and further decomposed by singular values. Our previous paper proposed to train phonemic embedded vectors (PEV) [43] in a neural network based language model (NNLM) and represent the phonemes together with word embedded vector (WEV). In this paper we enhance this representation by introducing the syllable embedded vector (SEV). The rest of this section includes two parts: one is how to train the embedded vector and the other is how to combine these embedded vectors to describe the phonemic features.

2.1 Joint Training with Embedded Vectors

There are a number of methods proposed to train the word embedded vector (WEV), such as Global C&W [44], continuous bag-of-words model (CBOW) [26] and Skip-Gram [26]. We will take CBOW in Fig. 1 as an example to describe the joint training structure.

Given a sentence with N training words, S = {x ₁, x ₂, ⋯ , x _N}, an objective function of training CBOW is to maximize the average log probability in Eq. (1).

$$ L(S)=\frac{1}{N-2K}{\sum}_{i-K+1}^{N-K}\mathit{\log}P\left({x}_i\left|{x}_{i-K},\cdots, {x}_{i+K}\right.\right) $$

(1)

where K is the size of the sliding window for the neighboring words. The probability P(x _i| x _i − K, ⋯ , x _i + K) is a softmax function described in Eq. (2).

$$ P\left({x}_i|{x}_{i-K},\cdots, {x}_{i+K}\right)=\frac{\exp \left({X}_0^{\mathrm{T}}\cdot {X}_{\mathrm{i}}\right)}{\sum_{X_j\in W}\exp \left({\mathrm{X}}_0^{\mathrm{T}}\cdot {X}_j\right)} $$

(2)

where W is the word vocabulary, X _i is the WEV of the target word x _i, and X ₀ is the average of all neighboring context words in Eq. (3).

$$ {X}_0=\frac{1}{2K}{\sum}_{\mathrm{j}=\mathrm{i}\hbox{-} K,\cdots, \mathrm{i}+K,\mathrm{j}\ne \mathrm{i}}{\mathrm{X}}_{\mathrm{j}}. $$

(3)

It is difficult to train the syllable embedded vector (SEV) or phonemic embedded vector (PEV) directly from the large corpus because the SEV or PEV takes a non-semantic meaning of a word. But it could be learned simultaneously with the WEV in a joint training structure described in [42, 45]. In this structure, the embedded vector for the context word x _i is changed from X _i to $ {X}_{\mathrm{i}}^{\mathrm{new}} $ in Eq. (4).

$$ {X}_{\mathrm{i}}^{\mathrm{new}}={X}_{\mathrm{i}}+\frac{1}{N_{\mathrm{i}}}{\sum}_{\mathrm{m}=1}^{N_{\mathrm{i}}}{P}_{\mathrm{m}} $$

(4)

where $ {X}_{\mathrm{i}}^{\mathrm{new}} $ is the composed embedded vector, X_i is the word embedded vector (WEV), P _m is the phoneme embedded vector (PEV) or syllable embedded vector (SEV), N _i is the number of syllable initials and finals or syllable for the ith word.

2.2 Combination of Embedded Vectors

The PEV and SEV are the byproducts of training the WEV and are generated in the word level. But in DNN-based speech synthesis systems, the synthesis unit is at the frame level. So the PEV, SEV and WEV should be converted into the frame level firstly. It is not easy to directly encode and a substitution is to encode these embedded vectors at the phonemic level and then to combine with positions’ parameters at the frame level.

There are several ways to encode these embedded vectors into the phonemic level. This paper will adapt two ways: one is to calculate the mean of these vectors and the other is to concatenate them into one vector directly. In the first way described in Eq. (5), these three types of embedded vectors should be trained in the same dimension. In the second way described in Eq. (6), these three types of embedded vectors can be trained in the different dimension but should be in a low dimension to avoid the curse of dimensionality. Comparing experiments are conducted in Section 6 for evaluating these two combination methods.

$$ {X}_{\mathrm{P}\_\mathrm{new}}=\frac{1}{3}\left({X}_{\mathrm{P}}+{\mathrm{X}}_{\mathrm{S}}+{X}_{\mathrm{W}}\right) $$

(5)

or

$$ {X}_{\mathrm{P}\_\mathrm{new}}=\left[{X}_{\mathrm{P}},{X}_{\mathrm{S}},{X}_{\mathrm{W}}\right] $$

(6)

where X _{P_new} is the encoded PEV, X _P is the PEV, X _S is the SEV, X _W is the WEV.

3 Multi-Task Learning

3.1 Classical DNN-Based Speech Synthesis

A typical DNN based speech synthesis system shown in the left of Fig. 2 is constituted with a few hidden layers and an output layer. The hidden layers can be considered as a nonlinear feature extractor from the input contextual features. The output layer stacked on the top of the hidden layers is an affine-transform layer for generating speech parameters from the nonlinearly transformed features. To train the DNN, the hidden layers are constituted by the pre-trained RBMs [1] with the contrastive convergence (CD) criterion [46]. The input of the first input layer is normalized as a Gaussian with zero mean and unity variance so the pre-trained hidden layers are stacked as the first Gaussian-Bernoulli RBM and the rest Bernoulli-Bernoulli RBMs.

This topology is also used in RNN based speech synthesis. In addition the hidden layers will be stacked by at least one recurrent layer, for example bidirectional recurrent neural network with long short term memory (BLSTM-RNN) units shown in the right of Fig. 2.

3.2 Proposed Classification Layer

DNN-based parameter learning for speech synthesis is often cast as a regression problem and DNN is used to construct a mapping function directly from the contextual features to the speech parameters. Thus this regression function is usually learned with little discriminative information in the output layer. To alleviate this problem, decision trees is adopted in HMM-based speech synthesis to classify the input contextual features and learn parameters. Besides this, the function will also introduce an over-smoothing problem because the generated speech parameters in the output layer are only decided by the input contextual features. To overcome this problem, Zen et al. adopted the maximum likelihood parametric generation (MLPG) algorithm [34] to get additional speech parameters, including the first and second order derivatives.

The two issues listed above could also be addressed by adding another output layer for categorization which is learned together with the affine-transform layer. The error signal of the categorization tasks will be back-propagated to update the hidden-layer parameters. Thus, the hidden layers will be learned with discriminative attributes. Moreover, this additional classification layer will also help overcoming the over-smoothing problem in discriminative learning. The proposed framework for the DNN-based speech synthesis is demonstrated in Fig. 3. A detailed description about how to learn the additional classification layer is given in the followings.

For regression, the mean square error (MSE) in Eq. (7) is minimized to fine-tune the DNN parameters:

$$ {D}_{\mathrm{MSE}}\left(\hat{y},\mathrm{y}\right)=\frac{1}{T}{\sum}_{t=1}^T{\left(\hat{\mathrm{y}}\hbox{-} \mathrm{y}\right)}^2 $$

(7)

where T is the total number of frames, y is the target speech feature vector and $ \widehat{\mathrm{y}} $ is the predicted speech feature vector as follow:

$$ \widehat{y}=\tilde{g}\left({W}_{\mathrm{A}},{b}_{\mathrm{A}},h\right) $$

(8)

where $ \overset{\sim }{g} $ is a linear function, W _A , b _A are the weight matrix and bias vector for the affine-transform layer, h is the output of the hidden layers.

As for classification, a soft-max layer is trained with the cross entropy (CE) criterion [47] in Eq. (9) as follow:

$$ {D}_{CE}\left(\hat{\mathrm{s}},\mathrm{s}\right)={\sum}_{\mathrm{n}=1}^N{\sum}_{\mathrm{t}=1}^T\mathrm{s}\log \widehat{\mathrm{s}} $$

(9)

where N is the sentence number, T is the total number of frames, s is the target label for the categorization tasks, and $ \widehat{\mathrm{s}} $ is the generated label as follow:

$$ \widehat{s}=\frac{\exp \left(\tilde{g}\left({\mathrm{W}}_{\mathrm{S}},{\mathrm{b}}_{\mathrm{S}},\mathrm{h}\right)\right)}{\sum \exp \left(\tilde{g}\left({\mathrm{W}}_{\mathrm{S}},{\mathrm{b}}_{\mathrm{S}},\mathrm{h}\right)\right)} $$

(10)

A stochastic gradient descent (SGD) algorithm [48] is used in mini-batches to update the parameters in Eq. (11).

$$ \left(W,b\right)\leftarrow \left(W,b\right)+\uplambda \frac{\partial D}{\partial \left(W,b\right)} $$

(11)

where λ is the learning rate.

The outputs of back-propagation [49] in Eq. (7) and (8) are added together with an error ratio in Eq. (12) as the input for back-propagating to the hidden layers.

$$ \mathrm{D}\left(\widehat{y},\mathrm{y},\widehat{s},\mathrm{s}\right)={\mathrm{D}}_{\mathrm{MSE}}\left(\widehat{y},\mathrm{y}\right)+\upalpha \times {\mathrm{D}}_{\mathrm{CE}}\left(\widehat{s},\mathrm{s}\right) $$

(12)

where α is an error ratio.

3.3 Categorization Tasks

Decision trees facilitate a sharable structure for every state in the HMM-based speech synthesis system. It splits the set of categorical information into several nodes by asking a number of questions, such as phonemes identity, left or right contextual information and voiced/unvoiced labels. These questions help the decision trees to split the space of the speech parameters into small groups in order to learn more accurate parameters. Due to differences between the HMM and DNN, it is very hard to directly incorporate all the related questions into DNN training. Here we only consider four types of questions for constructing the classification layer.

The first is the voiced/unvoiced label. Due to the different vibrating state of glottis, the speech frames’ spectra can be easily split into two groups: non-zero fundamental frequency with a harmonic structure and zero fundamental frequency with a noisy structure. This additional classification layer therefore enhances the hidden layers to describe the differences between voiced and unvoiced frames.

The second is the phone identity. In the HMM-based speech synthesis system, decision trees are constructed for every HMM state and the phone identity questions are asked in parallel with other contextual information. It means that the constructed decision trees are shared across all the phones. It is a cause of the over-smoothing problem existing in the HMM-based speech synthesis system. To alleviate this problem in DNN-based speech synthesis, we stack a phone identity classification layer on top of the hidden layers to re-enforce the phone identity’s discrimination in the hidden layers.

The third is the phonation position. Every phone phonates in different positions of the vocal tract. So the phones can also be categorized into small groups. According to the knowledge in phonetics, the syllable initials and finals in Mandarin can be split into 15 groups as listed in Table 1. This layer will group the phones and learn the groups in a discriminative manner.

Table 1 Mandarin initials and finals based on phonation position.

Improving Deep Neural Network Based Speech Synthesis through Contextual Feature Parametrization and Multi-Task Learning

Abstract

Similar content being viewed by others

Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Explore related subjects

1 Introduction

2 Contextual Feature Parameterization

2.1 Joint Training with Embedded Vectors

2.2 Combination of Embedded Vectors

3 Multi-Task Learning

3.1 Classical DNN-Based Speech Synthesis

3.2 Proposed Classification Layer

3.3 Categorization Tasks

4 PSA-Based Vocoder in DNN-Based Speech Synthesis System

4.1 PSA-Based Vocoder

4.2 Integration into DNN-Based Speech Synthesis

5 Experiments and Discussion

5.1 Experiment Setup

5.2 Parameterization of Contextual Features

5.2.1 Replacing Binary Features of Phonemes

5.2.2 Comparing with Binary Features in DNN-Based Speech Synthesis Systems

5.3 Multi-Task Learning

5.3.1 Objective Measures

5.3.2 Subjective Preference Scores

5.4 PSA-Based Vocoder in DNN-Based Synthesis

5.4.1 PSA-Based Vocoder with one BLSTM-RNN

5.4.2 Comparison of one and two BLSTM-RNNs

5.4.3 Combination of PSA-Vocoder and two BLSTM-RNNs in DNN-Based Speech Synthesis

6 Conclusion and Future Work

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation