Keywords

1 Introduction

Voice conversion systems modify a speaker’s voice to be perceived as uttered by another speaker [1]. Voice conversion technology can be applied to many areas. Definitely, one of the most important applications is the use in the area of text-to-speech synthesis (TTS) where it hopes to synthesize different speaking style voices without a large utterance corpus. Besides speech synthesis, however, voice conversion has other potential applications in areas like entertainment, security and speaker individuality for interpreting telephony and voice restoration.

Many conversion methods have been proposed since the problem was first formulated in 1988 by Abe et al. [2]. Generally speaking, the main methods such as Gaussian Mixture Model (GMM) [3], Hidden Markov Model (HMM) [4], Frequency Warping (FW) [5] and Artificial Neural Network (ANN) [6] based methods has gained similarity in converted speech together with quality declines dramatically. Though many improved methods like bring Global Variance and Parameter Trajectory in GMM based method has been put forward over the last few years, and all upgraded in quality to some extent, the conversion results is not satisfactory for practical application. Therefore, we need to pursue new conversion methods to achieving high-quality converted speech.

It’s clear that the information separation of speech signals is going to be a new approach for speech processing. Among the information, the meaning of message being uttered is of prime importance, and the information of speaker identity also plays an important role in oral communication. We define these two types of information as content and style of speech respectively. General knowledge tells us that human auditory perception system has the ability to identify the meaning of message and the speaker identity simultaneously, i.e. to separate content and style factors sophisticatedly from speech signals.

In Popa Victor’s work, the bilinear model was used to model speech, and two sets of speech parameters, indicating style and content respectively, are achieved by singular value decomposition (SVD) on speech observations [7]. With the separation result, voice conversion is realized by replacing the source style with the target one, while preserving the initial content, i.e. the converted speech is reproduced based on source content and target style. Based on this literature, in this paper we use a new method to separate the information of speaker identity.

K-means singular value decomposition (K-SVD) is a signal decomposition method aims at sparse representation of signals. This method has been successfully used in image denoising, character extracting and so on. Because of the sparsity nature and other characteristics, we can use K-SVD to decomposing the vocal tract spectrum into a dictionary which conveys the personal identity and corresponding sparse matrix which contains the content information. Taking this into account, we can use the dictionary to achieve the separation of speaker-specific characteristics and speech conversion based on substitution of the dictionary.

Combining the identity of speech with K-SVD, we implement voice conversion using style replacing technique as proposed in Popa Victor’s work. Experimental results show that the proposed voice conversion approach outperforms the traditional GMM and ANN method in terms of both similarity and naturalness, especially in the case of small size of training dataset.

The paper is organized as follows. In the next section, we briefly introduce the K-SVD theory. In Sect. 3.3, we describe the voice conversion scheme based on K-SVD. Next, we make experiments to evaluate the proposed method, and the results demonstrate the benefits of K-SVD comparing with GMM and ANN in voice conversion. Finally, we make remarks on the proposed method, and some potential future research directions are also presented in section V.

2 The K-SVD Algorithm

We have witnessed a growing interest in the use of sparse representations for signals in recent years. Using an overcomplete dictionary matrix \( D \in IR^{n \times K} \) that contains K atoms, \( \{ d_{j} \}_{j = 1}^{K} \), as its columns, it is assumed that a signal \( Y \in IR^{n} (n \ll K) \) can be represented as a sparse linear combination of these atoms. The representation of Y may be approximate, \( Y \approx DX \), satisfying \( ||Y - DX||_{2} \le \varepsilon \). The vector \( X \in IR^{K} \) contains the representation coefficients of the signal Y. This sparsest representation is

$$ (P_{0,\varepsilon } )\;\mathop {\hbox{min} }\limits_{x} ||x||_{0} \;{\text{subject to}}\;||Y - DX||_{2} \le \varepsilon $$
(3.1)

In the K-SVD algorithm [8] we solve (1) iteratively, using two stages, parallel to those in K-Means. In the sparse coding stage, we compute the coefficients matrix X, using any pursuit method, and allowing each coefficient vector to have no more than \( T_{0} \) non-zero elements. Then, we update each dictionary element sequentially, changing its content, and the values of its coefficients, to better represent the signals that use it. This is markedly different from the K-Means generalizations that were proposed previously, e.g., since these methods freeze X while finding a better D, while we change the columns of D sequentially, and allow changing the relevant coefficients as well. This difference results in a Gauss–Seidel-like acceleration, since the subsequent columns to consider for updating are based on more relevant coefficients. We also note that the computational complexity of the K-SVD is equal to the previously reported algorithms for this task.

We now describe the process of updating each atom \( d_{k} \) and its corresponding coefficients, which are located in the k-th row of the coefficient matrix X, denoted as \( x_{k} \). We first find the matrix of residuals, and restrict this matrix only to the columns that correspond to the signals that initially use the currently improved atom. Let r be the set of indices of these signals, similarly, denote \( E_{k}^{r} \) as the restricted residual matrix, which we would now like to approximate using a multiplication of the two updated vectors \( d_{k} \) and \( x_{k} \), i.e. The equation converted to:

$$ ||Y - DX||_{F}^{2} = ||y - \sum\limits_{j = 1}^{N} {d_{j} } x_{j}^{r} || = ||(y - d_{j} x_{j}^{r} ) - d_{k} x_{k}^{r} ||_{F}^{2} = ||E_{k} - d_{k} x_{k}^{r} ||_{F}^{2} $$
(3.2)

We seek for a rank-one approximation. Clearly, this approximation is based on the singular value decomposition (SVD), taking the first left and right singular vectors, together with the first singular value. So we can get the sparse representation of speech signals as well as the dictionary.

We can suppose that the personal characteristics are contained in the dictionary while the content is conveyed in coding matrix due to the dictionary open out the main identity and inner structure of speech. And for this hypothesis, we will demonstrate it at the experiment phase.

3 Voice Conversion Based On K-SVD

So far, we know that a speech signal can be decomposed to a dictionary conveys the information related to personal identity and a corresponding sparse matrix contains the information related to content. Naturally we can replace the source speaker’s dictionary by the target speaker’s dictionary and then multiply with the source sparse matrix to implement voice conversion. While this scheme is easy to realize, there is a cute problem that the result trained by K-SVD is of multiplicity which means that a vocal track spectral matrix may have different combination of dictionary and sparse matrix. To settle this problem, we excerpt the means put forward by XU [9] to introduce partial limitation for the vocal track spectral conversion which is shown in Fig. 3.1.

Fig. 3.1
figure 1figure 1

Voice conversion system based on K-SVD

There are two basic modes in the voice conversion system. In training mode, we process the data of source and target speaker (part 1 in Fig. 3.1) and extract the dictionary based on K-SVD (part 2 in Fig. 3.1). In converting mode, we implement voice conversion based on dictionary replacement (part 3 in Fig. 3.1). Next, we will introduce the process of part 1, 2 and 3 concretely.

3.1 Data Processing

The source and target speech to be trained are denoted as x and y, in which the contents are the same. As for the extraction of vocal tract spectral, we take use of STRAIGHT model which can separate the excitation to vocal tract better to analyze the speech. The trained speech data analyzed by the STRAIGHT model and we get the STRAIGHT spectrum of source and target speech are denoted as \( S_{x} \) and \( S_{y} \).

Before training, the \( S_{x} \) and \( S_{y} \) must be aligned in time to ensure the coherence of the coding matrix decomposed at the next step. While the common method Dynamic Time Warping (DTW) [2] may lead to the quality declines dramatically because of having not considered the continuity of the inter-frame when inserting or removing the sequential frames, we adopt the time-aligned method with phoneme label information put forward in [10]. \( S_{x}^{'} \) and \( S_{y}^{'} \) denote time-aligned source and target STRAIGHT spectrum respectively.

3.2 Dictionary Extraction

So far, we have got the time-aligned STRAIGHT spectrum, and then we will extract the dictionary used for conversion.

As illustrated in part 2 of Fig. 3.1, using K-SVD to decomposing \( S_{x}^{'} \), we can get the dictionary \( D_{x} \) and sparse matrix \( H_{x} \) of source speech. If the target speaker’s STRAIGHT spectrum is analyzed in the same way, it is difficult to ensure the coherence of the dictionary of the target with the source. Considering the hypothesis that the sparse matrix is determined by the content information, it is clearly that the sparse matrix \( H_{x} \) of \( S_{y}^{'} \) is the same with \( H_{x} \). Therefore, set \( H_{y} = H_{x} \) when decomposing \( S_{y}^{'} \) to get corresponding dictionary of the target speaker denoted as \( D_{y} \).

3.3 Voice Conversion Based on Dictionary Replacement

As shown in part 3 of Fig. 3.1, in converting mode we should first extract the STRAIGHT spectrum of the source speech. And then we shall use OMP algorithm to decompose this spectrum with \( D_{x} \) fixed to get the corresponding sparse matrix \( H_{x}^{convert} \). So we can achieve the converted STRAIGHT spectrum based on \( H_{x}^{convert} \) and the dictionary \( D_{y} \) accepted at the training mode according to the Eq. (3.3).

$$ S_{y}^{convert} = D_{y} \cdot H_{x}^{convert} $$
(3.3)

4 Experimental Evaluations

4.1 Experimental Conditions

We conducted two experimental evaluations. Firstly, in order to demonstrate the dictionary conveys the speaker-specific characteristics and the sparse matrix conveys the content information, we use K-SVD to separate the speaker-specific characteristics with content information. Secondly, in order to demonstrate the effectiveness of the proposed conversion method, we compared the method against the GMM and ANN based methods, which are the most popular conversion techniques in the past few years.

In our experiments, all the data are selected from the CMU ARCTIC databases. It consists of four different database named SLT, BDL, JMK and AWB uttered by four different speakers.

The two parameters may affect the conversion result for the K-SVD; they are the numbers of atoms (K) in the dictionary and the sparsity (t) of the sparse matrix. It is clear that the reconstruction error can be decreased with the increasing of the two parameters. However, the result is not always better as the parameters increase too much. We have found that when the two parameters fill Eq. (3.4)

$$ K = 40,\;t = 1 2 $$
(3.4)

The converted voice is the best in quality as well as similarity through many experiments. Therefore, we set the two parameters as Eq. (3.4) at the next experiments.

4.2 Speaker-Specific Characteristics Separation

To demonstrating the speaker-specific characteristics are included in the dictionary, we do the work as follow.

Firstly, we select the preceding 10 sentences in BDL and SLT and compute the STRAIGHT spectrum respectively, the dimensionality of each row vector is 256. Secondly, getting the dictionary \( D_{BDL} \) and \( D_{SLT} \) through K-SVD. Then, we choose the succedent 5 sentences in BDL and SLT to calculate the STRAIGHT spectrum. At last, computing the reconstructed error with the dictionary settled as \( D_{BDL} \) and \( D_{SLT} \). The reconstructed error is calculated using,

$$ e = 10\log_{10} \left( {{{\frac{1}{{N_{{\mathbf{S}}} }}\sum\limits_{i = 1}^{{N_{{\mathbf{S}}} }} {\left\| {{\mathbf{s}}_{i} - {\hat{\mathbf{s}}}_{i} } \right\|^{2} } } \mathord{\left/ {\vphantom {{\frac{1}{{N_{{\mathbf{S}}} }}\sum\limits_{i = 1}^{{N_{{\mathbf{S}}} }} {\left\| {{\mathbf{s}}_{i} - {\hat{\mathbf{s}}}_{i} } \right\|^{2} } } {\frac{1}{{N_{{\mathbf{S}}} }}\sum\limits_{t = 1}^{{N_{{\mathbf{S}}} }} {\left\| {{\mathbf{s}}_{i} } \right\|^{2} } }}} \right. \kern-0pt} {\frac{1}{{N_{{\mathbf{S}}} }}\sum\limits_{t = 1}^{{N_{{\mathbf{S}}} }} {\left\| {{\mathbf{s}}_{i} } \right\|^{2} } }}} \right) $$
(3.5)

where \( N_{{\mathbf{S}}} \) denotes the number of rows in STRAIGHT spectrum, \( {\mathbf{s}}_{i} \) denotes the i-th row of the source spectrum, \( {\hat{\mathbf{s}}}_{i} \) denotes the i-th row of the reconstructed spectrum. The results are summarized in Table 3.1.

Table 3.1 Results of reconstructed error (db)

It is clear that the reconstructed error of dictionary suited to the voice is much less than the other when analyzing by K-SVD. Therefore, we can assure that the dictionary trained by K-SVD is certain to containing the speaker-specific characteristics, and we can separate the speaker-specific characteristics effectively. And according to this, we can implement voice conversion by replacing the dictionary.

4.3 Effectiveness of the Proposed Method

In Fig. 3.2, we can see the source, target and converted speech directly. To contrasting the converted result, two subjective listening tests were carried out. One test is Mean Opinion Score (MOS), which aims to give a score between 1 and 5 to evaluate speech naturalness, and the other ABX which uses the correct rate of speaker recognition to evaluate the successfulness of individuality conversion. All tests were performed by five listeners who have been engaged on speech processing research for many years. The results are given in Table 3.2.

Fig. 3.2
figure 2figure 2

Source, target and converted speech

Table 3.2 Results of subjective listening tests

From the experimental results shown in Table 3.2, it is easy to see that the proposed method for voice conversion based on style and content separation clearly outperform traditional GMM and ANN algorithm based on parameter extraction and mapping. So we can make conclusion that the method would work with only small size of training data, while GMM and ANN methods does not because of unreliable estimation of GMM parameters due to insufficient data.

The result also shows that although the proposed method is able to implement voice conversion in the framework of style and content separation, and has superior performance. We may conclude that K-SVD has the advantage of preserving the speaker-specific characteristics and taking the inter-frame correlation into account for separation and conversion.

5 Conclusion

This paper presents a voice conversion method based on style and content separation, which is solved by K-SVD. And on the basis of successful separating speaker-specific characteristics, the authors invent a novel method for voice conversion using style replacement. And they also have solved the multiplicity of the decomposition result of K-SVD in the conversion method. Experimental results show that the proposed method for voice conversion has superior performance than traditional GMM and ANN based algorithm in view of both speech quality and conversion similarity and naturalness.

The authors present a novel way to realize voice conversion and improve the conversion performance dramatically which is still a linear combination of style and content while the complex relationship between observed data and style and content is not able to be fully described only by linearity. Therefore, in future studies, researchers will extend the K-SVD to non-linearity to improve the performance of speech style and content separation technique. In addition, the style and content separation technique will immensely promote the development of technique in other speech signal processing domain, such as speech recognition, speaker identification, and low-rate speech coding, etc.