1 Introduction

Hidden Markov Models (HMM) have been widely used for time series analysis (e.g., speech, text, and image processing). HMM parameters can be estimated by statistical methods, effectiveness of which depends on the quality and quantity of available data that should distribute according to the statistical feature of intended signal space or conditions. As there is no sure way of collecting sufficient data to cover all conditions, adaptive training of HMM parameters from a set of previously obtained parameters to a new set that befits a specific environment with a small amount of new data is an important research issue.

In speech recognition, one approach is to view the adaptation of model parameters to new data as a transformation problem; that is, the new set of model parameters is a transformed version of the old set: λ n+1 = f(λ n ,{x} n ), where {x} n denotes the new set of data available at moment n for the existing model parameters λ n to adapt to. Most frequently and practically, the function f is chosen to be of an affine transformation type [1, 2]: λ n+1 = A λ n + b, when various parts of the model parameters, e.g., the mean vectors or the variances, are envisaged in a vector space. The adaptation algorithm therefore involves deriving the affine map components, A and b, from the adaptation data {x} n . A number of algorithms have been proposed for this purpose. (See [3, 4] for detail. There are many variants of transformation types for HMMs, e.g., [5-9]). Some techniques bear the name "linear regression", and our paper also uses this name by convention. There are many other applications of the adaptive training of HMMs than speech recognition (e.g., speech synthesis [10], speaker recognition [11], face recognition [12] and activity recognition [13]).

For automatic speech recognition, the number of the Gaussian distributions or simply Gaussians, which are used as component distributions in forming state-dependent mixture distributions, is typically in the thousands or more. If each mean vector in the set of Gaussians is to be modified by a unique transformation matrix, the number of “adaptation parameters” can therefore be quite large. The main problem of this method is thus how to improve “generalization capability” by avoiding the over-training problem when the amount of adaptation data is small. To solve the problem, there are mainly two approaches: 1) model selection and 2) prior knowledge utilization.

The model selection approach is originally proposed within the estimation of linear transformation parameters by using the maximum likelihood EM algorithm (called Maximum Likelihood Linear Regression (MLLR)). MLLR proposes to share one linear transformation in a cluster of many Gaussians in the HMM set, thereby effectively reducing the number of free parameters that can then be trained with a small amount of adaptation data. The Gaussian clusters are usually constructed as a tree structure, as shown in Fig. 1, which is pre-determined and fixed throughout adaptation.

Figure 1
figure 1

Gaussian tree representation of liner regression parameters.

This tree (called regression tree) is constructed based on a centroid splitting algorithm, described in [14]. This algorithm first makes two centroid vectors from a random perturbation of the global mean vector computed from Gaussians assigned to a target leaf node. Then, it splits a set of these Gaussians according to the Euclidean distance between Gaussian mean vectors and two centroid vectors. Obtained two sets of Gaussians are assigned to child nodes, and this procedure is continued to finally build a tree.

The utility of the tree structure is commensurate with the amount of adaptation data; namely, if we have a small amount of data, it uses only coarse clusters (e.g., the root node of a tree in the top layer of Fig. 1) where the number of free parameters in the linear transformation matrices is small. On the other hand, if we have a sufficiently large amount of data, it can use fine clusters where the number of free parameters in the linear transformation matrices is large, potentially improving the precision of the estimated parameters. This framework needs to select appropriate Gaussian clusters according to the amount of data, i.e., it needs an appropriate model selection function. Usually, the model selection is performed by setting a threshold value manually (e.g., the total number of speech frames assigned to a set of Gaussians in a node).

While the regression tree in MLLR can be considered one form of prior knowledge, i.e., how various Gaussian distributions are related, another approach is to explicitly construct and use prior knowledge of regression parameters in an approximated Bayesian paradigm. For example, Maximum A Posteriori Linear Regression (MAPLR) [15] and quasi-Bayes linear regression [16] replace the ML criterion with the MAP and quasi-Bayes criteria, respectively, in the estimation of regression parameters. With the explicit prior knowledge acting as a regularization term, MAPLR appears to be less susceptible to the problem of over-fitting. The MAPLR is extended to the structural MAP (SMAP) [17] and the structural MAPLR (SMAPLR) [18], both of which fully utilize the Gaussian tree structure used in the model selection approach to efficiently set the hyper-parameters in prior distributions. In SMAP and SMAPLR, the hyper-parameters in the prior distribution in a target node are obtained by the statistics in its parent node. Since the total number of speech frames assigned to a set of Gaussians in the parent node is always larger than that in the target node, the obtained statistics in the parent node is more reliable than that in the target node, and these can be good prior knowledge for transformation parameter estimation in the target node. Another extension of MAPLR is to replace MAP approximation to a fully Bayesian treatment of latent models, called variational Bayes (VB). VB has been developed in the machine learning field based on a variational technique [19-23], and has been successfully applied to HMM training in speech recognition [24-31]. VB is also applied to the estimation of the linear transformation parameters of HMMs [32, 33] to achieve further generalization capabilities.

This paper also employs VB for the linear regression problem, but we focus on the model selection and efficient prior utilization at the same time, in addition to the estimation of the linear transformation parameters of HMMs proposed in previous work [32,33]. In particular, we consistently use the variational lower bound as the optimization criterion for the model structure and hyper-parameters, in addition to the posterior distributions of the transformation parameters and the latent variables.Footnote 1 Since this optimization leads the approximated variational posterior distributions to the true posterior distributions theoretically in the sense of minimizing Kullback Leibler divergence between them, the above consistent approach yields to improve the generalization capability [20, 22, 23]. To do this, this paper provides an analytical solution to the variational lower bound by marginalizing all possible transformation parameters and latent variables introduced in the linear regression problem. The solution is based on a variance-normalized representation of Gaussian mean vectors to simplify the solution as normalized domain MLLR. As a result of variational calculation, we can marginalize the transformation parameters in all nodes used in the structural prior setting. This is a part of the solution of the variational message passing algorithm [34], which is a general framework of variational inference in a graphical model. Furthermore, the optimization of the model topology and hyper-parameters in the proposed approach yields an additional benefit to the improvement of the generalization capability. For example, the proposed approach infers the linear regression without controlling the Gaussian cluster topology and hyper-parameters as the tuning parameters. Thus linear regression for HMM parameters is accomplished without excessive parameterization in a Bayesian sense.

This paper is organized as follows. It first introduces the conventional MLLR framework in Section 2. Then, we provide a formulation of the Bayesian linear regression framework in Section 3. Based on the formulation, Section 4 introduces a practical model selection and hyper-parameter optimization scheme in terms of optimizing the variational lower bound. Section 5 reports unsupervised speaker adaptation experiments for a large vocabulary continuous speech recognition task, and confirms the effectiveness of the proposed approach. The mathematical notations used in this paper are summarized in Table  1.

Table 1 Notation list.

2 Linear Regression for Hidden Markov Models Based on Variance Normalized Representation

This section briefly explains a solution for the linear regression parameters for HMMs within a maximum likelihood EM algorithm framework. This paper uses a solution based on a variance normalized representation of Gaussian mean vectors to simplify the solution.Footnote 2 In this paper, we only focus on the transformation of Gaussian mean vectors in HMMs.

2.1 Maximum Likelihood Solution Based on EM Algorithm and Variance Normalized representation

First, we explain the basic EM algorithm of the conventional HMM parameter estimation to set the notational convention and to align with the subsequent development of the proposed approach. Let O ≜ {o t ∈ ℝD|t = 1,⋯,T} be a sequence of D dimensional feature vectors for t speech frames. The latent variables in a continuous density HMM are composed of HMM states and mixture components of GMMs. A sequence of HMM states is represented by S ≜ {s t | t = 1,⋯, T}, where the value of s t denotes an HMM state index at frame t. Similarly, a sequence of mixture components is represented by Z ≜ {z t | t = 1, ⋯, T}, where the value of z t denotes a mixture component index at frame t. The EM algorithm deals with the following auxiliary function as an optimization function instead of directly using the model likelihood:

$$ Q(\boldsymbol{\Theta};\hat{\boldsymbol{\Theta}}) \triangleq \left< \log p(\mathbf{O}, \mathbf{S}, \mathbf{Z}|\boldsymbol{\Theta}) \right>_{p(\mathbf{S}, \mathbf{Z}|\mathbf{O}; \hat{\boldsymbol{\Theta}})}, $$
(1)

where Θ is a set of HMM parameters. The brackets 〈〉 denote the expectation i.e. 〈 g(y)〉 p (y)∫ g(y)p(y)dy for a continuous random variable y and < g (n)> p (n) ≡ ∑ n g(n) p(n) for a discrete random variable n. p(O,S,Z|Θ) is a complete data likelihood given Θ. p(S, Z|O; Θ̂) is the posterior distribution of the latent variables given the previously estimated HMM parameters Θ̂. Eq. 1 is an expected value, and is efficiently computed by using the forward-backward algorithm as the E-step of the EM algorithm.

The M-step of the EM algorithm estimates HMM parameters, as follows:

$$ \bar{\boldsymbol{\Theta}} = \underset{\boldsymbol{\Theta}}{\arg\!\max}\ Q(\boldsymbol{\Theta}; \hat{\boldsymbol{\Theta}}). $$
(2)

The E-step and the M-step are performed iteratively until convergence, and finally we obtain the HMM parameters as a close approximate of the stationary point solution.

Now we focus on the linear transformation parameters within the EM algorithm. We prepare a transformation parameter matrix W j for each leaf node j in a Gaussian tree. Here, we assume that the Gaussian tree is pruned by a model selection approach as a model structure m, and the set of leaf nodes in the pruned tree is represented as \({\mathcal {J}}_{m}\). Hereinafter, we use V to denote a joint event of S and Z (i.e., V ≜ {S, Z}). This will much simplify the following development pertaining to the adaptation of the mean and the covariance parameters. Similar to Eq. 1, the auxiliary function with respect to a set of transformation parameters \(\boldsymbol {\Lambda }_{{\mathcal {J}}_{m}} = \{\mathbf {W}_{j} | j = 1, \cdots, |{\mathcal {J}}_{m}| \} \) can be represented as follows:

$$\begin{array}{rll} Q(\boldsymbol{\Lambda}_{{\mathcal{J}}_{m}}; \hat{\boldsymbol{\Lambda}}_{{\mathcal{J}}_{m}}) &=& \left< \log p(\mathbf{O}, \mathbf{V}|\boldsymbol{\Lambda}_{{\mathcal{J}}_{m}}; \boldsymbol{\Theta}) \right>_{p(\mathbf{V}|\mathbf{O}; \boldsymbol{\Theta}, \hat{\boldsymbol{\Lambda}}_{{\mathcal{J}}_{m}})} \\ &=& \sum\limits_{k = 1}^{K} \sum\limits_{t = 1}^{T} \zeta_{k, t} \log {\mathcal{N}}(\mathbf{o}_{t}|\boldsymbol{\mu}_{k}^{ad}, \mathbf{ \Sigma }_{k}), \end{array} $$
(3)

k denotes a unique mixture component index of all Gaussians in the target HMMs (for all phoneme HMMs in a speech recognition case), and k is the total number of Gaussians. \(\zeta _{k, t} \triangleq p(v_{t} = k|\mathbf {O}; \boldsymbol {\Theta }, \hat {\boldsymbol {\Lambda }}_{{\mathcal {J}}_{m}})\) is the posterior probability of mixture component k at t, derived from the previously estimated transformation parameters \(\hat {\boldsymbol {\Lambda }}_{{\mathcal {J}}_{m}}\).Footnote 3 \(\boldsymbol {\mu }_{k}^{ad}\) is a transformed mean vector with \(\boldsymbol {\Lambda }_{{\mathcal {J}}_{m}}\), and the concrete form of this vector is discussed in the next paragraph. In the Q function, we disregard the parameters of the state transition probabilities and the mixture weights since they do not depend on the optimization with respect to \(\boldsymbol {\Lambda }_{{\mathcal {J}}_{m}}\). \({\mathcal {N}}(\cdot | \boldsymbol {\mu }, \mathbf {\Sigma })\) denotes a Gaussian distribution with mean parameter μ and covariance matrix parameter Σ, and is defined as follows:

$$\begin{array}{l} {\mathcal{N}}(\mathbf{o}_{t}|\boldsymbol{\mu}_{k}^{ad}, \boldsymbol{\Sigma}_{k})\\ \triangleq g (\mathbf{ \Sigma }_{k}) \exp \left(-\frac{1}{2} \text{tr} \left[ (\mathbf{ \Sigma }_{k})^{-1} \left(\mathbf{o}_{t} \,-\, \boldsymbol{\mu}_{k}^{ad}\right) \left(\mathbf{o}_{t} \,-\, \boldsymbol{\mu}_{k}^{ad}\right)' \right] \right), \end{array} $$
(4)

where tr[·] and ′ mean the trace and transposition operations of a matrix, respectively. g(Σ k ) is a normalization factor, and is defined as follows:

$$ g (\mathbf{ \Sigma }_{k}) \triangleq (2 \pi)^{-\frac{D}{2}} |\mathbf{ \Sigma }_{k}|^{-\frac{1}{2}}. $$
(5)

In the following paragraphs, we derive Eq. 3 as a function of \(\boldsymbol {\Lambda }_{{\mathcal {J}}_{m}}\) to optimize \(\boldsymbol {\Lambda }_{{\mathcal {J}}_{m}}\), similar to Eq. 2.

We consider the concrete form of the transformed mean vector (μ k ad) based on the variance normalized representation. We first define Cholesky decomposition matrix C k as follows:

$$ \mathbf{ \Sigma }_{k} \triangleq \mathbf{C}_{k} (\mathbf{C}_{k})'. $$
(6)

C k is a D×D triangular matrix. If the Gaussian k is included in a set of Gaussians \({\mathcal {K}}_{j}\) in leaf node j (i.e., \(k \in {\mathcal {K}}_{j}\)), the affine transformation of a Gaussian mean vector in a covariance normalized space (C k )−1 μ k ad is represented as follows:

$$ \begin{array}{l} (\mathbf{C}_{k})^{-1} \boldsymbol{\mu}_{k}^{ad} = \mathbf{W}_{j} \left(\begin{array}{c} 1 \\ (\mathbf{C}_{k})^{-1} \boldsymbol{\mu}_{k}^{ini} \end{array} \right). \\ \Rightarrow \boldsymbol{\mu}_{k}^{ad} = \mathbf{C}_{k} \mathbf{W}_{j} \left(\begin{array}{c} 1 \\ (\mathbf{C}_{k})^{-1} \boldsymbol{\mu}_{k}^{ini} \end{array} \right) \triangleq \mathbf{C}_{k} \mathbf{W}_{j} \boldsymbol{\xi}_{k}. \end{array} $$
(7)

ξ k is an augmented normalized vector of an initial (non-adapted) Gaussian mean vector μ k ini. W j is a D×(D+1) affine transformation matrix. j is a leaf node index that holds a set of Gaussians. Namely, transformation parameter W j is shared among a set of Gaussians \({\mathcal {K}}_{j}\). The clustered structure of the Gaussians is usually represented as a binary tree where a set of Gaussians belongs to each node.

The Q function of \(\boldsymbol {\Lambda }_{{\mathcal {J}}_{m}}\) is represented by substituting Eqs. 7 and 4 into Eq. 3 as follows:

$$\begin{array}{rll} Q({}\boldsymbol{\Lambda}_{{\mathcal{J}}_{m}}; \hat{\boldsymbol{\Lambda}}_{{\mathcal{J}}_{m}}{}){} &=& \sum\limits_{j \in {\mathcal{J}}_{m}} \sum\limits_{k \in {\mathcal{K}}_{j}} \sum\limits_{t = 1}^{T} \zeta_{k, t} \log {\mathcal{N}} (\mathbf{o}_{t} |\mathbf{C}_{k} \mathbf{W}_{j} \boldsymbol{\xi}_{k}, \mathbf{\Sigma}_{k})\\ &=& \sum\limits_{j \in {\mathcal{J}}_{m}} \left(\sum\limits_{k \in {\mathcal{K}}_{j}} \zeta_{k} \log g(\mathbf{\Sigma}_{k})\right. \\ &&\left.- \frac{1}{2} \text{tr} {} \left[ \mathbf{W}_{j} ' \mathbf{W}_{j} \boldsymbol{ \Xi }_{j} {}-{}2 \mathbf{W}_{j}' \mathbf{Z}_{j}{}+{}\sum\limits_{k\in{\mathcal{K}}_{j}} \mathbf{ \Sigma }_{k}^{-1} \mathbf{S}_{k} \right]{}\right) {}, \end{array} $$
(8)

where Ξ j and Z j are 0th and 1st order statistics of linear regression parameters defined as:

$$ \left \{ \begin{aligned} \mathbf{ \Xi }_{j} & \triangleq \sum\limits_{k \in {\mathcal{K}}_{j}} \boldsymbol{\xi}_{k} (\boldsymbol{\xi}_{k})' \zeta_{k}. \\ \mathbf{Z}_{j} & \triangleq \sum\limits_{k \in {\mathcal{K}}_{j}} (\mathbf{C}_{k})^{-1} \boldsymbol{\nu}_{k} (\boldsymbol{\xi}_{k})'. \end{aligned} \right. $$
(9)

Here Z j is a D×(D+1) matrix and Ξ j is a (D+1)×(D+1) symmetric matrix. ζ k , ν k , and S k are defined as follows:

$$ \left \{ \begin{aligned} \zeta_{k} &=& \sum\limits_{t = 1}^{T} \zeta_{k, t} \\ \boldsymbol{\nu}_{k} &=& \sum\limits_{t = 1}^{T} \zeta_{k, t} \mathbf{o}_{t} \\ \mathbf{S}_{k} &=& \sum\limits_{t = 1}^{T} \zeta_{k, t} \mathbf{o}_{t} \mathbf{o}_{t} ' \end{aligned} \right. $$
(10)

These are the 0th, 1st, and 2nd order sufficient statistics of Gaussians in HMMs, respectively.

Since Eq. 8 is represented as a quadratic form with respect to W j , we can obtain the optimal W̄ j , similar to Eq. 2. By differentiating the Q function with respect to W j , we can derive the following equationFootnote 4

$$ \frac{\partial}{\partial \mathbf{W}_{j}} Q(\boldsymbol{\Lambda}_{{\mathcal{J}}_{m}}; \hat{\boldsymbol{\Lambda}}_{{\mathcal{J}}_{m}}) = 0. \Rightarrow \mathbf{Z}_{j} - \bar{\mathbf{W}}_{j} \mathbf{ \Xi }_{j} = 0. $$
(11)

Thus, we can obtain the following analytical solution:

$$ \bar{\mathbf{W}}_{j} = \mathbf{Z}_{j} \mathbf{ \Xi }_{j}^{-1}. $$
(12)

Therefore, the optimized mean vector parameter is represented as:

$$ \boldsymbol{\mu}^{ad}_{k} = \mathbf{C}_{k} \mathbf{Z}_{j} \mathbf{ \Xi }_{j}^{-1} \boldsymbol{\xi}_{k}. $$
(13)

Therefore, μ k ad is analytically obtained by using the statistics (Z j and Ξ j in Eq. 9) and initial HMM parameters (C k and ξ k ). This solution corresponds to the M-step of the EM algorithm, and the E-step is performed by the forward-backward algorithm, similarly to that of HMMs, to compute these statistics.

3 Bayesian Linear Regression

This section provides an analytical solution for Bayesian linear regression by using a variational lower bound. The previous section only considers a regression matrix in leaf node \(j \in {\mathcal {J}}_{m}\), we also consider a regression matrix in leaf or non-leaf node \(i \in {\mathcal {I}}_{m}\) in the Gaussian tree given model structure m. Then, we focus on a set of regression matrices in all nodes Λ I m = {W i |i = 1,⋯,|I m |}, instead of \(\boldsymbol {\Lambda }_{{\mathcal {J}}_{m}}\), and marginalize Λ I m in a Bayesian manner. This extension involves the structural prior setting as proposed in SMAP and SMAPLR [17,18].

In this section, we mainly deal with:

  • the prior distribution of model parameters \(p(\boldsymbol {\Lambda }_{{\mathcal {I}}_{m}}; m,\boldsymbol { \Psi }) \)

  • the true posterior distribution of model parameters and latent variables \(p (\boldsymbol {\Lambda }_{{\mathcal {I}}_{m}}, \mathbf {V} | \mathbf {O}; m, \mathbf {\Psi })\)

  • the variational posterior distribution of model parameters and latent variables \(q (\boldsymbol {\Lambda }_{{\mathcal {I}}_{m}}, \mathbf {V} | \mathbf {O}; m, \mathbf {\Psi })\)

  • the output distribution \(p(\mathbf {O}, \mathbf {V}|\boldsymbol {\Lambda }_{{\mathcal {I}}_{m}}; \boldsymbol {\Theta })\)

For simplicity, we omit some conditional variables in these distribution functions, as follows:

$$\begin{array}{rll}p(\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}; m, \mathbf{ \Psi })&\rightarrow&p(\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}})\\ p (\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}, \mathbf{V} | \mathbf{O}; m, \mathbf{ \Psi }) &\rightarrow& p (\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}, \mathbf{V} | \mathbf{O})\\ q (\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}, \mathbf{V} | \mathbf{O}; m, \mathbf{ \Psi }) &\rightarrow& q (\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}, \mathbf{V})\\ p(\mathbf{O}, \mathbf{V}|\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}; \boldsymbol{\Theta}) &\rightarrow& p(\mathbf{O}, \mathbf{V}|\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}) \end{array} $$

3.1 Variational Lower Bound

With regard to the variational Bayesian approaches, we first focus on the following marginalized log-likelihood p(O;Θ,m,Ψ) with a set of HMM parameters Θ, a set of hyper-parameters Ψ, and a model structure.Footnote 5 Footnote 6

$$\begin{array}{rll} \log p(\mathbf{O}; \boldsymbol{\Theta}, m, \mathbf{ \Psi }) \\ &=& \log \left(\int \sum\limits_{\mathbf{V}} p(\mathbf{O}, \mathbf{V}|\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}; \boldsymbol{\Theta}) p(\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}; m, \mathbf{\Psi}) d \boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}{}\right). \end{array} $$
(14)

where \(p(\mathbf {O}, \mathbf {V}|\boldsymbol {\Lambda }_{{\mathcal {I}}_{m}}; \boldsymbol {\Theta })\) is the output distribution of the transformed HMM parameters with transformed mean vectors μ k ad. p(Λ I m ; m, Ψ) is a prior distribution of transformation matrices Λ I m . In the following explanation, we omit Θ, m, and Ψ in the prior distribution and output distribution for simplicity, i.e., p(Λ I m ; m, Ψ) → p (Λ I m ), and p(O, V|Λ I m ; Θ) → p(O, V|Λ I m ).

The variational Bayesian approach focuses on the lower bound of the marginalized log likelihood \({\mathcal {F}} (m, \mathbf { \Psi })\) with a set of hyper-parameters Ψ and a model structure m, as follows:

$$\begin{array}{lll} &&\log p(\mathbf{O}; \boldsymbol{\Theta}, m, \mathbf{ \Psi })\\ = \log \left(\int \sum\limits_{\mathbf{V}} \frac{p(\mathbf{O}, \mathbf{V}|\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}) p(\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}})}{q(\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}, \mathbf{V})} q(\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}, \mathbf{V}) d \boldsymbol{\Lambda}_{{\mathcal{I}}_{m}} \right)\\ &&\quad \geq \underbrace{\left< \log \frac{p(\mathbf{O}, \mathbf{V}|\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}) p(\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}})}{q(\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}, \mathbf{V})} \right>_{q(\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}, \mathbf{V})}}_{\triangleq {\mathcal{F}} (m, \mathbf{\Psi})}. \end{array} $$
(15)

The inequality in Eq. 15 is supported by the Jensen’s inequality: log(〈 X p(X)) ≥ 〈 log (X) 〉 p(X). \(q(\boldsymbol {\Lambda }_{{\mathcal {I}}_{m}}, \mathbf {V})\) is an arbitrary distribution, and is optimized by using a variational method to be discussed later. For simplicity, we omit m, Ψ, and O from the distributions. The variational lower bound is a better approximation of the marginalized log likelihood than the auxiliary functions of maximum likelihood EM and maximum a posteriori EM algorithms that point-estimate model parameters, especially for small amount of training data [21-23]. Therefore, the variational Bayes can mitigate the sparse data problem that the conventional approaches must confront with.

The variational Bayes regards the variational lower bound \({\mathcal {F}} (m, \mathbf { \Psi })\) as an objective function for the model structure and hyper-parameter, and an objective functional for the joint posterior distribution of the transformation parameters and latent variables [22,23]. In particular, if we consider the true posterior distribution \(p (\boldsymbol {\Lambda }_{{\mathcal {I}}_{m}}, \mathbf {V} | \mathbf {O})\) (we omit conditional variables m and Ψ for simplicity), we obtain the following relationship:

$$\begin{array}{rll} \text{KL} [ q(\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}, \boldsymbol{\mathbf{V}}) || p (\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}, \boldsymbol{\mathbf{V}} | \boldsymbol{\mathbf{O}}) ] \\ &&{\kern8pt} = \log p(\boldsymbol{\mathbf{O}}; \boldsymbol{\Theta}, m, \boldsymbol{\mathbf{\Psi}}) - {\mathcal{F}} (m, \boldsymbol{\mathbf{\Psi}}) \end{array} $$
(16)

This equation means that maximizing the variational lower bound \({\mathcal {F}} (m, \mathbf { \Psi })\) with respect to \(q(\boldsymbol {\Lambda }_{{\mathcal {I}}_{m}}, \mathbf {V})\) corresponds to minimizing the Kullback-Leibler (KL) divergence between \(q(\boldsymbol {\Lambda }_{{\mathcal {I}}_{m}}, \mathbf {V})\) and \(p (\boldsymbol {\Lambda }_{{\mathcal {I}}_{m}}, \mathbf {V} | \mathbf {O})\) indirectly. Therefore, this optimization yields to find \(q(\boldsymbol {\Lambda }_{{\mathcal {I}}_{m}}, \mathbf {V})\), which approaches to the true posterior distribution.Footnote 7

Thus, in principle, we can straightforwardly obtain the (sub) optimal model structure, hyper-parameters, and posterior distribution, as follows:

$$\begin{array}{rll} \tilde{m} &=& \underset{m}{\arg\!\max}\ {\mathcal{F}} (m, \mathbf{ \Psi }).\\ \tilde{\mathbf{ \Psi }} &=& \underset{\mathbf{ \Psi }}{\arg\!\max}\ {\mathcal{F}} (m, \mathbf{ \Psi }).\\ \tilde{q}(\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}, \mathbf{V}) &=& \underset{q(\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}, \mathbf{V})}{\arg\!\max}\ {\mathcal{F}} (m, \mathbf{ \Psi }). \end{array} $$
(17)

This optimization steps are performed alternately, and finally lead to local optimum solutions, similar to the EM algorithm. However it is difficult to deal with the joint distribution \(q(\boldsymbol {\Lambda }_{{\mathcal {I}}_{m}}, \mathbf {V})\) directly, and we propose to factorize them by utilizing a Gaussian tree structure. In addition, we also set a conjugate form of the prior distribution \(p(\boldsymbol {\Lambda }_{{\mathcal {I}}_{m}})\). This procedure is a typical recipe of VB to make a solution mathematically tractable similar to that of the classical Bayesian adaptation approach.

3.2 Structural Prior Distribution Setting in a Binary Tree

We utilize a Gaussian tree structure to factorize the prior distribution \(p(\boldsymbol {\Lambda }_{{\mathcal {I}}_{m}})\). We consider a binary tree structure, but the formulation is applicable to a general non-binary tree. We define the parent node of i as p(i), the left child node of i as l(i), and the right child node of i as r(i), as shown in Fig. 2, where a transformation matrix is prepared for each corresponding node i. If we define W 1 as the transformation matrix in the root node, we assume the following factorization for the prior distribution \(p(\boldsymbol {\Lambda }_{{\mathcal {I}}_{m}})\),

$$\begin{array}{rll} p(\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}) &=& p(\mathbf{W}_{1}, \cdots, \mathbf{W}_{|{\mathcal{I}}_{m}|}) \\ &=& p (\mathbf{W}_{1}) p (\mathbf{W}_{{\mathsf{l}}(1)} | \mathbf{W}_{1}) p (\mathbf{W}_{{\mathsf{r}}(1)} | \mathbf{W}_{1}) \\ &&p(\mathbf{W}_{{\mathsf{l}}({\mathsf{l}}(1))} | \mathbf{W}_{{\mathsf{l}}(1)}) p(\mathbf{W}_{{\mathsf{r}}({\mathsf{l}}(1))} | \mathbf{W}_{{\mathsf{l}}(1)}) \\ &&p(\mathbf{W}_{{\mathsf{l}}({\mathsf{r}}(1))} | \mathbf{W}_{{\mathsf{r}}(1)}) p(\mathbf{W}_{{\mathsf{r}}({\mathsf{r}}(1))} | \mathbf{W}_{{\mathsf{r}}(1)}) \cdots \\ &=& \prod_{i \in {\mathcal{I}}_{m}} p (\mathbf{W}_{i} | \mathbf{W}_{{\mathsf{p}}(i)}). \end{array} $$
(18)

To make the prior distribution a product form in the last line of Eq. 18, we define p(W 1) ≜ p(W 1|W p(1)). As seen, the effect of the transformation matrix in a target node propagates to its child nodes.

Figure 2
figure 2

Binary tree structure with transformation matrices. If we focus on node i, the transformation matrices in the parent node, left child node, and right child node are represented as W p(i), W l(i), and W r(i), respectively.

This prior setting is based on an intuitive assumption that the statistics in a target node is highly correlated with the statistics in its parent node. In addition, since the total number of speech frames assigned to a set of Gaussians in the parent node is always larger than that in the target node, the obtained statistics in the parent node is more reliable than that in the target node, and these can be good prior knowledge for the transformation parameter estimation in the target node.

With a Bayesian approach, we need to set a practical form of the above prior distributions. A conjugate distribution is preferable as far as obtaining an analytical solution is concerned, and we set a matrix variate normal distribution similar to Maximum A Posteriori Linear Regression (MAPLR [15]). A matrix variate normal distribution is defined as follows:

$$\begin{array}{l} p(\mathbf{W}_{i}) = {\mathcal{N}}(\mathbf{W}_{i}|\mathbf{M}_{i}, \boldsymbol{ \Phi }_{i}, \boldsymbol{\Omega}_{i})\\ \triangleq {}\frac{\exp\left({}-{} \frac{1}{2} \text{tr} \left [ (\mathbf{W}_{i} {}-{} \mathbf{M}_{i})' \boldsymbol{ \Phi }_{i}^{-1} (\mathbf{W}_{i} {}-{} \mathbf{M}_{i}) \boldsymbol{\Omega}_{i}^{-1}{}\right] \right)} {(2 \pi)^{D(D+1)/2} |\boldsymbol{ \Omega }_{i}|^{D/2} |\boldsymbol{ \Phi }_{i}|^{(D+1)/2}}, \end{array} $$
(19)

where M i is a D×(D+1) location matrix, Ω i is a (D+1)×(D+1) symmetric scale matrix, and Φ i is a D×D symmetric scale matrix. Ω i represents correlation of column vectors, and Φ i represents correlation of raw vectors. These are hyper-parameters of the matrix variate normal distribution. There are many hyper-parameters to be set, and this makes the implementation complicated. In this paper, we try to find another conjugate distribution with fewer hyper-parameters than Eq. 19. To obtain a simple solution for the final analytical results, we use a spherical Gaussian distribution that has the following constraints on Ω i and Φ i :

$$\begin{array}{rll} \boldsymbol{ \Phi }_{i} &\approx& \mathbf{I}_{D},\\ \boldsymbol{ \Omega }_{i} &\approx& \rho_{i}^{-1} \mathbf{I}_{D+1}, \end{array} $$
(20)

where I D is the D×D identity matrix. ρ i indicates a precision parameter. Then, Eq. 19 can be rewritten as follows:

$$\begin{array}{l} {\mathcal{N}}(\mathbf{W}_{i}|\mathbf{M}_{i}, \mathbf{I}_{D}, \rho^{-1}_{i} \mathbf{I}_{D+1})\\ = h(\rho_{i}^{-1} \mathbf{I}_{D+1}) \exp \left(- \frac{1}{2} \text{tr} [ \rho_{i} (\mathbf{W}_{i} - \mathbf{M}_{i})' (\mathbf{W}_{i} - \mathbf{M}_{i}) ] \right), \end{array} $$
(21)

where h(ρ i −1 I D+1) is a normalization factor, and defined as

$$ h(\rho_{i}^{-1} \mathbf{I}_{D+1}) \triangleq \left(\frac{\rho_{i}}{2 \pi} \right)^{\frac{D(D+1)}{2}}. $$
(22)

This approximation means that matrix elements do not have any correlation each other. This can produce simple solutions for Bayesian linear regression.Footnote 8

Based on the spherical matrix variate normal distribution, the conditional prior distribution p(W i |W p(i)) in Eq. 18 is obtaining by setting the location matrix as the transformation matrix W p(i) in the parent node with the precision parameter ρ i as follows:

$$ p(\mathbf{W}_{i}|\mathbf{W}_{{\mathsf{p}}(i)}) = {\mathcal{N}} (\mathbf{W}_{i}|\mathbf{W}_{{\mathsf{p}}(i)}, \mathbf{I}_{D}, \rho_{i}^{-1} \mathbf{I}_{D+1}) $$
(23)

Note that in the following sections W i and W p(i) are marginalized. In addition, we set the location matrix in the root node as the deterministic value of W p(1) = [0,I D ]. Since \(\boldsymbol {\mu }_{k}^{ad} = \mathbf {C}_{k} \mathbf {W}_{{\mathsf {p}}(1)} \boldsymbol {\xi }_{k} = \boldsymbol {\mu }_{k}^{ini}\) from Eq. 7, this hyper-parameter setting means that the initial mean vectors are not changed if we only use the prior knowledge. This makes sense in the case of small amount of data by fixing the HMM parameters as their initial values; this in a sense also inherits the philosophical background of Bayesian adaptation, although the objective function has been changed from a posteriori probability to a lower bound of the marginalized likelihood. Therefore, we just have \(\{\rho _{i} | i = 1, \cdots, |{\mathcal {I}}_{m}|\}\) as a set of hyper-parameters Ψ, which will be also optimized in our framework.

3.3 Variational Calculus

In VB, we also assume the following factorization form to the posterior distribution \(q (\mathbf {V}, \boldsymbol {\Lambda }_{{\mathcal {I}}_{m}})\):

$$ q (\mathbf{V}, \boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}) = q(\mathbf{V}) q(\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}) = q(\mathbf{V}) \prod_{i \in {\mathcal{I}}_{m}} q (\mathbf{W}_{i}) $$
(24)

Then, from the variational calculation for \({\mathcal {F}} (m, \mathbf { \Psi })\) with respect to q(W i ), we obtain the following (sub) optimal solution for q(W i ):

$$\begin{array}{lll} && \log \tilde{q} (\mathbf{W}_{i})\\ \propto \left< \left< \log p(\mathbf{O}, \mathbf{V}|\boldsymbol{\Lambda}_{\mathcal{I}_{m}}) \right>_{q(\mathbf{V})} p(\boldsymbol{\Lambda}_{\mathcal{I}_{m}}) \right>_{\scriptsize \displaystyle \prod_{i' \in {I}_{m} \setminus i} q (\mathbf{W}_{i'})}\\ \propto \sum\limits_{i' \in {\mathcal{I}_{m}}\i} \left< \log p (\mathbf{W}_{i'} | \mathbf{W}_{{\mathsf{p}}(i')}) \right>_{{\displaystyle \prod_{i' \in {I}_{m} \setminus i} q (\mathbf{W}_{i'})}}\ &&\quad + \left< \left< \log p(\mathbf{O}, \mathbf{V}|\boldsymbol{\Lambda}_{\mathcal{I}_{m}}) \right>_{q(\mathbf{V})} \right>_{\displaystyle \prod_{i' \in {I}_{m} \setminus i} q (\mathbf{W}_{i'})}, \end{array}$$
(25)

where we use Eqs. 18 and 24 to rewrite the equation. Operation \(\propto \) denotes the proportional relationship between the left and the right hand sides of the probabilistic distribution functions. It is a useful expression since we do not have to write normalization factors explicitly, which are disregarded in the following calculations. In Eq. 25, is also used in the logarithmic domain where normalization factors can be represented as constant terms.

In this expectation, we can consider the following two cases of variational posterior distributions:

3.3.1 1) Leaf Node

We first focus on the prior term of Eq. 25. If i is a leaf node, we can disregard the expectation with respect to \(\prod _{i' \in {\ I} _{m} \setminus i} q (\mathbf {W}_{i'})\) in the other nodes than the parent node p(i) of the target leaf node. Thus, we obtain the following simple solution:

$$\begin{array}{rll} && {\kern-9.8pt} \log \tilde{q} (\mathbf{W}_{i}) \propto \left< \log p(\mathbf{W}_{i}|\mathbf{W}_{{\mathsf{p}}(i)}) \right >_{q (\mathbf{W}_{{\mathsf{p}}(i)}) } \\ && \qquad \qquad\quad + \left< \left < \log p(\mathbf{O}, \mathbf{V}| \boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}) \right>_{q(\mathbf{V})} \right>_{ {\displaystyle \prod_{i' \in {\ I} _{m} \setminus i} q (\mathbf{W}_{i'})}} \end{array} $$
(26)

3.3.2 2) Non-Leaf Node (with child nodes)

Similarly, if i is a non-leaf node, in addition to the parent node p(i) of the target node, we also have to consider the child nodes l(i) and r(i) of the target node for the expectation, as follows:

$$ \log \tilde{q} (\mathbf{W}_{i}) \propto \left\langle \log p(\mathbf{W}_{i}|\mathbf{W}_{{\mathsf{p}}(i)}) \right\rangle_{q (\mathbf{W}_{{\mathsf{p}}(i)}) } $$
((27-1))
$$ {\kern40pt}\quad+ \left\langle \log p(\mathbf{W}_{{\mathsf{l}}(i)}|\mathbf{W}_{i}) \right \rangle_{q (\mathbf{W}_{{\mathsf{l}}(i)}) } $$
((27-2))
$$ {\kern40pt}\quad+ \left\langle \log p(\mathbf{W}_{{\mathsf{r}}(i)}|\mathbf{W}_{i}) \right \rangle_{q (\mathbf{W}_{{\mathsf{r}}(i)}) } $$
((27-3))
$$ {\kern40pt}\quad+ \left \langle \left \langle \log p(\mathbf{O}, \mathbf{V}| \boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}) \right\rangle_{q(\mathbf{V})} \right\rangle_{{\displaystyle \prod_{i' \in {\ I} _{m} \setminus i} q (\mathbf{W}_{i'})}} $$
((27-4))

In both cases, the posterior distribution of the transformation matrix in the target node depends on those in the parent and child nodes. Therefore, the posterior distributions are iteratively calculated. This inference is known as a variational message passing algorithm [34], and Eqs. 26 and 27-1 are specific solutions of the variational message passing algorithm to a binary tree structure. The next section provides a concrete form of the posterior distribution of the transformation matrix.

3.4 Posterior Distribution of Transformation Matrix

We first focus on Eq. (27), which is a general equation of Eq. 26 that has additional terms based on child nodes to Eq. 26. Equation (27-4) is based on the expectation with respect to \(\prod _{i' \in {\ I} _{m} \setminus i } q (\mathbf {W}_{i'})\) and q(V). The term with q(V) is represented as the following expression similar to Eq. 8:

$$\begin{array}{lll} &&\left < \log p(\mathbf{O}, \mathbf{V}| \boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}) \right>_{q(\mathbf{V})} \\ = \sum\limits_{i \in {\mathcal{I}}_{m}} \left(\sum\limits_{k \in {\mathcal{K}}_{i}} \zeta_{k} \log g(\boldsymbol{ \Sigma }_{k})\right.\\ && \qquad \qquad \quad \left. - \frac{1}{2} \text{tr} \left[ \mathbf{W}_{i} ' \mathbf{W}_{i} \boldsymbol{ \Xi }_{i} - 2 \mathbf{W}_{i}' \mathbf{Z}_{i} + \sum\limits_{k \in {\mathcal{K}}_{i}} \boldsymbol{ \Sigma }_{k}^{-1} \mathbf{S}_{k} \right] \right). \end{array} $$
(28)

Here sufficient statistics (ζ k , S k , Ξ i , and Z i in Eqs. 9 and 10) are computed by the VB-E step (e.g., ζ k,t = q(v t = k)), which is described in the next section. This equation form means that the term can be factorized by node i. This factorization property is important for the following analytic solutions and algorithm. Actually, by considering the expectation with respect to \(\prod _{i' \in {I}_{m} \setminus i} q (\mathbf{W} _{i'})\), we can integrate out the terms that do not depend on W i , as follows:

$$\begin{array}{l} \left < \left < \log p(\mathbf{O}, \mathbf{V}| \boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}) \right>_{q(\mathbf{V})} \right>_{ {\displaystyle \prod_{i' \in {\ I} _{m} \setminus i} \mathrm{q} (\mathbf{W}_{i'})}} \\ \propto - \frac{1}{2} \text{tr} \left [ \mathbf{W}_{i} ' \mathbf{W}_{i} \mathbf{ \Xi }_{i} - 2 \mathbf{W}_{i}' \mathbf{Z}_{i} \right ]. \end{array} $$
(29)

Next, we consider Eq. 27-1-1. Since we use a conjugate prior distribution, q(W p(i)) is also represented by the following matrix variate normal distribution as the same distribution family with the prior distribution.

$$ q (\mathbf{W}_{{\mathsf{p}}(i)}) = {\mathcal{N}}(\mathbf{W}_{{\mathsf{p}}(i)}|\mathbf{M}_{{\mathsf{p}}(i)}, \mathbf{I}_{D}, \boldsymbol{\Omega}_{{\mathsf{p}}(i)}) $$
(30)

Note that the posterior distribution has a unique form that the first covariance matrix is an identity matrix while the second one is a symmetric matrix. We discuss about this form with the analytical solution, later.

By substituting Eqs. 21 and 30 into Eq. (27-1), Eq. (27-1) is represented as follows:

$$\begin{array}{rll} \left< \log p(\mathbf{W}_{i}|\mathbf{W}_{{\mathsf{p}}(i)}) \right >_{q (\mathbf{W}_{{\mathsf{p}}(i)}) }\\ && = \int \left(\log {\mathcal{N}}(\mathbf{W}_{i}|\mathbf{W}_{{\mathsf{p}}(i)}, \mathbf{I}_{D}, \rho^{-1}_{i} \mathbf{I}_{D+1}) \right) \\ && \qquad \qquad {\mathcal{N}}(\mathbf{W}_{{\mathsf{p}}(i)}|\mathbf{M}_{{\mathsf{p}}(i)}, \mathbf{I}_{D}, \boldsymbol{ \Omega }_{{\mathsf{p}}(i)}) d \mathbf{W}_{{\mathsf{p}}(i)} \end{array} $$
(31)

To solve the integral, we use the following matrix distribution formula:

$$\begin{array}{rll} \int {\mathcal{N}}(\mathbf{W}_{{\mathsf{p}}(i)}|\mathbf{M}_{{\mathsf{p}}(i)}, \mathbf{I}_{D}, \boldsymbol{ \Omega }_{{\mathsf{p}}(i)}) d \mathbf{W}_{{\mathsf{p}}(i)} &=& 1\\ \int \mathbf{W}_{{\mathsf{p}}(i)} {\mathcal{N}}(\mathbf{W}_{{\mathsf{p}}(i)}|\mathbf{M}_{{\mathsf{p}}(i)}, \mathbf{I}_{D}, \boldsymbol{ \Omega }_{{\mathsf{p}}(i)}) d \mathbf{W}_{{\mathsf{p}}(i)} &=& \mathbf{M}_{{\mathsf{p}}(i)} \end{array} $$
(32)

Then, by disregarding the terms that do not depend on W i , Eq. 31 can be solved as the logarithmic function of the matrix variate normal distribution that has the posterior distribution parameter M p(i) as a hyper-parameter.

$$\begin{array}{rll} \left< \log p(\mathbf{W}_{i}|\mathbf{W}_{{\mathsf{p}}(i)}) \right >_{q (\mathbf{W}_{{\mathsf{p}}(i)}) }\\ && \propto \rho_{i} \int \text{tr} [ \mathbf{W}_{i}' \mathbf{W}_{{\mathsf{p}}(i)} ] {\mathcal{N}}(\mathbf{W}_{{\mathsf{p}}(i)}|\mathbf{M}_{{\mathsf{p}}(i)}, \mathbf{I}_{D}, \mathbf{ \Omega }_{{\mathsf{p}}(i)}) d \mathbf{W}_{{\mathsf{p}}(i)}\\ && \quad \quad - \frac{\rho_{i}}{2} \int \text{tr} [ \mathbf{W}_{i}' \mathbf{W}_{i} ] {\mathcal{N}}(\mathbf{W}_{{\mathsf{p}}(i)}|\mathbf{M}_{{\mathsf{p}}(i)}, \mathbf{I}_{D}, \mathbf{ \Omega }_{{\mathsf{p}}(i)}) d \mathbf{W}_{{\mathsf{p}}(i)} \\ && \propto \rho_{i} \text{tr} [ \mathbf{W}_{i}' \mathbf{M}_{{\mathsf{p}}(i)} ] - \frac{\rho_{i}}{2} \text{tr} [ \mathbf{W}_{i}' \mathbf{W}_{i} ] \\ && \propto \log {\mathcal{N}}(\mathbf{W}_{i}|\mathbf{M}_{{\mathsf{p}}(i)}, \mathbf{I}_{D}, \rho^{-1}_{i} \mathbf{I}_{D+1}) \end{array} $$
(33)

Similarly, Eqs. (27-2) and (27-3) are solved as follows:

$$\begin{array}{rll} \left< \log p(\mathbf{W}_{{\mathsf{l}}(i)}|\mathbf{W}_{i}) \right >_{q (\mathbf{W}_{{\mathsf{l}}(i)}) }\\ && \propto \log {\mathcal{N}}(\mathbf{W}_{i}|\mathbf{M}_{{\mathsf{l}}(i)}, \mathbf{I}_{D}, \rho^{-1}_{{\mathsf{l}}(i)} \mathbf{I}_{D+1})\\ \left< \log p(\mathbf{W}_{{\mathsf{r}}(i)}|\mathbf{W}_{i}) \right >_{q (\mathbf{W}_{{\mathsf{r}}(i)}) }\\ &&\propto \log {\mathcal{N}}(\mathbf{W}_{i}|\mathbf{M}_{{\mathsf{r}}(i)}, \mathbf{I}_{D}, \rho^{-1}_{{\mathsf{r}}(i)} \mathbf{I}_{D+1}) \end{array} $$
(34)

Thus, the expected value terms of the three prior distributions in Eq. (27) are represented as the following matrix variate normal distribution:

$$\begin{array}{rll} \left< \log p(\mathbf{W}_{i}|\mathbf{W}_{{\mathsf{p}}(i)}) \right >_{q (\mathbf{W}_{{\mathsf{p}}(i)}) } \\ && + \left< \log p(\mathbf{W}_{{\mathsf{l}}(i)}|\mathbf{W}_{i}) \right >_{q (\mathbf{W}_{{\mathsf{l}}(i)}) } \\ &&+ \left< \log p(\mathbf{W}_{{\mathsf{r}}(i)}|\mathbf{W}_{i}) \right >_{q (\mathbf{W}_{{\mathsf{r}}(i)}) } \\ \propto \log {\mathcal{N}}(\mathbf{W}_{i}|\mathbf{M}_{{\mathsf{p}}(i)}, \mathbf{I}_{D}, \rho^{-1}_{i} \mathbf{I}_{D+1}) \\ &&+ \log {\mathcal{N}}(\mathbf{W}_{i}|\mathbf{M}_{{\mathsf{l}}(i)}, \mathbf{I}_{D}, \rho^{-1}_{{\mathsf{l}}(i)} \mathbf{I}_{D+1}) \\ && + \log {\mathcal{N}}(\mathbf{W}_{i}|\mathbf{M}_{{\mathsf{r}}(i)}, \mathbf{I}_{D}, \rho^{-1}_{{\mathsf{r}}(i)} \mathbf{I}_{D+1}) \\ \propto \log {\mathcal{N}} \left(\mathbf{W}_{i}\left|\frac{\rho_{i} \mathbf{M}_{{\mathsf{p}}(i)} + \rho_{{\mathsf{l}}(i)} \mathbf{M}_{{\mathsf{l}}(i)} + \rho_{{\mathsf{r}}(i)} \mathbf{M}_{{\mathsf{r}}(i)}}{\rho_{i} + \rho_{{\mathsf{l}}(i)} + \rho_{{\mathsf{r}}(i)}},\right.\right. \\ && \left. \qquad \qquad \qquad \mathbf{I}_{D}, (\rho_{i} + \rho_{{\mathsf{l}}(i)} + \rho_{{\mathsf{r}}(i)})^{-1} \mathbf{I}_{D+1} \vphantom{ \frac{\rho_{i} \mathbf{M}_{{\mathsf{p}}(i)} + \rho_{{\mathsf{l}}(i)} \mathbf{M}_{{\mathsf{l}}(i)} + \rho_{{\mathsf{r}}(i)} \mathbf{M}_{{\mathsf{r}}(i)}}{\rho_{i} + \rho_{{\mathsf{l}}(i)} + \rho_{{\mathsf{r}}(i)}} } \right) \end{array} $$
(35)

It is an intuitive solution, since the location parameter of W i is represented as a linear interpolation of the location values of the posterior distributions in the parent and child nodes. The precision parameters control the linear interpolation ratio.

Similarly, we can also obtain the expected value term of the prior term in Eq. 26, and we summarize the prior terms of the non-leaf and leaf node cases as follows:

$$ \hat{q}(\mathbf{W}_{i}) = {\mathcal{N}}(\mathbf{W}_{i}|\hat{\mathbf{M}}_{i}, \mathbf{I}_{D}, \hat{\rho}^{-1}_{i} \mathbf{I}_{D+1}) $$
(36)

where

$$\begin{array}{rll} \hat{\mathbf{M}}_{i} &=& \left\{ \begin{array}{ll} \frac{\rho_{i} \mathbf{M}_{{\mathsf{p}}(i)} + \rho_{{\mathsf{l}}(i)} \mathbf{M}_{{\mathsf{l}}(i)} + \rho_{{\mathsf{r}}(i)} \mathbf{M}_{{\mathsf{r}}(i)}}{\rho_{i} + \rho_{{\mathsf{l}}(i)} + \rho_{{\mathsf{r}}(i)}} &\text{Non-leaf node} \\ \mathbf{M}_{{\mathsf{p}}(i)} & \text{Leaf node} \\ \end{array} \right.\\ \hat{\rho}_{i} &=& \left\{ \begin{array}{ll} \rho_{i} + \rho_{{\mathsf{l}}(i)} + \rho_{{\mathsf{r}}(i)} & \text{Non-leaf node} \\ \rho_{i} & \text{Leaf node} \\ \end{array} \right. \end{array} $$
(37)

Thus, the effect of prior distributions becomes different depending on whether the target node is a non-leaf node or leaf node. The solution is different from our previous solution [37] since the previous solution does not marginalize the transformation parameters in non-leaf nodes. In the Bayesian sense, this solution is stricter than the previous solution.

Based on Eqs. 28 and 36, we can finally derive the quadratic form of W i as follows:

$$\begin{array}{l} \log (\tilde{q}(\mathbf{W}_{i}))\\ \propto - \frac{1}{2} \text{tr} \left [ \hat{\rho}_{i} \mathbf{W}_{i}' \mathbf{W}_{i} + \mathbf{W}_{i} ' \mathbf{W}_{i} \boldsymbol{ \Xi }_{i} - 2 \hat{\rho}_{i} \mathbf{W}_{i}' \hat{\mathbf{M}}_{i} - 2 \mathbf{W}_{i}' \mathbf{Z}_{i} \right ]\\ = - \frac{1}{2} \text{tr} \left [ \mathbf{W}_{i}' \mathbf{W}_{i} (\hat{\rho}_{i} \mathbf{I}_{D+1} + \boldsymbol{ \Xi }_{i}) - 2 \mathbf{W}_{i}' (\hat{\rho}_{i} \hat{\mathbf{M}}_{i} + \mathbf{Z}_{i}) \right ], \end{array} $$
(38)

where we disregard the terms that do not depend on W i . Thus, by defining the following matrix variables

$$\begin{array}{rll} \tilde{\boldsymbol{ \Omega }}_{i} &=& (\hat{\rho}_{i} \mathbf{I}_{D+1} + \boldsymbol{ \Xi }_{i})^{-1}, \\ &=& \left\{ \begin{array}{ll} ((\rho_{i} + \rho_{{\mathsf{l}}(i)} + \rho_{{\mathsf{r}}(i)}) \mathbf{I}_{D+1} + \boldsymbol{ \Xi }_{i})^{-1} & \text{Non-leaf node} \\ (\rho_{i} \mathbf{I}_{D+1} + \boldsymbol{ \Xi }_{i})^{-1} & \text{Leaf node} \\ \end{array}\right. \\ \tilde{\mathbf{M}}_{i} &=& (\hat{\rho}_{i} \hat{\mathbf{M}}_{i} + \mathbf{Z}_{i}) \tilde{\boldsymbol\Omega}, \\ &=& \left\{ \begin{array}{ll} (\rho_{i} \mathbf{M}_{{\mathsf{p}}(i)} + \rho_{{\mathsf{l}}(i)} \mathbf{M}_{{\mathsf{l}}(i)} + \rho_{{\mathsf{r}}(i)} \mathbf{M}_{{\mathsf{r}}(i)} + \mathbf{Z}_{i}) \tilde{\boldsymbol{\Omega}} \\ \hspace{15mm} \text{Non-leaf node} \\ (\rho_{i} \mathbf{M}_{{\mathsf{p}}(i)} + \mathbf{Z}_{i}) \tilde{ \boldsymbol{\Omega }} \\ \hspace{15mm}\text{Leaf node} \end{array}\right. \end{array} $$
(39)

we can derive the posterior distribution of W i analytically. The analytical solution is expressed as

$$\begin{array}{l} \tilde{q}(\mathbf{W}_{i}) = {\mathcal{N}}(\mathbf{W}_{i}|\tilde{\mathbf{M}}_{i}, \mathbf{I}_{D}, \tilde{\boldsymbol{ \Omega }}_{i})\\ = h(\tilde{\boldsymbol{ \Omega }}_{i}) \exp \left(\,-\, \frac{1}{2} \text{tr} \left[ (\mathbf{W}_{i}\,-\, \tilde{\mathbf{M}}_{i})' (\mathbf{W}_{i} \,-\, \tilde{\mathbf{M}}_{i}) \tilde{\boldsymbol{\Omega}}_{i}^{-1} \right] \right), \end{array} $$
(40)

where

$$ h(\tilde{\mathbf{ \Omega }}_{i}) \triangleq (2 \pi)^{- \frac{D(D+1)}{2}} |\tilde{\mathbf{ \Omega }}_{i}|^{- \frac{D}{2}}. $$
(41)

The posterior distribution also becomes a matrix variate normal distribution since we use a conjugate prior distribution for W i . From Eq. 39, M̃ i by hyper-parameter M̂ i and the 1st order statistics of the linear regression matrix Z i . ρ̂ i controls the balance between the effects of the prior distribution and adaptation data. This solution is the M-step of the VB EM algorithm and corresponds to that of the ML EM algorithm in Section 2.1.

Compared with Eq. 21, Eq. 40 keeps the first covariance matrix as a diagonal matrix, while the second covariance matrix \(\tilde {\mathbf {\Omega }}\) has off diagonal elements. This means that the posterior distribution only considers the correlation between column vectors in W. This unique property comes from the variance normalized representation introduced in Section 2, which makes multivariate Gaussian distributions in HMMs uncorrelated, and this relationship is taken over to the VB solutions.

Although the solution for a non-leaf node would make the prior distribution robust by taking account of the child node hyper-parameters, this structure makes the dependency of the target node with the other linked nodes complex. Therefore, in the implementation step, we approximate the hyper-parameters of the posterior distribution for a non-leaf node to those for a leaf node by M̂ i M p(i) and ρ̂ i ρ i in the Eq. 37, to make an algorithm simple. We would evaluate the effect of the non-leaf node solution in future work.

Next section explains the E-step of the VB EM algorithm, which computes sufficient statistics ζ k , S k , Ξ i , and Z i in Eqs. 9 and 10. These are obtained by using \(\tilde {q}(\mathbf {W}_{i})\), of which mode M̃ i is used for the Gaussian mean vector transformation.

3.5 Posterior Distribution of Latent Variables

From the variational calculation of \({\mathcal {F}} (m, \mathbf { \Psi })\) with respect to q(V), we also obtain the following posterior distribution:

$$ \log \tilde{q}(\mathbf{V}) \propto \left< \log p(\mathbf{O}, \mathbf{V}| \boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}) \right>_{q (\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}})}. $$
(42)

By using the factorization form of the variational posterior (Eq. 24), we can disregard the expectation with respect to the other variational posteriors than that of the target node i. Therefore, to obtain the above VB posteriors of latent variables, we have to consider the following integral.

$$ \int q(\mathbf{W}_{i}) \log {\mathcal{N}}(\mathbf{o}_{t}|\mathbf{C}_{k} \mathbf{W}_{i} \boldsymbol{\xi}_{k}, \mathbf{ \Sigma }_{k}) d \mathbf{W}_{i}. $$
(43)

Since the Gaussian mean vectors are only updated in the leaf nodes, node i in this section is regarded as a leaf node. By substituting Eqs. 40 and 4 into Eq. 43, the equation is represented as (see Appendix):

$$\begin{array}{l} \int q(\mathbf{W}_{i}) \log {\mathcal{N}}(\mathbf{o}_{t}|\mathbf{C}_{k} \mathbf{W}_{i} \boldsymbol{\xi}_{k}, \mathbf{ \Sigma }_{k}) d \mathbf{W}_{i}\\ = \log {\mathcal{N}}(\mathbf{o}_{t}|\tilde{\boldsymbol{\mu} }_{k}, \mathbf{ \Sigma }_{k}) - \frac{1}{2} \text{tr} [\boldsymbol{\xi}_{k} \boldsymbol{\xi} '_{k} \tilde{\mathbf{\Omega}}_{i} ]. \end{array} $$
(44)

where

$$ \tilde{\boldsymbol{\mu} }_{k} = \mathbf{C}_{k} \tilde{\mathbf{M}}_{i} \boldsymbol{\xi}_{k} $$
(45)

The analytical result is almost equivalent to the E-step of conventional MLLR, which means that the computation time is almost the same as that of the conventional MLLR E-step.

Note that the Gaussian mean vectors are updated in the leaf nodes in this result, while the posterior distributions of the transformation parameters are updated for all nodes.

3.6 Variational Lower Bound

By using the factorization form (Eq. 24) of the variational posterior distribution, the variational lower bound defined in Eq. 15 is decomposed as follows:

$$\begin{array}{rll} {\mathcal{F}}(m, \mathbf{ \Psi })\\ && \quad = \underbrace{\left< \log \frac{p(\mathbf{O}, \mathbf{V}|\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}) p(\boldsymbol{\Lambda}_{{\mathcal{I}}_{m}}) }{\prod_{i \in {\mathcal{I}}_{m}} q (\mathbf{W}_{i}) } \right>_{\substack{\prod_{i \in {\mathcal{I}}_{m}} q (\mathbf{W}_{i}) \\ q(\mathbf{V})}}}_{\triangleq {\mathcal{L}} (m, \mathbf{\Psi})}\\ && \quad \quad - \left< \log q(\mathbf{V}) \right>_{q(\mathbf{V})}. \end{array} $$
(46)

The second term, which consists of q(V), is an entropy value and is calculated at the E-step in the VB EM algorithm. The first term (\({\mathcal {L}} (m, \mathbf {\Psi })\)) is a logarithmic evidence term for m and \(\boldsymbol {\Psi } = \{\rho _{i} | i = 1, \cdots |{\mathcal {I}}_{m}|\}\) and we can obtain an analytical solution of ℒ(m, Ψ). Because of the factorization forms in Eqs. 2418, and 28, ℒ(m, Ψ) can be represented as the summation over i, as follows:

$$ {\mathcal{L}} (m, \boldsymbol{ \Psi }) = \sum\limits_{i \in {\mathcal{I}}_{m}} {\mathcal{L}}_{i} (\rho_{i}, \rho_{{\mathsf{l}}(i)}, \rho_{{\mathsf{r}}(i)}), $$
(47)

where

$$\begin{array}{lll} &&{\mathcal{L}}_{i} (\rho_{i}, \rho_{{\mathsf{l}}(i)}, \rho_{{\mathsf{r}}(i)})\\ &&\quad \triangleq \sum\limits_{i \in {\mathcal{I}}_{m}} \left< \log \frac{p(\mathbf{O}, \mathbf{V}|\mathbf{W}_{i}) p(\mathbf{W}_{i}|\mathbf{W}_{{\mathsf{p}}(i)}) }{q(\mathbf{W}_{i})} \right>_{\substack{q(\mathbf{W}_{i}) \\ q(\mathbf{V})}} \end{array} $$
(48)

Note that this factorization form has some dependencies from parent and child node parameters through Eqs. 37 and 39. To derive an analytical solution, we first consider the expectation with respect to only q(V) for cluster i. By substituting Eqs. 821, and 40 into ℒ i (ρ i ,ρ l(i),ρ r(i)), and by using Eq. 39, the expectation can be rewritten, as follows:

$$\begin{array}{rll} \left< \log \frac{p(\mathbf{O}, \mathbf{V}|\mathbf{W}_{i}) p(\mathbf{W}_{i}|\mathbf{W}_{{\mathsf{p}}(i)}) }{q(\mathbf{W}_{i})} \right>_{q(\mathbf{V})}\\ && = {\sum\limits}_{k \in {\mathcal{K}}_{i}} \zeta_{k} \log g(\boldsymbol{ \Sigma }_{k}) + \log \frac{g(\hat{\rho}_{i}^{-1} \mathbf{I}_{D+1})}{g(\tilde{\boldsymbol{\Omega}}_{i})}\\ && \quad-\frac{1}{2} \text{tr} \left[\hat{\rho}_{i} \hat{\mathbf{M}}_{i} ' \hat{\mathbf{M}}_{i} - \tilde{\mathbf{M}}_{i} ' \tilde{\mathbf{M}}_{i} \tilde{\boldsymbol{\Omega}}_{i}^{-1} + {{\sum\limits}}_{k \in {\mathcal{K}}_{i}} \boldsymbol{\Sigma}_{k}^{-1} \mathbf{S}_{k} \right]. \end{array} $$
(49)

The obtained result does not depend on W i . Therefore, the expectation with respect to q(W i ) can be disregarded in ℒ i (ρ i , ρ l (i), ρ r(i)). Consequently, we can obtain the following analytical result for the lower bound:

$$\begin{array}{rll} {\mathcal{L}}_{i} (\rho_{i}, \rho_{{\mathsf{l}}(i)}, \rho_{{\mathsf{r}}(i)})\\ &=& - \frac{D}{2} \log (2 \pi) \sum\limits_{k \in {\mathcal{K}}_{i}} \zeta_{k} - \frac{1}{2} \sum\limits_{k \in {\mathcal{K}}_{i}} \zeta_{k} \log |\mathbf{\Sigma}_{k}|\\ && \quad + \frac{D(D+1)}{2} \log \hat{\rho}_{i} + \frac{D}{2} \log |\tilde{\mathbf{ \Omega }}_{i}| \\ && \quad - \frac{1}{2} \text{tr} \left[\hat{\rho}_{i} \hat{\mathbf{M}}_{i} ' \hat{\mathbf{M}}_{i} - \tilde{\mathbf{M}}_{i} ' \tilde{\mathbf{M}}_{i} \tilde{\boldsymbol{\Omega}}_{i}^{-1} + \sum\limits_{k \in {\mathcal{K}}_{i}} \boldsymbol{\Sigma}_{k}^{-1} \mathbf{S}_{k} \right]. \end{array} $$
(50)

The first line of the obtained result corresponds to the likelihood value given the amount of data and the covariance matrices of the Gaussians. The other terms consider the effect of the prior and posterior distributions of the model parameters. This is used as an optimization criterion with respect to the model structure m and the hyper- parameters Ψ. Note that the objective function can be represented as a summation over i because of the factorization form of the prior and posterior distributions. This representation property is used for our model structure optimization in Section 4.2 for a binary tree structure representing a set of Gaussians used in the conventional MLLR.

4 Optimization of Hyper-Parameters and Model Structure

In this section, we describe how to optimize hyper-parameters Ψ and model structure m by using the variational lower bound as an objective function. Once we obtain the variational lower bound, we can obtain an appropriate model structure and hyper-parameters at the same time that maximize the lower bound as follows:

$$ \{\widetilde{\boldsymbol{ \Psi }}, \widetilde{m}\} = \underset{m, \boldsymbol{ \Psi }}{\arg\!\max}\ {\mathcal{F}} (m, \boldsymbol{ \Psi }) $$
(51)

In this paper, we use two approximations for the variational lower bound to make the inference algorithm practical. First, we fix latent variables V during the above optimization. Then, < log q(V) > q(V) in Eq. 46 is also fixed for m and Ψ, and can be disregarded in the objective function. Thus, we can only focus on ℒ(m, Ψ) in the optimization step, which reduces computational cost greatly, as follows:

$$ \{\widetilde{\mathbf{ \Psi }}, \widetilde{m}\} \approx \underset{m, \mathbf{\Psi}}{\arg\!\max} {\mathcal{L}} (m, \mathbf{ \Psi }) $$
(52)

This approximation is widely used in acoustic model selection (likelihood criterion [38] and Bayesian criterion [26]). Second, as we discussed in Section 3.4, the solution for a non-leaf node (Eq. 36) makes the dependency of the target node with the other linked nodes complex. Therefore, we approximate ℒ i (ρ i , ρ l (i), ρ r (i)) ≈ ℒ i (ρ i ) by ρ̂ i ρ i and so on, where ℒ i (ρ i ) is defined in the next section. Therefore, in the implementation step, we approximate the posterior distribution for a non-leaf node to that for a leaf node to make an algorithm simple.

4.1 Hyper-Parameter Optimization

Even though we marginalize all transformation matrix (W i ), we still have to set the precision hyper-parameters ρ i for all nodes. Since we can derive the variational lower bound, we can optimize the precision hyper-parameter, and can remove the manual tuning of the hyper-parameters with the proposed approach. This is an advantage of the proposed approach with regard to SMAPLR [18], since SMAPLR has to hand-tune its hyper-parameters corresponding to {ρ i } i .

Based on the leaf node approximation for variational posterior distributions, in addition to the fixed latent variable approximation (\({\mathcal {F}} (m, \mathbf {\Psi }) \approx {\mathcal {L}} (m, \mathbf {\Psi })\)), in this paper the method we implement approximately optimize the precision hyper-parameter as follows:

$$ \begin{array}{rll}\tilde{\rho}_{i}&{\kern-5pt} ={\kern-5pt} &\mathop{\text{argmax}}_{\rho_{i}} {\cal L} (m, \boldsymbol{ \Psi })\nonumber\\&{\kern-5pt} =&{\kern-5pt}\left\{ {\kern-2pt}\begin{array}{lll}\mathop{\text{argmax}}_{\rho_{i}} ( {\cal L}_{i} (\rho_{i}, \rho_{{\sf l}(i)}, \rho_{{\sf r}(i)}) + {\cal L}_{{\sf p}(i)} (\rho_{{\sf p}(i)}, \rho_{i}, \rho_{{\sf r}({\sf p}(i))}) )\\i~\text{is a left child node of}~\text{{\sf p}}(i) \\\mathop{\text{argmax}}_{\rho_{i}} ( {\cal L}_{i} (\rho_{i}, \rho_{{\sf l}(i)}, \rho_{{\sf r}(i)}) +{\cal L}_{{\sf p}({\kern-.5pt}i{\kern-.5pt})} (\rho_{{\sf p}(i)}, \rho_{{\sf l}({\sf p}(i))}, \rho_{i}) )\\i~\text{is a right child node of}~\text{{\sf p}}(i)\end{array}\right.\nonumber\\& {\kern-5pt} \approx {\kern-5pt} & \mathop{\text{argmax}}_{\rho_{i}} {\cal L}_{i} (\rho_{i}),\end{array} $$
(53)

where

$$\begin{array}{rll} {\mathcal{L}}_{i} (\rho_{i}) \triangleq \frac{D(D+1)}{2} \log \rho_{i} + \frac{D}{2} \log |\tilde{\boldsymbol{ \Omega }}_{i}|\\ && {\kern2.6pc} - \frac{1}{2} \text{tr} \left[\rho_{i} \mathbf{M}_{{\mathsf{p}}(i)} ' \mathbf{M}_{{\mathsf{p}}(i)} - \tilde{\mathbf{M}}_{i} ' \tilde{\mathbf{M}}_{i} \tilde{\mathbf{\Omega}}_{i}^{-1} \right]. \end{array} $$
(54)

This approximation makes the algorithm simple because we can optimize the precision hyper-parameter within the target and parent nodes, and do not have to consider the child nodes. Since we only have one scalar parameter for this optimization step, we simply used a line search algorithm to obtain the optimal precision hyper-parameter. If we consider a more complex precision structure (e.g., a precision matrix instead of a scalar precision parameter in the prior distribution setting Eq. 20), the line search algorithm may not be adequate. In that case, we need to update hyper-parameters by using some other optimization technique (e.g., gradient decent).

4.2 Model Selection

The remaining tuning parameter in the proposed approach is how many clusters we prepare. This is a model selection problem, and we can also automatically obtain the number of clusters by optimizing the variational lower bound. In the binary tree structure, we focus on a subtree composed of a target non-leaf node i and its child nodes l(i) and r(i). We compute the following difference based on Eq. 54 of the parent and that of the child nodesFootnote 9

$$ \Delta {\mathcal{L}}_{i} (\rho_{i}) \triangleq {\mathcal{L}}_{{\mathsf{l}}(i)} (\rho_{{\mathsf{l}}(i)}) + {\mathcal{L}}_{{\mathsf{r}}(i)} (\rho_{{\mathsf{r}}(i)}) - {\mathcal{L}}_{i} (\rho_{i}). $$
(55)

This difference function is used for a stopping criterion in a top-down clustering strategy. Then, if the sign of Δ ℒ is negative, the target non-leaf node is regarded as a new leaf node determined by the model selection in terms of optimizing the lower bound. Then, we prune the child nodes l(i) and r(i). By checking the signs of Δℒ i for all possible nodes, and pruning the child nodes when Δℒ i have negative signs, we can obtain the pruned tree structure, which corresponds to maximizing the variational lower bound locally. This optimization is efficiently accomplished by using a depth-first search. This approach is similar to the tree-based triphone clustering based on VB[26].

Thus, by optimizing the hyper-parameters and model structure, we can avoid setting any tuning parameters. We summarize this optimization in Algorithm 1, 2, and 3. Algorithm 1 prepares a large Gaussian tree with a set of nodes \({\mathcal {I}}\), prunes a tree based on the model selection (Algorithm 2), and transforms HMMs (Algorithm 3). Algorithm 3 first optimizes the precision hyper-parameters Ψ, and then the model structure m. Algorithm 3 transforms Gaussian mean vectors in HMMs at the new root nodes in the pruned tree \({\mathcal {I}}_{m}\) obtained by Algorithm 2.

figure d
figure e
figure f

5 Experiments

This section shows the effectiveness of the proposed approach through experiments with large vocabulary continuous speech recognition. We used a Corpus of Spontaneous Japanese (CSJ) task [39].

5.1 Experimental Condition

The training data for constructing the initial (non-adapted) acoustic model consists of 961 talks from the CSJ conference presentations (234 hours of speech data), and the training data for the language model construction consists of 2,672 talks from the complete CSJ speech data (6.8M word transcriptions). The test set consists of 10 talks (2.4 hours, 26,798 words). Table 2 provides information on acoustic and language models used in the experiments[40]. We used context-dependent models with continuous density HMMs. The HMM parameters were estimated based on a discriminative training (Minimum Classification Error: MCE) approach [41]. Lexical and language models were also obtained by employing all the CSJ speech data. We used a 3-gram model with a Good-Turing smoothing technique. The OOV rate was 2.3 % and the test set perplexity was 82.4. The acoustic model construction, LVCSR decoding, and the following acoustic model adaptation procedures were performed with the NTT speech recognition platform SOLON [42].

Table 2 Experimental setup for CSJ.

5.2 Experimental Result

To check whether the proposed approach steadily increase the variational lower bound for each optimization in Section 4, Fig. 3 examines the values of the variational lower bound for each condition. Namely, we compare the proposed approach that optimizes both model structure and hyper-parameters, as discussed in Section 4 with those did not optimize each or any of them, in terms of the ℒ(m, Ψ) value. Figure 3 shows that the proposed approach indeed steadily increases the ℒ(m, Ψ) value. Therefore, this result indicates that the optimization works well by obtaining appropriate hyper-parameters and model structure.

Figure 3
figure 3

Variational lower bound for each optimization.

Next, Fig. 4 compares the proposed approach with MLLR based on the maximum likelihood estimation, and SMAPLR based on the approximate Bayesian estimation, as regards the Word Error Rate (WER) for various amounts of adaptation data. With a small amount of adaptation data, the proposed approach outperforms the conventional approaches by about 1.0 %, while with a large amount of adaptation data, the accuracies of all approaches are comparable. This property is theoretically reasonable since the variational lower bound would be tighter than the EM-based objective function for a small amount of data, while would approach it for a large amount of data asymptotically. Therefore, we conclude that this improvement comes from the optimization of the hyper-parameters and the model structure of the proposed approach, in addition to the mitigation of sparse data problem based on the Bayesian approach.

Figure 4
figure 4

Word error rates of conventional MLLR, SMAPLR, and the proposed Bayesian Linear Regression (VBLR) for various amounts (utterances) of adaptation data. The word error rate of the non-adapted (speaker independent) model was 17.9 %.

Thus, from the values of the lower bound and the recognition result, we show the effectiveness of the proposed approach.

6 Summary and Future Work

This paper presents a fully Bayesian treatment of linear regression for HMMs by using variational techniques. The derived lower bound of the marginalized log-likelihood can be used for optimizing the hyper-parameters and model structure, which was confirmed by speech recognition experiments. One promising extension is to apply the proposed approach to advanced adaptation techniques. Actually, [43,44] provide a fully Bayesian solution for standard transformation parameters (not variance normalize representation in this paper), and apply it to both the feature space and model parameter transformations. The model structure and hyper-parameters are also optimized automatically during adaptation. Thus, feature space normalization and model space adaptation are consistently performed based on a variational Bayesian approach without tuning any parameters.

Another important plan for the future work is joint optimization of HMM parameters and linear regression parameters in a Bayesian framework. This paper assumes that the HMM parameters are fixed during the estimation process of linear regression parameters. These parameters depend on each other, and the variational approximation can deal with the problem (in the sense of local optimum solutions). However, to consider the model selection in this joint optimization, we have to think of many possible combinations of HMM and linear regression topologies. One promising approach to this problem is to consider a non-parametric Bayesian approach (e.g., variational inference for Dirichlet process mixtures [45] in the VB framework), which can efficiently search an appropriate model structure in the many possible combinations.

Finally, how to integrate Bayesian approaches with discriminative approaches theoretically and practically is also important future work. One promising approach for this direction is the marginalization of model parameters and margin variables to provide Bayesian interpretations with discriminative methods [46]. However applying [46] to acoustic models requires some extensions to deal with large-scale structured data problems [47]. This extension enables the more robust regularization of discriminative approaches, and allows structural learning by combining Bayesian and discriminative criteria.