1 Introduction

Hierarchical parametric models are employed for unsupervised learning in many data-mining and machine-learning applications. Statistical analysis of the models plays an important role for not only revealing the theoretical properties but also the practical applications. For example, the asymptotic forms of the generalization error and the marginal likelihood are used for model selection in the maximum-likelihood and Bayes methods, respectively (Akaike 1974; Schwarz 1978; Rissanen 1986).

Parametric models generally fall into two cases: regular and singular. The present paper focuses on the models, the function of which are continuous and sufficiently smooth with respect to the parameter. In regular cases, the Fisher information matrix is positive definite, and there is a one-to-one relation between the parameter and the expression of the model as a probability density function. Otherwise, the model is singular, and the parameter space includes singularities. Due to these singularities, the Fisher information matrix is not positive definite, and so the conventional analysis methods that rely on its inverse matrix are not applicable. In this case, an algebraic geometrical approach can be used to analyze the Bayes method (Watanabe 2001, 2009).

Hierarchical models have both observable and latent variables. The latent variables represent the underlying structure of the model, while the observable ones correspond to the given data. For example, unobservable labels in clustering are expressed as the latent variables in mixture models, and the system dynamics of time-series data is a sequence of the variables in hidden Markov models. Hierarchical models thus have two estimation targets: observable and latent variables. The well-known generalization error measures the performance of the prediction of a future observable variable. Combining the two model cases and the two estimation targets, there are four estimation cases, which are summarized in Table 1. We will use the abbreviations shown in the table to specify the target variable and the model case; for example, Reg-OV estimation stands for estimation of the observable variable in the regular case.

Table 1 Estimation classification according to the target variable and the model case

In the present paper, we will investigate the asymptotic performance of the Sing-LV estimation. One of the main concerns in unsupervised learning is the estimation of unobservable parts and in practical situations, the ranges of the latent variables are unknown, which corresponds to the singular case. The other estimation cases have already been studied; the accuracy of the Reg-OV estimation has been clarified on the basis of the conventional analysis method, and the results have been used for model selection criteria, such as AIC (Akaike 1974). The primary purpose for using the algebraic geometrical method is to analyze the Sing-OV estimation, and the asymptotic generalization error of the Bayes method has been derived for many models (Aoyagi and Watanabe 2005; Aoyagi 2010; Rusakov and Geiger 2005; Yamazaki and Watanabe 2003a, b, 2005a, b; Zwiernik 2011). Recently, an error function for the latent-variable estimation was formalized in a distribution-based manner, and its asymptotic form was determined for the Reg-LV estimation of both the maximum likelihood and Bayes methods (Yamazaki 2014). Hereinafter, the estimation method will be assumed to be the Bayes method unless it is explicitly stated otherwise.

In the Bayes estimation, parameter sampling from the posterior distribution is an important process for practical applications. The behavior of posterior distributions has been studied in the statistical literature. The convergence rate of the posterior distribution has been analyzed (e.g., Ghosal et al. 2000; Le Cam 1973; Ibragimov and Has’ Minskii 1981). Specifically, the rate based on the Wasserstein metrics is elucidated in finite and infinite mixture models (Nguyen 2013). To avoid singularities, conditions for the identifiability guaranteeing the positive Fisher matrix are necessary. Allman et al. (2009) use algebraic techniques to clarify the identifiability in some hierarchical models. In the regular case, the posterior distribution has the asymptotic normality, which means that it converges to a Gaussian distribution. Because the variance of the distribution goes to zero when the number of data is sufficiently large, the limit distribution is the delta distribution. Then, the sample sequence from the posterior distribution converges to a point. On the other hand, in the singular case, the posterior distribution does not have the asymptotic normality and the sequence converges to some area of the parameter space (Watanabe 2001). Studies on the Sing-OV estimation such as Yamazaki and Kaji (2013) have shown that the convergence area of the limit distribution depends on a prior distribution. The behavior of the posterior distribution has not been clarified in the Sing-LV estimation. The analysis of the present paper enables us to elucidate the relation between the prior and the limit posterior distributions.

The main contributions of the present paper are summarized as follows:

  1. 1.

    The algebraic geometrical method for the Sing-OV estimation is applicable to the analysis of the Sing-LV estimation.

  2. 2.

    The asymptotic form of the error function is obtained, and its dominant order is larger than that of the Reg-LV estimation.

  3. 3.

    There is a case, where the limit posterior distribution in the Sing-LV estimation is different from that in the Sing-OV estimation.

The third result is important for practical applications: in some priors, parameter-sampling methods based on latent variables, such as Gibbs sampling in the Markov chain Monte Carlo (MCMC) method, cannot construct the proper posterior distribution because the sample sequence of the MCMC method follows the posterior of the Sing-LV estimation, which has a different convergence area from the desired one in the Sing-OV estimation.

The rest of this paper is organized as follows. The next section formalizes the hierarchical model and the singular case, and introduces the performance of the Reg-OV and the Sing-OV estimations. Section 3 explains the asymptotic analysis of the free energy function and the convergence of the posterior distribution based on the results of the Sing-OV estimation. In Sect. 4, the latent-variable estimation and its evaluation function are formulated in a distribution-based manner. Section 5 shows the main results: the asymptotic error function of general hierarchical models, and the detailed error properties in mixture models. In Sect. 6, we discuss the limit distribution of the posterior in the Sing-LV estimation and differences from the Sing-OV estimation. Finally, Sect. 7 presents conclusions.

2 The singular case and accuracy of the observable-variable estimation

In this section, we introduce the singular case and formalize the Bayes method for the observable-variable estimation. This section is a brief summary of the results on the Reg-OV and the Sing-OV estimations.

2.1 Hierarchical models and singularities

Let a learning model be defined by

$$\begin{aligned} p(x|w) = \sum _{y=1}^K p(x,y|w) = \sum _{y=1}^K p(y|w)p(x|y,w), \end{aligned}$$

where \(x\in R^M\) is an observable variable, \(y\in \{1,\ldots ,K\}\) is a latent one, and \(w\in W \subset R^d\) is a parameter. For the discrete \(x\) such that \(x\in \{1,2,\ldots ,M\}\), all results hold by replacing \(\int dx\) with \(\sum _{x=1}^M\).

Example 1

A mixture of distributions is described by

$$\begin{aligned} p(x|w) = \sum _{k=1}^K a_k f(x|b_k), \end{aligned}$$
(1)

where \(f\) is the density function associated with a mixture component, which is identifiable for any \(b_k \in W_b \subset R^{d_c}\). The mixing ratios have constraints \(a_k\ge 0\) and \(\sum _{k=1}^K a_k =1\). We regard \(a_1\) as a function of the parameters \(a_1=1-\sum _{k=2}^K a_k\). The parameter \(w\) consists of \(\{a_2,\ldots ,a_K\}\) and \(\{b_1,\ldots ,b_K\}\), where \(w \in \{[0,1]^{K-1},W_b^{Kd_c}\}\). The latent variable \(y\) is the component label.

Assume that the number of data is \(n\) and the observable data \(X^n=\{x_1,\ldots ,x_n\}\) are independent and identically distributed from the true model, which is expressed as

$$\begin{aligned} q(x) = \sum _{y=1}^{K^*} q(y)q(x|y). \end{aligned}$$

Note that the value range of the latent variable \(y\) described as \([1,\ldots ,K^*]\) is generally unknown and can be different from the one in the learning model. In the example of the mixture model, the true model is expressed as

$$\begin{aligned} q(x) = \sum _{k=1}^{K^*} a^*_k f(x|b^*_k). \end{aligned}$$
(2)

We also assume that the true model satisfies the minimality condition:

$$\begin{aligned} k \ne j \in \{1,\ldots ,K^*\} \Rightarrow q(x|y=k)\ne q(x|y=j). \end{aligned}$$

For example, consider a three-component model such that \(q(x|y=1)\ne q(x|y=2)=q(x|y=3)\). This model does not satisfy the minimality condition. Defining a new label, we obtain the following two-component expression, which satisfies the condition;

$$\begin{aligned} q(x)&= q(y=1)q(x|y=1) + \{ q(y=2)+q(y=3)\}q(x|y=2)\\&= q(y=1)q(x|y=1) + q(y=\bar{2})q(x|y=\bar{2}), \end{aligned}$$

where \(y\in \{1,\bar{2}\}\) and \(\bar{2}=\{2,3\}\).

The present paper focuses on the case in which the true model is in the class of the learning model. More formally, there is a set of parameters expressing the true model such that

$$\begin{aligned} W_X^t = \{w^*;p(x|w^*)=q(x)\} \ne \emptyset , \end{aligned}$$

which is referred to as the true parameter set for \(x\). This means that the latent variable range satisfies \(K=K^*\) or \(K>K^*\). The former relation corresponds to the regular case and the latter one to the singular case. The true parameter set \(W_X^t\) includes \(K!\) isolated points in the regular case due to the symmetry of the parameter space. On the other hand, it consists of an analytic set in the singular case. We explain this structure using the following model settings.

Example 2

Assume that \(K=2\) and \(K^*=1\) in the mixture model. For illustrative purposes, let the learning and the true models be defined by

$$\begin{aligned} p(x|w)&= af(x|b_1) + (1-a)f(x|b_2),\\ q(x)&= f(x|b^*), \end{aligned}$$

respectively, where \(x\in R^1\) and \(w=\{a,b_1,b_2\}\) such that \(a\in [0,1]\) and \(b_1,b_2\in W_b \subset R^1\). We can confirm that the true parameter set consists of the following analytic set:

$$\begin{aligned} W_X^t&= W^t_1 \cup W^t_2 \cup W^t_3,\\ W^t_1&= \{a=1,b_1=b^*\},\\ W^t_2&= \{b_1=b_2=b^*\},\\ W^t_3&= \{a=0,b_2=b^*\}. \end{aligned}$$

As shown in Fig. 1, let \(W_1\), \(W_2\), and \(W_3\) be the neighborhood of \(W^t_1\), \(W^t_2\), and \(W^t_3\), respectively. The Fisher information matrix is not positive definite in \(W^t_X\). Moreover, the intersections of \(W^t_1\), \(W^t_2\) and \(W^t_3\) are singularities.

When \(K=K^*\), \(W^t_X\) is a set of points, which corresponds to the regular case;

Example 3

If both the learning and the true models have two components,

$$\begin{aligned} p(x|w)&= af(x|b_1) +(1-a)f(x|b_2),\\ q(x)&= a^*f(x|b^*_1) +(1-a^*)f(x|b^*_2) \end{aligned}$$

for \(a^*\ne 0,1\) and \(b^*_1\ne b^*_2\), the estimation will be in the regular case. Due to \(K!=2!=2\), the set consists of two isolated points;

$$\begin{aligned} W_X^t = \{(a=a^*,b_1=b^*_1,b_2=b^*_2),(a=1-a^*,b_1=b^*_2,b_2=b^*_1)\}, \end{aligned}$$

where the Fisher information matrix is positive definite.

Fig. 1
figure 1

The true parameter set \(W_X^t\) (the left panel), and the parameter areas \(W_1\), \(W_2\), and \(W_3\) (the right panel)

2.2 The observable-variable estimation and its performance

In Bayesian statistics, estimation of the observable variables is defined by

$$\begin{aligned} p(x|X^n)&= \int p(x|w)p(w|X^n)dw ,\\ p(w|X^n)&= \frac{\prod _{i=1}^np(x_i|w)\varphi (w;\eta )}{Z(X^n)}, \end{aligned}$$

where \(\varphi (w;\eta )\) is a prior distribution with the hyperparameter \(\eta \), \(p(w|X^n)\) is the posterior distribution of the parameter, and its normalizing factor is given by

$$\begin{aligned} Z(X^n) = \int \prod _{i=1}^np(x_i|w)\varphi (w;\eta )dw. \end{aligned}$$

This formulation is available for both the Reg-OV and Sing-OV estimations. In the mixture model, the Dirichlet distribution is often used for the prior distribution of the mixing ratio;

$$\begin{aligned} \varphi (w;\eta )&= \varphi (a;\eta _1)\varphi (b;\eta _2),\end{aligned}$$
(3)
$$\begin{aligned} \varphi (a;\eta _1)&= \frac{\varGamma (K\eta _1)}{\varGamma (\eta _1)^K}\prod _{i=k}^K a_k^{\eta _1-1}, \end{aligned}$$
(4)

where \(a=\{a_1,\ldots ,a_K\}\), \(b=\{b_1,\ldots ,b_K\}\), \(\eta =\{\eta _1,\eta _2\}\in R^2_{>0}\), and \(\varGamma \) is the gamma function. Since \(a_k\) has the same exponential part for all \(k\), \(\varphi (a;\eta _1)\) is referred to as a symmetric Dirichlet distribution.

The estimation accuracy is measured by the average Kullback–Leibler divergence:

$$\begin{aligned} G(n) = E_X\bigg [ \int q(x) \ln \frac{q(x)}{p(x|X^n)}dx \bigg ], \end{aligned}$$

where the expectation is

$$\begin{aligned} E_X\left[ f(X^n)\right] = \int f(X^n) q(X^n)dX^n. \end{aligned}$$

Let us define the free energy as

$$\begin{aligned} F(X^n) = -\ln Z(X^n), \end{aligned}$$

which plays an important role in Bayes statistics as a criterion for selecting the optimal model. In the Reg-OV estimation, the Bayesian information criterion (BIC; Schwarz 1978) and the minimum-description-length principle (MDL; Rissanen 1986) are both based on the asymptotic form of \(F(X^n)\). Theoretical studies often analyze the average free energy given by

$$\begin{aligned} F_X(n) = -nS_X + E_X[F(X^n)], \end{aligned}$$

where the entropy function is defined by

$$\begin{aligned} S_X = -\int q(x)\ln q(x)dx. \end{aligned}$$

The model that minimizes \(F(X^n)\) is then selected as optimal from among the candidate models. The energy function \(F_X(n)\) allows us to investigate the average behavior of the selection. Note that the entropy term does not affect the selection result because it is independent of the candidate models. According to the definitions, the average free energy and the generalization error have the relation

$$\begin{aligned} G(n)&= E_{X^n}\bigg [ \int q(x_{n+1})\ln \frac{q(x_{n+1})}{p(x_{n+1}|X^n)}dx_{n+1}\bigg ]\nonumber \\&= E_{X^n,x_{n+1}}\bigg [\ln \frac{q(x_{n+1})}{p(x_{n+1}|X^n)}\bigg ]\nonumber \\&= E_{X^n,x_{n+1}}\bigg [ \ln \frac{\prod _{i=1}^{n+1} q(x_i)}{\int \prod _{i=1}^{n+1} p(x_i|w)\varphi (w;\eta )dw}\bigg ]\nonumber \\&-E_{X^n}\bigg [ \ln \frac{\prod _{i=1}^n q(x_i)}{\int \prod _{i=1}^n p(x_i|w)\varphi (w;\eta )dw}\bigg ]\nonumber \\&= F_X(n+1) - F_X(n), \end{aligned}$$
(5)

which implies that the asymptotic form of \(F(n)\) also relates to that of \(G(n)\). The rest of the paper discusses the case \(W_X^t\ne \emptyset \), although it is also important to consider the case \(W_X^t=\emptyset \), where the learning model cannot attain the true model.

The algebraic geometrical analysis (Watanabe 2001, 2009) is applicable to both the regular and singular cases for deriving the asymptotic form of \(F_X(n)\). Its result shows that the form is expressed as

$$\begin{aligned} F_X(n) = \lambda _X\ln n -(m_X-1)\ln \ln n +O(1), \end{aligned}$$

where the coefficients \(\lambda _X\) and \(m_X\) are positive rational and natural, respectively. The reason why the free energy has this form will be explained in the next section. According to the relation Eq. (5), the asymptotic form of the generalization error is given by

$$\begin{aligned} G(n) = \frac{\lambda _X}{n} - \frac{m_X-1}{n \ln n} + o\bigg (\frac{1}{n \ln n}\bigg ). \end{aligned}$$
(6)

Since the learning model can attain the true model, we can confirm that the generalization error converges to zero for \(n\rightarrow \infty \). The coefficients are \(\lambda _X=d/2\) and \(m_X=1\) in the regular case. It is proved that \(\lambda _X<d/2\) in the singular case (Section 7, Watanabe 2009).

3 Asymptotic analysis of the free energy and posterior convergence

This section introduces the asymptotic analysis of \(F_X(n)\) based on algebraic geometry and explains how the prior distribution affects convergence of the posterior distribution. The topics in this section have already been elucidated in the studies on the Sing-OV estimation (e.g., Watanabe 2009).

3.1 Relation between the free energy and the zeta function

Let us define another Kullback–Leibler divergence,

$$\begin{aligned} H_X(w) = \int q(x) \ln \frac{q(x)}{p(x|w)}dx, \end{aligned}$$

which is assumed to be analytic (Fundamental Condition I, Watanabe 2009). We consider the prior distribution \(\varphi (w;\eta )=\psi _1(w;\eta )\psi _2(w;\eta )\), where \(\psi _1(w;\eta )\) is a positive function of class \(C^{\infty }\) and \(\psi _2(w;\eta )\) is a nonnegative analytic function (Fundamental Condition II, Watanabe 2009). Let the zeta function of a parametric model be given by

$$\begin{aligned} \zeta _X(z) = \int H_X(w)^z \varphi (w;\eta )dw, \end{aligned}$$

where \(z\) is a complex variable. From algebraic analysis, we know that its poles are real, negative, and rational (Atiyah 1970). Let the largest pole and its order be \(z=-\lambda _X\) and \(m_X\), respectively. The zeta function includes the term

$$\begin{aligned} \zeta _X(z) = \frac{f_c(z)}{(z+\lambda _X)^{m_X}}+\cdots , \end{aligned}$$

where \(f_c(z)\) is a holomorphic function. We define the state density function of \(t>0\) as

$$\begin{aligned} v(t) = \int \delta (t-H_X(w))\varphi (w;\eta )dw. \end{aligned}$$

The zeta function is its Mellin transform:

$$\begin{aligned} \zeta _X(z) = \mathcal {M}[v(t)]=\int _0^\infty v(t)t^z dt. \end{aligned}$$

Moreover, it is known that the inverse Laplace transform of \(v(t)\) has the same asymptotic form as \(F_X(n)\);

$$\begin{aligned} \mathcal {L}^{-1}[v(t)]&= \int v(t)e^{nt}dt\\&= \int e^{nH_X(w)}\varphi (w;\eta )dw=F_X(n). \end{aligned}$$

Then, there is the following relation,

$$\begin{aligned} F_X(n) \overset{\mathcal {L}}{\Longleftrightarrow } v(t) \overset{\mathcal {M}}{\Longleftrightarrow } \zeta _X(z). \end{aligned}$$

Based on the Laplace and the Mellin transforms, the asymptotic forms of all functions are available if one of them is given. Following the transforms from \(\zeta _X(z)\) to \(F_X(n)\) through \(v(t)\), we obtain the asymptotic form

$$\begin{aligned} F_X(n) = \lambda _X\ln n -(m_X-1)\ln \ln n +O(1). \end{aligned}$$

Let us define the effective area of the parameter space, which plays an important role in the convergence analysis of the posterior distribution. According to the results on the Sing-OV estimation, it has been found that the largest pole exists in a restricted parameter space. In Example 2, the parameter space is divided into \(W_1\), \(W_2\), \(W_3\) and the rest of the support of \(\varphi (w;\eta )\). The first three sets are neighborhoods of the analytic sets \(W^t_1\), \(W^t_2\) and \(W^t_3\) constructing \(W^t_X\), respectively. Assume that a pole \(z=-\lambda _e\) of the zeta function

$$\begin{aligned} \zeta _e(z)= \int _{W_e} H_X(w)^z\varphi (w;\eta )dw \end{aligned}$$

is equal to the largest pole \(z=-\lambda _X\), where \(W_e=W_1\cap W_2\). In the present paper, we refer to \(W_e\) as the effective area. Let the effective area be denoted by the minimum set \(W_1\cap W_2\). In other words, we do not call \(W_1\) the effect area even though \(W_1\) includes \(W_e\). If the largest pole of \(\int _{W_1\setminus W_1\cap W_2} H_X(w)^z\varphi (w;\eta )dw\) is also equal to \(z=-\lambda _X\), the effective area is \(W_1\) since \(W_1\cap W_2\) can not cover the area.

3.2 Phase transition

A switch in the underlying function of the free energy is generally referred to as a phase transition. When the prior of the mixing ratio parameters is the Dirichlet distribution. the phase transition is observed in \(F_X(n)\). Combining the results of Yamazaki et al. (2010) and Yamazaki and Kaji (2013), we obtain the following lemma;

Lemma 1

Suppose that \(K=2\), \(K^*=1\) in the mixture model, where the true and the learning models are given by

$$\begin{aligned} q(x)&= f(x|b^*),\\ p(x|w)&= af(x|b_1)+(1-a)f(x|b_2), \end{aligned}$$

respectively. Let the component be expressed as

$$\begin{aligned} f(x=m|b_k) = \left( {\begin{array}{c}M\\ m\end{array}}\right) b_k^m(1-b_k)^{M-m}, \end{aligned}$$

where \(x\in \{1,\ldots ,M\}\), \(M\) is an integer such that \(K<M\), and \((M\; m)^\top \) is the binomial coefficient. We consider the case \(0<b^*<1\). Let the prior distribution for the mixing ratio be the symmetric Dirichlet distribution, and the one for \(b_k\) be analytic and positive. Then the largest pole of the zeta function \(\zeta _X(z)\) is

$$\begin{aligned} \lambda _X&= {\left\{ \begin{array}{ll} \frac{1+\eta _1}{2} &{} \eta _1 \le 1/2,\\ \frac{3}{4} &{} \eta _1 > 1/2, \end{array}\right. }\\ m_X&= {\left\{ \begin{array}{ll} 2 &{} \eta _1 = 1/2,\\ 1 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

Moreover, the effective area \(W_e\) is given by

$$\begin{aligned} W_e = {\left\{ \begin{array}{ll} W_1 \cup W_3 &{} \eta _1<1/2,\\ (W_1\cap W_2)\cup (W_3\cap W_2) &{} \eta _1=1/2,\\ W_2 &{} \eta _1>1/2. \end{array}\right. } \end{aligned}$$

The proof is in Appendix 3. Lemma 1 indicates that the free energy has the phase transition at \(\eta _1=1/2\).

3.3 Convergence area of the posterior distribution

The asymptotic form of the free energy determines the limit structure of the posterior distribution. In this subsection, we will show that the convergence area is the effective parameter area.

The free energy \(F(X^n)\) has an asymptotic form similar to the average energy \(F_X(n)\) (Watanabe 2009, Main Formula II),

$$\begin{aligned} F(X^n) = nS(X^n)+\lambda _X \ln n -(m_X-1)\ln \ln n + O_p(1), \end{aligned}$$
(7)

where \(S(X^n)=\frac{1}{n}\sum _{i=1}^n\ln q(x_i)\). According to \(Z(X^n) = \exp (-F(X^n))\), the posterior distribution has the expression,

$$\begin{aligned} p\left( w|X^n\right) = \frac{\prod _{i=1}^n p(x_i|w)\varphi (w;\eta )}{\exp \left\{ -nS(X^n)-\lambda _X \ln n +o_p(\ln n)\right\} }. \end{aligned}$$

Let us divide the neighborhood of \(W_X^t\) into \(W_e \cup W_o\), where \(W_e\) is the effective area. Then, there is a pole \(z=-\mu _X\) such that \(\mu _X>\lambda _X\) in the other area \(W_o\), and the posterior value of \(W_o\) is described by

$$\begin{aligned} p(W_o|X^n)&= \int _{W_o} p(w|X^n) dw\\&= \frac{\int _{W_o}\prod _{i=1}^n p(x_i|w)\varphi (w;\eta )dw}{\exp \{-nS(X^n)-\lambda _X \ln n +o_p(\ln n)\}}\\&= \frac{\exp \{-nS(X^n)-\mu _X \ln n +o_p(\ln n)\}}{\exp \{-nS(X^n)-\lambda _X \ln n +o_p(\ln n)\}}\\&= n^{-\mu _X+\lambda _X} +o_p(n^{-\mu _X+\lambda _X}). \end{aligned}$$

The posterior asymptotically has zero value in \(W_o\), which means that it converges to the effective area.

According to Lemma 1, the effective area depends on the hyperparameter. Therefore, the convergence area changes at the phase transition point \(\eta _1=1/2\). It also shows how the learning model realizes the true one. In \(W_1\cup W_3\), the true model is expressed by one-component model, which means that the redundant component is eliminated. On the other hand, all components of the learning model are used in \(W_2\).

The phase transition is observed in general mixture models;

Theorem 1

Let a learning model and the true one be expressed as Eqs. (1) and (2), respectively. When the prior of the mixing ratio is the Dirichlet distribution of Eq. (4), the average free energy \(F_X(n)\) has at least two phases: the phase that eliminates all redundant components when \(\eta _1\) is small, and the one that uses them when \(\eta _1\) is sufficiently large.

The proof is in Appendix 3.

4 Formal definition of the latent-variable estimation and its accuracy

This section formulates the Bayes latent-variable estimation and an error function that measures its accuracy.

We first consider a detailed definition of a latent variable. Let \(Y^n=\{y_1,\ldots ,y_n\}\) be unobservable data, which correspond to the latent parts of the observable \(X^n\). Then, the complete form of the data is \((x_i,y_i)\), and \((X^n,Y^n)\) and \(X^n\) are referred to as complete and incomplete data, respectively. The true model generates the complete data \((X^n,Y^n)\), where the range of the latent variables is \(y_i\in \{1,\ldots ,K^*\}\). The learning model, on the other hand, has the range \(y_i\in \{1,\ldots ,K\}\). For a unified description, we define that the true model has probabilities \(q(y)=0\) and \(q(x,y)=0\) for \(y>K^*\).

We define the true parameter set for \((x,y)\) as

$$\begin{aligned} W_{XY}^t = \{w^*;p(x,y|w^*)=q(x,y)\}, \end{aligned}$$

which is a proper subset of \(W_X^t\). In Example 2,

$$\begin{aligned} W_{XY}^t = \{a=1,b_1=b^*\}=W^t_1\subset W^t_X. \end{aligned}$$

The subsets \(W_2=\{b_1=b_2=b^*\}\) and \(W_3=\{a=0,b_2=b^*\}\) in \(W_X^t\) are excluded since \(W_{XY}^t\) takes account of the representation with respect to not only \(x\) but also \(y\). Due to the assumption \(W_X^t\ne \emptyset \), \(W_{XY}^t\) is not empty. The set \(W_{XY}^t\) again consists of an analytic set in the singular case, and it is a unique point in the regular case.

While latent-variable estimation falls into various types according to the target of the estimation, the present paper focuses on the Type-I estimation of Yamazaki (2014): the joint probability of \((y_1,\ldots ,y_n)\) is the target and is written as \(p(Y^n|X^n)\). The Bayes estimation has two equivalent definitions:

$$\begin{aligned} p(Y^n|X^n)&= \int \prod _{i=1}^n \frac{p(x_i,y_i|w)}{p(x_i|w)}p(w|X^n)dw \end{aligned}$$
(8)
$$\begin{aligned}&= \frac{Z(X^n,Y^n)}{Z(X^n)}, \end{aligned}$$
(9)

where the marginal likelihood for the complete data is given by

$$\begin{aligned} Z(X^n,Y^n) = \int \prod _{i=1}^n p(x_i,y_i|w)\varphi (w;\eta )dw. \end{aligned}$$

It is easily confirmed that \(Z(X^n)=\sum _{Y^n} Z(X^n,Y^n)\).

The true probability of \(Y^n\) is uniquely given by

$$\begin{aligned} q(Y^n|X^n) = \frac{q(X^n,Y^n)}{q(X^n)} = \prod _{i=1}^n \frac{q(x_i,y_i)}{q(x_i)}. \end{aligned}$$
(10)

The accuracy of the estimation is measured by the difference between \(q(Y^n|X^n)\) and \(p(Y^n|X^n)\). Thus, we define the error function as the average Kullback–Leibler divergence,

$$\begin{aligned} D(n) = \frac{1}{n}E_{XY}\bigg [ \ln \frac{q(Y^n|X^n)}{p(Y^n|X^n)} \bigg ], \end{aligned}$$
(11)

where the expectation is defined as

$$\begin{aligned} E_{XY}\left[ f(X^n,Y^n)\right] = \int \sum _{y_1=1}^{K}\cdots \sum _{y_n=1}^{K} f(X^n,Y^n) q(X^n,Y^n)dX^n. \end{aligned}$$

5 Asymptotic analysis of the error function

In this section, we show that the algebraic geometrical analysis is applicable to the Sing-LV estimation, and present the asymptotic form of the error function \(D(n)\).

5.1 Conditions for the analysis

Before showing the asymptotic form of the error function, we state necessary conditions.

Let us define the zeta function on the complete data \((x,y)\) as

$$\begin{aligned} \zeta _{XY}(z) = \int H_{XY}(w)^z\varphi (w;\eta )dw, \end{aligned}$$

where the Kullback–Leibler divergence \(H_{XY}(w)\) is given by

$$\begin{aligned} H_{XY}(w) = \sum _{y=1}^{K}\int q(x,y)\ln \frac{q(x,y)}{p(x,y|w)}dx. \end{aligned}$$

Let the largest pole of \(\zeta _{XY}(z)\) be \(z=-\lambda _{XY}\), and let its order be \(m_{XY}\).

We consider the following conditions:

  1. (A1)

    The divergence functions \(H_{XY}(w)\) and \(H_X(w)\) are analytic.

  2. (A2)

    The prior distribution has the compact support, which includes \(W_X^t\), and has the expression \(\varphi (w;\eta )=\psi _1(w;\eta )\psi _2(w;\eta )\), where \(\psi _1(w;\eta )>0\) is a function of class \(C^{\infty }\) and \(\psi _2(w;\eta )\ge 0\) is analytic on the support of \(\varphi (w;\eta )\).

They correspond to the Fundamental Conditions I and II in Watanabe (2009), respectively. It is known that models with discrete \(x\) such as the binomial mixture satisfy (A1) (Yamazaki et al. 2010). On the other hand, if \(x\) is continuous, there are some models, of which \(H_X(w)\) is not analytic;

Example 4

(Example 7.3 in Watanabe 2009) In the Gaussian mixture, \(H_X(w)\) is not analytic, which means that the mixture model does not satisfies (A1). Let us consider a simple case; \(K=2\) and \(K^*=1\), where the true model and a learning model are given by

$$\begin{aligned} q(x)&= f(x|0),\\ p(x|a)&= a f(x|2) +(1-a)f(x|0), \end{aligned}$$

respectively, where \(x \in R^1\),

$$\begin{aligned} f(x|b)&= \frac{1}{\sqrt{2\pi }}\exp \bigg \{-\frac{(x-b)^2}{2}\bigg \}, \end{aligned}$$

and \(b\in W^1=R^1\). Then,

$$\begin{aligned} H_X(a)&= \int q(x) \ln \frac{q(x)}{p(x|a)}dx \\&= - \int q(x) \ln \{1+a(\exp (2x-2)-1)\}dx \\&= \int \sum _{j=1}^\infty \frac{a^j}{j}(1-\exp (2x-2))^j q(x)dx, \end{aligned}$$

where the last expression is a formal expansion. Since its convergence radius is zero at \(a=0\), \(H_X(a)\) is not analytic. Based on the similar way, we can find that \(H_X(w)\) is not analytic in a general Gaussian mixture .

The following example shows a prior distribution for the mixture model satisfying (A2).

Example 5

The symmetric Dirichlet distribution satisfies the condition (A2) because Eq. (4) is obviously analytic and non negative in its support. Choosing an analytic distribution for \(\varphi (b;\eta _2)\), we obtain the prior \(\varphi (w;\eta )\) satisfying the condition (A2).

5.2 Asymptotic form of the error function

Now, we show the main theorem on the asymptotic form of the error function:

Theorem 2

Let the true distribution of the latent variables and the estimated distribution be defined by Eqs. (9) and (10), respectively. By assuming the conditions (A1) and (A2), the asymptotic form of \(D(n)\) is expressed as

$$\begin{aligned} D(n) = (\lambda _{XY}-\lambda _X)\frac{\ln n}{n} - (m_{XY}-m_X)\frac{\ln \ln n}{n} + o\bigg (\frac{\ln \ln n}{n}\bigg ). \end{aligned}$$

The proof is in Appendix 1. The theorem indicates that the algebraic geometrical method plays an essential role for the analysis of the Sing-LV estimation because the coefficients consist of the information of the zeta functions such as \(\lambda _{XY}\), \(\lambda _X\), \(m_{XY}\) and \(m_X\). The order \(\ln n/n\) has not ever appeared in the Reg-LV estimation. In the Reg-LV estimation such that \(K=K^*\), the asymptotic error function has the following form (Yamazaki 2014);

$$\begin{aligned} D(n)&= \frac{1}{n}{\mathrm {Tr}}\left[ I_{XY}(w^*)I_X(w^*)^{-1}\right] + o\bigg (\frac{1}{n}\bigg ),\\ \{I_{XY}(w)\}_{ij}&= \sum _{y=1}^K \int \frac{\partial \ln p(x,y|w)}{\partial w_i}\frac{\partial \ln p(x,y|w)}{\partial w_j} p(x,y|w)dx,\\ \{I_X(w)\}_{ij}&= \int \frac{\partial \ln p(x|w)}{\partial w_i}\frac{\partial \ln p(x|w)}{\partial w_j} p(x|w) dx, \end{aligned}$$

where \(w^*\) is the unique point consisting of \(W_{XY}^t\). The dominant order is \(1/n\), and the coefficient is determined by the Fisher information matrices on \(p(x,y|w)\) and \(p(x|w)\). Theorem 2 implies that the largest possible order is \(\ln n/ n\) in the Sing-LV estimation. This order change is adverse for the performance because the error converges more slowly to zero. In singular cases, the probability \(p(Y^n|X^n)\) is constructed over the space \(Y^n \in K^n\) while the true probability \(q(Y^n|X^n)\) is over \(Y^n\in K^{*n}\). The size of the redundant space \(K^n-K^{*n}\) grows exponentially with the amount of training data. For realizing \(p(x,y|w^*)\), where \(w^* \in W_{XY}^t\), we must assign zero to the probabilities on the vast redundant space. The increased order reflects the cost of assigning these values.

Let us compare the dominant order of \(D(n)\) with that of the generalization error. We find that both Reg-OV and Sing-OV estimations have the same dominant order \(1/n\) as shown in Eq. (6) while the redundancy and the hyperparameter affect the coefficients. Thus, changing the order is a unique phenomenon of the latent-variable estimation.

5.3 Asymptotic error in the mixture model

In Theorem 2, the possible dominant order was calculated as \(\ln n/n\). However, there is no guarantee that this is the actual maximum order; the order can decrease to \(1/n\) if the coefficients are zero, where the zeta functions \(\zeta _{XY}(z)\) and \(\zeta _X(z)\) have their largest poles in the same position and their multiple orders are also the same. The result of the following theorem clearly shows that the dominant order is \(\ln n/n\) in the mixture models.

Theorem 3

Let the learning and the true models be mixtures defined by Eqs. (1) and (2), respectively. Assume the conditions (A1) and (A2). The Bayes estimation for the latent variables, Eq. (9), with the prior represented by Eqs. (3) and (4) has the following bound for the asymptotic error:

$$\begin{aligned} D(n) \ge \frac{(K-K^*)\eta _1}{2}\frac{\ln n}{n} + o\bigg (\frac{\ln n}{n}\bigg ). \end{aligned}$$

The proof is in Appendix 1. Due to the definition of the Dirichlet distribution, \(\eta _1\) is positive. Combining this with the assumption \(K^*<K\), we obtain that the coefficient of \((\ln n)/n\) is positive, which indicates that it is the dominant order.

The Dirichlet prior distribution for the mixing ratio is qualitatively known to have a function controlling the number of available components, the so-called automatic relevance determination (ARD); a small hyperparameter tends to have a result with few components due to the shape of the distribution. Theorem 3 quantitatively shows an effect of the Dirichlet prior. The lower bound in the theorem mathematically supports the ARD effect; the redundancy \(K-K^*\) and the hyperparameter \(\eta _1\) have a linear influence on the accuracy.

Theorem 3 holds in a wider class of the mixture models since the error is evaluated as the lower bound. The following corollary shows that the Gaussian mixture has the same bound for the error even though it does not satisfy (A1) as shown in Example 4.

Corollary 1

Assume that in a mixture model, \(H_{XY}(w)\) is analytic, and the prior distribution for the mixing ratio is the symmetric Dirichlet distribution. If there is a positive constant \(C_1\) such that

$$\begin{aligned} H_X(w) \le C_1\int \bigg (\frac{p(x|w)}{q(x)}-1\bigg )^2 dx, \end{aligned}$$

the error function has the same lower bound as Theorem 3. In the Gaussian mixture, components of which are defined by

$$\begin{aligned} f(x|b) = \frac{1}{\sqrt{2\pi }^M}\exp \bigg \{-\frac{||x-b||^2}{2}\bigg \}, \end{aligned}$$

where \(x\in R^M\) and \(b\in W^M=R^M\), \(H_{XY}(w)\) is analytic and the inequality holds.

The proof is in Appendix 1.

6 Discussion

Theorem 2 shows that the asymptotic error has the coefficient \(\lambda _{XY}-\lambda _X\), which is the difference of the largest poles in the zeta functions. Based on the free energy of the complete data defined as \(F(X^n,Y^n)=-\ln Z(X^n,Y^n)\), we find that the error is determined by the different properties between \(F(X^n,Y^n)\) and \(F(X^n)\) since their asymptotic forms are expressed as

$$\begin{aligned} F(X^n,Y^n)&= nS(X^n,Y^n) + \lambda _{XY} \ln n -(m_{XY}-1)\ln \ln n +O_p(1),\\ F(X^n)&= nS(X^n) +\lambda _X \ln n -(m_X-1)\ln \ln n + O_p(1), \end{aligned}$$

where \(S(X^n,Y^n) = -\frac{1}{n}\sum _{i=1}^n \ln q(x_i,y_i)\).

In this section, we examine the properties of \(F(X^n,Y^n)\) and indicate that the difference from those of \(F(X^n)\) affects the behavior of the Sing-LV estimation and the parameter sampling from the posterior distribution.

6.1 Effect to eliminate redundant labels

According to Eq. (9), the MCMC sampling of the \(Y^n\)’s following \(p(Y^n|X^n)\) is essential for the Bayes estimation. The following relation indicates that we do not need to calculate \(Z(X^n)\) and that the value of \(Z(X^n,Y^n)\) determines the properties of the estimation:

$$\begin{aligned} p(X^n,Y^n)=Z(X^n,Y^n) \propto p(Y^n|X^n) = \frac{Z(X^n,Y^n)}{Z(X^n)}. \end{aligned}$$
(12)

The expression of \(p(X^n,Y^n)\) can be tractable with a conjugate prior, which marginalizes out the parameter integral (Dawid and Lauritzen 1993; Heckerman 1999).

We determine where the estimated distribution \(p(Y^n|X^n)\) has its peak. Obviously, the label assignment \(Y^n\) minimizing \(F(X^n,Y^n)\) provides the peak due to the definition \(F(X^n,Y^n)=-\ln Z(X^n,Y^n)\) and Eq. (12). Let this assignment be described as \(\bar{Y}^n\);

$$\begin{aligned} \bar{Y}^n = \arg \max _{Y^n} p(Y^n|X^n) = \arg \min _{Y^n} F(X^n,Y^n). \end{aligned}$$

The following discussion shows that \(\bar{Y}^n\) does not include the redundant labels.

We have to consider the symmetry of the latent variable in order to discuss the peak. In latent-variable models, both the latent variable and the parameter are symmetric. In Example 2, the component \(f(x|b^*)\) of the true model can be attained by the first component \(a_1f(x|b_1)\) or the second one \((1-a_1)f(x|b_2)\) of the learning model. Because the true label \(y=1\), which the true model provides, is unobservable, there are two proper estimation results \(Y^n=\{1,\ldots ,1\}\) and \(Y^n=\{2,\ldots ,2\}\) to indicate that the true model consists of one component. This is the symmetry of the latent variable. In the parameter space, it corresponds to the symmetric structure of \(W_1\) and \(W_3\) shown in Fig. 1. The symmetry makes it difficult to interpret the estimation results, which is known as the label-switching problem.

For the purpose of the theoretical evaluation, the definition of the error function \(D(n)\) selects the true assignment of the latent variable. In the above example, only \(Y^n=\{1,\ldots ,1\}\) is accepted as the proper result. However, there is no selection of the true assignment in the estimation process; other symmetric assignments such as \(Y^n=\{2,\ldots ,2\}\) will be the peak of \(p(Y^n|X^n)\). Then, the true parameter area \(W^t_{XY}\) is not sufficient to describe the peak. Taking account of the symmetry, we define another analytic set of the parameter as

$$\begin{aligned} W_{XY}^p = \cup _{\sigma \in \varSigma }\left\{ w; a_{\sigma (k)}=a^*_k, b_{\sigma (k)}=b^*_k \quad \text {for}\quad 1\le k\le K^*\right\} , \end{aligned}$$

\(\varSigma \) is the set of injective functions from \(\{1,\ldots ,K^*\}\) to \(\{1,\ldots ,K\}\). It is easy to confirm that \(W^t_{XY}\subset W_{XY}^p\). In Example 2, \(W^t_{XY}=W^t_1\subset W^t_1\cup W^t_3=W_{XY}^p\). Note that the redundant components are eliminated in \(p(x|w^*)\), where \(w^*\in W^p_{XY}\).

Let us analyze the location of the peak. Define that

$$\begin{aligned} S'(X^n,Y^n) = -\frac{1}{n} \sum _{i=1}^n \ln p(x_i,y_i|w^*), \end{aligned}$$

where \(w^* \in W^p_{XY}\). Switching the label based on the symmetry, we can easily prove that \(\max _{w^*,Y^n}S'(X^n,Y^n)=\max _{Y^n}S(X^n,Y^n)\). Moreover, \(-\frac{1}{n}\sum _{i=1}^n \ln p(x_i,y_i|w)\) with \(w\in W_X^t \setminus W_{XY}^p\), such as \(w\in W^t_2\) in Example 2, cannot realize \(S(X^n,Y^n)\) according to a simple calculation as shown in the next paragraph. Because the leading term of the asymptotic \(F(X^n,Y^n)\) is \(nS(X^n,Y^n)\) and \(nS'(X^n,\bar{Y}^n)\) realizes it, the true assignment \(\bar{Y}^n\) follows the parameter \(w^* \in W^p_{XY}\). Recalling that the redundant components are eliminated when \(w\in W^p_{XY}\), we can conclude that the redundant labels are eliminated in \(\bar{Y}^n\). This elimination occurs in any prior distribution if its support includes \(W^p_{XY}\).

Let us confirm the elimination in Example 2. We consider three parameters; \(w^*_1\in W^t_1=\{a=1,b_1=b^*\}\), \(w^*_2\in W^t_2=\{b_1=b_2=b^*\}\) and \(w^*_3\in W^t_3=\{a=0,b_2=b^*\}\). The leading term of the asymptotic \(F(X^n,Y^n)\) is expressed as

$$\begin{aligned} nS'_j(X^n,Y^n) = -\sum _{i=1}^n \ln p(x_i,y_i|w^*_j) \end{aligned}$$

for \(j=1,2,3\). This is rewritten as

$$\begin{aligned} nS'_j(X^n,Y^n)&= -\sum _{i=1}^n \delta _{y_i,1}\ln a - \sum _{i=1}^n \delta _{y_i,2}\ln (1-a)\nonumber \\&\quad - \sum _{i=1}^n \delta _{y_i,1}\ln f(x_i|b_1) - \sum _{i=1}^n \delta _{y_i,2}\ln f(x_i|b_2), \end{aligned}$$
(13)

where \(\delta _{i,j}\) is the Kronecker delta. The assignment \(\bar{Y}^n\) depends on \(w^*_j\). For example, \(\bar{Y}^n=\{1,\ldots ,1\}\) for \(w^*_1\) and \(\bar{Y}^n=\{2,\ldots ,2\}\) for \(w^*_3\). Then, we obtain that

$$\begin{aligned} nS'_j(X^n,Y^n) = {\left\{ \begin{array}{ll} -\sum \nolimits _{i=1}^n \ln f(x_i|b^*) &{} j=1\\ -N_1\ln a -N_2\ln (1-a) - \sum \nolimits _{i=1}^n \ln f(x_i|b^*) &{} j=2\\ -\sum \nolimits _{i=1}^n \ln f(x_i|b^*) &{} j=3, \end{array}\right. } \end{aligned}$$

where \(N_1=\sum _{i=1}^n \delta _{y_i,1}\) and \(N_2=\sum _{i=1}^n \delta _{y_i,2}\). The cases \(j=1\) and \(j=3\) have the same value and the case \(j=2\) is smaller than the others due to the first two terms in Eq. (13), which holds for any value of \(0<a<1\) in \(W^t_2\). This means that \(W^t_2\) cannot make \(p(Y^n|X^n)\) maximum. In other words, the assignment \(Y^n\) using both labels \(1\) and \(2\) is not the peak.

6.2 Two approaches to calculate \(p(Y^n|X^n)\) and their difference

It is necessary to emphasize that the calculation of \(p(Y^n|X^n)\) based on sampling from \(p(w|X^n)\) following Eq. (8) can be inaccurate. According to Theorem 1 and Eq. (7), we confirm that \(F(X^n)\) has a phase transition in mixture models due to the hyperparameter of the Dirichlet prior. This means that, when the hyperparameter \(\eta _1\) is large, the Monte Carlo sampling are from the area, in which all the components are used such as \(W^t_2\). In the numerical computation, the integrand of Eq. (8) will be close to \(\prod _{i=1}^n p(x_i,y_i|w^*_2)/p(x_i|w^*_2)\), where \(w^*_2\in W^t_2\). Because \(w^*_2\in W^t_2 \subset W^t_X\),

$$\begin{aligned} \prod _{i=1}^n \frac{p(x_i,y_i|w^*_2)}{p(x_i|w^*_2)}&= \exp \bigg \{ \sum _{i=1}^n \ln p(x_i,y_i|w^*_2) - \sum _{i=1}^n \ln p(x_i|w^*_2)\bigg \}\\&= \exp \big \{ -nS'_2(X^n,Y^n) + nS(X^n) \big \}. \end{aligned}$$

On the other hand, based on Eq. (9), the desired value of \(p(Y^n|X^n)\) is calculated as

$$\begin{aligned} \frac{Z(X^n,Y^n)}{Z(X^n)}&= \exp \bigg \{ F(X^n)-F(X^n,Y^n)\bigg \}\\&= \exp \{-nS(X^n,Y^n) + nS(X^n)\} +o(\exp (-n)). \end{aligned}$$

Since \(S'_2(X^n,Y^n)>S(X^n,Y^n)\), the value of Eq. (8) is much smaller than that of Eq. (9). Therefore, the result of the numerical integration in Eq. (8) is almost zero. The parameter area providing non-zero value of integrand in Eq. (8) is located in the tail of the posterior distribution when \(p(w|X^n)\) converges to \(W^t_X\setminus W^p_{XY}\).

6.3 Failure of parameter sampling from the posterior distribution

In the previous subsection, parameter sampling from the posterior distribution can make an adverse effect on the calculation of the distribution of the latent variable. Here, in the other way, we show that latent-variable sampling can construct an undesired posterior distribution.

There are methods to sample a sequence of \(\{w,Y^n\}\) from \(p(w,Y^n|X^n)\). Ignoring \(Y^n\), we obtain the sequence \(\{w\}\). The Gibbs sampling in the MCMC method (Robert and Casella 2005) is one of the representative techniques.

[Gibbs Sampling for a Model with a Latent Variable]

  1. 1.

    Initialize the parameter;

  2. 2.

    Sample \(Y^n\) based on \(p(Y^n|w,X^n)\);

  3. 3.

    Sample \(w\) based on \(p(w|Y^n,X^n)\);

  4. 4.

    Iterate by alternately updating Step 2 and Step 3.

The sequence of \(\{w,Y^n\}\) obtained by this algorithm follows \(p(w,Y^n|X^n)\). The extracted parameter sequence \(\{w\}\) is assumed to be samples from the posterior because \(p_G(w|X^n)=\sum _{Y^n} p(w,Y^n|X^n)\) is theoretically equal to \(p(w|X^n)\). However, in the mixture models, the practical value of \(p_G(w|X^n)\) based on the Monte Carlo method can be different from that of the original posterior \(p(w|X^n)\) when the hyperparameter for the mixing ratio \(\eta _1\) is large.

Let us consider the expression

$$\begin{aligned} -\ln p(X^n,Y^n,w)&= -\ln \prod _{i=1}^n \prod _{k=1}^K a_k^{\delta _{y_ik}}f(x_i|b_k)^{\delta _{y_ik}} -\ln \varphi (w;\eta )\\&= - \sum _{k=1}^K \delta _{y_ik} \ln a_k - \sum _{i=1}^n \sum _{k=1}^K \delta _{y_ik}\ln f(x_i|b_k) -\ln \varphi (w;\eta ). \end{aligned}$$

We determine a location of a pair \((\bar{w},\bar{Y}^n)\) that minimizes this expression in the asymptotic case \(n\rightarrow \infty \) because the relation \(p(X^n,Y^n,w)\propto p(w,Y^n|X^n)\) indicates that the sequence \(\{w,Y^n\}\) is mainly taken from the neighborhood of the pair. The third term of the last expression does not have any asymptotic effect because it has the constant order on \(n\). The first two terms have the same expression as Eq. (13). Based on the calculation of \(S'_j(X^n,Y^n)\), \(\bar{w}\in W^p_{XY}\) and \(\bar{Y}^n=\arg \max _{Y^n} p(X^n,Y^n,\bar{w})\). Therefore, the practical value of \(p_G(w|X^n)\) is calculated by the sequence \(\{w\}\) around \(W_{XY}^p\) for any \(\eta _1\) while the convergence area of the original \(p(w|X^n)\) depends on the phase of \(F(X^n)\) controlled by \(\eta _1\).

In Example 2, the posterior \(p(w|X^n)\) converges to \(W^t_2\) when \(\eta _1\) is large. On the other hand, the sampled sequence based on \(p(X^n,Y^n,w)\) are mainly from \(W_1\cup W_3\) since \(S'_2(X^n,\bar{Y}^n_2)>S'_1(X^n,\bar{Y}^n_1)=S'_3(X^n,\bar{Y}^n_3)\), where \(\bar{Y}^n_j\) stands for the assignment minimizing \(S'_j(X^n,Y^n)\). In order to construct the sequence \(\{w\}\) following \(p(w|X^n)\), we need samples \((w,Y^n)\in W_2\times \bar{Y}^n_2\), which are located in the tail of \(p(w,Y^n|X^n)\). In theory, the sequence \(\{w\}\) from \(p(w,Y^n|X^n)\) realizes the one from \(p(w|X^n)\). However, in practice, it is not straightforward to obtain \(\{w,Y^n\}\) from the tail of \(p(w,Y^n|X^n)\). This property of the Gibbs sampling has been reported in a Gaussian mixture model (Nagata and Watanabe 2009). The experimental results show that the obtained sequence of \(\{w\}\) is localized in the area corresponding to \(W_{XY}^p\). Note that there is no failure of the MCMC method when \(\eta _1\) is sufficiently small, where the peaks of \(p(w|X^n)\) and \(p(w,Y^n|X^n)\) are in the same area. Thus, to judge the reliability of the MCMC sampling, we have to know the phase transition point such as \(\eta _1=1/2\) in Lemma 1.

7 Conclusions

The present paper clarifies the asymptotic accuracy of the Bayes latent-variable estimation. The dominant order is at most \(\ln n/n\), and its coefficient is determined by a positional relation between the largest poles of the zeta functions. According to the mixture-model case, it is suggested that the order is dominant and the coefficient is affected by the redundancy of the learning model and the hyperparameters. The accuracy of prediction can be approximated by methods such as the cross-validation and bootstrap methods. On the other hand, there is no approximation for the accuracy of latent-variable estimation, which indicates that the theoretical result plays a central role in evaluating the model and the estimation method.