Keywords

1 Introduction

In this paper, we focus on a problem of parameter estimation of discrete probabilistic models. A typical way for the estimation is the Maximum Likelihood Estimation (MLE) and the MLE is a “good” estimator which asymptotically satisfies the Cramér-Rao bound and is asymptotically efficient. In general, explicit solutions for the MLE cannot be obtained and then gradient-based optimization methods is usually required. But the calculation of the gradient includes the calculation of the normalization constant which makes the model to be in the probability space, and the calculation of the normalization constant is sometimes computationally intractable when the model is in a high-dimensional space. A typical example is the Boltzmann machine on \(\mathcal {X}=\{+1,-1\}^p\),

$$\begin{aligned} \frac{\exp \left( \varvec{\theta }_1\varvec{x}+\frac{1}{2}\varvec{x}^T\varvec{\theta }_2\varvec{x}\right) }{\sum _{\varvec{x}\in \mathcal {X}}\exp \left( \varvec{\theta }_1\varvec{x}+\frac{1}{2}\varvec{x}^T\varvec{\theta }_2\varvec{x}\right) }{} \end{aligned}$$
(1)

and a calculation of the normalization constant of requires \(2^p\) summation, which is hard to calculate as p is large. Other estimators derived from minimization of divergence measures [2] also suffer the computational problem. To tackle with the problem associated with the normalization constant, various kinds of approaches have been researched. Some methods are based on the Markov Chain Monte Carlo (MCMC) sampling and the contrastive divergence [7] is a well-known example. Another approach approximate the targeted probabilistic model by a tractable model by the mean-field approximation assuming independence of variables [11]. In this paper, we focus on an approach which considers an unnormalized model rather than the (normalized) probabilistic model. [8] defines information of “neighbor” by contrasting probability with that of a flipped variable and makes it possible to omit the calculation of normalization constant. [4] proposed a generalized local scoring rules on discrete sample spaces and [6] avoids the calculation of the normalization constant using a trick with auxiliary examples. [13] proposes an asymptotically efficient estimator without the calculation of the normalization constant, which consists of a concept of empirical localization and a homogeneous \(\gamma \)-divergence [5, 10]. In this paper, we extend the concept of the empirical localization and propose a novel estimator which does not require the calculation of normalization constant. We investigate statistical properties of the proposed estimator and verify its validity with small experiments.

2 Settings

Let \(\varvec{x}\) be a d-dimensional vector in discrete space \(\mathcal {X}\) such as \(\{+1,-1\}^d\) or \(\{1,2,\ldots \}^d\), and a bracket \(\left\langle f \right\rangle \) for a function f on \(\mathcal {X}\) denotes a sum of f over \(\mathcal {X}\), \(\left\langle f \right\rangle =\sum _{\varvec{x}\in \mathcal {X}}f(\varvec{x})\). For a given dataset \(\mathcal {D}=\{\varvec{x}_i\}_{i=1}^n\), the empirical distribution \(\tilde{p}(\varvec{x})\) is defined as

$$\begin{aligned} \tilde{p}(\varvec{x})= {\left\{ \begin{array}{ll} \frac{n_{\varvec{x}}}{n} &{} \varvec{x} \text{ is } \text{ observed, }\\ 0 &{} \text{ otherwise, } \end{array}\right. } \end{aligned}$$
(2)

where \(n_{\varvec{x}}\) is number of examples \(\varvec{x}\) is observed. We consider a probabilistic model

$$\begin{aligned} \bar{q}_{\varvec{\theta }}(\varvec{x})=\frac{q_{\varvec{\theta }}(\varvec{x})}{Z_{\varvec{\theta }}} \end{aligned}$$
(3)

where \(q_{\varvec{\theta }}(\varvec{x})\) is a unnormalized model expressed as

$$\begin{aligned} q_{\varvec{\theta }}(\varvec{x})=\exp (\psi _{\varvec{\theta }}(\varvec{x})) \end{aligned}$$
(4)

with a function \(\psi _{\varvec{\theta }}(\varvec{x})\) parameterized by \(\varvec{\theta }\) and \(Z_{\varvec{\theta }}\) is a normalization constant defined as \(Z_{\varvec{\theta }}=\left\langle q_{\varvec{\theta }} \right\rangle \) which enforces the (3) to be a probability function. Note that the unnormalized model (4) is not a probability function and \(\left\langle q_{\varvec{\theta }} \right\rangle =1\) does not hold in general, and calculation of the normalization constant \(Z_{\varvec{\theta }}\) often requires a high computational cost. Then calculation of the Maximum Likelihood Estimator (MLE)

$$\begin{aligned} \varvec{\hat{\theta }}_{MLE}= \mathop {\mathrm{argmax}}_{\varvec{\theta }}\sum _{i=1}^n\log \bar{q}_{\varvec{\theta }}(\varvec{x}_i) \end{aligned}$$
(5)

or maximization process of the log-likelihood using its gradient

$$\begin{aligned} \sum _{i=1}^n\left\{ \psi _{\varvec{\theta }}(\varvec{x}_i)-\left\langle \bar{q}_{\varvec{\theta }}\psi _{\varvec{\theta }} \right\rangle \right\} \end{aligned}$$
(6)

involves difficulty of computational cost derived from \(Z_{\varvec{\theta }}\). To overcome the difficulty of the computation of \(Z_{\varvec{\theta }}\), we consider combination of the \(\gamma \)-divergence and generalized mixture model.

2.1 \(\gamma \)-divergence

For two positive measure fg, the \(\gamma \)-divergence [5] is defined as follows.

$$\begin{aligned} D_{\gamma }(f,g)= \frac{1}{1+\gamma }\log \left\langle f^{\gamma +1} \right\rangle + \frac{\gamma }{1+\gamma }\log \left\langle g^{\gamma +1} \right\rangle - \log \left\langle fg^{\gamma } \right\rangle \end{aligned}$$
(7)

where \(\gamma \) is a positive constant. Note that \(D_{\gamma }(f,g)\) is non-negative and is said to be homogeneous divergence because \(D_{\gamma }(f,g)=0\) holds if and only if \(f\propto g\), rather than \(f=g\). In the limit of \(\gamma \rightarrow 0\), the \(\gamma \)-divergence reduces to the usual KL-divergence,

$$\begin{aligned} {\left\langle f\log \frac{f}{g}-f+g \right\rangle .} \end{aligned}$$
(8)

Note that a combination of the \(\gamma \)-divergence and the unnormalized model does not solve the problem of computational cost because a term \(D_{\gamma }(\tilde{p},q_{\varvec{\theta }})\) includes \(\left\langle q_{\varvec{\theta }}^{\gamma +1} \right\rangle \) whose computation also requires the same order with the normalization constant \(Z_{\varvec{\theta }}\).

2.2 Empirical Localization

Firstly, we briefly introduce a concept of empirical localization of the (unnormalized) model \(\bar{q}_{\varvec{\theta }}(\varvec{x})\)(or \(q_{\varvec{\theta }}(\varvec{x})\)) with the empirical distribution \(\tilde{p}(\varvec{x})\) [13]. The empirical localization is interpreted as a generalized mean of \(q_{\varvec{\theta }}\) and \(\tilde{p}\), and lies in e-flat subspace [1] as

$$\begin{aligned} \tilde{r}_{\alpha ,\varvec{\theta }}(\varvec{x}) =\frac{\tilde{p}(\varvec{x})^{\alpha }\bar{q}_{\varvec{\theta }}(\varvec{x})^{1-\alpha }}{\left\langle \tilde{p}^{\alpha }\bar{q}_{\varvec{\theta }}^{1-\alpha } \right\rangle } = \frac{\tilde{p}(\varvec{x})^{\alpha }q_{\varvec{\theta }}(\varvec{x})^{1-\alpha }}{\left\langle \tilde{p}^{\alpha }q_{\varvec{\theta }}^{1-\alpha } \right\rangle }. \end{aligned}$$
(9)

Note that the normalization constant \(Z_{\varvec{\theta }}\) in \(\bar{q}_{\varvec{\theta }}\) is canceled out and \(\tilde{r}_{\alpha ,\varvec{\theta }}(\varvec{x})\) does not depend on \(Z_{\varvec{\theta }}\), except for \(\alpha =0\). Also note that the denominator \(\left\langle \tilde{p}^{\alpha }\bar{q}^{1-\alpha } \right\rangle \) (or \(\tilde{r}_{\alpha ,\varvec{\theta }}(\varvec{x})\) itself) can be easily calculated because the empirical distribution \(\tilde{p}(\varvec{x})\) has some values only on observed \(\varvec{x}\) in the dataset and is always 0 on the unobserved subset of \(\mathcal {X}\). This implies the model \(q_{\varvec{\theta }}(\varvec{x})\) is empirically localized to the observed subset of domain \(\mathcal {X}\) of dataset \(\mathcal {D}\) and we can ignore the unobserved subset of \(\mathcal {X}\), which leads to a drastic reduction of computational cost. We observe \(\tilde{r}_{0,\varvec{\theta }}(\varvec{x})=\bar{q}(\varvec{x})\) and \(\tilde{r}_{1,\varvec{\theta }}(\varvec{x})=\tilde{p}(\varvec{x})\), and (9) connects the empirical distribution \(\tilde{p}\) and the normalized model \(\bar{q}\) with the parameter \(\alpha \).

2.3 Estimator by Homogeneous Divergence and Empirical Localization

In [13], an estimator which does not require calculation of the normalization constant, was proposed by combining (7) and (9). The estimator is defined as

$$\begin{aligned} \varvec{\hat{\theta }}&=\mathop {\mathrm{argmin}}_{\varvec{\theta }} D_{\gamma }(\tilde{r}_{\alpha ,\varvec{\theta }}^{1/(1+\gamma )},\tilde{r}_{\alpha ',\varvec{\theta }}^{1/(1+\gamma )}) \end{aligned}$$
(10)
$$\begin{aligned} D_{\gamma }(\tilde{r}_{\alpha ,\varvec{\theta }}^{1/(1+\gamma )},\tilde{r}_{\alpha ',\varvec{\theta }}^{1/(1+\gamma )})&= \frac{1}{1+\gamma }\log \left\langle \tilde{p}^{\alpha }q_{\varvec{\theta }}^{1-\alpha } \right\rangle + \frac{\gamma }{1+\gamma }\log \left\langle \tilde{p}^{\alpha '}q_{\varvec{\theta }}^{1-\alpha '} \right\rangle - \log \left\langle \tilde{p}^{\beta }q_{\varvec{\theta }}^{1-\beta } \right\rangle \end{aligned}$$
(11)

where \(\alpha \not =\alpha '\) and \(\beta =(\alpha +\gamma \alpha ')/(1+\gamma )\). We observe that a setting with \(\alpha =1,\alpha '=0\) and \(\gamma \rightarrow 0\) corresponds to the conventional MLE. Note that the empirical risk (11) does not include the calculation of the normalization constant \(Z_{\varvec{\theta }}\) and can be easily calculated.

The estimator (10) has the following good statistical properties.

Proposition 1

([13]). Let us assume that \(\psi _{\varvec{\theta }}(\varvec{x})\) is written as \(\varvec{\theta }^T\varvec{\phi }(\varvec{x})\) with a fixed vector function \(\varvec{\psi }(\varvec{x})\). Then the risk function (10) is convex with respect to \(\varvec{\theta }\) when \(\beta =1\) holds.

Proposition 2

([13]). The estimator (10) is Fisher consistent and asymptotically efficient.

3 Proposed Estimator

The empirical localization (9) can be interpreted as a generalized mean of a constant 1 and a distribution ratio \(q_{\varvec{\theta }}/\tilde{p}\), and is rewritten as

$$\begin{aligned} \tilde{r}_{\alpha ,\varvec{\theta }}(\varvec{x})&\propto \tilde{p}(\varvec{x})^{\alpha }q_{\varvec{\theta }}(\varvec{x})^{1-\alpha } = \tilde{p}(\varvec{x})\left( \frac{q_{\varvec{\theta }}(\varvec{x})}{\tilde{p}(\varvec{x})} \right) ^{1-\alpha } \nonumber \\&= \tilde{p}(\varvec{x})\exp \left( \alpha \log 1+(1-\alpha )\log \frac{q_{\varvec{\theta }}(\varvec{x})}{\tilde{p}(\varvec{x})} \right) . \end{aligned}$$
(12)

We can extend the concept of (12) to the quasi-arithmetic mean, with a monotonically increasing function u and its inverse function \(\xi \), as follows.

$$\begin{aligned} \tilde{r}_{u,\alpha ,\varvec{\theta }}(\varvec{x}) =\tilde{p}(\varvec{x})u\left( \alpha \xi (1)+(1-\alpha )\xi \left( \frac{q_{\varvec{\theta }}(\varvec{x})}{\tilde{p}(\varvec{x})} \right) \right) . \end{aligned}$$
(13)

By transforming the function u(z) to \(u(z-a)\), we can set \(\xi (1)=0\) without loss of generality. The generalized version of empirical localization (13) is rewritten as

$$\begin{aligned} \tilde{r}_{u,\alpha ,\varvec{\theta }}(\varvec{x}) \propto {\left\{ \begin{array}{ll} n_{\varvec{x}}u\left( (1-\alpha )\xi \left( \frac{nq_{\theta }(\varvec{x})}{n_{\varvec{x}}} \right) \right) &{} \varvec{x} \text{ is } \text{ observed } \\ 0 &{} \text{ otherwise }, \end{array}\right. } \end{aligned}$$
(14)

and the model can be easily calculated because we can omit the unobserved domain. We show two examples associated with \(\beta \)-divergence and \(\eta \)-divergence [3] which are employed for the purpose of robust estimation [9, 12].

Example 1

For \(u(z)=(1+\beta z)^{1/\beta }\) and \(\xi (z)=\frac{z^\beta -1}{\beta }\), we have

$$\begin{aligned} \tilde{r}_{u,\alpha ,\varvec{\theta }}(\varvec{x}) = \tilde{p}(\varvec{x}) \left( (1-\alpha )\left( \frac{q_{\varvec{\theta }}(\varvec{x})}{\tilde{p}(\varvec{x})}\right) ^{\beta }+\alpha \right) ^{1/\beta } \end{aligned}$$
(15)

Example 2

For \(u(z)=(1+\eta )e^z-\eta \) and \(\xi (z)=\log \frac{z+\eta }{1+\eta }\), we have

$$\begin{aligned} \tilde{r}_{u,\alpha ,\varvec{\theta }}(\varvec{x}) = \tilde{p}(\varvec{x}) \left\{ (1+\eta )\left( \frac{\frac{q_{\varvec{\theta }}(\varvec{x})}{\tilde{p}(\varvec{x})}+\eta }{1+\eta } \right) ^{1-\alpha }-\eta \right\} \end{aligned}$$
(16)

Example 3

For \(u(z)=-\frac{1}{z}\) and \(\xi (z)=-\frac{1}{z}\), we have

$$\begin{aligned} \tilde{r}_{u,\alpha ,\varvec{\theta }}(\varvec{x}) = \frac{\tilde{p}(\varvec{x})q_{\varvec{\theta }}(\varvec{x})}{\alpha q_{\varvec{\theta }}(\varvec{x})+(1-\alpha )\tilde{p}(\varvec{x})} \end{aligned}$$
(17)

We propose a novel estimator for discrete probabilistic model, which can be constructed without calculation of the normalization constant \(Z_{\varvec{\theta }}\). The proposed estimator is defined by combining the (13) and \(\gamma \)-divergence with two hyper-parameters \(\alpha ,\alpha '\)(\(\alpha \not =\alpha '\)), as follows.

$$\begin{aligned} \varvec{\hat{\theta }}&= \mathop {\mathrm{argmin}}_{\varvec{\theta }}D_\gamma ((\tilde{r}_{u,\alpha ,\varvec{\theta }})^{1/(1+\gamma )}, (\tilde{r}_{u,\alpha ',\varvec{\theta }})^{1/(1+\gamma )}) \end{aligned}$$
(18)

Note that when \(q_{\varvec{\theta }}(\varvec{x})\propto \tilde{p}(\varvec{x})\) holds, we observe that \(\tilde{r}_{u,\alpha ,\varvec{\theta }}(\varvec{x})\propto \tilde{r}_{u,\alpha ',\varvec{\theta }}(\varvec{x})\) and \(D_\gamma ((\tilde{r}_{u,\alpha ,\varvec{\theta }})^{1/(1+\gamma )}, (\tilde{r}_{u,\alpha ',\varvec{\theta }})^{1/(1+\gamma )}) =0\) holds.

4 Statistical Property

In this section, we investigate statistical property of the proposed estimator. Firstly, we show the Fisher consistency of the proposed estimator.

Proposition 3

Let \(\varvec{\theta }_0\) be a true parameter of the underlying distribution, i.e., \(p(\varvec{x})=\bar{q}_{\varvec{\theta }_0}(\varvec{x})\). Then

$$\begin{aligned} \varvec{\theta }_0=\mathop {\mathrm{argmin}}_{\varvec{\theta }}D_{\gamma }(r_{u,\alpha ,\varvec{\theta }},r_{u,\alpha ',\varvec{\theta }}) \end{aligned}$$
(19)

holds for arbitrary \(\gamma \), \(\alpha \), \(\alpha '\)(\(\alpha \not =\alpha '\)) and \(\varvec{\theta }_0\).

Proof

The proposed estimator satisfies the equilibrium equation

$$\begin{aligned} 0= \left. \frac{\partial D_\gamma ((\tilde{r}_{u,\alpha ,\varvec{\theta }})^{1/(1+\gamma )}, (\tilde{r}_{u,\alpha ',\varvec{\theta }})^{1/(1+\gamma )}) }{\partial \varvec{\theta }} \right| _{\varvec{\theta }=\varvec{\hat{\theta }}} \end{aligned}$$
(20)

implying the Fisher consistency.

Secondly, we investigate the asymptotic distribution of the proposed estimator.

Proposition 4

Let \(\varvec{\theta }_0\) be a true parameter of the underlying distribution. Then, under mild regularity condition, the proposed estimator asymptotically follows

$$\begin{aligned} \sqrt{n}(\varvec{\hat{\theta }}-\varvec{\theta }_0)\sim \mathcal {N}(\varvec{0},I(\varvec{\theta }_0)^{-1}) \end{aligned}$$
(21)

where \(\mathcal {N}\) is the Normal distribution and \(I(\varvec{\theta }_0)=V_{\bar{q}_{\varvec{\theta }_0}}[\psi _{\varvec{\theta }_0}']\) is the Fisher information matrix.

Proof

Let us assume that the empirical distribution is written as \(\tilde{p}(\varvec{x})=\bar{q}_{\varvec{\theta }_0}(\varvec{x})+\epsilon (\varvec{x})\). By expanding the equilibrium condition (20) around \(\varvec{\theta }=\varvec{\theta }_0\) and \(\epsilon (\varvec{x})=0\), we have

$$\begin{aligned} 0\simeq&\left. \frac{\partial D_\gamma ((\tilde{r}_{u,\alpha ,\varvec{\theta }})^{1/(1+\gamma )}, (\tilde{r}_{u,\alpha ',\varvec{\theta }})^{1/(1+\gamma )}) }{\partial \varvec{\theta }} \right| _{\varvec{\theta }=\varvec{\theta }_0} \nonumber \\&+ \left. \frac{\partial ^2 D_\gamma ((\tilde{r}_{u,\alpha ,\varvec{\theta }})^{1/(1+\gamma )}, (\tilde{r}_{u,\alpha ',\varvec{\theta }})^{1/(1+\gamma )}) }{\partial \varvec{\theta }\partial \varvec{\theta }^T} \right| _{\varvec{\theta }=\varvec{\theta }_0}(\varvec{\hat{\theta }}-\varvec{\theta }_0). \end{aligned}$$
(22)

Using the delta method [14], we have

$$\begin{aligned}&\left. \frac{\partial D_\gamma ((\tilde{r}_{u,\alpha ,\varvec{\theta }})^{1/(1+\gamma )}, (\tilde{r}_{u,\alpha ',\varvec{\theta }})^{1/(1+\gamma )}) }{\partial \varvec{\theta }} \right| _{\varvec{\theta }=\varvec{\theta }_0} - \left. \frac{\partial D_\gamma ((r_{u,\alpha ,\varvec{\theta }})^{1/(1+\gamma )}, (r_{u,\alpha ',\varvec{\theta }})^{1/(1+\gamma )}) }{\partial \varvec{\theta }} \right| _{\varvec{\theta }=\varvec{\theta }_0}\end{aligned}$$
(23)
$$\begin{aligned}&\simeq C \left\langle \psi _{\varvec{\theta }_0}'\epsilon \right\rangle \end{aligned}$$
(24)

where C is a constant, and from the central limit theorem, we observe \(\sqrt{n}\left\langle \psi _{\varvec{\theta }_0}'\epsilon \right\rangle \) asymptotically follows the normal distribution with mean 0 and variance \(I(\varvec{\theta }_0)=V_{\bar{q}_{\varvec{\theta }_0}}[\psi _{\varvec{\theta }_0}']\). From the law of large number, we observe that the second term in the rhs of (22) converges to \(-CI(\varvec{\theta }_0)\) in the limit of \(n\rightarrow \infty \), which concludes the proposition.

The asymptotic variance in (21) implies that the proposed estimator is asymptotically efficient and has the same efficiency with the MLE, which asymptotically attains the Cramér-Rao bound. Also note that the asymptotic variance of the proposed estimator does not depend on choice of \(\alpha ,\alpha ',\gamma \).

5 Experiments

We numerically investigated properties of the proposed estimator with a small synthetic dataset. Let \(\bar{q}_{\varvec{\theta }}(\varvec{x})\) be a 5-dimensional Boltzmann machine

$$\begin{aligned} \bar{q}_{\varvec{\theta }}(\varvec{x})= \frac{\exp \left( \frac{1}{2}\varvec{x}^T\varvec{\theta }\varvec{x}\right) }{Z_{\varvec{\theta }}} \end{aligned}$$
(25)

whose parameter \(\varvec{\theta }\) follows the normal distribution with mean 0 and variance 1. We generated 20 sets of datasets including 4000 examples and compared the following method.

  1. 1.

    MLE: Maximum likelihood estimator

  2. 2.

    gamma: The proposed estimator with \(u(z)=\exp (z)\) [13]

  3. 3.

    IS: The proposed estimator with \(u(z)=-\frac{1}{z}\)

  4. 4.

    eta: The proposed estimator with \(u(z)=(1+\eta )\exp (z)-\eta \)

  5. 5.

    beta: The proposed estimator with \(u(z)=(1+\beta z)^{1/\beta }\)

Figure 1(a) shows a box plot of MSEs of parameters, \(||\varvec{\hat{\theta }}-\varvec{\theta }||^2\) in a logarithmic scale, with various deformation function u in (13). Figure 1(b) shows a box plot of computational times for each estimator in a logarithmic scale. We observe that some of the proposed estimator is comparable with the MLE, while the computational time of the proposed estimator is drastically reduced compared with that of the MLE.

Fig. 1.
figure 1

\(n=4000\). (a) Box plot of estimation errors, \(||\varvec{\hat{\theta }}-\varvec{\theta }||^2\) of each method. (b) Box plot of computational time of each method.

Fig. 2.
figure 2

\(n=16000\). (a) Box plot of estimation errors, \(||\varvec{\hat{\theta }}-\varvec{\theta }||^2\) of each method. (b) Box plot of computational time of each method.

A reason of why the proposed estimator with some functions u are inferior to the MLE is a shortage of examples. The theoretical result shown in Sect. 4 is based on assumptions of asymptotics and requires a lot of examples to assure the asymptotic efficiency. We executed an another experiment with the same setting except for the number n of examples. Figure 2(a), (b) show results for \(n=16000\) and we observe that performance of the proposed estimator (IS, eta, beta) is improved at the same level as the MLE while required computational cost is still drastically fewer.

6 Conclusion

We proposed the novel estimator for discrete probabilistic model, which does not require calculation of the normalization constant. The proposed estimator is constructed by a combination of the \(\gamma \)-divergence and generalized empirical localization, which can be interpreted as the generalized mean of distributions. We investigated statistical properties of the proposed estimator and showed that the proposed estimator asymptotically has the same efficiency with the MLE and demonstrated the asymptotic efficiency with the small experiment.