Parameter Estimation with Generalized Empirical Localization

Takenouchi, Takashi

doi:10.1007/978-3-030-26980-7_38

Takashi Takenouchi¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11712))

Included in the following conference series:

International Conference on Geometric Science of Information

1723 Accesses
1 Citations

Abstract

It is often difficult to estimate parameters of discrete models because of the computational cost for calculation of normalization constant, which enforces the model to be probability. In this paper, we consider a computationally feasible estimator for discrete probabilistic models using a concept of generalized empirical localization, which corresponds to the generalized mean of distributions and homogeneous $\gamma $-divergence. The proposed estimator does not require the calculation of the normalization constant and is asymptotically efficient.

This work was supported by JSPS KAKENHI Grant Numbers 16K00051 from MEXT, Japan.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Asymptotic Properties of Minimum S-Divergence Estimator for Discrete Models

Article 18 December 2014

Maximum a posteriori estimators as a limit of Bayes estimators

Article 30 January 2018

Adaptation of the tuning parameter in general Bayesian inference with robust divergence

Article Open access 04 February 2023

Keywords

1 Introduction

In this paper, we focus on a problem of parameter estimation of discrete probabilistic models. A typical way for the estimation is the Maximum Likelihood Estimation (MLE) and the MLE is a “good” estimator which asymptotically satisfies the Cramér-Rao bound and is asymptotically efficient. In general, explicit solutions for the MLE cannot be obtained and then gradient-based optimization methods is usually required. But the calculation of the gradient includes the calculation of the normalization constant which makes the model to be in the probability space, and the calculation of the normalization constant is sometimes computationally intractable when the model is in a high-dimensional space. A typical example is the Boltzmann machine on $\mathcal {X}=\{+1,-1\}^p$,

$$\begin{aligned} \frac{\exp \left( \varvec{\theta }_1\varvec{x}+\frac{1}{2}\varvec{x}^T\varvec{\theta }_2\varvec{x}\right) }{\sum _{\varvec{x}\in \mathcal {X}}\exp \left( \varvec{\theta }_1\varvec{x}+\frac{1}{2}\varvec{x}^T\varvec{\theta }_2\varvec{x}\right) }{} \end{aligned}$$

(1)

and a calculation of the normalization constant of requires $2^p$ summation, which is hard to calculate as p is large. Other estimators derived from minimization of divergence measures [2] also suffer the computational problem. To tackle with the problem associated with the normalization constant, various kinds of approaches have been researched. Some methods are based on the Markov Chain Monte Carlo (MCMC) sampling and the contrastive divergence [7] is a well-known example. Another approach approximate the targeted probabilistic model by a tractable model by the mean-field approximation assuming independence of variables [11]. In this paper, we focus on an approach which considers an unnormalized model rather than the (normalized) probabilistic model. [8] defines information of “neighbor” by contrasting probability with that of a flipped variable and makes it possible to omit the calculation of normalization constant. [4] proposed a generalized local scoring rules on discrete sample spaces and [6] avoids the calculation of the normalization constant using a trick with auxiliary examples. [13] proposes an asymptotically efficient estimator without the calculation of the normalization constant, which consists of a concept of empirical localization and a homogeneous $\gamma $-divergence [5, 10]. In this paper, we extend the concept of the empirical localization and propose a novel estimator which does not require the calculation of normalization constant. We investigate statistical properties of the proposed estimator and verify its validity with small experiments.

2 Settings

Let $\varvec{x}$ be a d-dimensional vector in discrete space $\mathcal {X}$ such as $\{+1,-1\}^d$ or $\{1,2,\ldots \}^d$, and a bracket $\left\langle f \right\rangle $ for a function f on $\mathcal {X}$ denotes a sum of f over $\mathcal {X}$, $\left\langle f \right\rangle =\sum _{\varvec{x}\in \mathcal {X}}f(\varvec{x})$. For a given dataset $\mathcal {D}=\{\varvec{x}_i\}_{i=1}^n$, the empirical distribution $\tilde{p}(\varvec{x})$ is defined as

$$\begin{aligned} \tilde{p}(\varvec{x})= {\left\{ \begin{array}{ll} \frac{n_{\varvec{x}}}{n} &{} \varvec{x} \text{ is } \text{ observed, }\\ 0 &{} \text{ otherwise, } \end{array}\right. } \end{aligned}$$

(2)

where $n_{\varvec{x}}$ is number of examples $\varvec{x}$ is observed. We consider a probabilistic model

$$\begin{aligned} \bar{q}_{\varvec{\theta }}(\varvec{x})=\frac{q_{\varvec{\theta }}(\varvec{x})}{Z_{\varvec{\theta }}} \end{aligned}$$

(3)

where $q_{\varvec{\theta }}(\varvec{x})$ is a unnormalized model expressed as

$$\begin{aligned} q_{\varvec{\theta }}(\varvec{x})=\exp (\psi _{\varvec{\theta }}(\varvec{x})) \end{aligned}$$

(4)

with a function $\psi _{\varvec{\theta }}(\varvec{x})$ parameterized by $\varvec{\theta }$ and $Z_{\varvec{\theta }}$ is a normalization constant defined as $Z_{\varvec{\theta }}=\left\langle q_{\varvec{\theta }} \right\rangle $ which enforces the (3) to be a probability function. Note that the unnormalized model (4) is not a probability function and $\left\langle q_{\varvec{\theta }} \right\rangle =1$ does not hold in general, and calculation of the normalization constant $Z_{\varvec{\theta }}$ often requires a high computational cost. Then calculation of the Maximum Likelihood Estimator (MLE)

$$\begin{aligned} \varvec{\hat{\theta }}_{MLE}= \mathop {\mathrm{argmax}}_{\varvec{\theta }}\sum _{i=1}^n\log \bar{q}_{\varvec{\theta }}(\varvec{x}_i) \end{aligned}$$

(5)

or maximization process of the log-likelihood using its gradient

$$\begin{aligned} \sum _{i=1}^n\left\{ \psi _{\varvec{\theta }}(\varvec{x}_i)-\left\langle \bar{q}_{\varvec{\theta }}\psi _{\varvec{\theta }} \right\rangle \right\} \end{aligned}$$

(6)

involves difficulty of computational cost derived from $Z_{\varvec{\theta }}$. To overcome the difficulty of the computation of $Z_{\varvec{\theta }}$, we consider combination of the $\gamma $-divergence and generalized mixture model.

2.1 $\gamma $-divergence

For two positive measure f, g, the $\gamma $-divergence [5] is defined as follows.

$$\begin{aligned} D_{\gamma }(f,g)= \frac{1}{1+\gamma }\log \left\langle f^{\gamma +1} \right\rangle + \frac{\gamma }{1+\gamma }\log \left\langle g^{\gamma +1} \right\rangle - \log \left\langle fg^{\gamma } \right\rangle \end{aligned}$$

(7)

where $\gamma $ is a positive constant. Note that $D_{\gamma }(f,g)$ is non-negative and is said to be homogeneous divergence because $D_{\gamma }(f,g)=0$ holds if and only if $f\propto g$, rather than $f=g$. In the limit of $\gamma \rightarrow 0$, the $\gamma $-divergence reduces to the usual KL-divergence,

$$\begin{aligned} {\left\langle f\log \frac{f}{g}-f+g \right\rangle .} \end{aligned}$$

(8)

Note that a combination of the $\gamma $-divergence and the unnormalized model does not solve the problem of computational cost because a term $D_{\gamma }(\tilde{p},q_{\varvec{\theta }})$ includes $\left\langle q_{\varvec{\theta }}^{\gamma +1} \right\rangle $ whose computation also requires the same order with the normalization constant $Z_{\varvec{\theta }}$.

2.2 Empirical Localization

Firstly, we briefly introduce a concept of empirical localization of the (unnormalized) model $\bar{q}_{\varvec{\theta }}(\varvec{x})$(or $q_{\varvec{\theta }}(\varvec{x})$) with the empirical distribution $\tilde{p}(\varvec{x})$ [13]. The empirical localization is interpreted as a generalized mean of $q_{\varvec{\theta }}$ and $\tilde{p}$, and lies in e-flat subspace [1] as

$$\begin{aligned} \tilde{r}_{\alpha ,\varvec{\theta }}(\varvec{x}) =\frac{\tilde{p}(\varvec{x})^{\alpha }\bar{q}_{\varvec{\theta }}(\varvec{x})^{1-\alpha }}{\left\langle \tilde{p}^{\alpha }\bar{q}_{\varvec{\theta }}^{1-\alpha } \right\rangle } = \frac{\tilde{p}(\varvec{x})^{\alpha }q_{\varvec{\theta }}(\varvec{x})^{1-\alpha }}{\left\langle \tilde{p}^{\alpha }q_{\varvec{\theta }}^{1-\alpha } \right\rangle }. \end{aligned}$$

(9)

Note that the normalization constant $Z_{\varvec{\theta }}$ in $\bar{q}_{\varvec{\theta }}$ is canceled out and $\tilde{r}_{\alpha ,\varvec{\theta }}(\varvec{x})$ does not depend on $Z_{\varvec{\theta }}$, except for $\alpha =0$. Also note that the denominator $\left\langle \tilde{p}^{\alpha }\bar{q}^{1-\alpha } \right\rangle $ (or $\tilde{r}_{\alpha ,\varvec{\theta }}(\varvec{x})$ itself) can be easily calculated because the empirical distribution $\tilde{p}(\varvec{x})$ has some values only on observed $\varvec{x}$ in the dataset and is always 0 on the unobserved subset of $\mathcal {X}$. This implies the model $q_{\varvec{\theta }}(\varvec{x})$ is empirically localized to the observed subset of domain $\mathcal {X}$ of dataset $\mathcal {D}$ and we can ignore the unobserved subset of $\mathcal {X}$, which leads to a drastic reduction of computational cost. We observe $\tilde{r}_{0,\varvec{\theta }}(\varvec{x})=\bar{q}(\varvec{x})$ and $\tilde{r}_{1,\varvec{\theta }}(\varvec{x})=\tilde{p}(\varvec{x})$, and (9) connects the empirical distribution $\tilde{p}$ and the normalized model $\bar{q}$ with the parameter $\alpha $.

2.3 Estimator by Homogeneous Divergence and Empirical Localization

In [13], an estimator which does not require calculation of the normalization constant, was proposed by combining (7) and (9). The estimator is defined as

$$\begin{aligned} \varvec{\hat{\theta }}&=\mathop {\mathrm{argmin}}_{\varvec{\theta }} D_{\gamma }(\tilde{r}_{\alpha ,\varvec{\theta }}^{1/(1+\gamma )},\tilde{r}_{\alpha ',\varvec{\theta }}^{1/(1+\gamma )}) \end{aligned}$$

(10)

$$\begin{aligned} D_{\gamma }(\tilde{r}_{\alpha ,\varvec{\theta }}^{1/(1+\gamma )},\tilde{r}_{\alpha ',\varvec{\theta }}^{1/(1+\gamma )})&= \frac{1}{1+\gamma }\log \left\langle \tilde{p}^{\alpha }q_{\varvec{\theta }}^{1-\alpha } \right\rangle + \frac{\gamma }{1+\gamma }\log \left\langle \tilde{p}^{\alpha '}q_{\varvec{\theta }}^{1-\alpha '} \right\rangle - \log \left\langle \tilde{p}^{\beta }q_{\varvec{\theta }}^{1-\beta } \right\rangle \end{aligned}$$

(11)

where $\alpha \not =\alpha '$ and $\beta =(\alpha +\gamma \alpha ')/(1+\gamma )$. We observe that a setting with $\alpha =1,\alpha '=0$ and $\gamma \rightarrow 0$ corresponds to the conventional MLE. Note that the empirical risk (11) does not include the calculation of the normalization constant $Z_{\varvec{\theta }}$ and can be easily calculated.

The estimator (10) has the following good statistical properties.

Proposition 1

([13]). Let us assume that $\psi _{\varvec{\theta }}(\varvec{x})$ is written as $\varvec{\theta }^T\varvec{\phi }(\varvec{x})$ with a fixed vector function $\varvec{\psi }(\varvec{x})$. Then the risk function (10) is convex with respect to $\varvec{\theta }$ when $\beta =1$ holds.

Proposition 2

([13]). The estimator (10) is Fisher consistent and asymptotically efficient.

3 Proposed Estimator

The empirical localization (9) can be interpreted as a generalized mean of a constant 1 and a distribution ratio $q_{\varvec{\theta }}/\tilde{p}$, and is rewritten as

$$\begin{aligned} \tilde{r}_{\alpha ,\varvec{\theta }}(\varvec{x})&\propto \tilde{p}(\varvec{x})^{\alpha }q_{\varvec{\theta }}(\varvec{x})^{1-\alpha } = \tilde{p}(\varvec{x})\left( \frac{q_{\varvec{\theta }}(\varvec{x})}{\tilde{p}(\varvec{x})} \right) ^{1-\alpha } \nonumber \\&= \tilde{p}(\varvec{x})\exp \left( \alpha \log 1+(1-\alpha )\log \frac{q_{\varvec{\theta }}(\varvec{x})}{\tilde{p}(\varvec{x})} \right) . \end{aligned}$$

(12)

We can extend the concept of (12) to the quasi-arithmetic mean, with a monotonically increasing function u and its inverse function $\xi $, as follows.

$$\begin{aligned} \tilde{r}_{u,\alpha ,\varvec{\theta }}(\varvec{x}) =\tilde{p}(\varvec{x})u\left( \alpha \xi (1)+(1-\alpha )\xi \left( \frac{q_{\varvec{\theta }}(\varvec{x})}{\tilde{p}(\varvec{x})} \right) \right) . \end{aligned}$$

(13)

By transforming the function u(z) to $u(z-a)$, we can set $\xi (1)=0$ without loss of generality. The generalized version of empirical localization (13) is rewritten as

$$\begin{aligned} \tilde{r}_{u,\alpha ,\varvec{\theta }}(\varvec{x}) \propto {\left\{ \begin{array}{ll} n_{\varvec{x}}u\left( (1-\alpha )\xi \left( \frac{nq_{\theta }(\varvec{x})}{n_{\varvec{x}}} \right) \right) &{} \varvec{x} \text{ is } \text{ observed } \\ 0 &{} \text{ otherwise }, \end{array}\right. } \end{aligned}$$

(14)

and the model can be easily calculated because we can omit the unobserved domain. We show two examples associated with $\beta $-divergence and $\eta $-divergence [3] which are employed for the purpose of robust estimation [9, 12].

Example 1

For $u(z)=(1+\beta z)^{1/\beta }$ and $\xi (z)=\frac{z^\beta -1}{\beta }$, we have

$$\begin{aligned} \tilde{r}_{u,\alpha ,\varvec{\theta }}(\varvec{x}) = \tilde{p}(\varvec{x}) \left( (1-\alpha )\left( \frac{q_{\varvec{\theta }}(\varvec{x})}{\tilde{p}(\varvec{x})}\right) ^{\beta }+\alpha \right) ^{1/\beta } \end{aligned}$$

(15)

Example 2

For $u(z)=(1+\eta )e^z-\eta $ and $\xi (z)=\log \frac{z+\eta }{1+\eta }$, we have

$$\begin{aligned} \tilde{r}_{u,\alpha ,\varvec{\theta }}(\varvec{x}) = \tilde{p}(\varvec{x}) \left\{ (1+\eta )\left( \frac{\frac{q_{\varvec{\theta }}(\varvec{x})}{\tilde{p}(\varvec{x})}+\eta }{1+\eta } \right) ^{1-\alpha }-\eta \right\} \end{aligned}$$

(16)

Example 3

For $u(z)=-\frac{1}{z}$ and $\xi (z)=-\frac{1}{z}$, we have

$$\begin{aligned} \tilde{r}_{u,\alpha ,\varvec{\theta }}(\varvec{x}) = \frac{\tilde{p}(\varvec{x})q_{\varvec{\theta }}(\varvec{x})}{\alpha q_{\varvec{\theta }}(\varvec{x})+(1-\alpha )\tilde{p}(\varvec{x})} \end{aligned}$$

(17)

We propose a novel estimator for discrete probabilistic model, which can be constructed without calculation of the normalization constant $Z_{\varvec{\theta }}$. The proposed estimator is defined by combining the (13) and $\gamma $-divergence with two hyper-parameters $\alpha ,\alpha '$($\alpha \not =\alpha '$), as follows.

$$\begin{aligned} \varvec{\hat{\theta }}&= \mathop {\mathrm{argmin}}_{\varvec{\theta }}D_\gamma ((\tilde{r}_{u,\alpha ,\varvec{\theta }})^{1/(1+\gamma )}, (\tilde{r}_{u,\alpha ',\varvec{\theta }})^{1/(1+\gamma )}) \end{aligned}$$

(18)

Note that when $q_{\varvec{\theta }}(\varvec{x})\propto \tilde{p}(\varvec{x})$ holds, we observe that $\tilde{r}_{u,\alpha ,\varvec{\theta }}(\varvec{x})\propto \tilde{r}_{u,\alpha ',\varvec{\theta }}(\varvec{x})$ and $D_\gamma ((\tilde{r}_{u,\alpha ,\varvec{\theta }})^{1/(1+\gamma )}, (\tilde{r}_{u,\alpha ',\varvec{\theta }})^{1/(1+\gamma )}) =0$ holds.

4 Statistical Property

In this section, we investigate statistical property of the proposed estimator. Firstly, we show the Fisher consistency of the proposed estimator.

Proposition 3

Let $\varvec{\theta }_0$ be a true parameter of the underlying distribution, i.e., $p(\varvec{x})=\bar{q}_{\varvec{\theta }_0}(\varvec{x})$. Then

$$\begin{aligned} \varvec{\theta }_0=\mathop {\mathrm{argmin}}_{\varvec{\theta }}D_{\gamma }(r_{u,\alpha ,\varvec{\theta }},r_{u,\alpha ',\varvec{\theta }}) \end{aligned}$$

(19)

holds for arbitrary $\gamma $, $\alpha $, $\alpha '$($\alpha \not =\alpha '$) and $\varvec{\theta }_0$.

Proof

The proposed estimator satisfies the equilibrium equation

$$\begin{aligned} 0= \left. \frac{\partial D_\gamma ((\tilde{r}_{u,\alpha ,\varvec{\theta }})^{1/(1+\gamma )}, (\tilde{r}_{u,\alpha ',\varvec{\theta }})^{1/(1+\gamma )}) }{\partial \varvec{\theta }} \right| _{\varvec{\theta }=\varvec{\hat{\theta }}} \end{aligned}$$

(20)

implying the Fisher consistency.

Secondly, we investigate the asymptotic distribution of the proposed estimator.

Proposition 4

Let $\varvec{\theta }_0$ be a true parameter of the underlying distribution. Then, under mild regularity condition, the proposed estimator asymptotically follows

$$\begin{aligned} \sqrt{n}(\varvec{\hat{\theta }}-\varvec{\theta }_0)\sim \mathcal {N}(\varvec{0},I(\varvec{\theta }_0)^{-1}) \end{aligned}$$

(21)

where $\mathcal {N}$ is the Normal distribution and $I(\varvec{\theta }_0)=V_{\bar{q}_{\varvec{\theta }_0}}[\psi _{\varvec{\theta }_0}']$ is the Fisher information matrix.

Proof

Let us assume that the empirical distribution is written as $\tilde{p}(\varvec{x})=\bar{q}_{\varvec{\theta }_0}(\varvec{x})+\epsilon (\varvec{x})$. By expanding the equilibrium condition (20) around $\varvec{\theta }=\varvec{\theta }_0$ and $\epsilon (\varvec{x})=0$, we have

$$\begin{aligned} 0\simeq&\left. \frac{\partial D_\gamma ((\tilde{r}_{u,\alpha ,\varvec{\theta }})^{1/(1+\gamma )}, (\tilde{r}_{u,\alpha ',\varvec{\theta }})^{1/(1+\gamma )}) }{\partial \varvec{\theta }} \right| _{\varvec{\theta }=\varvec{\theta }_0} \nonumber \\&+ \left. \frac{\partial ^2 D_\gamma ((\tilde{r}_{u,\alpha ,\varvec{\theta }})^{1/(1+\gamma )}, (\tilde{r}_{u,\alpha ',\varvec{\theta }})^{1/(1+\gamma )}) }{\partial \varvec{\theta }\partial \varvec{\theta }^T} \right| _{\varvec{\theta }=\varvec{\theta }_0}(\varvec{\hat{\theta }}-\varvec{\theta }_0). \end{aligned}$$

(22)

Using the delta method [14], we have

$$\begin{aligned}&\left. \frac{\partial D_\gamma ((\tilde{r}_{u,\alpha ,\varvec{\theta }})^{1/(1+\gamma )}, (\tilde{r}_{u,\alpha ',\varvec{\theta }})^{1/(1+\gamma )}) }{\partial \varvec{\theta }} \right| _{\varvec{\theta }=\varvec{\theta }_0} - \left. \frac{\partial D_\gamma ((r_{u,\alpha ,\varvec{\theta }})^{1/(1+\gamma )}, (r_{u,\alpha ',\varvec{\theta }})^{1/(1+\gamma )}) }{\partial \varvec{\theta }} \right| _{\varvec{\theta }=\varvec{\theta }_0}\end{aligned}$$

(23)

$$\begin{aligned}&\simeq C \left\langle \psi _{\varvec{\theta }_0}'\epsilon \right\rangle \end{aligned}$$

(24)

where C is a constant, and from the central limit theorem, we observe $\sqrt{n}\left\langle \psi _{\varvec{\theta }_0}'\epsilon \right\rangle $ asymptotically follows the normal distribution with mean 0 and variance $I(\varvec{\theta }_0)=V_{\bar{q}_{\varvec{\theta }_0}}[\psi _{\varvec{\theta }_0}']$. From the law of large number, we observe that the second term in the rhs of (22) converges to $-CI(\varvec{\theta }_0)$ in the limit of $n\rightarrow \infty $, which concludes the proposition.

The asymptotic variance in (21) implies that the proposed estimator is asymptotically efficient and has the same efficiency with the MLE, which asymptotically attains the Cramér-Rao bound. Also note that the asymptotic variance of the proposed estimator does not depend on choice of $\alpha ,\alpha ',\gamma $.

5 Experiments

We numerically investigated properties of the proposed estimator with a small synthetic dataset. Let $\bar{q}_{\varvec{\theta }}(\varvec{x})$ be a 5-dimensional Boltzmann machine

$$\begin{aligned} \bar{q}_{\varvec{\theta }}(\varvec{x})= \frac{\exp \left( \frac{1}{2}\varvec{x}^T\varvec{\theta }\varvec{x}\right) }{Z_{\varvec{\theta }}} \end{aligned}$$

(25)

whose parameter $\varvec{\theta }$ follows the normal distribution with mean 0 and variance 1. We generated 20 sets of datasets including 4000 examples and compared the following method.

1.
MLE: Maximum likelihood estimator
2.
gamma: The proposed estimator with $u(z)=\exp (z)$ [13]
3.
IS: The proposed estimator with $u(z)=-\frac{1}{z}$
4.
eta: The proposed estimator with $u(z)=(1+\eta )\exp (z)-\eta $
5.
beta: The proposed estimator with $u(z)=(1+\beta z)^{1/\beta }$

Figure 1(a) shows a box plot of MSEs of parameters, $||\varvec{\hat{\theta }}-\varvec{\theta }||^2$ in a logarithmic scale, with various deformation function u in (13). Figure 1(b) shows a box plot of computational times for each estimator in a logarithmic scale. We observe that some of the proposed estimator is comparable with the MLE, while the computational time of the proposed estimator is drastically reduced compared with that of the MLE.

A reason of why the proposed estimator with some functions u are inferior to the MLE is a shortage of examples. The theoretical result shown in Sect. 4 is based on assumptions of asymptotics and requires a lot of examples to assure the asymptotic efficiency. We executed an another experiment with the same setting except for the number n of examples. Figure 2(a), (b) show results for $n=16000$ and we observe that performance of the proposed estimator (IS, eta, beta) is improved at the same level as the MLE while required computational cost is still drastically fewer.

6 Conclusion

We proposed the novel estimator for discrete probabilistic model, which does not require calculation of the normalization constant. The proposed estimator is constructed by a combination of the $\gamma $-divergence and generalized empirical localization, which can be interpreted as the generalized mean of distributions. We investigated statistical properties of the proposed estimator and showed that the proposed estimator asymptotically has the same efficiency with the MLE and demonstrated the asymptotic efficiency with the small experiment.

References

Amari, S., Nagaoka, H.: Methods of Information Geometry, Translations of Mathematical Monographs, vol. 191. Oxford University Press, Oxford (2000)
Google Scholar
Basu, A., Shioya, H., Park, C.: Statistical Inference: The Minimum Distance Approach. Chapman and Hall/CRC, Boca Raton (2011)
Book Google Scholar
Cichocki, A., Cruces, S., Amari, S.i.: Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 13(1), 134–170 (2011)
Article Google Scholar
Dawid, A.P., Lauritzen, S., Parry, M.: Proper local scoring rules on discrete sample spaces. Ann. Stat. 40(1), 593–608 (2012)
Article MathSciNet Google Scholar
Fujisawa, H., Eguchi, S.: Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 99(9), 2053–2081 (2008)
Article MathSciNet Google Scholar
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304 (2010)
Google Scholar
Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)
Article Google Scholar
Hyvärinen, A.: Some extensions of score matching. Comput. Stat. Data Anal. 51(5), 2499–2512 (2007)
Article MathSciNet Google Scholar
Mihoko, M., Eguchi, S.: Robust blind source separation by beta divergence. Neural Comput. 14(8), 1859–1886 (2002)
Article Google Scholar
Nielsen, F., Nock, R.: Patch matching with polynomial exponential families and projective divergences. In: Amsaleg, L., Houle, M.E., Schubert, E. (eds.) SISAP 2016. LNCS, vol. 9939, pp. 109–116. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46759-7_8
Chapter Google Scholar
Opper, M., Saad, D. (eds.): Advanced Mean Field Methods: Theory and Practice. MIT Press, Cambridge (2001)
MATH Google Scholar
Takenouchi, T., Eguchi, S.: Robustifying AdaBoost by adding the naive error rate. Neural Comput. 16(4), 767–787 (2004)
Article Google Scholar
Takenouchi, T., Kanamori, T.: Statistical inference with unnormalized discrete models and localized homogeneous divergences. J. Mach. Learn. Res. 18(56), 1–26 (2017). http://jmlr.org/papers/v18/15-596.html
MathSciNet MATH Google Scholar
Van der Vaart, A.W.: Asymptotic Statistics. Cambridge University Press, Cambridge (1998)
Book Google Scholar

Download references

Author information

Authors and Affiliations

Future University Hakodate, RIKEN Center for Advanced Intelligence Project (AIP), Hakodate, Japan
Takashi Takenouchi

Authors

Takashi Takenouchi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takashi Takenouchi .

Editor information

Editors and Affiliations

Sony Computer Science Laboratories, Inc., Tokyo, Japan
Frank Nielsen
Thales, Limours, France
Frédéric Barbaresco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Takenouchi, T. (2019). Parameter Estimation with Generalized Empirical Localization. In: Nielsen, F., Barbaresco, F. (eds) Geometric Science of Information. GSI 2019. Lecture Notes in Computer Science(), vol 11712. Springer, Cham. https://doi.org/10.1007/978-3-030-26980-7_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-26980-7_38
Published: 02 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26979-1
Online ISBN: 978-3-030-26980-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Parameter Estimation with Generalized Empirical Localization

Abstract

Similar content being viewed by others

Asymptotic Properties of Minimum S-Divergence Estimator for Discrete Models

Maximum a posteriori estimators as a limit of Bayes estimators

Adaptation of the tuning parameter in general Bayesian inference with robust divergence

Keywords

1 Introduction

2 Settings

2.1 \(\gamma \)-divergence

2.2 Empirical Localization

2.3 Estimator by Homogeneous Divergence and Empirical Localization

Proposition 1

Proposition 2

3 Proposed Estimator

Example 1

Example 2

Example 3

4 Statistical Property

Proposition 3

Proof

Proposition 4

Proof

5 Experiments

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Parameter Estimation with Generalized Empirical Localization

Abstract

Similar content being viewed by others

Asymptotic Properties of Minimum S-Divergence Estimator for Discrete Models

Maximum a posteriori estimators as a limit of Bayes estimators

Adaptation of the tuning parameter in general Bayesian inference with robust divergence

Keywords

1 Introduction

2 Settings

2.1 \(\gamma \)-divergence

2.2 Empirical Localization

2.3 Estimator by Homogeneous Divergence and Empirical Localization

Proposition 1

Proposition 2

3 Proposed Estimator

Example 1

Example 2

Example 3

4 Statistical Property

Proposition 3

Proof

Proposition 4

Proof

5 Experiments

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation