Distributed Estimation of Mixture Models

Dedecius, Kamil; Reichl, Jan

doi:10.1007/978-3-319-16238-6_3

Kamil Dedecius⁵ &
Jan Reichl⁵

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 126))

2055 Accesses

Abstract

The contribution deals with sequential distributed estimation of global parameters of normal mixture models, namely mixing probabilities and component means and covariances. The network of cooperating agents is represented by a directed or undirected graph, consisting of vertices taking observations, incorporating them into own statistical knowledge about the inferred parameters and sharing the observations and the posterior knowledge with other vertices. The aim to propose a computationally cheap online estimation algorithm naturally disqualifies the popular (sequential) Monte Carlo methods for the associated high computational burden, as well as the expectation-maximization (EM) algorithms for their difficulties with online settings requiring data batching or stochastic approximations. Instead, we proceed with the quasi-Bayesian approach, allowing sequential analytical incorporation of the (shared) observations into the normal inverse-Wishart conjugate priors. The posterior distributions are subsequently merged using the Kullback–Leibler optimal procedure.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Trust-Region Based Stochastic Variational Inference for Distributed and Asynchronous Networks

Article 28 December 2022

Nonnegative Bayesian nonparametric factor models with completely random measures

Article Open access 24 August 2021

A Survey of Nonparametric Mixing Density Estimation via the Predictive Recursion Algorithm

Article 14 October 2019

Key words

1 Introduction

The rapid development of ad-hoc networks and the emergence of the so-called big data phenomenon have brought new challenges for distributed statistical data processing. For instance, the processing often needs to be decentralized, i.e. without any dedicated unit in the network. Instead, all agents are responsible for (i) taking measurements, (ii) processing them, and (iii) sharing the statistical knowledge about the (usually global) inferred parameters. In addition, the estimation should run online in many cases. This means to take observations of a dynamic process and incorporate them sequentially into the shared knowledge. This often disqualifies the popular sequential Monte Carlo (MC) approach due to the associated high computational burden. Excellent surveys on distributed estimation are the recent papers by Sayed [10] (non-MC) and Hlinka et al. [5] (MC-based).

Despite the great potential of the Bayesian paradigm in this field, its adoption is still rather the exception than the rule. From the probabilistic viewpoint, the resulting “classical” (that is, non-Bayesian) algorithms often suffer from statistical inconsistencies. For instance, point estimators are often combined without reflecting the associated uncertainty, which may lead to erroneous estimates. The first author’s work [1] aims at partially filling this gap. It proposes a fully Bayesian approach to decentralized distributed estimation with fusion, based on minimizing the Kullback–Leibler divergence. The present contribution extends the results to the case of mixture models, covered for the static cases, e.g., in [3, 8, 13].

The novelty of the proposed framework lies in a fully analytical Bayesian processing of observations and shared knowledge about the estimated parameters. To this end, the underlying theory relies on the quasi-Bayesian approach, proposed by Smith, Makov, and Titterington [11, 12] and followed by Kárný et al. [6], whose approach is adopted here. It provides analytical tractability of mixture inference by relying on point estimators where necessary. Though we focus on normal mixtures, the results are applicable to homogeneous mixtures of exponential family distributions.

2 Quasi-Bayesian Estimation of Mixture Models

Consider an observable time series {Y _t, t ∈ ℕ} with Y _t ∈ ℝ ⁿ following a normal mixture distribution

$$\displaystyle\begin{array}{rcl} Y _{t}\vert \phi,\theta & \sim & \phi _{1}\mathrm{N}(\mu _{1},\varSigma _{1}) +\ldots +\phi _{K}\mathrm{N}(\mu _{K},\varSigma _{K}) \\ & \sim & \phi _{1}\mathrm{N}(\theta _{1}) +\ldots +\phi _{K}\mathrm{N}(\theta _{K}), {}\end{array}$$

(3.1)

where N(μ _k, Σ _k) denotes the kth component density, namely a normal distribution with mean vector μ _k ∈ ℝ ⁿ and covariance matrix Σ _k ∈ ℝ ^n×n, in the latter notation summarized by θ _k = {μ _k, Σ _k}. The nonnegative variables ϕ _k taking values in the unit K-simplex are the component probabilities. The number of components K is assumed to be known a priori. Furthermore, the notation θ = {θ ₁, …, θ _K}, ϕ = {ϕ ₁, …, ϕ _K} is used.

Let p _k(y _k | θ _k) be the probability density function of the kth component, yielding the mixture density of the form

$$\displaystyle{ p(y_{t}\vert \theta,\phi ) =\sum _{ k=1}^{K}\phi _{ k}p_{k}(y_{t}\vert \theta _{k}). }$$

(3.2)

At each time instant t the observation y _t is generated by the k _tth component p _k(y _t | θ _k), selected with probability ϕ _k,

$$\displaystyle{ p(y_{t}\vert \theta,\phi,k_{t}) =\prod _{ k=1}^{K}\left [\phi _{ k}p_{k}(y_{t}\vert \mu _{k},\varSigma _{k})\right ]^{S_{k,t} }, }$$

(3.3)

where S _k, t is the indicator function of the active component

$$\displaystyle{ S_{k,t} = \left \{\begin{array}{@{}l@{\quad }l@{}} 1\quad \mbox{ if $S_{k,t} = k_{t}$},\quad \\ 0\quad \text{otherwise.} \quad \end{array} \right. }$$

(3.4)

In other words, S _t = (S _1, t, …, S _K, t) can be viewed as a vector with 1 on the k _tth position and zeros elsewhere, and hence follows the multinomial distribution Multi(1, ϕ).

From the Bayesian viewpoint the topological property of ϕ is crucial, as it allows its modelling with the Dirichlet distribution with parameters κ ₁, …, κ _K,

$$\displaystyle{ \phi = (\phi _{1},\ldots,\phi _{K}) \sim \mathrm{ Dir}(\kappa _{1},\ldots,\kappa _{K}),\qquad \kappa _{k} > 0\quad \text{for all}\quad k = 1,\ldots,K, }$$

conjugate to the multinomial distribution of S _t. Sequential estimation of each single component mean and covariance can then proceed with the conjugate normal inverse-Wishart distribution (or normal inverse-gamma in the univariate case),

$$\displaystyle{ \theta _{k} =\{\mu _{k},\varSigma _{k}\} \sim \mathrm{ NiW}(m,s,a,b),\qquad m \in \mathbb{R}^{n},\ \ s \in \mathbb{R}^{n\times n},\ \ a,b > 0. }$$

Exact knowledge of S _t would make the Bayesian inference of both the component parameters μ _k, Σ _k and mixing probabilities ϕ easily tractable, since the product (3.3) simplifies to a single density and a single component probability. Likewise, the Bayesian inference of mixing probabilities ϕ is easy under known components, as the detection of the active one is a relatively simple hypotheses testing problem, see, e.g., [4]. However, our attention is shifted towards estimating both component parameters μ, Σ and mixing probabilities ϕ. For this sake, we need to derive the Bayesian update

$$\displaystyle{ \pi _{\phi,\theta }(\phi,\theta \vert y_{1:t},k_{1:t}) \propto \pi _{\phi,\theta }(\phi,\theta \vert y_{1:t-1},k_{1:t-1})\prod _{k=1}^{K}\left [\phi _{ k}p_{k}(y_{t}\vert \theta _{k})\right ]^{S_{k,t} } }$$

where the joint prior distribution is assumed to be

$$\displaystyle{ \pi _{\phi,\theta }(\phi,\theta \vert y_{1:t-1},k_{1:t-1}) =\pi _{\phi }(\phi \vert y_{1:t-1},k_{1:t-1})\pi _{\theta }(\theta \vert y_{1:t-1},k_{1:t-1}). }$$

The independence of ϕ and θ allows tractable computation of the posterior distribution. Indeed, this assumption is not quasi-Bayes specific.

In this case, Kárný et al. [6] propose to rely on the approach of Smith, Makov, and Titterington [11, 12] and replace the latent indicators S _k, t defined in Eq. (3.4) by their respective point estimates with respect to ϕ _k and θ _k of the form

$$\displaystyle\begin{array}{rcl} \widehat{S}_{k,t}& =& \mathbb{E}\left [S_{k,t}\vert y_{1:t},k_{1:t-1}\right ] \\ & \propto & \mathbb{E}\left [\phi _{k}\vert y_{1:t-1},k_{1:t-1}\right ]p_{k}(y_{t}\vert y_{1:t-1},k_{1:t-1}),{}\end{array}$$

(3.5)

where

$$\displaystyle{ p_{k}(y_{t}\vert y_{1:t-1},k_{1:t-1}) =\int p_{k}(y_{t}\vert \theta _{k})\pi _{\theta _{k}}(\theta _{k}\vert y_{1:t-1},k_{1:t-1})\mathrm{d}\theta _{k} }$$

(3.6)

is the predictive distribution (under normal inverse-Wishart prior it is a Student’s t distribution). To summarize, the estimation of the indicator S _k, t of the active component k is based on (i) testing the component membership based on the predictive likelihood (3.6), and (ii) the estimated probability of the particular component $\mathbb{E}[\phi _{k}\vert \cdot ]$ in (3.5).

The quasi-Bayesian update then takes the weighted form of the regular update under known S _t,

$$\displaystyle\begin{array}{rcl} \pi _{\phi }(\phi \vert y_{1:t},k_{1:t})& \propto & \mathbb{E}\left [\widehat{S}_{t}\vert y_{1:t},k_{1:t-1}\right ]\pi _{\phi }(\phi \vert y_{1:t-1},k_{1:t-1}),{}\end{array}$$

(3.7)

$$\displaystyle\begin{array}{rcl} \pi _{\theta _{k}}(\theta _{k}\vert y_{1:t},k_{1:t})& \propto & \left [p_{k}(y_{t}\vert \theta _{k})\right ]^{\widehat{S}_{k,t} }\pi _{\theta _{k}}(\theta _{k}\vert y_{1:t-1},k_{1:t-1}).{}\end{array}$$

(3.8)

If the component density is rewritten as the exponential family and the prior density is conjugate, then, as shown in the Appendix, the update of the relevant hyperparameters is particularly easy.

3 Distributed Estimation

Assume that the distributed estimation runs in a network represented by a directed or undirected connected graph G(V, E) consisting of a set of vertices V = { 1, …, N} (also called nodes or agents) and a set E of edges, defining the graph topology. The vertices n ∈ V are allowed to communicate with adjacent vertices. For a fixed vertex n, these neighbors form a complete bipartite subgraph (every neighboring vertex is connected with n) with radius 1, diameter at most 2 and of type star (unless the vertex n is of degree 1), where n is the central vertex and all other vertices peripheral. The set of vertices of this subgraph is denoted by V _n.

The vertices independently observe the process {Y _t, t ∈ ℕ}, taking observations y _t ⁽ⁿ⁾, n ∈ V. These are shared within V _n in the sense that each vertex n has access to y _t ^(j) of vertices j ∈ V _n and incorporates them according to the quasi-Bayesian estimation theory outlined in the previous section. That is, each node n ends with the joint posterior density

$$\displaystyle{ \pi _{\phi,\theta }^{(n)}(\phi,\theta \vert \widetilde{y}_{ 1:t},\widetilde{k}_{1:t}), }$$

(3.9)

resulting from the number of $\mathop{card}\nolimits (V _{n})$ updates of the form (3.7) and (3.8). Here, tilde denotes the statistical knowledge comprising the V _n’s information relevant to the particular variable. This step is called adaptation, e.g., [10].^{Footnote 1}

3.1 Combination of Estimates

In the combination step [10], the vertices n ∈ V access V _n’s posterior distributions (3.9) resulting from the adaptation,

$$\displaystyle{ \pi _{\phi,\theta }^{(j)}(\phi,\theta \vert \widetilde{y}_{ 1:t},\widetilde{k}_{1:t}),\qquad j \in V _{n}. }$$

Now the goal is to represent (i.e., approximate) them by a single joint posterior $\widetilde{\pi }_{\phi,\theta }^{(n)}$ parameterizing the mixture (3.1) in consideration. To this end, we adopt the Kullback–Leibler divergence [7] defined in the Appendix, and seek for $\widetilde{\pi }_{\phi,\theta }^{(n)}$ satisfying

$$\displaystyle{ \sum _{j\in V _{n}}\alpha _{nj}\mathrm{D}(\widetilde{\pi }_{\phi,\theta }^{(n)}\vert \vert \pi _{\phi,\theta }^{(j)}) \rightarrow \min, }$$

(3.10)

where $\alpha _{nj} = 1/(\mathop{card}\nolimits (V _{n}))$ are nonnegative uniform weights assigned to nodes j ∈ V _n summing to unity. Other weight choices, e.g. reflecting properties of the neighboring vertices are possible as well.

Let us impose an additional assumption simplifying the theory: identical order of component parameters and significantly overlapping densities π _ϕ, θ ^(j) of all j ∈ V _n. This means that the order of the components and their parameterization agrees at all vertices in V _n (and hence V ). This assumption can be easily removed by incorporating detection of similar posterior distributions or enforced by starting from identical initial priors.

We exploit the following general proposition proved, e.g., in [1]. Although we consider exponential family distributions (where it provides analytically tractable results), the proposition is not limited to them.

Proposition 1.

Let π _ϕ,θ ^(j) be the posterior probability density functions of vertices j ∈ V _n and α _nj their weights from the unit card(V _n )-simplex. Their approximation by a single density $\widetilde{\pi }_{\phi,\theta }^{(n)}$ optimal in the Kullback–Leibler sense (3.10) has the form

$$\displaystyle{ \widetilde{\pi }_{\phi,\theta } \propto \prod _{j\in V _{n}}\left [\pi _{\phi,\theta }^{(j)}\right ]^{\alpha _{nj} }. }$$

(3.11)

The resulting approximate posterior density hence virtually parameterizes a much richer mixture, however, the individual densities overlap by the given assumption. Then Proposition 1 gives a method for reduction to the parametrization of K components,

$$\displaystyle\begin{array}{rcl} \widetilde{\pi }_{\phi }^{(n)} \propto \prod _{ j\in V _{n}}\left [\pi _{\phi }^{(j)}\right ]^{\alpha _{nj} }\qquad \text{and}\qquad \widetilde{\theta }_{\phi }^{(n)} \propto \prod _{ j\in V _{n}}\left [\theta _{\phi }^{(j)}\right ]^{\alpha _{nj} },& & {}\\ \end{array}$$

which, due to the structure of the conjugate priors (see Appendix) and component ordering yields

$$\displaystyle\begin{array}{rcl} \widetilde{\xi }_{k,t}^{(n)} =\sum _{ j\in V _{n}}\alpha _{nj}\xi _{k,t}^{(j)},\quad \widetilde{\nu }_{ k,t}^{(n)} =\sum _{ j\in V _{n}}\alpha _{nj}\nu _{k,t}^{(j)},\qquad \text{and}\qquad \widetilde{\kappa }_{ k,t}^{(n)} =\sum _{ j\in V _{n}}\alpha _{nj}\kappa _{k,t}^{(j)},& & {}\\ \end{array}$$

for the hyperparameters ξ, ν and κ of the prior distributions for θ and ϕ, respectively. The resulting KL-optimal posterior is then again conjugate to the model and can be used for the subsequent adaptation step.

4 Simulation Example

The simulation example deals with estimating a three-component normal mixture model, for simplicity univariate of the form

$$\displaystyle{ Y \sim \frac{1} {3}\mathrm{N}(-2,1) + \frac{1} {3}\mathrm{N}(4,1) + \frac{1} {3}\mathrm{N}(8,2), }$$

with unknown means and variances. The graph G(V, E), whose scheme is depicted together with the components and samples in Fig. 3.1, consists of a set of vertices V = { 1, …, 8}∖{6}. The sixth vertex is disconnected and serves for comparison. The vertices n ∈ V ∪{ 6} take observations y _t ⁽ⁿ⁾ with t = 1, …, 300. Clearly, one would expect relatively easy identification of the leftmost component, while the other two may be problematic due to their closeness.

The quasi-Bayesian estimation of components k ∈ { 1, 2, 3} exploits the conjugate normal inverse-gamma prior NIG(μ _k, σ _k; m _k, s _k, a _k, b _k) = N(μ _k | σ _k ²; m _k, σ ² s _k) × IG(σ _k ²; a _k, b _k) with initial hyperparameters m _k set to 0, 3, and 6, respectively; the other hyperparameters are $s_{k} = 1,a_{k} = 2,b_{k} = 2$ for all k. The prior for the component probabilities is $\phi \sim \mathrm{ Dir}(\frac{1} {3}, \frac{1} {3}, \frac{1} {3})$. This initialization is identical across the graph.

The progress of the point estimates of μ _k and σ _k is depicted in Fig. 3.2 for the isolated vertex 6 (left) and the randomly chosen vertex 4 (right). The point estimates of μ _k converge relatively well in both cases, however, the variance estimates converge well only in the case of the distributed estimation (with the exception of σ ₁ ²). This is due to the much richer data available for the interconnected vertices. The mean squared errors (MSE) of the final estimates are given in Table 3.1.

Table 3.1 Statistics of mean square errors (MSEs) of resulting estimates: distributed estimation and isolated vertex 6

Full size table

5 Conclusion

The quasi-Bayesian method for analytically tractable sequential inference of parameters of probabilistic mixtures has been extended to the case of distributed estimation of normal mixture model with unknown mixing probabilities and component parameters. Here, distributed means that there is a graph (network) of cooperating vertices (nodes, agents) sharing their statistical knowledge (observations and estimates) with a limited subset of other vertices. This knowledge is combined at each vertex: the observations are incorporated by means of the Bayes’ theorem, the estimates are combined via the Kullback–Leibler optimal rule.

The main advantage of the method is its simplicity and scalability. Unlike Monte Carlo approaches, it is computationally very cheap. The authors have recently shown in [2] that this method is suitable for the whole class of mixture models consisting of exponential family distributions and their conjugate prior distributions.

One difficulty associated with the method is common for most mixture estimation methods, namely the initialization. In addition, merging and splitting of components after the combination of estimates would significantly enhance the suitability of the approach for dynamic cases. These topics remain for further research.

Notes

1.
The terms “adaptation” and “combination” were introduced by [10]. We adopt them for our Bayesian counterparts.

References

Dedecius, K., Sečkárová, V.: Dynamic diffusion estimation in exponential family models. IEEE Signal Process. Lett. 20(11), 1114–1117 (2013)
Article Google Scholar
Dedecius, K., Reichl, J., Djurić, P.M.: Sequential estimation of mixtures in diffusion networks. IEEE Signal Process. Lett. 22(2), 197–201 (2015)
Google Scholar
Dongbing, Gu.: Distributed EM algorithm for Gaussian mixtures in sensor networks. IEEE Trans. Neural Netw. 19(7), 1154–1166 (2008)
Article Google Scholar
Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models. Springer, London (2006)
MATH Google Scholar
Hlinka, O., Hlawatsch, F., Djurić, P.M.: Distributed particle filtering in agent networks: a survey, classification, and comparison. IEEE Signal Process. Mag. 30(1), 61–81 (2013)
Google Scholar
Kárný, M., Böhm, J., Guy, T.V., Jirsa, L., Nagy, I., Nedoma, P., Tesař, L.: Optimized Bayesian Dynamic Advising: Theory and Algorithms. Springer, London (2006)
Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MATH MathSciNet Google Scholar
Pereira, S.S., Lopez-Valcarce, R., Pages-Zamora, A.: A diffusion-based EM algorithm for distributed estimation in unreliable sensor networks. IEEE Signal Process. Lett. 20(6), 595–598 (2013)
Article Google Scholar
Raiffa, H., Schlaifer, R.: Applied Statistical Decision Theory (Harvard Business School Publications). Harvard University Press, Cambridge (1961)
Google Scholar
Sayed, A.H.: Adaptive networks. Proc. IEEE 102(4), 460–497 (2014)
Article Google Scholar
Smith, A.F.M., Makov, U.E.: A Quasi-Bayes sequential procedure for mixtures. J. R. Stat. Soc. Ser. B (Methodol.) 40(1), 106–112 (1978)
Google Scholar
Titterington, D.M., Smith, A.F.M., Makov, U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, New York (1985)
MATH Google Scholar
Weng, Y., Xiao, W., Xie, L.: Diffusion-based EM algorithm for distributed estimation of Gaussian mixtures in wireless sensor networks. Sensors 11(6), 6297–316 (2011)
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Czech Science Foundation, postdoctoral grant no. 14–06678P. The authors thank the referees for their valuable comments.

Author information

Authors and Affiliations

Institute of Information Theory and Automation, Academy of Sciences of the Czech Republic, Prague, Czech Republic
Kamil Dedecius & Jan Reichl

Authors

Kamil Dedecius
View author publications
You can also search for this author in PubMed Google Scholar
Jan Reichl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kamil Dedecius .

Editor information

Editors and Affiliations

Institute for Statistics and Mathematics, WU Vienna University of Economics and Business, Vienna, Austria
Sylvia Frühwirth-Schnatter
Institute for Statistics and Mathematics, WU Vienna University of Economics and Business, Vienna, Austria
Angela Bitto
Institute for Statistics and Mathematics, WU Vienna University of Economics and Business, Vienna, Austria
Gregor Kastner
Institute for Statistics and Mathematics, WU Vienna University of Economics and Business, Vienna, Austria
Alexandra Posekany

Appendix

Below we give several useful definitions and lemmas regarding the Bayesian estimation of exponential family distributions with conjugate priors [9]. The proofs are trivial. Their application to the normal model and normal inverse-gamma prior used in Sect. 3.4 follows.

Definition 1 (Exponential family distributions and conjugate priors).

Any distribution of a random variable y parameterized by θ with the probability density function of the form

$$\displaystyle{ p(y\vert \theta ) = f(y)g(\theta )\exp \left \{\eta (\theta )^{\intercal }T(y)\right \}, }$$

where f, g, η, and T are known functions, is called an exponential family distribution. η ≡ η(θ) is its natural parameter, T(y) is the (dimension preserving) sufficient statistic. The form is not unique.

Any prior distribution for θ is said to be conjugate to p(y | θ), if it can be written in the form

$$\displaystyle{ \pi (\theta \vert \xi,\nu ) = q(\xi,\nu )g(\theta )^{\nu }\exp \left \{\eta (\theta )^{\intercal }\xi \right \}, }$$

where q is a known function and the hyperparameters ν ∈ ℝ ⁺ and ξ is of the same shape as T(y).

Lemma 1 (Bayesian update with conjugate priors).

Bayes’ theorem

$$\displaystyle{ \pi (\theta \vert \xi _{t},\nu _{t}) \propto p(y_{t}\vert \theta )\pi (\theta \vert \xi _{t-1},\nu _{t-1}) }$$

yields the posterior hyperparameters as follows:

$$\displaystyle{ \xi _{t} =\xi _{t-1} + T(y_{t})\qquad \text{and}\qquad \nu _{t} =\nu _{t-1} + 1. }$$

Lemma 2.

The normal model

$$\displaystyle{ p(y_{t}\vert \mu,\sigma ^{2}) = \frac{(\sigma ^{2})^{-\frac{1} {2} }} {\sqrt{2\pi }} \exp \left \{-\frac{1} {2\sigma ^{2}}(y_{t}-\mu )^{2}\right \} }$$

where μ,σ ² are unknown can be written in the exponential family form with

$$\displaystyle\begin{array}{rcl} \eta = \left ( \frac{\mu } {\sigma ^{2}}, \frac{-1} {2\sigma ^{2}}, \frac{-\mu ^{2}} {2\sigma ^{2}} \right )^{\intercal },\qquad T(y_{ t}) = \left (y,y^{2},1\right )^{\intercal },\qquad g(\eta ) = \left (\sigma ^{2}\right )^{-\frac{1} {2} }.& & {}\\ \end{array}$$

Lemma 3.

The normal inverse-gamma prior distribution for μ,σ ² with the (nonnatural) real scalar hyperparameters m, and positive s,a,b, having the density

$$\displaystyle{ p(\mu,\sigma ^{2}\vert m,s,a,b) = \frac{b^{a}(\sigma ^{2})^{a+1+\frac{1} {2} }} {\sqrt{2\pi }s\varGamma (a)} \exp \left \{-\frac{1} {\sigma ^{2}} \left [b + \frac{1} {2s}(m-\mu )^{2}\right ]\right \} }$$

can be written in the prior-conjugate form with

$$\displaystyle{ \xi _{t} = \left (\frac{m} {s}, \frac{m^{2}} {s} + 2b, \frac{1} {s}\right )^{\intercal }. }$$

Lemma 4.

The Bayesian update of the normal inverse-gamma prior following the previous lemma coincides with the ‘ordinary’ well-known update of the original hyperparameters,

$$\displaystyle{ \begin{array}{ll} s_{t}^{-1} & = s_{t-1}^{-1} + 1, \\ m_{t} & = s_{t}\left (\frac{m_{t-1}} {s_{t-1}} + y_{t}\right ),\end{array} \qquad \qquad \begin{array}{ll} a_{t}& = a_{t-1} + \frac{1} {2}, \\ b_{t} & = b_{t-1} + \frac{1} {2}\left (\frac{m_{t-1}^{2}} {s_{t-1}} -\frac{m_{t}^{2}} {s_{t}} + y_{t}^{2}\right ).\end{array} }$$

Definition 2 (Kullback–Leibler divergence).

Let f(x), g(x) be two probability density functions of a random variable x, f absolutely continuous with respect to g. The Kullback–Leibler divergence is the nonnegative functional

$$\displaystyle{ \mathrm{D}(f\vert \vert g) = \mathbb{E}_{f}\left [\log \frac{f(x)} {g(x)}\right ] =\int f(x)\log \frac{f(x)} {g(x)}dx, }$$

(3.12)

where the integration domain is the support of f. The Kullback–Leibler divergence is a premetric; it is zero if f = g almost everywhere, it does not satisfy the triangle inequality nor is it symmetric.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dedecius, K., Reichl, J. (2015). Distributed Estimation of Mixture Models. In: Frühwirth-Schnatter, S., Bitto, A., Kastner, G., Posekany, A. (eds) Bayesian Statistics from Methods to Models and Applications. Springer Proceedings in Mathematics & Statistics, vol 126. Springer, Cham. https://doi.org/10.1007/978-3-319-16238-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-16238-6_3
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16237-9
Online ISBN: 978-3-319-16238-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics