Parameterizing mixture models with generalized moments

Huang, Zhiyue; Marriott, Paul

doi:10.1007/s10463-014-0490-9

Parameterizing mixture models with generalized moments

Published: 15 October 2014

Volume 68, pages 269–297, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Parameterizing mixture models with generalized moments

Download PDF

Zhiyue Huang¹ &
Paul Marriott¹

284 Accesses
1 Citation
Explore all metrics

An Erratum to this article was published on 06 May 2015

Abstract

This paper considers a new way of parameterizing mixture models where parameters are interpreted as the generalized moments of the mixing distribution. Following a dimensionality reduction approach, approximate models have a finite-dimensional parameter with a corresponding parameter space: a moment space. The geometry of the moment space is studied and we derive the properties of the reconstructed mixing distributions. Links between the reparameterization and estimation methods for mixture models are also briefly discussed.

Mixture models: building a parameter space

Article 26 February 2016

On Mean And/or Variance Mixtures of Normal Distributions

Clustering via finite nonparametric ICA mixture models

Article 28 August 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Mixture models can be found in a wide variety of statistical applications; a comprehensive introduction can be found in McLachlan and Peel (2000). This paper considers the class of non-parametric mixtures of exponential families

$$\begin{aligned} f_{\mathrm{Mix}}(x;Q) {:=} \int _a^b f(x;\theta ) \mathrm{d}Q(\theta ), \quad x \in {\mathcal{S}}, \end{aligned}$$

where for each $\theta \in [a,b]$, the component distribution $f(x;\theta )$ is in the exponential family, i.e.

$$\begin{aligned} f(x;\theta ) = h(x)\exp (\eta (\theta ) T(x) - A(\theta )), \end{aligned}$$

where $T(x)$, $h(x)$, $\eta (\theta )$ and $A(\theta )$ are known functions, and the mixing distribution $Q(\theta )$ is a probability measure over a known, compact set $[a,b]$.

Example 1

Consider a general mixture of Poisson distributions,

$$\begin{aligned} f_{\mathrm{Mix}}(x;{Q}) = \int _a^b {\mathrm{Pois}}(x;\theta ) \mathrm{d}Q(\theta ), \end{aligned}$$

where ${\mathrm{Pois}}(x;\theta )$ is the probability mass function of a Poisson distribution with mean $\theta $ and $Q(\theta )$ is a probability measure over a compact set $[a,b]$. Because the above model takes more variability into account than a single Poisson distribution, it has been used in a wide range of scientific fields for modelling non-homogeneous populations; see Schlattmann (2009).

As an example, consider a Thailand-based cohort study analyzed by Böhoning (1995). To study the health status of $602$ pre-school children, the number of times that a child who showed symptoms of fever, a cough, a runny nose, or these symptoms together, is recorded. Previous studies showed the existence of overdispersion. Therefore, a mixture of Poisson is suggested.

Example 2

Consider a general mixture of normal distributions,

$$\begin{aligned} f_{\mathrm{Mix}}(x;Q) = \int _{a}^b N(x;\theta , \sigma ^2) \mathrm{d}Q(\theta ), \end{aligned}$$

where $N(x;\theta , \sigma ^2)$ is the probability density function of a normal distribution with mean $\theta $ and variance $\sigma ^2$, and $Q(\theta )$ is a probability measure over a compact region $[a,b]$. The above model has wide applications due to its flexibility; see McLachlan and Peel (2000).

One possible application of a mixture of normals is to assess the impact of possible underlying genotypes that display continuous variation in the population. Roeder (1994) gave an example of the sodium–lithium countertransport (SLC) activity in red blood cells. The data set consists of the SLC activity from $190$ individuals. It is suspected that the SLC activity is mainly affected by a gene with two alleles. This implies that there are three possible genotypes. And thus, it is reasonable to consider a mixture of normals with at most three components; see Roeder (1994).

A general, or nonparametric, mixture model has an infinite dimensional parameter space. In frequentist statistics, the maximum likelihood estimator with an infinite dimensional parameter may not be consistent and may not be efficient in the sense that the Cramer–Rao bound is not attained even asymptotically; see Neyman and Scott (1948). In Bayesian statistics, a prior on an infinite dimensional space is not easily defined and can be highly informative even with large amounts of data; see Marriott (2007). To deal with this issue, we can use the modified likelihood (Lindsay 1980) or reduce the dimension of the parameter space (Marriott 2002, 2007). This paper follows the dimensionality reduction idea and makes several contributions.

Firstly, we give a general framework for the reparameterization of a mixture model with a complete orthonormal basis in a Hilbert space. The new parameters are interpreted as the moments of the mixing distribution $Q(\theta )$, which are induced by Chebyshev systems (defined in Sect. 2).

Definition 1

Let $\{u_i(\theta )\}_{i=0}^r$ form a Chebyshev system over $[a,b]$. For $i=0,1,\ldots ,r$, the $i$ th moment of a probability measure $Q(\theta )$ induced by $\{u_i(\theta )\}_{i=0}^r$ is defined as:

$$\begin{aligned} m_i(Q) {:=} {E}_{\theta }[u_i(\theta );Q] = \int _a^b u_i(\theta ) \mathrm{d}Q(\theta ) < \infty . \end{aligned}$$

Secondly, after approximating the reparameterized model using a dimensionality reduction technique, we introduce the moment space (defined in Sect. 3) as the parameter space of the approximated model. The moment space is natural from the point of view of the interpretation of the parameters and allows us to reconstruct the mixing distribution from the moments.

Lastly, we study the geometry of the moment space, showing important properties of the reconstructed mixing distributions, including results on existence, uniqueness, number of support points and gradient characterization.

In Sect. 2, we illustrate how to reparameterize a mixture model by a complete orthonormal basis and interpret the new parameters. In Sect. 3, we approximate the reparameterized model by a model with a finite-dimensional parameter and study the quality of the approximation. In Sect. 4, we study the geometry of the moment space in two ways: the positive representation and the gradient characterization. In Sect. 5, we show two real examples to demonstrate the application of the reparameterization method. Lastly, we discuss the links between the reparameterization and two estimation methods: the method of moments and the maximum likelihood estimator.

2 Parameterization in moments

Consider a measure space $({\mathcal{S}}, \varSigma , \mu _0)$ and the $L^2(\mathcal{S}, \mu _0)$ space induced by it. We assume $\mu _0$ is a probability measure with support $\mathcal S$. We will also denote, where appropriate, $\mu _0(x) = f_0(x)\mu (x)$ with respect to a fixed measure $\mu $, typically Lebesgue or a counting measure. The probability function $f_0(x)$ is either fully known or lies in a parametric family. Let the set $\{e_i(x)\}_{i=0}^{ \infty }$ form a complete orthonormal basis of $L^2(\mathcal{S}, \mu _0)$, i.e.

$$\begin{aligned} \langle e_i(x), e_j(x) \rangle _{L^2({\mathcal{S}},\mu _0)} = \int _{\mathcal{S}} e_i(x)e_j(x) f_0(x) \mathrm{d}x = \delta _{ij}, \end{aligned}$$

where $\delta $ is the Kronecker delta.

Assume that for each $\theta \in [a,b]$, the function $f(x;\theta )/{f_0(x)}$ belongs to $L^2(\mathcal{S}, {\mu }_0)$, i.e., $\int _\mathcal{S} f(x;\theta )^2/f_0(x) dx < \infty $. According to standard results in Hilbert spaces (Debnath and Mikusiński 1999), we have the expansion

$$\begin{aligned} f(x;\theta ) = \sum _{i=0}^{\infty } u_i(\theta ) e_i(x) {f_0(x)}, \end{aligned}$$

(1)

where for each $i \in \{ 0,1,\dots \}$,

$$\begin{aligned} u_i(\theta ) = \left\langle e_i(x), \frac{f(x;\theta )}{{f_0(x)}} \right\rangle _{L^2(\mathcal{S}, \mu _0)}. \end{aligned}$$

(2)

Taking expectations on both sides of (1), with respect to $\theta $ under the mixing distribution $Q(\theta )$, gives

$$\begin{aligned} f_{\mu _0}(x;{\varvec{m}}_{\infty }) = \sum _{i=0}^{\infty } m_i(Q) e_i(x) {f_0(x)}, \end{aligned}$$

where ${\varvec{m}}_{\infty } = (m_1,m_2,\ldots )^{\mathrm{T}} \in {R}^{\infty }$, and for each $i$,

$$\begin{aligned} m_i(Q) = \int _a^b u_i(\theta )\mathrm{d}Q(\theta ) = {E}_{\theta }[u_i(\theta );Q]. \end{aligned}$$

When $e_0(x) \equiv 1$, $x\in \mathcal S$, we have $u_0(x) = \int _\mathcal{S} 1 \times \frac{f(x;\theta ) }{f_0(x)} f_0(x) \mathrm{d}x =1$, and so gives a reparameterization

$$\begin{aligned} f_{\mu _0}(x;{\varvec{m}}_{\infty }) = f_0(x) + \sum _{i=1}^{\infty } m_i(Q) e_i(x) {f_0(x)}. \end{aligned}$$

(3)

Furthermore, by orthogonality, for $i > 0$, we have $\int _\mathcal{S} 1 \times e_i(x) f_0(x) \mathrm{d}x =0$ so that (3) integrates to one.

Definition 2

The set of functions $\{u_i(\theta )\}_{i=0}^r$ is a Chebyshev system over $\varTheta \subseteq {R}$, if we have $\det (u_i(\theta _j))_{i,j=0}^r > 0$ whenever $\theta _0 < \theta _1 < \cdots < \theta _r$ and $\theta _j \in \varTheta $, $j=0,1,\ldots ,r$.

Definition 3

If $\{u_i(\theta )\}_{i=0}^{\infty }$ in (2) forms a Chebyshev system over $[a,b]$ with $u_0(\theta )\equiv 1$, then for a given $f_0(x)$, $x \in \mathcal S$, the formula (3) is the reparameterization of the mixture model $f(x;Q)$ in the moments induced by $\{u_i(\theta )\}_{i=0}^{\infty }$.

When $e_0(x)\equiv 1$, the reparameterization of the mixture model $f_{\mathrm{Mix}}(x;Q)$ is locally defined by $\{e_i(x)\}_{i=0}^{\infty }$ at $f_0(x)$. The choice of the basis depends on the specific inference problem. Here, we give an example where the mean of the mixed distribution, rather than the whole of the mixing distribution, is of inferential interest. Other examples will be seen in Sect. 3.

2.1 The moments induced by power functions

This subsection considers inference for the mean parameter $\theta _0$ in the mixture model

$$\begin{aligned} f_{\mathrm{Mix}}(x; \theta _0, Q) = \int _a^b f(x;\theta _0 + \eta ) \mathrm{d}Q(\eta ), \end{aligned}$$

where $Q(\eta )$ is a zero mean, for identification reasons, probability measure over $[a,b]$. Furthermore, the component distributions are natural exponential models with quadratic variance functions. This class includes the normal, Poisson, gamma, binomial and negative binomial families; see Morris (1982, 1983), and has the following formal definition.

Definition 4

If $f(x;\theta )$ is a natural exponential family in the mean parameterization, then $V_f(\theta )$, defined by $ V_f(\theta ){:=} {E}_X [(X- \theta )^2], $ is called the variance function. If the variance function $V_f(\theta )$ is quadratic with the form $ V_f(\theta ) = v_0 + v_1 \theta + v_2 \theta ^2, $ then we say $f(x;\theta )$ is a natural exponential family with quadratic variance function.

In this problem $\theta _0$ is the parameter of interest and the mixing distribution $Q$ is considered as a nuisance parameter. The mixture model $f_{\mathrm{Mix}}(x;\theta _0,Q)$ could be expanded by a Laplace expansion; see Marriott (2007). Here, we describe this process within our new framework.

Following (Morris 1983), we define, for $i=0,1,2,\ldots ,$

$$\begin{aligned} P_i(x;\theta ) = \frac{V_f^i(\theta )}{f(x;\theta )} \frac{\partial ^i}{\partial \theta ^i} f(x;\theta ), \end{aligned}$$

$ a_i = i! \prod _{j=0}^{i-1} (1+jv_2) \equiv i! b_i$, and let $f_0(x) = f(x;\theta _0)$. Morris (1983) showed that

$$\begin{aligned} \langle P_i(x;\theta _0), P_j(x;\theta _0)\rangle _{L^2(\mathcal{S}, \mu _0)} = \delta _{ij} a_i V_f^i(\theta _0) \end{aligned}$$

and

$$\begin{aligned} \left\langle P_i(x;\theta _0), \frac{f(x;\theta )}{f_0(x)} \right\rangle _{L^2(\mathcal{S}, \mu _0)} = \int _\mathcal{S} P_i(x;\theta _0)f(x;\theta )\mathrm{d}x = b_i (\theta - \theta _0)^i = b_i \eta ^i. \end{aligned}$$

For a given $\theta _0 \in (a,b)$, a mixture of natural exponential families with quadratic variance functions can be reparameterized as:

$$\begin{aligned} f(x;{\varvec{m}}_{\theta _0,\infty }) = f(x;\theta _0) + \sum _{i=2}^{\infty } m_i(Q) \frac{1}{i!} \frac{P_i(x;\theta _0)}{V_f^i(\theta _0)} f(x;\theta _0) , \end{aligned}$$

(4)

where ${\varvec{m}}_{\theta _0,\infty } = (\theta _0, m_2, m_3, \ldots )^{\mathrm{T}} \in {R}^{\infty }$ and for each $i=1,2,\ldots ,$

$$\begin{aligned} m_i(Q) = \int _a^b \eta ^i \mathrm{d}Q(\eta ). \end{aligned}$$

For each $i$, $u_i(\eta ) = \eta ^i$. The set $\{ u_i(\eta )\}_{i=0}^{\infty }$ is a power set in a compact region and so forms a Chebyshev system. So, (4) is a reparameterization of mixture models in the moments induced by power functions.

3 Dimensionality reduction in parameter spaces

In the reparameterization of mixture model in Definition 3, we still have a non-finite dimensional parameter space which could be problematic from an inferential point of view; see Marriott (2007). Following Marriott (2002, 2007), we approximate (3) with a finite sum

$$\begin{aligned} f_{\mu _0}(x;{\varvec{m}}_c) {:=} f_0(x) + \sum _{i=1}^r m_i(Q) e_i(x)f_0(x), \end{aligned}$$

(5)

where ${\varvec{m}}_c = (m_1,m_2,\ldots ,m_r)^{\mathrm{T}} \in {R}^r$. The error between $f_{\mu _0}(x;{\varvec{m}}_c)/f_0(x)$ and $f_{\mu _0}(x;{\varvec{m}}_{\infty })/f_0(x)$ in $L^2(\mathcal{S}, \mu _0)$ is

$$\begin{aligned} {e}_r(Q)&:= \left\| \left( f_{\mu _0}(x;{\varvec{m}}_c) - f_{\mu _0}(x;{\varvec{m}}_{\infty })\right) /{f_0(x)} \right\| _{L^2(\mathcal{S}, \mu _0)}^2 \nonumber \\&= \int _\mathcal{S} \left( \sum _{i=r+1}^{\infty } m_i(Q) e_i(x) \right) ^2 f_0(x)\mathrm{d}x \nonumber \\&= \sum _{i=r+1}^{\infty } m_i^2(Q) < \infty . \end{aligned}$$

(6)

3.1 The moments induced by an integral operator

To find a good finite dimensional approximation, one approach is to find an orthonormal basis $\{\gamma _i(x)\}_{i=0}^{\infty }$ in $L^2(\mu _0,\mathcal{S})$ which minimizes ${e}_r(Q)$ for a given $r$. However since the error ${e}_r(Q)$ depends on the unknown mixing distribution $Q$, instead we aim to minimize an upper bound of ${e}_r(Q)$ instead. By the Cauchy–Schwarz inequality, an upper bound is

$$\begin{aligned} \int _a^b (\mathrm{d}Q(\theta )/ \mathrm{d} \theta )^2 \mathrm{d}\theta \times \sum _{i=r+1}^{\infty } \int _a^b u_i^2(\theta ) \mathrm{d}\theta . \end{aligned}$$

(7)

With the assumption that $\int _a^b (\mathrm{d}Q(\theta )/ \mathrm{d} \theta )^2 \mathrm{d}\theta $ is finite, minimizing $\sum _{i=r+1}^{\infty } \int _a^b u_i^2(\theta ) \mathrm{d}\theta $ is equivalent to minimizing the upper bound.

We can obtain a basis $\{\gamma _i(x)\}_{i=0}^{\infty }$ from the eigenfunctions $\{\tilde{\gamma }_i(x)\}_{i=0}^{\infty }$ to the integral operator

$$\begin{aligned} (Ag)(x) {:=} \int _\mathcal{S} g(y) K(x,y) \mathrm{d}y < \infty , \end{aligned}$$

(8)

with the kernel function

$$\begin{aligned} K(x,y) {:=} \int _a^b \frac{f(x;\theta )}{f_0^{1/2}(x)} \frac{f(y;\theta )}{f_0^{1/2}(y)}\mathrm{d}\theta , \quad (x,y) \in \mathcal{S} \times \mathcal{S}. \end{aligned}$$

This integral operator is positive and self-adjoint; see Debnath and Mikusiński (1999). If $A(\cdot )$ is also compact, the set $\{\gamma _i(x)\}_{i=0}^{\infty }$ forms the complete orthonormal basis in $L^2({\mu }_0,\mathcal{S})$ that minimizes $\sum _{i=r+1}^{\infty } \int _a^b u_i^2(\theta ) \mathrm{d}\theta $, where for each $i$,

$$\begin{aligned} \gamma _i(x) {:=} \tilde{\gamma }_i(x)/f_0^{1/2}(x), \quad x \in \mathcal S; \end{aligned}$$

(9)

see the results on functional principle component analysis in Horváth and Kokoszka (2012). To show the compactness of $A(\cdot )$, one sufficient condition is that

$$\begin{aligned} \int _a^b \int _\mathcal{S} f^2(x;\theta )/f_0(x) \mathrm{d}x \mathrm{d}\theta < \infty ; \end{aligned}$$

see (16) in the Appendix.

To have $\gamma _0(x) \equiv 1$ and $\tilde{\gamma }_0(x) = f_0^{1/2}(x)$, we need

$$\begin{aligned} \lambda _0 f_0^{1/2}(y)&= \int _\mathcal{S} f_0^{1/2}(x) K(x,y) \mathrm{d}x \\&= {f_0^{-1/2}(y)} \int _\mathcal{S} \int _a^b f(x;\theta ) f(y;\theta ) \mathrm{d}\theta \mathrm{d} x \\&= {f_0^{-1/2}(y)} \int _a^b f(y;\theta ) \mathrm{d} \theta . \end{aligned}$$

It follows that $\lambda _0 = b-a$ and $\gamma _0(x) \equiv 1$ when $f_0(x) = \frac{1}{b-a} \int _a^b f(x;\theta )\mathrm{d}\theta $.

Theorem 1

For each $i=0,1,\dots $, let

$$\begin{aligned} \phi _i(\theta ) {:=} \left\langle \gamma _i(x), \frac{f(x;\theta )}{f_0(x)} \right\rangle _{L^2(\mathcal{S}, \mu _0)} = \int _\mathcal{S} f(x;\theta ) \gamma _i(x) \mathrm{d}x < \infty , \quad \theta \in [a,b], \end{aligned}$$

where $\gamma _i(x)$ is defined by (9). For each $r=1,2,\ldots $, the set of functions $\{\phi _i(\theta )\}_{i=0}^{r}$ forms a Chebyshev system over $[a,b]$.

Proof

See the Appendix.

Example 1

(continued) Let $[a,b] = [0,25]$ and

$$\begin{aligned} f_0(x) = \frac{1}{25}\int _0^{25} {\mathrm{Pois}}(x;\theta )\mathrm{d}\theta . \end{aligned}$$

It can be shown that

$$\begin{aligned} \sum _{x=0}^{\infty } \left( {\mathrm{Pois}}(x;\theta )\right) ^2/f_0(x) < \infty , \end{aligned}$$

for each $\theta \in [0,25]$. Figure 1 shows the largest $10$ eigenvalues of the integral operator $A(\cdot )$ in (8), the functions $\gamma _i(x)$ and $\phi _i(\theta )$ corresponding to the largest $4$ eigenvalues.

Example 2

(continued) For each fixed $\sigma ^2\ge 0$, let $[a,b] = [0,0.7]$ and

$$\begin{aligned} f_0(x) = \frac{1}{0.7} \int _{0}^{0.7} N(x;\theta ,\sigma ^2)\mathrm{d}\theta . \end{aligned}$$

It can be shown that

$$\begin{aligned} \int _{-\infty }^{\infty } \left( N(x;\theta ,\sigma ^2)\right) ^2/f_0(x) \mathrm{d}x < \infty , \end{aligned}$$

for each $\theta \in [a,b]$. For $\sigma ^2= 0.07^2$, Fig. 2 shows the largest $10$ eigenvalues of the integral operator $A(\cdot )$ in (8), the functions $\gamma _i(x)$ and $\phi _i(\theta )$ corresponding to the largest $4$ eigenvalues.

3.2 The quality of the approximation

We need to consider the parameter space for the new parameters, ${\varvec{m}}_c$, and also reconstruct the mixing distribution $Q$ from ${\varvec{m}}_c$. Recall that the new parameters are interpreted as the moments of $Q$. Then, it is natural to consider the moment space in Definition 5 as the parameter space of (5). More nice properties of the moment space will be seen in Sect. 4.

Definition 5

Let $\{u_i(\theta )\}_{i=0}^r$ form a Chebyshev system over $[a,b]$ with $u_0(\theta ) \equiv 1$. The moment space induced by $\{u_i(\theta )\}_{i=0}^r$ is

$$\begin{aligned} \mathcal{K}_{r} = \left\{ {\varvec{m}}_c = (m_1,m_2,\ldots ,m_r)^{\mathrm{T}} \in {R}^r| {\varvec{m}}_c = \int _a^b {\varvec{u}}_c(\theta )\mathrm{d}Q(\theta ) \right\} , \end{aligned}$$

where ${\varvec{u}}_c(\theta ) = (u_1(\theta ),u_2(\theta ),\ldots , u_r(\theta ))^{\mathrm{T}} \in {R}^r$ and $Q$ is a probability measure over $[a,b]$.

Note that there is no requirement that $f_{\mu _0}(x;{\varvec{m}}_c)$ be a non-negative function for any ${\varvec{m}}_c \in \mathcal{K}_r$. To ensure that $f_{\mu _0}(x;{\varvec{m}}_c)$ behaves like a probability density (or mass) function, the non-negative condition for each $x \in \mathcal S$ also should be considered; see Marriott (2002, 2007). However, the non-negative conditions could be omitted in some cases, as we will see in Sect. 4.2.

By (6), we have, for any mixing distribution $Q$ over $[a,b]$,

$$\begin{aligned} \min _{{\varvec{m}}_c \in \mathcal{K}_r}\Vert \left( f_{\mu _0}(x; {\varvec{m}}_c) - f_{\mathrm{Mix}}(x;Q)\right) /f_0(x)\Vert _{L^2(\mathcal{S}, \mu _0)}^2&\le {e}_r(Q). \end{aligned}$$

Furthermore, if $f_{\mu _0}(x;{\varvec{m}}_c)$ is reparameterized with the moments induced by $\{\phi _i(\theta )\}_{i=0}^{\infty }$, we can have

$$\begin{aligned} {e}_r(Q)&\le \int _a^b (\mathrm{d}Q(\theta )/ \mathrm{d} \theta )^2 \mathrm{d}\theta \times \sum _{i=r+1}^{\infty } \int _a^b \phi _i^2(\theta ) \mathrm{d}\theta \\&= \int _a^b (\mathrm{d}Q(\theta )/ \mathrm{d} \theta )^2 \mathrm{d}\theta \times \sum _{i=r+1}^{\infty } \int _\mathcal{S} \tilde{\gamma }_i(x) (A \tilde{\gamma }_i)(x) \mathrm{d}x \\&= \int _a^b (\mathrm{d}Q(\theta )/ \mathrm{d} \theta )^2 \mathrm{d}\theta \times \sum _{i=r+1}^{\infty } \lambda _i. \end{aligned}$$

We return to Examples 1 and 2 to examine the non-negativeness and the quality of the approximation of $f_{\mu _0}(x;{\varvec{m}}_c)$ over the sample space. Because each $f_{\mu _0}(x;{\varvec{m}}_c)$ can be written as a convex combination of $f_{\mu _0}(x;{\varvec{u}}_c(\theta ))$, we aim to study the approximation of $f_{\mu _0}(x;{\varvec{u}}_c(\theta ))$ to the component distribution $f(x;\theta )$ at each $(x,\theta ) \in \mathcal{S} \times [a,b]$. The quality of the approximation is measured by

$$\begin{aligned} \epsilon _4(x; \theta ) = f_{\mu _0}(x;{\varvec{u}}_c(\theta )) - f(x;\theta ) \end{aligned}$$

(10)

for each $(x,\theta ) \in \mathcal{S} \times [a,b]$.

Example 1

(continued) Let ${\varvec{u}}_c(\theta ) = (u_1(\theta ), \ldots u_4(\theta ))^{\mathrm{T}} \in {R}^4$. We consider the cases that ${\varvec{u}}_c(\theta )$ is induced by $\{(\theta -12.5)^i\}_{i=0}^4$ in Sect. 2.1 and $\{\phi _i(\theta )\}_{i=0}^4$ in Sect. 3.1. Figure 3 shows the negative region of $f_{\mu _0}(x;{\varvec{u}}_c(\theta ))$ over $\mathcal{S} \times [0,25]$ in these two cases.

Figure 4 examines the quality of the approximation. Various issues of the reparameterization with power moments are seen from panel (a). Firstly, the quality of the approximation is non-uniform at each point in the sample space. Secondly, the approximation is poor when $\theta $ is away from $\theta = \theta _0$. This is due to the nature of the underlying Laplace approximation where a polynomial approximation only behaves well in a small neighbourhood of $\theta = \theta _0$. On the other hand, from the panel (b), we see that the quality of the approximation is almost uniform at each point $(x,\theta ) \in \mathcal{S} \times [a,b]$, when the moments are induced by $\{\phi _i(\theta )\}_{i=0}^4$.

Example 2

(continued) Consider a fixed $\sigma ^2 = 0.07^2$. Again consider ${\varvec{u}}_c(\theta ) = (u_1(\theta ), \ldots u_4(\theta ))^{\mathrm{T}} \in {R}^4$ and ${\varvec{u}}_c(\theta )$ is induced by either $\{(\theta -0.35)^i\}_{i=0}^4$ or $\{\phi _i(\theta )\}_{i=0}^4$. Figure 5 shows the negative regions of $f_{\mu _0}(x;{\varvec{u}}_c(\theta ))$ over $\mathcal{S} \times [0,0.7]$ under these two types of reparameterizations. Also, Fig. 6 gives the contour plots of ${\epsilon }_4(x;\theta )$ over $\mathcal{S} \times [0,0.7]$. From the panel (a), we see the non-uniform and local approximation properties of the reparameterization with the power moments. On the other hand, the quality of the approximation is more uniform when the moments are induced by $\{\phi _i(\theta )\}_{i=0}^4$.

4 The geometry of moment spaces

In this section, we study the geometry of the moment-based parameter space. We aim to link the moments ${\varvec{m}}_c \in \mathcal{K}_r$ to the probability measure $Q(\theta )$ and we do this in two ways: the positive representation and the gradient characterization.

First, we introduce the moment cone induced by a Chebyshev system and its connection to the moment space. Let $\{u_i(\theta )\}_{i=0}^r$ form a Chebyshev system over $[a,b]$ with $u_0(\theta )\equiv 1$ and assume that for each $i=0,1,\ldots ,r$, the function $u_i(\theta )$ is continuous. When $\theta $ moves from $a$ to $b$, the trace of ${\varvec{u}}(\theta ) = (u_0(\theta ), \ldots , u_r(\theta ))^{\mathrm{T}} \in {R}^{r+1}$ forms the moment curve $\varGamma _{r+1}$ in ${R}^{r+1}$.

Definition 6

The conical cone of the curve $\varGamma _{r+1}$ is called the moment cone induced by $\{u_i(\theta )\}_{i=0}^r$, that is

$$\begin{aligned} \mathcal{M}_{r+1} {:=} \left\{ {\varvec{c}} = (c_0,c_1,\ldots ,c_r)^{\mathrm{T}} \in {R}^{r+1} | {\varvec{c}} = \int _a^b {\varvec{u}}(\theta ) \mathrm{d}\sigma (\theta )\right\} , \end{aligned}$$

where $\sigma (\theta )$ is a nondecreasing right continuous function of bounded variation and $\theta \in [a,b]$.

The moment cone contains the convex hull of $\varGamma _{r+1}$, denoted by $\mathrm{conv}(\varGamma _{r+1})$, because for each ${\varvec{m}} \in \mathrm{conv}(\varGamma _{r+1})$, the vector ${\varvec{m}} = (1, {\varvec{m}}_c^{\mathrm{T}})^{\mathrm{T}} = \int _a^b {\varvec{u}}(\theta ) dQ(\theta )$, where $Q(\theta )$ is a probability measure over $[a,b]$. Moreover, we have the following result.

Proposition 1

If $u_0(\theta )\equiv 1$ in a Chebyshev system $\{u_i(\theta )\}_{i=0}^r$ over $[a,b]$, then the boundary of ${\mathrm{conv}(\varGamma _{r+1})}$ is a subset of the boundary of the moment cone $\mathcal{M}_{r+1}$ induced by $\{u_i(\theta )\}_{i=0}^r$.

Proof

See the Appendix.

4.1 The positive representations

As will be shown, a positive representation of a vector ${\varvec{m}} \in \mathrm{conv}(\varGamma _{r+1})$ corresponds to a mixing distribution $Q$. To illustrate the positive representation of a nonzero vector in $\mathrm{conv}(\varGamma _{r+1})$, we need to first introduce the positive representation and its index.

Definition 7

A nonzero vector $\varvec{c}$ has a positive representation in a Chebyshev system $\{u_i(\theta )\}_{i=0}^r$, if it can be written in the form of

$$\begin{aligned} {\varvec{c}} = \sum _{j=1}^J \beta _j {\varvec{u}}(\theta _j), \end{aligned}$$

(11)

where ${\varvec{u}}(\theta ) \in \varGamma _{r+1}$, $a \le \theta _1 < \theta _2 < \cdots < \theta _J \le b$ and $\beta _j >0$, $j=1,2,\ldots ,J$. If $\sum _{j=1}^J \beta _j = 1$, the positive representation (11) is called a convex representation.

Definition 8

Let

$$\begin{aligned} \mathcal{I}(\theta ) {:=} \left\{ \begin{array}{@{}l@{\quad }l} 1, &{} \quad \text{ if } \theta \in (a,b)\text{; } \\ 1/2, &{} \quad \text{ if } \theta = a \text{ or } b. \end{array}\right. \end{aligned}$$

If ${\varvec{c}}$ has the positive representation (11), the index of $\varvec{c}$, denoted by $\mathcal{I}({\varvec{c}})$, is $\sum _{j=1}^J \mathcal{I}(\theta _j)$.

According to Carathéodory’s theorem, for each vector ${\varvec{m}} \in \mathrm{conv}(\varGamma _{r+1})$, there exists a convex representation of ${\varvec{m}}$ by $\{u_i(\theta )\}_{i=0}^r$ with $J < r+1$. We have the following:

Theorem 2

For each ${\varvec{m}}_c \in \mathcal{K}_r$, the moment space, there exists a probability measure $Q(\theta )$ such that ${\varvec{m}}_c = \int _a^b {\varvec{u}}_c(\theta )\mathrm{d}Q(\theta )$ and $Q(\theta )$ has at most $r+1$ support points over $[a,b]$.

If we further assume ${\varvec{m}}$ is on the boundary of $\mathrm{conv}(\varGamma _{r+1})$, the upper bound of the number of support points can be sharpened using the following proposition in Karlin and Studden (1966).

Proposition 2

A nonzero vector ${\varvec{c}}$ is a boundary point of $\mathcal{M}_{r+1}$, the moment cone, induced by $\{u_i(\theta )\}_{i=0}^r$ over $[a,b]$ if and only if $\mathcal{I}({\varvec{c}}) < (r+1)/2$. Moreover, its positive representation is unique with $J \le (r+2)/2$.

With Proposition 2 and the fact that $\hat{\varvec{m}}$ is on the boundary of $\mathcal{M}_{r+1}$, we have the following.

Theorem 3

If ${\varvec{m}}_c$ is on the boundary of $\mathcal{K}_{r}$, there exists one unique probability measure $\hat{Q}(\theta )$ such that ${\varvec{m}}_c =\int _a^b {\varvec{u}}_c(\theta )\mathrm{d}\hat{Q}(\theta )$ and $\hat{Q}(\theta )$ has at most $(r+2)/2$ support points.

Example 1

(continued) Figure 7 shows the moment cones $\mathcal{M}_3$ induced by $\{(\theta -12.5)^i\}_{i=0}^2$ and $\{\phi _i(\theta )\}_{i=0}^2$. In each plot, the curve $\varGamma _3$ is induced by the corresponding Chebyshev system and its convex hull $\mathrm{conv}(\varGamma _3)$.

The boundary of $\mathcal{M}_3$ contains the boundary of $\mathrm{conv}(\varGamma _3)$; see Proposition 1. The boundary vectors of $\mathrm{conv}(\varGamma _3)$ are either ${\varvec{u}}(\theta ) \in {R}^3$ or $(1-\alpha ){\varvec{u}}(0) + \alpha {\varvec{u}}(1)$, where $0< \alpha < 1$. Therefore, the index of a boundary vector is either $1$ or $3/2$; see Theorem 3. On the other hand, if the index of a vector is less than $3/2$, it must locate on the boundary. Moreover, when ${\varvec{m}}$ is on the boundary, it uniquely corresponds to a mixing distribution. For example, one point on $\varGamma _3$ is the image of ${\mathrm{Pois}}(12)$ in ${R}^3$.

4.2 The gradient characterization

The gradient characterization is useful for computational algorithms. In the literature of the non-parametric MLE for mixture models, there exists a class of computational algorithms based on the same convex structure as considered here; see Böhoning (1995) and Wang (2007). This class has more stable computational speeds than the EM algorithm, which is also commonly used for mixture models.

In this subsection, we consider the following optimization problem:

$$\begin{aligned} \min _{{\varvec{m}}_c}&\quad&{\mathcal{L}}({\varvec{m}}_c) \\ {\mathrm{s.t.}}&\quad&{\varvec{m}}_c \in \mathcal{K}_r, \nonumber \end{aligned}$$

(12)

where $\mathcal{L}({\varvec{m}}_c)$ is an arbitrary loss function and strictly convex with respect to ${\varvec{m}}_c$. Given a random sample $X_1,X_2,\ldots ,X_N$, one example of $\mathcal{L}({\varvec{m}}_c)$ is the log-likelihood type function

$$\begin{aligned} \ell ({\varvec{m}}_c){: =} -\sum _{n=1}^N \log (f_{\mu _0} (x_n; {\varvec{m}}_c)). \end{aligned}$$

(13)

Denote

$$\begin{aligned} {\varvec{s}}_n {:=} \bigtriangledown \log (f_{\mu _0} (x_n; {\varvec{m}}_c)) = \left( \frac{e_1(x)f_0(x)}{f_{\mu _0} (x_n; {\varvec{m}}_c)}, \ldots , \frac{e_r(x)f_0(x)}{f_{\mu _0} (x_n; {\varvec{m}}_c)}\right) ^{\mathrm{T}}, \end{aligned}$$

and ${\varvec{S}}^{\mathrm{T}} {: =} ({\varvec{s}}_1, \ldots , {\varvec{s}}_N)$. Note that

$$\begin{aligned} \bigtriangledown \ell ({\varvec{m}}_c) =- {\varvec{S}}^{\mathrm{T}} {\varvec{1}}, \end{aligned}$$

and

$$\begin{aligned} \bigtriangledown ^2 \ell ({\varvec{m}}_c) = {\varvec{S}}^{\mathrm{T}}{\varvec{S}}, \end{aligned}$$

where ${\varvec{1}} = (1, \ldots , 1)^{\mathrm{T}} \in {R}^N$, $\bigtriangledown $ is the gradient and $\bigtriangledown ^2$ is the Hessian. The objective function $\ell ({\varvec{m}}_c)$ is strictly convex if and only if the matrix ${\varvec{S}}^{\mathrm{T}}{\varvec{S}}$ is positive definite. Note that the non-negative condition on $f_{\mu _0}(x;{\varvec{m}}_c)$ does not need to be explicitly included here because for each observed $X$, the value of $f_{\mu _0}(x;{\varvec{m}}_c)$ must be positive to minimize the objective function.

Since the optimization problem (12) is convex, its solution $\hat{\varvec{m}}_c$ is unique and on the boundary of $\mathcal{K}_r$. So there exists a supporting hyperplane of $\mathcal{K}_r$ at $\hat{\varvec{m}}_c$ such that

$$\begin{aligned} \mathcal{H} = \left\{ {\varvec{h}} = (h_1,\ldots ,h_{r})^{\mathrm{T}} \in {R}^{r} | ( \hat{\varvec{m}}_c-{\varvec{h}})^{\mathrm{T}} \bigtriangledown \mathcal{L}(\hat{\varvec{m}}_c) = 0 \right\} . \end{aligned}$$

The following theorem states the relationship between $\mathcal H$ and the support points of $\hat{Q}$ in Theorem 3. Recall that, we have defined ${\varvec{u}}_c(\theta ) = (u_1(\theta ),\ldots , u_r(\theta ))^{\mathrm{T}} \in {R}^r$.

Theorem 4

Let $\hat{\varTheta }$ be the set of support points of $\hat{Q}$. Then, if a point $\hat{\theta } \in [a,b]$ is an element of $\hat{\varTheta }$, then ${\varvec{u}}_c(\hat{\theta })$ is on the hyperplane $\mathcal{H}$. The converse also holds.

Proof

See Lindsay (1983a, b).

The above theorem also implies that $\hat{\varTheta }$ is the set of zeros of the gradient function of the objective function $\mathcal{L}({\varvec{m}}_c)$ which is defined as:

$$\begin{aligned} \mathcal{D}(\hat{\varvec{m}}_c, {\varvec{u}}_c(\theta ))&:= \frac{\partial }{\partial \epsilon } \mathcal{L}((1- \epsilon ) \hat{\varvec{m}}_c + \epsilon {\varvec{u}}_c(\theta ) ) \biggr |_{\epsilon =0} \nonumber \\&= ({\varvec{u}}_c(\theta ) - \hat{\varvec{m}}_c)^{\mathrm{T}} \bigtriangledown \mathcal{L}(\hat{\varvec{m}}_c). \end{aligned}$$

Moreover, we can use the gradient function to characterize $\hat{\varvec{m}}_c$ as follows.

Theorem 5

The following three statements are equivalent:

1.
$\hat{\varvec{m}}_c$ minimizes $\mathcal{L}({\varvec{m}}_c)$.
2.
$\inf _{\theta } \mathcal{D}(\hat{\varvec{m}}_c, {\varvec{u}}_c(\theta )) = 0$.
3.
$\hat{\varvec{m}}_c$ maximizes $\inf _{\theta } \mathcal{D}({\varvec{m}}_c, {\varvec{u}}_c(\theta ))$.

Proof

See Lindsay (1983a, b).

Now, we continue Example 1 to illustrate Theorems 4 and 5.

Example 1

(continued) In each panel of Fig. 8, we see the curve $\varGamma _3$ induced by $\{\phi _1(\theta ), \phi _2(\theta )\}$ and its convex hull $\mathcal{K}_2$ in the space of $({m}_1,m_2)^{\mathrm{T}} \in {R}^2$. The contours show the identical values of the objective function

$$\begin{aligned} \mathcal{L}({\varvec{m}}_c) = ({\varvec{t}} - {\varvec{m}}_c)^{\mathrm{T}}({\varvec{t}} - {\varvec{m}}_c), \end{aligned}$$

(14)

where ${\varvec{m}}_c \in {R}^2$ and ${\varvec{t}} = (-2, 0)^{\mathrm{T}}$. This objective function is chosen for its nice visual interpretation in Fig. 8. Because ${\varvec{t}} \notin \mathcal{K}_2$, $\mathcal{L}({\varvec{m}}_c)$ is strictly convex with respect to ${\varvec{m}}_c$. The minimum value of $\mathcal{L}({\varvec{m}}_c)$ over $\mathcal{K}_2$ is $0.5517$. As we see, the contour $\mathcal{L}({\varvec{m}}_c) = 0.5517$ has a unique intersection $\hat{\varvec{m}}_c$ with $\mathcal{K}_2$. Moreover, the intersection $\hat{\varvec{m}}_c$ is on the boundary of $\mathcal{K}_2$. In Fig. 8a, the solid line represents the supporting hyperplane $\mathcal H$ of $\mathcal{K}_2$ at $\hat{\varvec{m}}_c$. Here, we have $\hat{\varvec{m}}_c = {\varvec{u}}_c(\hat{\theta }) \in \mathcal H$; see Theorem 4.

Moreover, $\bigtriangledown \mathcal{L}(\hat{\varvec{m}}_c)$ is orthogonal to the supporting hyperplane. For any vector ${\varvec{u}}_c(\theta ) \ne \hat{\varvec{m}}_c$ on $\varGamma _3$, we have the vector ${\varvec{u}}_c(\theta ) - \hat{\varvec{m}}_c$. From Fig. 8a, it can be seen that the angle $\psi \in [0,\pi ]$ between $\bigtriangledown \mathcal{L}(\hat{\varvec{m}}_c)$ and ${\varvec{u}}_c(\theta ) - \hat{\varvec{m}}_c$ is always acute. Therefore, we have

$$\begin{aligned} \cos (\psi ) = \frac{\mathcal{D}(\hat{\varvec{m}}_c, {\varvec{u}}_c(\theta ))}{\sqrt{({\varvec{u}}_c(\theta ) - \hat{\varvec{m}}_c)^{\mathrm{T}}({\varvec{u}}_c(\theta ) - \hat{\varvec{m}}_c)}\sqrt{(\bigtriangledown \mathcal{L}(\hat{\varvec{m}}_c))^{\mathrm{T}} (\bigtriangledown \mathcal{L}(\hat{\varvec{m}}_c))}} > 0; \end{aligned}$$

see Theorem 5 (2). It is also obvious that $\cos (\psi ) = 0$ if and only if ${\varvec{u}}_c(\theta ) = \hat{\varvec{m}}_c$. In Fig. 8b, we see that for any ${\varvec{m}}_c' \ne \hat{\varvec{m}}_c$ in $\mathcal{K}_2$, there always exists a ${\varvec{u}}_c(\theta ') \in \varGamma _3$ such that the angle $\psi '$ between $\bigtriangledown \mathcal{L}({\varvec{m}}_c')$ and ${\varvec{u}}_c(\theta ') - {\varvec{m}}_c'$ is obtuse. It follows that $\inf _{\theta }\mathcal{D}({\varvec{m}}_c',{\varvec{u}}_c(\theta )) < 0$; see Theorem 5 (3).

5 Real examples

5.1 The Thailand cohort study data

Consider the data on morbidity in northeast Thailand which was analyzed by Böhoning (1995); see also Schlattmann (2009). We fit a mixture of Poisson with $[a,b] = [0,25]$ by minimizing $\ell ({\varvec{m}}_c)$ in (13) over $\mathcal{K}_r$, where $f_{\mu _0}(x;{\varvec{m}}_c)$ is induced by $\{\phi _i(\theta )\}_{i=0}^r$ from Example 1 in Sect. 3.1 and $r=2,3,\ldots ,24$. Let $\hat{\varvec{m}}_c^{(r)} \in {R}^r$ be the solution and $\hat{Q}^{(r)}$ be the probability measure reconstructed from $\hat{\varvec{m}}_c^{(r)}$. Also, let $\hat{Q}^{(\infty )}$ be the non-parametric MLE of the mixing distribution.

The results are summarized in Table 1. When $r \le 5$, the $\hat{\varvec{m}}_c^{(r)} \in {R}^r$ is not on the boundary of $\mathcal{K}_r$ and, thus, no unique mixing distribution could be reconstructed from the moments. For each $r \ge 6$, the probability measure $\hat{Q}^{(r)}$ is unique and reported in Table 1. The log-likelihoods of $f_{\mathrm{Mix}}(x;\hat{Q}^{(r)})$ become stable and close to the log-likelihood of $f_{\mathrm{Mix}}(x;\hat{Q}^{(\infty )})$, when $r$ is large; also see Fig. 9b. Furthermore, for large $r$, the two distributions $\hat{Q}^{(r)}$ and $\hat{Q}^{(\infty )}$ are almost the same; see Table 1. Both of the observations support that the quality of the approximation of $f_{\mu _0}(x;{\varvec{m}}_c)$ to $f_{\mathrm{Mix}}(x;Q)$ is accurate, where ${\varvec{m}}_c \in \mathcal{K}_r$ and $r$ is finite.

Table 1 Some results of the fitted models to the Thailand cohort study data

Full size table

5.2 The sodium–lithium countertransport data

Consider the SLC data analyzed by Roeder (1994). We fit a mixture of normal with same variance. Let $[a,b] = [0,0.7]$. For each $r\ge 4$, we minimize $\ell ({\varvec{m}}_c, \sigma ^2)$ over $\mathcal{K}_r \times {R}^+$, where ${\varvec{m}}_c \in {R}^r$ is the moment vector induced by $\{\phi _i(\theta )\}_{i=1}^r$ from Example 2 in Sect. 3.1. Note here $\{\phi _i(\theta )\}_{i=1}^r$ depends on the value of $\sigma ^2$. Given $\sigma ^2$, let $\hat{\varvec{m}}_c^{(r)}(\sigma ^2)$ minimize $\ell ({\varvec{m}}_c, {\sigma }^2)$. The variance is estimated as the $\hat{\sigma }^2 = \arg \min _{{R^+}} \ell ( \hat{\varvec{m}}_c^{(r)}(\sigma ^2), \sigma ^2)$. Let $\hat{Q}^{(r)}$ be the reconstructed mixing distribution from $\hat{\varvec{m}}_c(\hat{\sigma }^2) \in \mathcal{K}_r$.

The results are summarized in Table 2 and some of the fitted models are plotted in Fig. 10. From the figure, we can see that $f_{\mathrm{Mix}}(x;\hat{Q}^{(r)})$ may not give sensible fitting until $r \ge 9$. Based on the prior knowledge that the data may consist of three sub-populations, we choose $f_{\mathrm{Mix}}(x;\hat{Q}^{(9)})$ as our final fitted model. This result is close to the three component mixture of normal with same variance fitted by maximizing the log-likelihood, in which the mixing parameters are $(0.223, 0.379, 0.577)$ with the mixing proportions $(0.774, 0.202, 0.024)$ and a standard deviation of $0.058$; see Roeder (1994). Lastly, we want to point out that the non-parametric MLE is not appropriate here because it is exactly the empirical distribution of the sample.

Table 2 Some results of the fitted models to the SLC data

Full size table

6 Discussion and conclusion

The essential idea in this paper is to use a moment space in ${R}^r$ to approximate an infinite-dimensional parameter space. We end this paper with a brief discussion of the following issues.

The links to the method of moments The method of moments for mixture models has a long history; see McLachlan and Peel (2000). It can be computed easily but suffers from a lack of efficiency; see Lindsay (1989). As a result, it is typically used to find the initial value for the computational algorithms of other estimation methods. According to the discussion in Sect. 2, the moments of mixing distribution are induced locally and correspond to local approximations of non-parametric mixtures. We may expect that the quality of the approximation is related to the efficiency of the method of moments. However, the connection between the two still remains unclear.

The links to the non-parametric MLE The link between the reparameterization in moments and the non-parametric MLE is their geometry. Lindsay (1983a, b) studied the geometry in the likelihood-based embedding spaces and gave the fundamental properties of the non-parametric MLE. Because the set of mixture is a convex hull of a curve in both cases, the properties of the reparameterization in moments are similar to those given by Lindsay .

The two types of geometry are different. The dimension of the likelihood-based embedding space is the number of distinct values in a random sample; see Lindsay (1983a) and Lindsay (1983b), while the dimension of the moment space is chosen depending on the quality of the approximation. In the two examples of the moments discussed in this paper, the dimension of the moment space is usually smaller than that of the likelihood-based embedding space, especially when the mixing region $[a,b]$ is narrow and the decay of the eigenvalues of (8) is fast. This makes the mixing distribution reconstructed from the moments have a sharpened upper bound of the number of support points.

Extension to the generalized linear mixed models As pointed out by the referees, mixture models have a much wider application in the context of regression. These models are known as the generalized linear mixed models or the random effects models; see Diggle et al. (2002). It would be interesting future work to examine the performance of the reparameterization of mixture models to the generalized linear mixed models and in practical applications.

References

Anaya-Izquierdo, K. A., Marriott, P. (2007). Local mixtures of the exponential distribution. Annals of the Institute of Statistical Mathematics, 59, 111–134.
Böhoning, D. (1995). A review of reliable maximum likelihood algorithms for semiparametric mixture models. Journal of Statistical Planning and Inference, 47, 5–28.
Debnath, L., Mikusiński, P. (1999). Introduction to Hilbert Spaces with Applications (2nd ed., pp. 87–130). San Diego: Academic Press Inc.
Diggle, P., Heagerty, P., Liang, K. Y., Zeger, S. (2002). Analysis of longitudinal data. Oxford: Oxford University Press.
Horváth, L., Kokoszka, P. (2012). Inference for functional data with applications (pp. 37–39). New York: Springer.
Karlin, S., Studden, W. J. (1966). Tchebycheff systems: With applications in analysis and statistics (pp. 1–49). New York: Wiley.
Lindsay, B. G. (1980). Nuisance parameters, mixture models, and the efficiency of partial likelihood estimators. Philosophical Transactions of the Royal Society of London Series A, Mathematical and Physical Sciences, 296, 639–662.
Lindsay, B. G. (1983a). The geometry of mixture likelihoods: a general theory. The Annals of Statistics, 11, 86–94.
Lindsay, B. G. (1983b). The geometry of mixture likelihoods. II. The exponential family. The Annals of Statistics, 11, 783–792.
Lindsay, B. G. (1989). Moment matrices: applications in mixture. The Annals of Statistics, 17, 722–740.
Lidnsay, B. G., Roeder, K. (1993). Uniqueness of estimation and identifiability in mixture models. Canadian Journal of Statistics, 21, 139–147.
Marriott, P. (2002). On the local geometry of mixture models. Biometrika, 89, 77–93.
Marriott, P. (2007). Extending local mixture models. Annals of the Institute of Statistical Mathematics, 59, 95–110.
McLachlan, G. J., Peel, D. (2000). Finite Mixture Models. New York: Wiley.
Morris, C. (1982). Natural exponential families with quadratic variance functions. The Annals of Statistics, 10, 65–80.
Morris, C. (1983). Natural exponential families with quadratic variance functions: statistical theory. The Annals of Statistics, 11, 515–529.
Neyman, J., Scott, E. L. (1948). Consistent estimates based on partially consistent observations. Econometrica, 16, 1–32.
Pinkus, A. (1996). Spectral properties of totally positive kernels and matrices. Total positivity and its application (pp. 477–511). Dordrecht: Kluwer Academic Publishers.
Roeder, K. (1994). A graphical technique for determining the number of components in a mixture of normals. Journal of the American Statistical Association, 89, 487–495.
Schlattmann, P. (2009). Medical applications of finite mixture models. Berlin: Springer.
Wang, Y. (2007). On fast computation of the non-parametric maximum likelihood estimate of a mixing distribution. Journal of the Royal Statistical Society: Series B, 69, 185–198.

Download references

Acknowledgments

The authors thank the Editor, the Associate Editor and the two referees for their insightful comments and suggestions that lead to substantially improve the paper.

Author information

Authors and Affiliations

Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada
Zhiyue Huang & Paul Marriott

Authors

Zhiyue Huang
View author publications
You can also search for this author in PubMed Google Scholar
Paul Marriott
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiyue Huang.

Appendix A

1.1 Strictly totally positive kernel functions

The strictly totally positive kernel functions and Chebyshev systems are defined in the following ways in (Karlin and Studden 1966).

Definition 9

A real-valued kernel function $K(s,\theta )$, $(s,\theta ) \in \mathcal{S} \times \varTheta \subseteq {R}^2$, is called strictly totally positive of order $r$, if for each $J=1,2,\ldots ,r$, we have $\det (K(s_i,\theta _j))_{i,j=0}^J > 0$, whenever $s_0 < s_1 < \cdots < s_r$, $\theta _0 < \theta _1 < \cdots < \theta _r$ and $(s_i,\theta _j) \in \mathcal{S} \times \varTheta $, $i,j=0,1,\ldots ,r$.

Consider a kernel function for $(x,y) \in \mathcal{S} \times \mathcal{S}' \subseteq {R}^2$,

$$\begin{aligned} K^*(x,y) {:=} \int _a^b L(x,\theta ) M(y, \theta ) \mathrm{d}\theta , \end{aligned}$$

(15)

where $L(x,\theta ), (x,\theta ) \in \mathcal{S} \times [a,b] \subset {R}^2$ and $M(y,\theta ), (y,\theta )\in \mathcal{S}' \times [a,b] \subset {R}^2$. The following proposition is proved in Karlin and Studden (1966).

Proposition 3

If the kernel function in (15) exists for each $(x,y) \in \mathcal{S} \times \mathcal{S}'$ and $L(x,\theta )$ and $M(x,\theta )$ are strictly totally positive, then $K(x,y)$ is strictly totally positive.

Pinkus (1996) further states that the eigenfuctions from a strictly totally positive kernel function could also form a Chebyshev system.

Proposition 4

Let

$$\begin{aligned} (A'g)(\theta ) = \int _a^b g(\theta ) K'(\theta , \theta ') \mathrm{d}\theta , \end{aligned}$$

be a compact, self-adjoint, positive integral operator in the form of (8). Moreover, the kernel function $K'(\theta , \theta ')$ is strictly totally positive over $[a,b] \times [a,b]$. Then, the integral operator $A'(\cdot )$ has the eigenvalues $\lambda _0 > \lambda _1 > \cdots > 0$ and associated eigenfunctions $\phi _0(\theta ), \phi _1(\theta ), \ldots $, which are continuous over $[a,b]$. For each $r=1,2,\ldots $, the set $\{\phi _i(\theta )\}_{i=0}^r$ forms a Chebyshev system over $[a,b]$.

It is known that a Hilbert–Schmidt operator is compact; see Debnath and Mikusiński (1999). To show $K^*(x,y)$ is Hilbert–Schmidt, one sufficient condition is $\int _a^b \int _\mathcal{S}L^2(x;\theta ) \mathrm{d}x \mathrm{d} \theta < \infty $ and $\int _a^b \int _\mathcal{S'}M^2(y;\theta ) \mathrm{d}y \mathrm{d} \theta < \infty $, because

$$\begin{aligned} \int _\mathcal{S} \int _\mathcal{S'} \left( K^*(x,y) \right) ^2 \mathrm{d}x \mathrm{d} y&\le \int _a^b \int _\mathcal{S}L^2(x;\theta ) \mathrm{d}x \mathrm{d} \theta \times \int _a^b \int _\mathcal{S'}M^2(y;\theta ) \mathrm{d}y \mathrm{d} \theta . \end{aligned}$$

(16)

1.2 Proof of Theorem 1

Firstly, we show that $\{\phi _i(\theta )\}_{i=0}^{\infty }$ is the set of eigenfunctions of the integral operator

$$\begin{aligned} (A'g)(\theta ) = \int _a^b g(\theta ) K'(\theta , \theta ') \mathrm{d}\theta , \end{aligned}$$

where

$$\begin{aligned} K'(\theta , \theta ') = \int _\mathcal{S} \frac{f(x;\theta )f(x;\theta ')}{f_0(x)}\mathrm{d}x. \end{aligned}$$

For each $i$, we have

$$\begin{aligned} \lambda _i \phi _i(\theta )&= \lambda _i \int _\mathcal{S} \frac{\tilde{\gamma }_i(x)}{f_0^{1/2}(x)} f(x;\theta ) \mathrm{d}x \\&= \int _\mathcal{S} \frac{f(x;\theta )}{f_0^{1/2}(x)} \int _\mathcal{S} \tilde{\gamma }_i(y) K(x,y) \mathrm{d}y \mathrm{d}x \\&= \int _a^b \int _\mathcal{S} \tilde{\gamma }_i(y)\frac{f(y;\theta ')}{f_0^{1/2}(y)} \mathrm{d}y \int _\mathcal{S} \frac{f(x;\theta ) f(x;\theta ') }{ f_0(x)} \mathrm{d}x \mathrm{d}\theta ' \\&= \int _a^b \phi _i(\theta ') K'(\theta , \theta ') \mathrm{d}\theta '. \end{aligned}$$

Next, note that the one-parameter exponential family $f(x;\theta )$ is strictly totally positive; see Lidnsay and Roeder (1993). According to Proposition 3, the kernel function $K'(\theta , \theta ')$ is strictly totally positive. Then, it follows from Proposition 4 that the set of eigenfuctions $\{\phi _i(\theta )\}_{i=0}^r$ forms a Chebyshev system over $[a,b]$. $\square $

1.3 Proof of Proposition 1

We want to show that for each boundary vector ${\varvec{m}}^*$ of $\mathrm{conv}(\varGamma _{r+1})$, there exists a supporting hyperplane of $\mathrm{conv}(\varGamma _{r+1})$ at ${\varvec{m}}^*$ which is also a supporting hyperplane of $\mathcal{M}_{r+1}$ at ${\varvec{m}}^*$.

Firstly, the convex hull $\mathrm{conv}(\varGamma _{r+1})$ is the intersection of $\mathcal{M}_{r+1}$ and the hyperplane $\mathcal{H}_1 = \left\{ {\varvec{h}}=(1,h_1,h_2,\ldots ,h_r)^{\mathrm{T}} \in {R}^{r+1} \right\} $. Then, in $\mathcal{H}_1$, there exists a vector $\tilde{\varvec{\beta }}_c = (\tilde{\beta }_1, \tilde{\beta }_2, \ldots , \tilde{\beta }_r)^{\mathrm{T}} \in {R}^r$ such that for each ${\varvec{m}} \in \mathrm{conv}(\varGamma _{r+1})$, we have

$$\begin{aligned} \sum _{i = 1}^r m_i \tilde{\beta }_i \ge \sum _{i=1}^r m_i^* \tilde{\beta }_i. \end{aligned}$$

Let $\tilde{\beta }_0 = -\sum _{i=1}^r m_i^* \tilde{\beta }_i$. For each ${\varvec{m}} \in \mathrm{conv}(\varGamma _{r+1})$, we have

$$\begin{aligned} \sum _{i=0}^r m_i \tilde{\beta }_i \ge \sum _{i=0}^r m_i^* \tilde{\beta }_i. \end{aligned}$$

Therefore, the vector $\tilde{\varvec{\beta }} = (\tilde{\beta }_0, \tilde{\beta }_c^{\mathrm{T}})^{\mathrm{T}} \in {R}^{r+1}$ determines the hyperplane

$$\begin{aligned} \left\{ {\varvec{h}}\in {R}^{r+1} | ({\varvec{h}} - {\varvec{m}}^*)^{\mathrm{T}} \tilde{\varvec{\beta }} = 0 \right\} \end{aligned}$$

(17)

as a supporting hyperplane of $\mathrm{conv}(\varGamma _{r+1})$ at ${\varvec{m}}^*$.

Note that any vector in $\mathcal{M}_{r+1}$ can be written as $\varDelta {\varvec{m}}$, where $\varDelta \ge 0$ and ${\varvec{m}} \in \mathrm{conv}(\varGamma _{r+1})$. We have the inequality:

$$\begin{aligned} \sum _{i=0}^r (\varDelta m_i - {m}^*_i) \tilde{\beta }_i&= (\varDelta - 1) \tilde{\beta }_0 + \sum _{i=1}^r (\varDelta m_i - {m}^*_i) \tilde{\beta }_i \\&= (1 - \varDelta )\sum _{i=1}^r m_i^* \tilde{\beta }_i + \sum _{i=1}^r (\varDelta m_i - {m}^*_i) \tilde{\beta }_i \\&= \varDelta \sum _{i = 1}^r (m_i - m_i^*) \tilde{\beta }_i \ge 0, \end{aligned}$$

and thus the hyperplane (17) is also a supporting hyperplane of $\mathcal{M}_{r+1}$. $\square $

About this article

Cite this article

Huang, Z., Marriott, P. Parameterizing mixture models with generalized moments. Ann Inst Stat Math 68, 269–297 (2016). https://doi.org/10.1007/s10463-014-0490-9

Download citation

Received: 23 September 2013
Revised: 12 July 2014
Published: 15 October 2014
Issue Date: April 2016
DOI: https://doi.org/10.1007/s10463-014-0490-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Parameterizing mixture models with generalized moments

Abstract

Similar content being viewed by others

Mixture models: building a parameter space

On Mean And/or Variance Mixtures of Normal Distributions

Clustering via finite nonparametric ICA mixture models

1 Introduction

Example 1

Example 2

Definition 1

2 Parameterization in moments

Definition 2

Definition 3

2.1 The moments induced by power functions

Definition 4

3 Dimensionality reduction in parameter spaces

3.1 The moments induced by an integral operator

Theorem 1

Proof

Example 1

Example 2

3.2 The quality of the approximation

Definition 5

Example 1

Example 2

4 The geometry of moment spaces

Definition 6

Proposition 1

Proof

4.1 The positive representations

Definition 7

Definition 8

Theorem 2

Proposition 2

Theorem 3

Example 1

4.2 The gradient characterization

Theorem 4

Proof

Theorem 5

Proof

Example 1

5 Real examples

5.1 The Thailand cohort study data

5.2 The sodium–lithium countertransport data

6 Discussion and conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix A

Appendix A

1.1 Strictly totally positive kernel functions

Definition 9

Proposition 3

Proposition 4

1.2 Proof of Theorem 1

1.3 Proof of Proposition 1

About this article

Cite this article

Share this article

Keywords

Search

Navigation