Abstract
The Gaussian mixture model (GMM) is a popular tool for multivariate analysis, in particular, cluster analysis. The expectation–maximization (EM) algorithm is generally used to perform maximum likelihood (ML) estimation for GMMs due to the M-step existing in closed form and its desirable numerical properties, such as monotonicity. However, the EM algorithm has been criticized as being slow to converge and thus computationally expensive in some situations. In this article, we introduce the linear regression characterization (LRC) of the GMM. We show that the parameters of an LRC of the GMM can be mapped back to the natural parameters, and that a minorization–maximization (MM) algorithm can be constructed, which retains the desirable numerical properties of the EM algorithm, without the use of matrix operations. We prove that the ML estimators of the LRC parameters are consistent and asymptotically normal, like their natural counterparts. Furthermore, we show that the LRC allows for simple handling of singularities in the ML estimation of GMMs. Using numerical simulations in the R programming environment, we then demonstrate that the MM algorithm can be faster than the EM algorithm in various large data situations, where sample sizes range in the tens to hundreds of thousands and for estimating models with up to 16 mixture components on multivariate data with up to 16 variables.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The Gaussian mixture model (GMM) is a ubiquitous tool in the domain of model-based cluster analysis; for instance, see McLachlan and Basford (1988) and Chapter 3 of McLachlan and Peel (2000) for a statistical perspective, and Chapter 9 of Bishop (2006) and Chapter 10 of Duda et al. (2001) for a machine learning point of view. Furthermore, discussions of applications of GMMs and comparisons of GMMs to other cluster analysis methods can be found in Chapter 8 of Clarke et al. (2009), Section 14.3 of Hastie et al. (2009), and Section 9.3 of Ripley (1996), as well as Hartigan (1985), Jain et al. (1999), and Jain (2010). The GMM framework can be defined as follows.
Let \(Z\in \left\{ 1,\ldots ,g\right\} \) be a categorical random variable such that
for \(i=1,\ldots ,g-1\) and \(\mathbb {P}\left( Z=g\right) =1-\sum _{i=1}^{g-1}\pi _{i}=\pi _{g}\), and let \({\varvec{X}}\in \mathbb {R}^{d}\) be such that \({\varvec{X}}|Z=i\) has density \(\phi _{d}\left( {\varvec{x}}; {\varvec{\mu }}_{i},{\varvec{\varSigma }}_{i}\right) \), where
is a \(d\hbox {-dimensional}\) multivariate Gaussian density function with mean vector \({\varvec{\mu }}\in \mathbb {R}^{d}\) and positive-definite covariance matrix \({\varvec{\varSigma }}\in \mathbb {R}^{d\times d}\). Here, the superscript T represents matrix/vector transposition.
If we suppose that Z is unobserved (i.e. Z is a latent variable), then the density of \({\varvec{X}}\) can be written as
where \({\varvec{\theta }}=({\varvec{\pi }}^{T},{\varvec{\mu }}_{1}^{T}, \ldots ,{\varvec{\mu }}_{g}^{T},\hbox {vech}^{T} ({\varvec{\varSigma }}_{1}),\ldots ,\hbox {vech}^{T} ({\varvec{\varSigma }}_{g}))^{T}\) is the model parameter vector and \({\varvec{\pi }}=(\pi _{1},\ldots ,\pi _{g-1})^{T}\). Densities of form (2) are known as GMMs, and we refer to each \(\phi _{d}({\varvec{x}};\varvec{\mu }_{i},{\varvec{\varSigma }}_{i})\) as component densities.
Let \({\varvec{X}}_{1},\ldots ,{\varvec{X}}_{n}\) be a random sample from a population characterized by density \(f({\varvec{x}}; {\varvec{\theta }}^{0})\), where the parameter vector \({\varvec{\theta }}^{0}\) is unknown. In such cases, the estimation of \({\varvec{\theta }}^{0}\) is required for further inference regarding the data. Given an observed sample \({\varvec{x}}_{1},\ldots ,{\varvec{x}}_{n}\), such estimation can be conducted via maximum likelihood (ML) estimation to yield the ML estimator \(\hat{\varvec{\theta }}_{n}\), where \(\hat{\varvec{\theta }}_{n}\) is an appropriate local maximizer of the likelihood function \(\mathcal {L}_{n}({\varvec{\theta }})=\prod _{j=1}^{n}f ({\varvec{x}}_{j};\varvec{\theta })\).
Due to the summation form of (2), the computation of \(\hat{\varvec{\theta }}_{n}\) cannot be conducted in closed form. However, since its introduction by Dempster et al. (1977), the expectation–maximization (EM) algorithm has provided a stable and monotonic iterative method for computing the ML estimator; see Section 3.2 of McLachlan and Peel (2000) for details. Although effective, the EM algorithm for GMM is not without some criticisms; for example, it is known that the convergence of EM algorithms can be very slow, and may be computationally expensive in some applications; see Chapters 3 and 4 of McLachlan and Krishnan (2008) for details regarding theses issues.
To remedy the aforementioned issues, there have been broad developments in constructing modifications of the GMM EM algorithm, as well as suggestions of alternative methodologies for estimation. For example, Andrews and McNicholas (2013), Botev and Kroese (2004), Ingrassia (1991), and Pernkopf and Bouchaffra (2005) considered the use of metaheuristic algorithms; and Andrews and McNicholas (2013), Celeux and Govaert (1992), Ganesalingam and McLachlan (1980), and McLachlan (1982) considered alternatives to the likelihood criterion [see Chapter 12 of McLachlan and Peel (2000) for a literature review of other developments in this direction]. Each of the aforementioned methods have been shown to improve upon the performance of the EM algorithm via simulation studies, although there are no theoretical results to show that any of the methods uniformly out performs the EM algorithm. Furthermore, each of the methods either require randomization, which relinquishes the monotonicity of the EM algorithm, or replacement of the likelihood criterion, which abandons the statistical properties of the ML estimators.
In this article, we devise a new algorithm for the estimation of GMMs that retains the monotonicity properties of the EM algorithm whilst not utilizing matrix operations. Our approach extends from the following characterization of the multivariate Gaussian density function.
Consider the following decomposition of \({\varvec{\mu }}\) and \(\varvec{\varSigma }\), from (1), in which
where \({\varvec{\mu }}_{1:k}\) contains the first k elements of \({\varvec{\mu }}\), \(\varvec{\varSigma }_{1:k,1:k}\) is the submatrix made up of the first k rows and columns of \({\varvec{\varSigma }}\), \(\varvec{\varSigma }_{k,1:l}\) contains the first l elements from row k of \({\varvec{\varSigma }}\), and \(\varSigma _{k,l}\) is the element from the \(k\hbox {th}\) row and \(l\hbox {th}\) column of \({\varvec{\varSigma }}\), for \(k,l=1,\ldots ,d\). Furthermore, let \(\tilde{\varvec{X}}_{k}=(1,{\varvec{X}}_{1:k-1})^{T}\), where \({\varvec{X}}_{1:k}\) contains the first k elements of \({\varvec{X}}\), and let \({\varvec{\beta }}_{k}=(\beta _{k,0},\ldots ,\beta _{k,k-1})^{T}\in \mathbb {R}^{k}\) and \(\sigma _{k}^{2}>0\) be parameters. Using this notation, Ingrassia et al. (2012) showed that for every \({\varvec{\mu }}\) and \(\varvec{\varSigma }\), there exists a parametrization \({\varvec{\beta }}_{d}\), \(\sigma _{d}^{2}\), \({\varvec{\mu }}_{1:d-1}\), and \({\varvec{\varSigma }}_{1:d-1,1:d-1}\) such that the density function
is equal to \(\phi _{d}({\varvec{x}}; {\varvec{\mu }},{\varvec{\varSigma }})\), for all values of \({\varvec{x}}\). This alternative parametrization allows for the \(d\hbox {-variate}\) Gaussian distribution to be considered in two parts: a linear regression (LR) and density estimation component. In Ingrassia et al. (2012, (2014), this parametrization was used within the cluster-weighted modeling framework for clustering data arising from LR processes.
We extend upon the regression decomposition of Ingrassia et al. (2012) to characterize the multivariate Gaussian distribution entirely in terms of LR components. In doing so, we are able to apply a minorization–maximization (MM) algorithm presented in Becker et al. (1997), for the estimation of LR models without matrix operations via an iterative scheme; see Hunter and Lange (2004) for further details regarding MM algorithms. Furthermore, by leveraging its MM construction, the algorithm that we present is proven to be monotonic in its iterations as well as convergent to a stationary point of the log-likelihood function. We are able to show via simulations that our algorithm compares favorably with the EM algorithm in various practical scenarios, when implemented in the R programming environment R Core Team (2013).
Aside from the numerical properties that we derive, we also address the statistical properties of our LR characterization (LRC). We are able to establish both the consistency and asymptotic normality of the LRC ML estimators, as well as devise a simple procedure for the satisfactory handling of singularities, which can arise in the ML estimation of GMMs.
The remainder of the article is organized as follows. Firstly, we describe the LRC of the multivariate normal distribution in Sect. 2. In Sect. 3, we devise the MM algorithm for ML estimation as well as establish its numerical properties. In Sect. 4, the statistical properties of the ML estimators and the model are derived, and in Sect. 5, the performance of the MM algorithm is demonstrated via numerical simulations. Lastly, conclusions are drawn in Sect. 6.
2 Linear regression characterization
Using the same notation as in (3) and (4), the LRC of the multivariate Gaussian density is
where \({\varvec{\gamma }}=({\varvec{\beta }}_{1}^{T},\ldots , {\varvec{\beta }}_{d}^{T})^{T}\) and \({\varvec{\sigma }}^{2}=(\sigma _{1}^{2},\ldots ,\sigma _{d}^{2})\). Here, we define \(\tilde{\varvec{x}}_{1}=1\) and \({\varvec{\beta }}_{1}=\beta _{1,0}\). We now show that (5) is a \(d\hbox {-dimensional}\) multivariate Gaussian density and that there is a one-to-one correspondence between the LRC parameters \({\varvec{\gamma }}\) and \({\varvec{\sigma }}^{2}\) and the natural parameters \({\varvec{\mu }}\) and \({\varvec{\varSigma }}\) of (2). To attain such a result, we require the following lemma.
Lemma 1
Using the same notation as in (3) and (4), if \({\varvec{X}}\) has density function (2), then for each k, \({\varvec{X}}_{1:k}\) and \(X_{k}|{\varvec{X}}_{1:k-1}={\varvec{x}}_{1:k-1}\) have density functions
respectively, where
and
Lemma 1 can be seen as a special case Theorems 2.4.1 and 2.5.1 from Anderson (2003). We can apply Lemma 1 to derive the following result.
Theorem 1
Every density function of form (2) can be expressed in form (5) via a bijective mapping between the parameters \({\varvec{\mu }}\) and \({\varvec{\varSigma }},\) and \({\varvec{\gamma }}\) and \({\varvec{\sigma }}^{2}.\)
The proof of Theorem 1 and other major results can be found in the Appendix. As an example application of Theorem 1, consider that the \(3\hbox {-dimensional}\) Gaussian density \(\phi _{3}\left( {\varvec{x}}; {\varvec{\mu }},{\varvec{\varSigma }}\right) \) is equal to
for all \({\varvec{x}}\in \mathbb {R}^{3}\), where \(\beta _{1,0}=\mu _{1}\), \(\sigma _{1}^{2}=\varSigma _{1,1}\), \(\beta _{2,0}=\mu _{2}-\varSigma _{2,1}\varSigma _{1,1}^{-1}\mu _{1}\), \(\beta _{2,1}=\varSigma _{2,1}\varSigma _{1,1}^{-1}\), \(\sigma _{2}^{2}=\varSigma _{2,2}-\varSigma _{2,1}^{2}\varSigma _{1,1}^{-1}\),
and
2.1 Ordering of variables
As noted by a reviewer, the bijective mapping between \({\varvec{\mu }}\) and \({\varvec{\varSigma }}\), and \({\varvec{\gamma }}\) and \({\varvec{\sigma }}^{2}\) only holds for the unpermuted ordering of the elements of \({\varvec{X}}=\left( X_{1},\ldots ,X_{d}\right) ^{T}\). That is, if \({\varvec{\varPi }}\ne {\varvec{I}}\) is a permutation matrix, as defined in Section 8.2 of Seber (2008) [i.e. \({\varvec{\varPi }}{\varvec{X}}\) permutes the ordering of the elements of \({\varvec{X}}\); e.g. there exists a \({\varvec{\varPi }}\) such that \({\varvec{\varPi }}(X_{1},X_{2},X_{3})^{T}=(X_{3},X_{1},X_{2})^{T}\)], then there exist functions \({\varvec{\gamma }}_{\varvec{\varPi }}({\varvec{\mu }}, {\varvec{\varSigma }})\) and \({\varvec{\sigma }}_{\varvec{\varPi }}^{2} ({\varvec{\mu }},{\varvec{\varSigma }})\) such that
for all \({\varvec{x}}\in \mathbb {R}^{d}\), where it is possible that \({\varvec{\gamma }}\ne {\varvec{\gamma }}_{\varvec{\varPi }}\left( {\varvec{\mu }},{\varvec{\varSigma }}\right) \) or \({\varvec{\sigma }}^{2}\ne {\varvec{\sigma }}_{\varvec{\varPi }}^{2}\left( {\varvec{\mu }},{\varvec{\varSigma }}\right) \). Here, \({\varvec{I}}\) is an identity matrix of appropriate dimension. This implies that if we permute the order of the elements in the data vector, then there exists an alternative LRC of any Gaussian density that we wish to represent, which may not be the same as the original LRC.
We note that although the explicit forms of \({\varvec{\gamma }}_{\varvec{\varPi }}({\varvec{\mu }}, {\varvec{\varSigma }})\) and \({\varvec{\sigma }}_{\varvec{\varPi }}^{2}({\varvec{\mu }},{\varvec{\varSigma }})\) are not provided, they can be constructed in the following way. Firstly, consider that for any permutation matrix \({\varvec{\varPi }}\),
for all \({\varvec{x}}\in \mathbb {R}^{d}\). Secondly, Eqs. (22)–(25) can be used to map the elements of \({\varvec{\varPi }}{\varvec{\mu }}\) and \({\varvec{\varPi }}{\varvec{\varSigma }}{\varvec{\varPi }}^{T}\) to the elements of \({\varvec{\gamma }}_{{\varvec{\varPi }}}({\varvec{\mu }}, {\varvec{\varSigma }})\) and \({\varvec{\sigma }}_{{\varvec{\varPi }}}^{2}({\varvec{\mu }}, {\varvec{\varSigma }})\). Thus, the LRC parameters of any permutation of the elements of \({\varvec{X}}\) can be obtain via knowledge of the form of the permutation, and the parameters of the underlying Gaussian density function of \({\varvec{X}}\).
2.2 Gaussian mixture model
We now consider the LRC of the GMM density function
where \({\varvec{\psi }}=(\varvec{\pi }^{T},{\varvec{\gamma }}_{1}^{T}, \ldots ,{\varvec{\gamma }}_{g}^{T},{\varvec{\sigma }}_{1}^{2}, \ldots ,{\varvec{\sigma }}_{g}^{2})^{T}\) is the vector of model parameters. Here, \({\varvec{\gamma }}_{i}=({\varvec{\beta }}_{i,1}^{T}, \ldots ,{\varvec{\beta }}_{i,d}^{T})^{T}\), \({\varvec{\sigma }}_{i}^{2}=(\sigma _{i,1}^{2},\ldots , \sigma _{i,d}^{2})\), and \({\varvec{\beta }}_{i,k}=(\beta _{i,k,0},\ldots ,\beta _{i,k,k-1})^{T}\), for each i and k. Furthermore, \(\varvec{\pi }\) is restricted in the same way as in (2). Using Theorem 1, the following result can be shown.
Corollary 1
For each natural parameter vector \(\varvec{\theta },\) there exists a mapping to an LRC parameter vector \({\varvec{\psi }},\) and vice versa, such that \(f({\varvec{x}};\varvec{\theta })=f_{R}({\varvec{x}}; {\varvec{\psi }})\) at every \({\varvec{x}}\in \mathbb {R}^{d}\).
Corollary 1 can be seen as an extension of Proposition 1 from Ingrassia et al. (2012). Unfortunately, unlike Theorem 1, the mapping between \(\varvec{\theta }\) and \({\varvec{\psi }}\) is not bijective, due to the non-identifiability of GMMs; this issue is well documented in Section 3.1 of Titterington et al. (1985). Nevertheless, Corollary 1 allows us to consider the density estimation of data generated from a GMM using an LRC of the GMM instead. We shall show that this representation permits the construction of a matrix-free algorithm for ML estimation.
3 Maximum likelihood estimation
Upon observing data \({\varvec{x}}_{1},\ldots ,{\varvec{x}}_{n}\), the likelihood and the log-likelihood that the data arise from a LRC of the GMM are \(\mathcal {L}_{R,n}({\varvec{\psi }})=\prod _{j=1}^{n}f_{R} ({\varvec{x}}_{j};{\varvec{\psi }})\) and
respectively.
As with the natural parametrization of the GMM, we generally assume that the data were generated from a process with density function \(f_{R}({\varvec{x}}; {\varvec{\psi }}^{0})\), where \({\varvec{\psi }}^{0}\) is unknown. In such cases, \({\varvec{\psi }}^{0}\) can be estimated via the ML estimator \(\hat{{\varvec{\psi }}}_{n}\), where \(\hat{{\varvec{\psi }}}_{n}\) is an appropriate local maximizer of (7).
Like \(\hat{\varvec{\theta }}_{n}\), \(\hat{{\varvec{\psi }}}_{n}\) cannot be computed in closed form. Thus, we must devise an iterative computation scheme. We now present an MM algorithm for such a purpose.
3.1 Minorization–maximization algorithms
Suppose that we wish to maximize some objective function \(\eta ({\varvec{t}})\), where \({\varvec{t}}\in S\) for some set \(S\subset \mathbb {R}^{r}\). If we cannot obtain the maximizer of \(\eta ({\varvec{t}})\) directly, then we can seek a minorizer of \(\eta ({\varvec{t}})\) over S, instead. A minorizer of \(\eta ({\varvec{t}})\) is a function \(h({\varvec{s}}; {\varvec{t}})\) such that \(\eta ({\varvec{t}})=h({\varvec{t}}; {\varvec{t}})\) and \(\eta ({\varvec{s}})\ge h({\varvec{s}}; {\varvec{t}})\), whenever \({\varvec{s}}\ne {\varvec{t}}\), and \({\varvec{s}}\in S\). Here, \(h({\varvec{s}}; {\varvec{t}})\) is said to minorize \(\eta ({\varvec{t}})\).
Upon finding an appropriate minorizer and denoting \({\varvec{t}}^{(m)}\) as the \(m\hbox {th}\) iteration of the algorithm, an MM algorithm can be defined using the update scheme
By the properties of the minorizer and by definition of (8), iterative applications of the MM update scheme yields the inequalities,
This shows that the sequence of objective function evaluations \(\eta ({\varvec{t}}^{(m)})\) is monotonically increasing in each step. Furthermore, under some regularity conditions, it can also be shown that the sequence of iterates \({\varvec{t}}^{(m)}\) converges to some stationary point \({\varvec{t}}^{*}\) of \(\eta ({\varvec{t}})\).
In this article, we consider minorizers for the functions
where \(S=\{{\varvec{t}}:t_{i}\ge 0,i=1,\ldots ,r\} \), and
where \(a\in \mathbb {R}\), \({\varvec{b}}\in \mathbb {R}^{r}\), and \(S=\mathbb {R}^{r}\). These objective functions [i.e. \(\eta _{1}({\varvec{t}})\) and \(\eta _{2}({\varvec{t}})\)] can be minorized via the functions
and
respectively, where \(\alpha _{i}=(|b_{i}|^{p}+\delta )/\sum _{i^{\prime }=1}^{r} (|b_{i^{\prime }}|^{p}+\delta )\), \(p>0\), and \(\delta >0\) is a small coefficient. The two minorizers were devised in Zhou and Lange (2010) and Becker et al. (1997), respectively; the latter was applied to perform LR without matrix operations. Here, we choose \(p=2\) in \(\alpha _{i}\) for use throughout the article, as per a suggestion from Becker et al. (1997).
Let \({\varvec{\psi }}^{(m)}\) be the \(m\hbox {th}\) MM iterate. By setting \(s_{i}=\pi _{i}\lambda ({\varvec{x}}_{j};{\varvec{\gamma }}_{i}, {\varvec{\sigma }}_{i}^{2})\) and \(t_{i}=\pi _{i}^{(m)}\lambda ({\varvec{x}}_{j}; {\varvec{\gamma }}_{i}^{(m)},{\varvec{\sigma }}_{i}^{(m)2})\) in (9), for each i and j, we get the minorizer for (7),
where
and
Here, \({\varvec{x}}_{j}=(x_{j,1},\ldots ,x_{j,d})^{T}\) and \(\tilde{{\varvec{x}}}_{j,k}=(1,x_{j,1},\ldots ,x_{j,k-1})^{T}\) for each j and k, and \(C({\varvec{\psi }}^{(m)})\) is a constant that does not depend on \({\varvec{\psi }}\).
Now, by setting \(a=x_{j,k}\), \({\varvec{b}}=\tilde{{\varvec{x}}}_{j,k}\), \({\varvec{s}}={\varvec{\beta }}_{i,k}\), and \({\varvec{t}}={\varvec{\beta }}_{i,k}^{(m)}\) in (10), we obtain the minorizer for (12),
where \(\alpha _{j,l}=(|x_{j,l}|^{p}+\delta )/\sum _{l^{\prime }=0}^{k-1} (|x_{j,l}|^{p}+\delta )\) and \(x_{j,0}=1\) by definition, for j and \(l=0,\ldots ,k-1\).
Using (14), we can further minorize (11), and thus (7), by
We now consider the partition of \({\varvec{\psi }}\) into \({\varvec{\psi }}_{1}=(\varvec{\pi }^{T}, {\varvec{\gamma }}_{1}^{T},\ldots ,{\varvec{\gamma }}_{g}^{T})^{T}\) and \({\varvec{\psi }}_{2}=({\varvec{\sigma }}_{1}^{2},\ldots , {\varvec{\sigma }}_{g}^{2})^{T}\), where \({\varvec{\psi }}=({\varvec{\psi }}_{1}^{T},{\varvec{\psi }}_{2}^{T} )^{T}\). By fixing \({\varvec{\psi }}_{2}\) at \({\varvec{\psi }}_{2}^{(m)}\), \(Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})\) is additively separable in the subsets of \({\varvec{\psi }}_{1}\); furthermore, for each i, the elements of \({\varvec{\gamma }}_{i}\) are additively separable as well. Similarly, by fixing \({\varvec{\psi }}_{1}\) at \({\varvec{\psi }}_{1}^{(m)}\), \(Q({\varvec{\psi }}_{1}^{(m)},{\varvec{\psi }}_{2}; {\varvec{\psi }}^{(m)})\) is additively separable in the elements of the subsets of \({\varvec{\psi }}_{2}\). This result suggests the following block successive MM update scheme.
Let \({\varvec{\psi }}^{(0)}\) be an initial parameter vector. At the \((m+1)\hbox {th}\) iteration, the algorithm proceeds in two steps: the min (minorization)-step and the max (maximization)-step. In the min-step, we either construct (15) if m is odd, or (11) if m is even; in either cases, the min-step requires the computation of \(\tau _{i}({\varvec{x}}_{j};{\varvec{\psi }}^{(m)})\), for each j.
In the max-step, if m is odd, then we set \({\varvec{\psi }}_{2}^{(m+1)}={\varvec{\psi }}_{2}^{(m)}\) and solve for the root of
to get the updates
and
for each i, k, and \(l=0,\ldots ,k-1\). Here, \(\varvec{0}\) is a zero matrix/vector of appropriate dimensionality.
Similarly, the max-step for even m proceeds by setting \({\varvec{\psi }}_{1}^{(m+1)}={\varvec{\psi }}_{1}^{(m)}\) and solving for the root of
to obtain the updates
for each i and k. We now show that together, updates (16)–(18) generate a sequence of monotonically increasing log-likelihood values.
Theorem 2
If m is odd and \({\varvec{\psi }}_{2}^{(m+1)}={\varvec{\psi }}_{2}^{(m)},\) then updates (16) and (17) result in the inequalities,
If m is even and \({\varvec{\psi }}_{1}^{(m+1)}={\varvec{\psi }}_{1}^{(m)},\) then update (18) results in the inequalities,
In general, the MM algorithm is iterated until some convergence criterion is met. Usually, this is either the absolute convergence criterion
or the relative convergence criterion
for some small \(\epsilon >0\). In either case, upon convergence, the final iterate of the algorithm is declared the ML estimator \(\hat{{\varvec{\psi }}}_{n}\). Let \({\varvec{\psi }}^{*}\) be a limit point of the MM algorithm, such that \(\hat{{\varvec{\psi }}}_{n}\rightarrow {\varvec{\psi }}^{*}\) as \(\epsilon \rightarrow 0\) for some starting parameter \({\varvec{\psi }}^{(0)}\).
The algorithm that we have devised is an instance of the block successive lower-bound maximization algorithms of Razaviyayn et al. (2013). As such, we can apply Theorem 2 of Razaviyayn et al. (2013) to obtain the following result regarding its limit points.
Theorem 3
If \({\varvec{\psi }}^{*}\) is a limit point of the algorithm conducted via the steps \({\varvec{\psi }}_{2}^{(m+1)}={\varvec{\psi }}_{2}^{(m)},\) (16), and (17), when m is odd; and \({\varvec{\psi }}_{1}^{(m+1)}={\varvec{\psi }}_{1}^{(m)}\) and (18), when m is even, for some initial vector \({\varvec{\psi }}^{(0)};\) then \({\varvec{\psi }}^{*}\) is a stationary point of (7).
Theorem 3 shows that given suitable initial parameter vector \({\varvec{\psi }}^{\left( 0\right) }\), the MM algorithm generates a sequence \({\varvec{\psi }}^{\left( m\right) }\) that converges to a stationary point of (7); this is a good result considering its multimodality.
3.2 Covariance constraints
We note that Theorem 3 requires that the limit points be finite values. This cannot always be guaranteed since \(\log \mathcal {L}_{R,n}({\varvec{\psi }}^{(m)})\rightarrow \infty \) if any of the sequences \(\sigma _{i,k}^{(m)2}\rightarrow 0\). This is equivalent to the problem of component covariance matrices \({\varvec{\varSigma }}_{i}\) becoming singular in the natural parametrization [i.e. in \(\mathcal {L}_{n}(\varvec{\theta })\)]. In the natural parametrization, the usual approach is to restrict the component covariance matrices to be positive definite via conditioning on the eigenvalues of the matrices. Such approaches were pioneered in Hathaway (1985); examples of recent developments include Greselin and Ingrassia (2008), Ingrassia (2004), and Ingrassia and Rocci (2007, (2011). We proceed to provide a simple alternative to the aforementioned approaches, based upon the LRC parametrization.
In the LRC, ensuring finite limit points amounts to guaranteeing that for each i and k, \(\sigma _{i,k}^{(m)2}\rightarrow \xi _{i,k}\) for some \(\xi _{i,k}>0\). This can be implemented by adding a small \(\xi >0\) to the right-hand side of update (18) at each iteration. Through doing this, we ensure that each \(\sigma _{i,k}^{*2}\) is positive, as well as retaining the monotonicity of the likelihood, for each update. The following result is then applicable.
Theorem 4
In (5), if \(\sigma _{k}^{2}>0\) for each k, then the corresponding covariance matrix \({\varvec{\varSigma }},\) of the natural parametrization, is positive-definite.
Theorem 4 implies that the covariance matrices in the natural parametrization will always be positive-definite, if we apply the described process. As suggested by a reviewer, it is practically important to choose a \(\xi \) such that it does not impede upon the estimation of mixture components with small variances. If we assume that the marginal variances of each marginal component do not exceed a proportion \(\varXi ^{-1}\) (\(\varXi >1\)) of the corresponding marginal variances of the overall distribution [i.e. \(\hbox {var}(X_{1}),\ldots ,\hbox {var}(X_{d})\)], then we can choose
where \(\hat{\hbox {var}}(X_{k})\) is an estimate for the marginal variance of the \(k\hbox {th}\) dimension. We use \(\varXi =10^{10}\) for all numerical applications presented in this article.
4 Statistical properties
The consistency and asymptotic normality of ML estimators for GMM under the natural parametrization have been proven in many instances; see for example, Redner and Walker (1984), Hathaway (1985), and Atienza et al. (2007). We now seek the consistency of the ML estimators of the LRC of the GMM. Such a result can be obtained via Theorem 4.1.2 of Amemiya (1985).
Theorem 5
Let \({\varvec{X}}_{1},\ldots ,{\varvec{X}}_{n}\) be independent and identically distributed random samples from a distribution with density \(f_{R}({\varvec{x}}; {\varvec{\psi }}^{0}),\) and let \(\varPsi _{n}\) be the set of roots of the equation \(\partial (\log \mathcal {L}_{R,n}({\varvec{\psi }}))/ \partial {\varvec{\psi }}=\varvec{0},\) where \(\varPsi _{n}=\{\varvec{0}\}\) if there are no roots. If \({\varvec{\psi }}^{0}\) is a strict local maximizer of \(\mathbb {E}[\log f_{R}({\varvec{X}}; {\varvec{\psi }})],\) then for any \(\epsilon >0,\)
Theorem 5 is an adequate result, considering that the log-likelihoods of GMMs are often multimodal and unbounded, and that the MM algorithm is able to locate local maximizers when started from suitable values. We note that the result similarly holds when a lower bound is enforced for each of the variance limit points \(\sigma _{ik}^{*2}\), as in Sect. 3.2.
We now seek to establish the asymptotic normality of the ML estimators. Upon making some assumptions (see the proof in the Appendix), we are able to utilize Theorem 4.2.4 of Amemiya (1985) to get the following result.
Theorem 6
Under Assumption B4, the ML estimator \(\hat{{\varvec{\psi }}}_{n}\) (as in Theorem 5) satisfies
Theorems 5 and 6 allow for inferences to be drawn from the ML estimators and their functions. For example, if \({\varvec{x}}\) is an observation that arises from a distribution with density \(f_{R}({\varvec{x}}; {\varvec{\psi }}^{0})\), then we can compute the conditional probability of its latent component variable \(Z=i\) given \({\varvec{X}}={\varvec{x}}\), by \(\tau _{i}({\varvec{x}}; {\varvec{\psi }}^{0})\) via an application of Bayes rule. Furthermore, if \(\hat{{\varvec{\psi }}}_{n}\overset{P}{\rightarrow }{\varvec{\psi }}^{0},\) then by continuous mapping, we have \(\tau _{i}({\varvec{x}};\hat{{\varvec{\psi }}}_{n}) \overset{P}{\rightarrow }\tau _{i}({\varvec{x}}; {\varvec{\psi }}^{0})\), for each i. Thus, the estimated allocation rule that assigns \({\varvec{x}}\) into component \(\hat{z}\in \{1,\ldots ,g\}\),
is asymptotically correct (i.e. it asymptotically assigns \({\varvec{x}}\) to the component which maximizes the a posteriori probability).
5 Numerical simulations
To assess the performance of the MM algorithm proposed in Sect. 3.1, we conduct a set of three numerical simulation studies: S1, S2, and S3. In S1, we investigate the performance of the MM algorithm against the standard EM algorithm for estimating GMMs in a setting where the simulated data arise from clusters of equal sample sizes. In S2, the setup from S1 is repeated albeit with simulated data that arises from clusters with differing sample sizes. Finally, in S3, we assess whether or not the ordering of the variables (as discussed in Sect. 2.1) are influential in the performance of the MM algorithm. We present the setups and results of the numerical simulations below.
5.1 Numerical simulation S1
In S1, we simulated \(2^{N-G}\) observations from each of \(2^{G}\) Gaussian distributions of dimensionality \(2^{D}\), where \(D=1,\ldots ,4\), \(G=1,\ldots ,4\), and \(N=15,16,17\). These sample sizes were chosen since performance improvements are most relevant in large data sets. A sample simulated via this design approximately corresponds to a sample of \(2^{N}\) observations from a \(2^{G}\) component GMM, where \(\pi _{1}=\cdots =\pi _{2^{G}}=2^{-G}\).
In each scenario, each of the \(2^{G}\) distribution means is randomly sampled from a Gaussian distribution with mean \(\varvec{0}\) and covariance matrix \(20\times {\varvec{I}}\). The covariance matrices of the Gaussian distributions are each sampled from a Wishart distribution with scale matrix \({\varvec{I}}\) and \(2^{D}+2\) degrees of freedom. An example of the \(D=2\), \(G=3\), and \(N=15\) case is shown in Fig. 1.
For each D, G, and N, 100 trials are simulated. In each trial, the MM algorithm is used to compute the ML estimates \(\hat{{\varvec{\psi }}}_{n}\) for the LRC parameters. Here, the algorithm is terminated using criterion (21) with \(\epsilon =10^{-5}\), and the computational time is recorded. The traditional EM algorithm (see Section 3.2 of McLachlan and Peel 2000) is then used to compute the ML estimates \(\hat{\varvec{\theta }}_{n}\) for the natural parameters, using the same starting values as for the MM algorithm. The EM algorithm is terminated using the criterion
using \(\epsilon =10^{-5}\), and the computational time is recorded. The \(k\hbox {-means}\) algorithm was used to initialize parameters, as per Section 2.12 of McLachlan and Peel (2000); see MacQueen (1967) regarding the \(k\hbox {-means}\) algorithm.
The algorithms were applied via implementations in the R programming environment (version 3.0.2) on an Intel Core i7-2600 CPU running at 3.40 GHz with 16 GB internal RAM, and the timing was conducted using the proc.time function from said environment. The computational times, in seconds, for both algorithms were then averaged over the trials, for each scenario, and the results are reported in Table 1. In Fig. 2, we also plot the average ratio of EM to MM algorithm computational times, for each scenario.
Upon inspection, Table 1 suggests that both the MM and the EM algorithms behave as expected, with regards to the increases in computation times with respect to increases in D, G, and N. Furthermore, we notice that in each scenario, the MM algorithm is faster than the EM algorithm on average. In the best case, the MM algorithm is approximately 35 times faster than the EM (i.e. case \(D=1\), \(G=2\), and \(N=15\)), and in the worst case, the MM and EM are approximately at parity (i.e. case \(D=4\), \(G=1\), and \(N=17\)).
In Fig. 2, the performance of the MM algorithm over the EM decreases due to increases in D, G, and N, with D decreasing this gain more severely than the other two variables. This pattern may be explained by some of the additional computation overhead of the MM algorithm. For instance, notice that the MM algorithm requires the computation and storage of \(\alpha _{jl}\) for each j, l, and k. The number of these computations increase quadratically in D, but only linearly in G and N. Due to this effect, we cannot recommend the MM algorithm in all situations. However, it is noticeable that in the low D cases, the MM algorithm appears to be distinctly faster than the EM.
5.2 Numerical simulation S2
In S2, we repeat the simulation design of S1 except instead of simulated \(2^{G}\) equally sized samples, we simulated \(2^{G-1}\) samples of size \((1/2)\times 2^{N-G}\) and \(2^{G-1}\) samples of size \((3/2)\times 2^{N-G}\). A sample simulated via this design approximately corresponds to a sample of \(2^{N}\) observations from a \(2^{G}\) component GMM, where \(\pi _{1}=\cdots =\pi _{2^{G-1}}=1/2^{G+1}\) and \(\pi _{2^{G-1}+1}=,\ldots ,\pi _{2^{G}}=3/2^{G+1}\). Using the same comparison method and termination criterion as applied in S1, we obtain the results presented in Table 2. These results are further visualized in Fig. 3.
From Table 2, we notice that both the MM and EM algorithm average computational times are greater than the times in each of the respective scenarios, in S1. This indicates that the problem of having estimating GMMs with differing cluster sizes using either algorithms may require more iterations to converge, on average. Like S1, it appears that the MM algorithm is again faster than the EM in all tested scenarios. However, we do note that the difference between the two algorithms is less in S2. For example, in the best case, the MM algorithm is only 25.6, times faster than the EM (i.e. case \(D=1\), \(G=2\), and \(N=15\)).
Upon inspection of Fig. 3, we again see that the performance of the MM algorithm over the EM decreases due to increases in D, G, and N, again with D decreasing this gain more severely than the other two variables. Thus, we recognize that although there is a decrease in computational gains as the tested variables increase, there is a good case for the MM algorithm to be used instead of the EM when D is relatively small.
5.3 Numerical simulation S3
Following the design of S1, we simulated 100 samples from the \(D=2\), \(G=2\), and \(N=15,16,17\) cases. For each sample of each of the three cases, we perform ML estimation using the MM algorithm on all 24 possible permutations of the four variables. The computation time and the log-likelihood value of each permutation is then recorded, and a mean, range and relative range (as a ratio to the absolute value of the mean) is then computed over the 24 permutations, for both the computation time and the log-likelihood value. The average mean values, ranges and relative ranges over the 100 samples for all three scenarios are presented in Table 3.
Upon inspection of Table 3, we notice that the range of computation times across the three scenarios can be 40 to 50 % of the average computational times. Thus, selecting an advantageous permutation of the variables can lead to significant improvements in algorithm performance. Unfortunately, the most advantageous permutation was not predictable in our simulations, and even in the \(D=2\) case with four variables, the number of permutations can be very large. Fortunately, the range of log-likelihood values is only two to three percent of that of the mean log-likelihood values in each scenario. As such, there appears to be little variation of the optimal outcome, once the algorithm has converged. This implies that regardless of performance, the algorithm is converging as expected according to Theorem 3.
We finally note that the results from all three numerical simulation studies are dependent on the specific performances of the subroutines, algorithms, and hardware. Thus, the results may vary when conducting performance tests under different settings. As such, we believe that this simulation study serves to demonstrate the potential computational performance in R, rather than the performance in all settings.
6 Conclusions
In this article, we introduced the LRC of the GMM, and show that there is a mapping between the LRC parameters and the natural parameters of a GMM. Using the LRC, we devised an MM algorithm for ML estimation, which does not depend on matrix operations. We then proved that the MM algorithm monotonically increases the log-likelihood in ML estimation, and that the sequence of estimators obtained from the algorithm is convergent to a stationary point of the log-likelihood function, under regularity conditions. Through simulations, we were able to demonstrate that the computational speed of the MM algorithm for the LRC parameter estimates was faster than the traditional EM algorithm for estimating GMMs in some large data situations, when both algorithms are implemented in the R programming environment. We also show that although the ordering of the variables may have a significant effect on the computational times of the MM algorithm, there appears to be little effect on the ability of the algorithm to converge to an appropriate limit point.
We also proved that the ML estimators of the LRC parameters, like those of the natural parameters, are also consistent and asymptotically normal. This allows for asymptotically valid statistical inference, such as using the LRC of the GMM for clustering data. Furthermore, we showed that the LRC allows for a simple method for handling singularities in the ML estimation of GMM parameters.
To the best of our knowledge, we are the first to apply the LRC for constructing a matrix operation-free algorithm for estimating GMMs. In the future, we hope to extend our matrix operation-free approach to the ML estimation of mixtures of \(t\hbox {-distributions}\), as well as skew variants of the GMM.
References
Amemiya T (1985) Advanced econometrics. Harvard University Press, Cambridge
Anderson TW (2003) An introduction to multivariate statistical analysis. Wiley, New York
Andrews JL, McNicholas PD (2013) Using evolutionary algorithms for model-based clustering. Pattern Recognit Lett 34:987–992
Atienza N, Garcia-Heras J, Munoz-Pichardo JM, Villa R (2007) On the consistency of MLE in finite mixture models of exponential families. J Stat Plan Inference 137:496–505
Becker MP, Yang I, Lange K (1997) EM algorithms without missing data. Stat Methods Med Res 6:38–54
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
Botev Z, Kroese DP (2004) Global likelihood optimization via the cross-entropy method with an application to mixture models. In: Proceedings of the 36th conference on winter simulation
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14:315–332
Clarke B, Fokoue E, Zhang HH (2009) Principles and theory for data mining and machine learning. Springer, New York
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York
Ganesalingam S, McLachlan GJ (1980) A comparison of the mixture and classification approaches to cluster analysis. Commun Stat Theory Methods 9:923–933
Greselin F, Ingrassia S (2008) A note on constrained EM algorithms for mixtures of elliptical distributions. Advances in data analysis, data handling and business intelligence In: Proceedings of the 32nd annual conference of the German classification society. vol 53
Hartigan JA (1985) Statistical theory in clustering. J Classif 2:63–76
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Springer, New York
Hathaway RJ (1985) A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann Stat 13:795–800
Hunter DR, Lange K (2004) A tutorial on MM algorithms. Am Stat 58:30–37
Ingrassia S (1991) Mixture decomposition via the simulated annealing algorithm. Appl Stoch Models Data Anal 7:317–325
Ingrassia S (2004) A likelihood-based constrained algorithm for multivariate normal mixture models. Stat Methods Appl 13:151–166
Ingrassia S, Rocci R (2007) Constrained monotone EM algorithms for finite mixture of multivariate Gaussians. Comput Stat Data Anal 51:5339–5351
Ingrassia S, Rocci R (2011) Degeneracy of the EM algorithm for the MLE of multivariate Gaussian mixtures and dynamic constraints. Comput Stat Data Anal 55:1714–1725
Ingrassia S, Minotti SC, Vittadini G (2012) Local statistical modeling via a cluster-weighted approach with elliptical distributions. J Classif 29:363–401
Ingrassia S, Minotti SC, Punzo A (2014) Model-based clustering via linear cluster-weighted models. Comput Stat Data Anal 71:159–182
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31:651–666
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31:264–323
Jennrich RI (1969) Asymptotic properties of non-linear least squares estimators. Ann Math Stat 40:633–643
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkley symposium on mathematical statistics and probability, University of California press, 281–297
McLachlan GJ (1982) The classification and mixture maximum likelihood approaches to cluster analysis. In: Krishnaiah PR, Kanal L (eds) Handbook of statistics, vol 2. North-Holland, Amsterdam
McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York
McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions. Wiley, New York
Pernkopf F, Bouchaffra D (2005) Genetic-based EM algorithm for learning Gaussian mixture models. IEEE Trans Pattern Anal Mach Intell 27:1344–1348
R Core Team (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Razaviyayn M, Hong M, Luo ZQ (2013) A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J Optim 23:1126–1153
Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26:195–239
Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge
Seber GAF (2008) A matrix handbook for statisticians. Wiley, New York
Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New York
Zhou H, Lange K (2010) Mm algorithms for some discrete multivariate distributions. J Comput Graph Stat 19:645–665
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Proof of Theorem 1
We shall show the result by construction. Firstly, set
followed by
and
for each \(k=2,\ldots ,d\), in order, to get
and \(\sigma _{k}^{2}=\varSigma _{k|1:k-1}\).
Now, by Lemma 1, and by definition of conditional densities,
for all \({\varvec{x}}\in \mathbb {R}^{d}\), which implies \(\lambda ({\varvec{x}}; {\varvec{\gamma }}, {\varvec{\sigma }}^{2})=\phi _{d}({\varvec{x}}; {\varvec{\mu }}, {\varvec{\varSigma }})\) by application of the mappings (22)–(25). Note that \({\varvec{\mu }}\) and \(\hbox {vech}({\varvec{\varSigma }})\), and \({\varvec{\gamma }}\) and \({\varvec{\sigma }}^{2}\) have equal numbers of elements, and (22)–(25) are unique for each k. Thus, there is an injective mapping between the LRC and the natural parameters. The inverse mapping can also be constructed by setting
followed by
and
for each \(k=2,\ldots ,d\), in order. The mappings (26)–(29) are also unique for each k, and thus constitutes a surjective mapping.
1.2 Proof of Theorem 2
The first and last inequalities of (19) and (20) are due to the definition of minorization [i.e. (11) and (14) are of forms (9) and (10), respectively]. The middle inequality of (19) is due to the concavity of \(Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})\). This can be shown by firstly noting that
is concave in \(\log (\pi _{i})\) since \(1-\sum _{i=1}^{g-1}\exp [\log (\pi _{i})]\) is concave and \(\log \) is an increasing concave function. Secondly, note that
is negative, and thus \(Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})\) is concave with respect to each \(\beta _{i,k,l}\) for each i, k and \(l=0,\ldots ,k-1\). Thus, \(Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})\) is the additive composition of concave functions and is therefore concave with respect to a bijection of \({\varvec{\psi }}_{1}\). Furthermore, the system of equations
for \(i=1,\ldots ,g-1\), has a unique root that is equivalent to update (16), which always satisfies the positivity restrictions on each \(\pi _{i}\).
The middle inequality of (20) is due to the concavity of \(Q({\varvec{\psi }}_{1}^{(m)},{\varvec{\psi }}_{2}; {\varvec{\psi }}^{(m)})\). This can be shown by noting that
is concave in \(\log \sigma _{i,k}^{2}\) for each i and k, since the inverse of \(\exp (x)\) is convex. Thus, \(Q({\varvec{\psi }}_{1}^{(m)}, {\varvec{\psi }}_{2};{\varvec{\psi }}^{(m)})\) is concave with respect to a bijection of \({\varvec{\psi }}_{2}.\) Furthermore, the system of equations
has a unique root that is equivalent to update (18).
1.3 Proof of Theorem 3
This result follows from part (a) of Theorem 2 from Razaviyayn et al. (2013), which assumes that \(Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})\) and \(Q({\varvec{\psi }}_{1}^{(m)},{\varvec{\psi }}_{2}; {\varvec{\psi }}^{(m)})\) both satisfy the definition of a minorizer, and are quasi-concave and have unique critical points, with respect to the parameters \({\varvec{\psi }}_{1}\) and \({\varvec{\psi }}_{2}\), respectively.
Firstly, the definition of a minorizer is satisfied via construction [i.e. (11) and (14) are of forms (9) and (10), respectively]. Secondly, from the proof of Theorem 2, both functions are concave with respect to some bijective mappings, and are therefore quasi-concave under said mappings [see Section 3.4 of Boyd and Vandenberghe (2004) regarding quasi-concavity]. Finally, since both functions are concave with respect to some bijective mapping, the critical points obtained must be unique.
1.4 Proof of Theorem 4
We show this result via induction. Firstly, using (26), we see that \(\sigma _{1}^{2}=\det (\varSigma _{1,1})>0\) is the first leading principal minor of \({\varvec{\varSigma }}\), and is positive. Now, by definition of (23), \(\sigma _{2}^{2}\) is the Schur complement of \({\varvec{\varSigma }}_{1:k,1:k}\), for \(k=2\), where
Since \(\sigma _{2}^{2}\) is positive and \(\varSigma _{1,1}\) is positive definite, we have the result that
via the partitioning of the determinant. Thus, \({\varvec{\varSigma }}_{1:2,1:2}\) is also positive definite because both the first and second leading principal minors are positive.
Now, for each \(k=3,\ldots ,d\), we assume that \({\varvec{\varSigma }}_{1:k-1,1:k-1}\) is positive-definite. Since \(\sigma _{k}^{2}>0\) is the Schur complement of the partitioning (30), we have the result that
Thus, the \(k\hbox {th}\) leading principal minor is positive, for all k. The result follows by the property of positive-definite matrices; see Chapters 10 and 14 of Seber (2008) for all relevant matrix results.
1.5 Proof of Theorem 5
Theorem 5 can be established from Theorem 4.1.2 of Amemiya (1985), which requires the validation of the assumptions,
-
A1
The parameter space \(\varPsi \) is an open subset of some Euclidean space.
-
A2
The log-likelihood \(\log \mathcal {L}_{R,n}({\varvec{\psi }})\) is a measurable function for all \({\varvec{\psi }}\in \varPsi \), \(\partial (\log \mathcal {L}_{R,n}({\varvec{\psi }}))/\partial {\varvec{\psi }}\) exist and is continuous in an open neighborhood \(N_{1}({\varvec{\psi }}^{0})\) of \({\varvec{\psi }}^{0}\).
-
A3
There exists an open neighborhood \(N_{2}({\varvec{\psi }}^{0})\) of \({\varvec{\psi }}^{0}\), where \(n^{-1}\log \mathcal {L}_{R,n}({\varvec{\psi }})\) converges to \(\mathbb {E}[\log f_{R}({\varvec{X}}; {\varvec{\psi }})]\) in probability uniformly in \({\varvec{\psi }}\) in any compact subset of \(N_{2}({\varvec{\psi }}^{0})\).
Assumptions A1, and A2 are fulfilled by noting that the parameter space \(\varPsi =(0,1)^{g-1}\times \mathbb {R}^{g(d^{2}+d)/2+gd}\) is an open subset of \(\mathbb {R}^{(g-1)+g(d^{2}+d)/2+gd}\), and that \(\log \mathcal {L}_{R,n}({\varvec{\psi }})\) is smooth with respect to the parameters \({\varvec{\psi }}\). Using Theorem 2 of Jennrich (1969), we can show that A3 holds by verifying that
where \(\bar{N}\) is a compact subset of \(N_{2}({\varvec{\psi }}^{0})\). Since \(f_{R}({\varvec{X}}; {\varvec{\psi }})\) is smooth, this is equivalent to showing that \(\mathbb {E}|f_{R}({\varvec{X}}; {\varvec{\psi }})|<\infty \), for any fixed \({\varvec{\psi }}\in \bar{N}\). This is achieved by noting that
The inequality on line 3 of (32) is due to Lemma 1 of Atienza et al. (2007). Considering that \(\log \phi _{1}(x_{k};{\varvec{\beta }}_{i,k}^{T} \tilde{{\varvec{x}}}_{k},\sigma _{i,k}^{2})\) is a polynomial function of Gaussian random variables, we have \(\mathbb {E}|\log \phi _{1}(x_{k};{\varvec{\beta }}_{i,k}^{T} \tilde{{\varvec{x}}}_{k},\sigma _{i,k}^{2})|<\infty \) for each i and k. The result then follows.
1.6 Proof of Theorem 6
Theorem 6 can be established from Theorem 4.2.4 of Amemiya (1985), which requires the validation of the assumptions,
-
B1
The Hessian \(\partial ^{2}(\log \mathcal {L}_{R,n}({\varvec{\psi }}))/ \partial {\varvec{\psi }}\partial {\varvec{\psi }}^{T}\) exists and is continuous in an open neighborhood of \({\varvec{\psi }}^{0}\).
-
B2
The equations
$$\begin{aligned} \int \frac{\partial \log f_{R}({\varvec{\psi }})}{\partial {\varvec{\psi }}}\hbox {d}{\varvec{x}}=\varvec{0}, \end{aligned}$$and
$$\begin{aligned} \int \frac{\partial ^{2}\log f_{R}({\varvec{\psi }})}{\partial {\varvec{\psi }}\partial {\varvec{\psi }}^{T}} \hbox {d}{\varvec{x}}=\varvec{0}, \end{aligned}$$hold, for any \({\varvec{\psi }}\in \varPsi \).
-
B3
The averaged Hessian satisfies
$$\begin{aligned} \frac{1}{n}\frac{\partial ^{2}\log \mathcal {L}_{R,n} ({\varvec{\psi }})}{\partial {\varvec{\psi }}\partial {\varvec{\psi }}^{T}}\overset{P}{\rightarrow }\mathbb {E} \left[ \frac{\partial ^{2}\log f_{R}({\varvec{X}}; {\varvec{\psi }})}{\partial {\varvec{\psi }} \partial {\varvec{\psi }}^{T}}\right] , \end{aligned}$$uniformly in \({\varvec{\psi }}\), in all compact subsets of an open neighborhood of \({\varvec{\psi }}^{0}\).
-
B4
The Fisher information
$$\begin{aligned} -\mathbb {E}\left[ \frac{\partial ^{2}\log f_{R} ({\varvec{x}}; {\varvec{\psi }})}{\partial {\varvec{\psi }} \partial {\varvec{\psi }}^{T}}\bigg |_{{\varvec{\psi }}= {\varvec{\psi }}^{0}}\right] ^{-1}, \end{aligned}$$is positive-definite.
Assumption B1 is validated via the smoothness of \(\log \mathcal {L}_{R,n}({\varvec{\psi }})\), and it is mechanical to check the validity of B2. Assumption B3 can be shown via Theorem 2 of Jennrich (1969). Unlike the others, B4 must be taken as given.
Rights and permissions
About this article
Cite this article
Nguyen, H.D., McLachlan, G.J. Maximum likelihood estimation of Gaussian mixture models without matrix operations. Adv Data Anal Classif 9, 371–394 (2015). https://doi.org/10.1007/s11634-015-0209-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-015-0209-7