1 Introduction

The Gaussian mixture model (GMM) is a ubiquitous tool in the domain of model-based cluster analysis; for instance, see McLachlan and Basford (1988) and Chapter 3 of McLachlan and Peel (2000) for a statistical perspective, and Chapter 9 of Bishop (2006) and Chapter 10 of Duda et al. (2001) for a machine learning point of view. Furthermore, discussions of applications of GMMs and comparisons of GMMs to other cluster analysis methods can be found in Chapter 8 of Clarke et al. (2009), Section 14.3 of Hastie et al. (2009), and Section 9.3 of Ripley (1996), as well as Hartigan (1985), Jain et al. (1999), and Jain (2010). The GMM framework can be defined as follows.

Let \(Z\in \left\{ 1,\ldots ,g\right\} \) be a categorical random variable such that

$$\begin{aligned} \mathbb {P}\left( Z=i\right) =\pi _{i}>0 \end{aligned}$$

for \(i=1,\ldots ,g-1\) and \(\mathbb {P}\left( Z=g\right) =1-\sum _{i=1}^{g-1}\pi _{i}=\pi _{g}\), and let \({\varvec{X}}\in \mathbb {R}^{d}\) be such that \({\varvec{X}}|Z=i\) has density \(\phi _{d}\left( {\varvec{x}}; {\varvec{\mu }}_{i},{\varvec{\varSigma }}_{i}\right) \), where

$$\begin{aligned} \phi _{d}\left( {\varvec{x}}; {\varvec{ \mu }}, \varvec{\varSigma }\right) =\left( 2\pi \right) ^{-\frac{d}{2}} \det \left( {\varvec{\varSigma }}\right) ^{-\frac{1}{2}}\exp \left[ -\frac{1}{2}\left( {\varvec{x}}-\varvec{\mu } \right) ^{T}{\varvec{\varSigma }}^{-1}\left( {\varvec{x}} -\varvec{\mu }\right) \right] \end{aligned}$$
(1)

is a \(d\hbox {-dimensional}\) multivariate Gaussian density function with mean vector \({\varvec{\mu }}\in \mathbb {R}^{d}\) and positive-definite covariance matrix \({\varvec{\varSigma }}\in \mathbb {R}^{d\times d}\). Here, the superscript T represents matrix/vector transposition.

If we suppose that Z is unobserved (i.e. Z is a latent variable), then the density of \({\varvec{X}}\) can be written as

$$\begin{aligned} f({\varvec{x}}; {\varvec{\theta }})= \sum _{i=1}^{g}\pi _{i}\phi _{d}\left( {\varvec{x}}; \varvec{\mu }_{i},{\varvec{\varSigma }}_{i}\right) \!, \end{aligned}$$
(2)

where \({\varvec{\theta }}=({\varvec{\pi }}^{T},{\varvec{\mu }}_{1}^{T}, \ldots ,{\varvec{\mu }}_{g}^{T},\hbox {vech}^{T} ({\varvec{\varSigma }}_{1}),\ldots ,\hbox {vech}^{T} ({\varvec{\varSigma }}_{g}))^{T}\) is the model parameter vector and \({\varvec{\pi }}=(\pi _{1},\ldots ,\pi _{g-1})^{T}\). Densities of form (2) are known as GMMs, and we refer to each \(\phi _{d}({\varvec{x}};\varvec{\mu }_{i},{\varvec{\varSigma }}_{i})\) as component densities.

Let \({\varvec{X}}_{1},\ldots ,{\varvec{X}}_{n}\) be a random sample from a population characterized by density \(f({\varvec{x}}; {\varvec{\theta }}^{0})\), where the parameter vector \({\varvec{\theta }}^{0}\) is unknown. In such cases, the estimation of \({\varvec{\theta }}^{0}\) is required for further inference regarding the data. Given an observed sample \({\varvec{x}}_{1},\ldots ,{\varvec{x}}_{n}\), such estimation can be conducted via maximum likelihood (ML) estimation to yield the ML estimator \(\hat{\varvec{\theta }}_{n}\), where \(\hat{\varvec{\theta }}_{n}\) is an appropriate local maximizer of the likelihood function \(\mathcal {L}_{n}({\varvec{\theta }})=\prod _{j=1}^{n}f ({\varvec{x}}_{j};\varvec{\theta })\).

Due to the summation form of (2), the computation of \(\hat{\varvec{\theta }}_{n}\) cannot be conducted in closed form. However, since its introduction by Dempster et al. (1977), the expectation–maximization (EM) algorithm has provided a stable and monotonic iterative method for computing the ML estimator; see Section 3.2 of McLachlan and Peel (2000) for details. Although effective, the EM algorithm for GMM is not without some criticisms; for example, it is known that the convergence of EM algorithms can be very slow, and may be computationally expensive in some applications; see Chapters 3 and 4 of McLachlan and Krishnan (2008) for details regarding theses issues.

To remedy the aforementioned issues, there have been broad developments in constructing modifications of the GMM EM algorithm, as well as suggestions of alternative methodologies for estimation. For example, Andrews and McNicholas (2013), Botev and Kroese (2004), Ingrassia (1991), and Pernkopf and Bouchaffra (2005) considered the use of metaheuristic algorithms; and Andrews and McNicholas (2013), Celeux and Govaert (1992), Ganesalingam and McLachlan (1980), and McLachlan (1982) considered alternatives to the likelihood criterion [see Chapter 12 of McLachlan and Peel (2000) for a literature review of other developments in this direction]. Each of the aforementioned methods have been shown to improve upon the performance of the EM algorithm via simulation studies, although there are no theoretical results to show that any of the methods uniformly out performs the EM algorithm. Furthermore, each of the methods either require randomization, which relinquishes the monotonicity of the EM algorithm, or replacement of the likelihood criterion, which abandons the statistical properties of the ML estimators.

In this article, we devise a new algorithm for the estimation of GMMs that retains the monotonicity properties of the EM algorithm whilst not utilizing matrix operations. Our approach extends from the following characterization of the multivariate Gaussian density function.

Consider the following decomposition of \({\varvec{\mu }}\) and \(\varvec{\varSigma }\), from (1), in which

$$\begin{aligned} {\varvec{\mu }}=\left[ \begin{array}{c} {\varvec{\mu }}_{1:d-1}\\ \mu _{d} \end{array}\right] \quad \hbox {and}\quad {\varvec{\varSigma }}=\left[ \begin{array}{c@{\quad }c} {\varvec{\varSigma }}_{1:d-1,1:d-1} &{} {\varvec{\varSigma }}_{d,1:d-1}^{T}\\ {\varvec{\varSigma }}_{d,1:d-1} &{} \varSigma _{d,d} \end{array}\right] , \end{aligned}$$
(3)

where \({\varvec{\mu }}_{1:k}\) contains the first k elements of \({\varvec{\mu }}\), \(\varvec{\varSigma }_{1:k,1:k}\) is the submatrix made up of the first k rows and columns of \({\varvec{\varSigma }}\), \(\varvec{\varSigma }_{k,1:l}\) contains the first l elements from row k of \({\varvec{\varSigma }}\), and \(\varSigma _{k,l}\) is the element from the \(k\hbox {th}\) row and \(l\hbox {th}\) column of \({\varvec{\varSigma }}\), for \(k,l=1,\ldots ,d\). Furthermore, let \(\tilde{\varvec{X}}_{k}=(1,{\varvec{X}}_{1:k-1})^{T}\), where \({\varvec{X}}_{1:k}\) contains the first k elements of \({\varvec{X}}\), and let \({\varvec{\beta }}_{k}=(\beta _{k,0},\ldots ,\beta _{k,k-1})^{T}\in \mathbb {R}^{k}\) and \(\sigma _{k}^{2}>0\) be parameters. Using this notation, Ingrassia et al. (2012) showed that for every \({\varvec{\mu }}\) and \(\varvec{\varSigma }\), there exists a parametrization \({\varvec{\beta }}_{d}\), \(\sigma _{d}^{2}\), \({\varvec{\mu }}_{1:d-1}\), and \({\varvec{\varSigma }}_{1:d-1,1:d-1}\) such that the density function

(4)

is equal to \(\phi _{d}({\varvec{x}}; {\varvec{\mu }},{\varvec{\varSigma }})\), for all values of \({\varvec{x}}\). This alternative parametrization allows for the \(d\hbox {-variate}\) Gaussian distribution to be considered in two parts: a linear regression (LR) and density estimation component. In Ingrassia et al. (2012, (2014), this parametrization was used within the cluster-weighted modeling framework for clustering data arising from LR processes.

We extend upon the regression decomposition of Ingrassia et al. (2012) to characterize the multivariate Gaussian distribution entirely in terms of LR components. In doing so, we are able to apply a minorization–maximization (MM) algorithm presented in Becker et al. (1997), for the estimation of LR models without matrix operations via an iterative scheme; see Hunter and Lange (2004) for further details regarding MM algorithms. Furthermore, by leveraging its MM construction, the algorithm that we present is proven to be monotonic in its iterations as well as convergent to a stationary point of the log-likelihood function. We are able to show via simulations that our algorithm compares favorably with the EM algorithm in various practical scenarios, when implemented in the R programming environment R Core Team (2013).

Aside from the numerical properties that we derive, we also address the statistical properties of our LR characterization (LRC). We are able to establish both the consistency and asymptotic normality of the LRC ML estimators, as well as devise a simple procedure for the satisfactory handling of singularities, which can arise in the ML estimation of GMMs.

The remainder of the article is organized as follows. Firstly, we describe the LRC of the multivariate normal distribution in Sect. 2. In Sect. 3, we devise the MM algorithm for ML estimation as well as establish its numerical properties. In Sect. 4, the statistical properties of the ML estimators and the model are derived, and in Sect. 5, the performance of the MM algorithm is demonstrated via numerical simulations. Lastly, conclusions are drawn in Sect. 6.

2 Linear regression characterization

Using the same notation as in (3) and (4), the LRC of the multivariate Gaussian density is

$$\begin{aligned} \lambda ({\varvec{x}}; {\varvec{\gamma }},{\varvec{\sigma }^{2}}) =\prod _{k=1}^{d}\phi _{1}(x_{k};{\varvec{\beta }}_{k}^{T}\tilde{\varvec{x}}_{k},\sigma _{k}^{2}), \end{aligned}$$
(5)

where \({\varvec{\gamma }}=({\varvec{\beta }}_{1}^{T},\ldots , {\varvec{\beta }}_{d}^{T})^{T}\) and \({\varvec{\sigma }}^{2}=(\sigma _{1}^{2},\ldots ,\sigma _{d}^{2})\). Here, we define \(\tilde{\varvec{x}}_{1}=1\) and \({\varvec{\beta }}_{1}=\beta _{1,0}\). We now show that (5) is a \(d\hbox {-dimensional}\) multivariate Gaussian density and that there is a one-to-one correspondence between the LRC parameters \({\varvec{\gamma }}\) and \({\varvec{\sigma }}^{2}\) and the natural parameters \({\varvec{\mu }}\) and \({\varvec{\varSigma }}\) of (2). To attain such a result, we require the following lemma.

Lemma 1

Using the same notation as in (3) and (4), if \({\varvec{X}}\) has density function (2), then for each k\({\varvec{X}}_{1:k}\) and \(X_{k}|{\varvec{X}}_{1:k-1}={\varvec{x}}_{1:k-1}\) have density functions

$$\begin{aligned} \phi _{k}({\varvec{x}}_{1:k};{\varvec{\mu }}_{1:k}, {\varvec{\varSigma }}_{1:k,1:k})\quad \hbox {and}\quad \phi _{1}(x_{k};\mu _{k|1:k-1}({\varvec{x}}_{1:k-1}), \varSigma _{k|1:k-1}), \end{aligned}$$

respectively, where

$$\begin{aligned} \mu _{k|1:k-1}({\varvec{x}}_{1:k-1})= \mu _{k}+{\varvec{\varSigma }}_{k,1:k-1} {\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1} ({\varvec{x}}_{1:k-1}-{\varvec{\mu }}_{1:k-1}), \end{aligned}$$

and

$$\begin{aligned} \varSigma _{k|1:k-1}=\varSigma _{k,k}-{\varvec{\varSigma }}_{k,1:k-1} {\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1}{\varvec{\varSigma }}_{k,1:k-1}^{T}. \end{aligned}$$

Lemma 1 can be seen as a special case Theorems 2.4.1 and 2.5.1 from Anderson (2003). We can apply Lemma 1 to derive the following result.

Theorem 1

Every density function of form (2) can be expressed in form (5) via a bijective mapping between the parameters \({\varvec{\mu }}\) and \({\varvec{\varSigma }},\) and \({\varvec{\gamma }}\) and \({\varvec{\sigma }}^{2}.\)

The proof of Theorem 1 and other major results can be found in the Appendix. As an example application of Theorem 1, consider that the \(3\hbox {-dimensional}\) Gaussian density \(\phi _{3}\left( {\varvec{x}}; {\varvec{\mu }},{\varvec{\varSigma }}\right) \) is equal to

$$\begin{aligned} \lambda ({\varvec{x}}; {\varvec{\gamma }}, {\varvec{\sigma }}^{2})=\phi _{1}(x_{1};\beta _{1,0}, \sigma _{1}^{2})\phi _{1}(x_{2};{\varvec{\beta }}_{2}^{T} \tilde{{\varvec{x}}}_{2},\sigma _{2}^{2})\phi _{1}(x_{3}; {\varvec{\beta }}_{3}^{T}\tilde{{\varvec{x}}}_{3},\sigma _{3}^{2}) \end{aligned}$$

for all \({\varvec{x}}\in \mathbb {R}^{3}\), where \(\beta _{1,0}=\mu _{1}\), \(\sigma _{1}^{2}=\varSigma _{1,1}\), \(\beta _{2,0}=\mu _{2}-\varSigma _{2,1}\varSigma _{1,1}^{-1}\mu _{1}\), \(\beta _{2,1}=\varSigma _{2,1}\varSigma _{1,1}^{-1}\), \(\sigma _{2}^{2}=\varSigma _{2,2}-\varSigma _{2,1}^{2}\varSigma _{1,1}^{-1}\),

and

2.1 Ordering of variables

As noted by a reviewer, the bijective mapping between \({\varvec{\mu }}\) and \({\varvec{\varSigma }}\), and \({\varvec{\gamma }}\) and \({\varvec{\sigma }}^{2}\) only holds for the unpermuted ordering of the elements of \({\varvec{X}}=\left( X_{1},\ldots ,X_{d}\right) ^{T}\). That is, if \({\varvec{\varPi }}\ne {\varvec{I}}\) is a permutation matrix, as defined in Section 8.2 of Seber (2008) [i.e. \({\varvec{\varPi }}{\varvec{X}}\) permutes the ordering of the elements of \({\varvec{X}}\); e.g. there exists a \({\varvec{\varPi }}\) such that \({\varvec{\varPi }}(X_{1},X_{2},X_{3})^{T}=(X_{3},X_{1},X_{2})^{T}\)], then there exist functions \({\varvec{\gamma }}_{\varvec{\varPi }}({\varvec{\mu }}, {\varvec{\varSigma }})\) and \({\varvec{\sigma }}_{\varvec{\varPi }}^{2} ({\varvec{\mu }},{\varvec{\varSigma }})\) such that

for all \({\varvec{x}}\in \mathbb {R}^{d}\), where it is possible that \({\varvec{\gamma }}\ne {\varvec{\gamma }}_{\varvec{\varPi }}\left( {\varvec{\mu }},{\varvec{\varSigma }}\right) \) or \({\varvec{\sigma }}^{2}\ne {\varvec{\sigma }}_{\varvec{\varPi }}^{2}\left( {\varvec{\mu }},{\varvec{\varSigma }}\right) \). Here, \({\varvec{I}}\) is an identity matrix of appropriate dimension. This implies that if we permute the order of the elements in the data vector, then there exists an alternative LRC of any Gaussian density that we wish to represent, which may not be the same as the original LRC.

We note that although the explicit forms of \({\varvec{\gamma }}_{\varvec{\varPi }}({\varvec{\mu }}, {\varvec{\varSigma }})\) and \({\varvec{\sigma }}_{\varvec{\varPi }}^{2}({\varvec{\mu }},{\varvec{\varSigma }})\) are not provided, they can be constructed in the following way. Firstly, consider that for any permutation matrix \({\varvec{\varPi }}\),

for all \({\varvec{x}}\in \mathbb {R}^{d}\). Secondly, Eqs. (22)–(25) can be used to map the elements of \({\varvec{\varPi }}{\varvec{\mu }}\) and \({\varvec{\varPi }}{\varvec{\varSigma }}{\varvec{\varPi }}^{T}\) to the elements of \({\varvec{\gamma }}_{{\varvec{\varPi }}}({\varvec{\mu }}, {\varvec{\varSigma }})\) and \({\varvec{\sigma }}_{{\varvec{\varPi }}}^{2}({\varvec{\mu }}, {\varvec{\varSigma }})\). Thus, the LRC parameters of any permutation of the elements of \({\varvec{X}}\) can be obtain via knowledge of the form of the permutation, and the parameters of the underlying Gaussian density function of \({\varvec{X}}\).

2.2 Gaussian mixture model

We now consider the LRC of the GMM density function

$$\begin{aligned} f_{R}({\varvec{x}}; {\varvec{\psi }})=\sum _{i=1}^{g}\pi _{i} \lambda ({\varvec{x}}; {\varvec{\gamma }}_{i}, {\varvec{\sigma }}_{i}^{2}), \end{aligned}$$
(6)

where \({\varvec{\psi }}=(\varvec{\pi }^{T},{\varvec{\gamma }}_{1}^{T}, \ldots ,{\varvec{\gamma }}_{g}^{T},{\varvec{\sigma }}_{1}^{2}, \ldots ,{\varvec{\sigma }}_{g}^{2})^{T}\) is the vector of model parameters. Here, \({\varvec{\gamma }}_{i}=({\varvec{\beta }}_{i,1}^{T}, \ldots ,{\varvec{\beta }}_{i,d}^{T})^{T}\), \({\varvec{\sigma }}_{i}^{2}=(\sigma _{i,1}^{2},\ldots , \sigma _{i,d}^{2})\), and \({\varvec{\beta }}_{i,k}=(\beta _{i,k,0},\ldots ,\beta _{i,k,k-1})^{T}\), for each i and k. Furthermore, \(\varvec{\pi }\) is restricted in the same way as in (2). Using Theorem 1, the following result can be shown.

Corollary 1

For each natural parameter vector \(\varvec{\theta },\) there exists a mapping to an LRC parameter vector \({\varvec{\psi }},\) and vice versa,  such that \(f({\varvec{x}};\varvec{\theta })=f_{R}({\varvec{x}}; {\varvec{\psi }})\) at every \({\varvec{x}}\in \mathbb {R}^{d}\).

Corollary 1 can be seen as an extension of Proposition 1 from Ingrassia et al. (2012). Unfortunately, unlike Theorem 1, the mapping between \(\varvec{\theta }\) and \({\varvec{\psi }}\) is not bijective, due to the non-identifiability of GMMs; this issue is well documented in Section 3.1 of Titterington et al. (1985). Nevertheless, Corollary 1 allows us to consider the density estimation of data generated from a GMM using an LRC of the GMM instead. We shall show that this representation permits the construction of a matrix-free algorithm for ML estimation.

3 Maximum likelihood estimation

Upon observing data \({\varvec{x}}_{1},\ldots ,{\varvec{x}}_{n}\), the likelihood and the log-likelihood that the data arise from a LRC of the GMM are \(\mathcal {L}_{R,n}({\varvec{\psi }})=\prod _{j=1}^{n}f_{R} ({\varvec{x}}_{j};{\varvec{\psi }})\) and

$$\begin{aligned} \log \mathcal {L}_{R,n}({\varvec{\psi }})= & {} \sum _{j=1}^{n} \log f_{R}({\varvec{x}}; {\varvec{\psi }})\nonumber \\= & {} \sum _{j=1}^{n}\log \sum _{i=1}^{g}\pi _{i}\lambda ({\varvec{x}}; {\varvec{\gamma }}_{i},{\varvec{\sigma }}_{i}^{2}), \end{aligned}$$
(7)

respectively.

As with the natural parametrization of the GMM, we generally assume that the data were generated from a process with density function \(f_{R}({\varvec{x}}; {\varvec{\psi }}^{0})\), where \({\varvec{\psi }}^{0}\) is unknown. In such cases, \({\varvec{\psi }}^{0}\) can be estimated via the ML estimator \(\hat{{\varvec{\psi }}}_{n}\), where \(\hat{{\varvec{\psi }}}_{n}\) is an appropriate local maximizer of (7).

Like \(\hat{\varvec{\theta }}_{n}\), \(\hat{{\varvec{\psi }}}_{n}\) cannot be computed in closed form. Thus, we must devise an iterative computation scheme. We now present an MM algorithm for such a purpose.

3.1 Minorization–maximization algorithms

Suppose that we wish to maximize some objective function \(\eta ({\varvec{t}})\), where \({\varvec{t}}\in S\) for some set \(S\subset \mathbb {R}^{r}\). If we cannot obtain the maximizer of \(\eta ({\varvec{t}})\) directly, then we can seek a minorizer of \(\eta ({\varvec{t}})\) over S, instead. A minorizer of \(\eta ({\varvec{t}})\) is a function \(h({\varvec{s}}; {\varvec{t}})\) such that \(\eta ({\varvec{t}})=h({\varvec{t}}; {\varvec{t}})\) and \(\eta ({\varvec{s}})\ge h({\varvec{s}}; {\varvec{t}})\), whenever \({\varvec{s}}\ne {\varvec{t}}\), and \({\varvec{s}}\in S\). Here, \(h({\varvec{s}}; {\varvec{t}})\) is said to minorize \(\eta ({\varvec{t}})\).

Upon finding an appropriate minorizer and denoting \({\varvec{t}}^{(m)}\) as the \(m\hbox {th}\) iteration of the algorithm, an MM algorithm can be defined using the update scheme

$$\begin{aligned} {\varvec{t}}^{(m+1)}=\arg \underset{{\varvec{s}}\in S}{\max } \, h({\varvec{s}}; {\varvec{t}}^{(m)}). \end{aligned}$$
(8)

By the properties of the minorizer and by definition of (8), iterative applications of the MM update scheme yields the inequalities,

$$\begin{aligned} \eta ({\varvec{t}}^{(m+1)})\ge h({\varvec{t}}^{(m+1)};{\varvec{t}}^{(m)})\ge h({\varvec{t}}^{(m)};{\varvec{t}}^{(m)})=\eta ({\varvec{t}}^{(m)}). \end{aligned}$$

This shows that the sequence of objective function evaluations \(\eta ({\varvec{t}}^{(m)})\) is monotonically increasing in each step. Furthermore, under some regularity conditions, it can also be shown that the sequence of iterates \({\varvec{t}}^{(m)}\) converges to some stationary point \({\varvec{t}}^{*}\) of \(\eta ({\varvec{t}})\).

In this article, we consider minorizers for the functions

$$\begin{aligned} \eta _{1}({\varvec{t}})=\log \left( \sum _{i=1}^{r}t_{i}\right) , \end{aligned}$$

where \(S=\{{\varvec{t}}:t_{i}\ge 0,i=1,\ldots ,r\} \), and

$$\begin{aligned} \eta _{2}({\varvec{t}})=-(a-{\varvec{t}}^{T}{\varvec{b}})^{2}, \end{aligned}$$

where \(a\in \mathbb {R}\), \({\varvec{b}}\in \mathbb {R}^{r}\), and \(S=\mathbb {R}^{r}\). These objective functions [i.e. \(\eta _{1}({\varvec{t}})\) and \(\eta _{2}({\varvec{t}})\)] can be minorized via the functions

$$\begin{aligned} h_{1}\left( {\varvec{s}}; {\varvec{t}}\right) = \sum _{i=1}^{r}\frac{t_{i}}{\sum _{i^{\prime }=1}^{r} t_{i^{\prime }}}\left[ \log \left( \frac{\sum _{i^{\prime }=1}^{r} t_{i^{\prime }}}{t_{i}}\right) +\log \left( s_{i}\right) \right] , \end{aligned}$$
(9)

and

$$\begin{aligned} h_{2}\left( {\varvec{s}}; {\varvec{t}}\right) =- \sum _{i=1}^{r}\alpha _{i}\left[ a-\frac{b_{i}}{\alpha _{i}} \left( s_{i}-t_{i}\right) -{\varvec{t}}^{T}{\varvec{b}} \right] ^{2}, \end{aligned}$$
(10)

respectively, where \(\alpha _{i}=(|b_{i}|^{p}+\delta )/\sum _{i^{\prime }=1}^{r} (|b_{i^{\prime }}|^{p}+\delta )\), \(p>0\), and \(\delta >0\) is a small coefficient. The two minorizers were devised in Zhou and Lange (2010) and Becker et al. (1997), respectively; the latter was applied to perform LR without matrix operations. Here, we choose \(p=2\) in \(\alpha _{i}\) for use throughout the article, as per a suggestion from Becker et al. (1997).

Let \({\varvec{\psi }}^{(m)}\) be the \(m\hbox {th}\) MM iterate. By setting \(s_{i}=\pi _{i}\lambda ({\varvec{x}}_{j};{\varvec{\gamma }}_{i}, {\varvec{\sigma }}_{i}^{2})\) and \(t_{i}=\pi _{i}^{(m)}\lambda ({\varvec{x}}_{j}; {\varvec{\gamma }}_{i}^{(m)},{\varvec{\sigma }}_{i}^{(m)2})\) in (9), for each i and j, we get the minorizer for (7),

$$\begin{aligned} Q({\varvec{\psi }}; {\varvec{\psi }}^{(m)})= & {} C({\varvec{\psi }}^{(m)})+\sum _{i=1}^{g}\sum _{j=1}^{n} \tau _{i}({\varvec{x}}_{j};{\varvec{\psi }}^{(m)}) \log [\pi _{i}\lambda ({\varvec{x}}; {\varvec{\gamma }}_{i}, {\varvec{\sigma }}_{i}^{2})]\nonumber \\= & {} C({\varvec{\psi }}^{(m)})+Q_{1}({\varvec{\psi }}; {\varvec{\psi }}^{(m)})\nonumber \\&+\sum _{i=1}^{g}\sum _{k=1}^{d}\frac{1}{2\sigma _{i,k}^{2}} \sum _{j=1}^{n}\tau _{i}({\varvec{x}}_{j};{\varvec{\psi }}^{(m)} )Q_{2,i,j,k}({\varvec{\beta }}_{i,k};{\varvec{\beta }}_{i,k}^{ (m)}), \end{aligned}$$
(11)

where

$$\begin{aligned}&Q_{1}({\varvec{\psi }}; {\varvec{\psi }}^{(m)})= \sum _{i=1}^{g}\sum _{j=1}^{n}\tau _{i}({\varvec{x}}_{j}; {\varvec{\psi }}^{(m)})\log \pi _{i}-\frac{1}{2} \sum _{i=1}^{g}\sum _{k=1}^{d}\log \sigma _{i,k}^{2} \sum _{j=1}^{n}\tau _{i}({\varvec{x}}_{j};{\varvec{\psi }}^{(m)}),\nonumber \\&Q_{2,i,j,k}({\varvec{\beta }}_{i,k}; {\varvec{\beta }}_{i,k}^{(m)})=-(x_{j,k}-{\varvec{\beta }}_{i,k}^{T} \tilde{{\varvec{x}}}_{j,k})^{2}, \end{aligned}$$
(12)

and

$$\begin{aligned} \tau _{i}({\varvec{x}}; {\varvec{\psi }})= \frac{\pi _{i}\lambda ({\varvec{x}}; {\varvec{\gamma }}_{i}, {\varvec{\sigma }}_{i}^{2})}{\sum _{i^{\prime }=1}^{g} \pi _{i^{\prime }}\lambda ({\varvec{x}}; {\varvec{\gamma }}_{i^{\prime }}, {\varvec{\sigma }}_{i^{\prime }}^{2})}. \end{aligned}$$
(13)

Here, \({\varvec{x}}_{j}=(x_{j,1},\ldots ,x_{j,d})^{T}\) and \(\tilde{{\varvec{x}}}_{j,k}=(1,x_{j,1},\ldots ,x_{j,k-1})^{T}\) for each j and k, and \(C({\varvec{\psi }}^{(m)})\) is a constant that does not depend on \({\varvec{\psi }}\).

Now, by setting \(a=x_{j,k}\), \({\varvec{b}}=\tilde{{\varvec{x}}}_{j,k}\), \({\varvec{s}}={\varvec{\beta }}_{i,k}\), and \({\varvec{t}}={\varvec{\beta }}_{i,k}^{(m)}\) in (10), we obtain the minorizer for (12),

$$\begin{aligned} Q_{2,i,j,k}^{\prime }({\varvec{\beta }}_{i,k}; {\varvec{\beta }}_{i,k}^{(m)})=-\sum _{l=0}^{k-1}\alpha _{j,l} \left[ x_{j,k}-\frac{x_{j,l}}{\alpha _{j,l}}(\beta _{i,k,l}- \beta _{i,k,l}^{(m)})-{\varvec{\beta }}_{i,k}^{(m)T} \tilde{{\varvec{x}}}_{j,k}\right] ^{2},\qquad \end{aligned}$$
(14)

where \(\alpha _{j,l}=(|x_{j,l}|^{p}+\delta )/\sum _{l^{\prime }=0}^{k-1} (|x_{j,l}|^{p}+\delta )\) and \(x_{j,0}=1\) by definition, for j and \(l=0,\ldots ,k-1\).

Using (14), we can further minorize (11), and thus (7), by

$$\begin{aligned} Q^{\prime }({\varvec{\psi }}; {\varvec{\psi }}^{(m)})= & {} C({\varvec{\psi }}^{(m)})+Q_{1}({\varvec{\psi }}; {\varvec{\psi }}^{(m)})\nonumber \\&+\sum _{i=1}^{g}\sum _{k=1}^{d}\frac{1}{2\sigma _{i,k}^{2}} \sum _{j=1}^{n}\tau _{i}({\varvec{x}}_{j}; {\varvec{\psi }}^{(m)})Q_{2,i,j,k}^{\prime } ({\varvec{\beta }}_{i,k};{\varvec{\beta }}_{i,k}^{(m)}). \end{aligned}$$
(15)

We now consider the partition of \({\varvec{\psi }}\) into \({\varvec{\psi }}_{1}=(\varvec{\pi }^{T}, {\varvec{\gamma }}_{1}^{T},\ldots ,{\varvec{\gamma }}_{g}^{T})^{T}\) and \({\varvec{\psi }}_{2}=({\varvec{\sigma }}_{1}^{2},\ldots , {\varvec{\sigma }}_{g}^{2})^{T}\), where \({\varvec{\psi }}=({\varvec{\psi }}_{1}^{T},{\varvec{\psi }}_{2}^{T} )^{T}\). By fixing \({\varvec{\psi }}_{2}\) at \({\varvec{\psi }}_{2}^{(m)}\), \(Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})\) is additively separable in the subsets of \({\varvec{\psi }}_{1}\); furthermore, for each i, the elements of \({\varvec{\gamma }}_{i}\) are additively separable as well. Similarly, by fixing \({\varvec{\psi }}_{1}\) at \({\varvec{\psi }}_{1}^{(m)}\), \(Q({\varvec{\psi }}_{1}^{(m)},{\varvec{\psi }}_{2}; {\varvec{\psi }}^{(m)})\) is additively separable in the elements of the subsets of \({\varvec{\psi }}_{2}\). This result suggests the following block successive MM update scheme.

Let \({\varvec{\psi }}^{(0)}\) be an initial parameter vector. At the \((m+1)\hbox {th}\) iteration, the algorithm proceeds in two steps: the min (minorization)-step and the max (maximization)-step. In the min-step, we either construct (15) if m is odd, or (11) if m is even; in either cases, the min-step requires the computation of \(\tau _{i}({\varvec{x}}_{j};{\varvec{\psi }}^{(m)})\), for each j.

In the max-step, if m is odd, then we set \({\varvec{\psi }}_{2}^{(m+1)}={\varvec{\psi }}_{2}^{(m)}\) and solve for the root of

$$\begin{aligned} \frac{\partial Q^{\prime }({\varvec{\psi }}_{1}, {\varvec{\psi }}_{2}^{(m)};{\varvec{\psi }}^{(m)})}{\partial {\varvec{\psi }}_{1}}=\varvec{0}, \end{aligned}$$

to get the updates

$$\begin{aligned} \pi _{i}^{(m+1)}=\frac{\sum _{j=1}^{n}\tau _{i} ({\varvec{x}}_{j};{\varvec{\psi }}^{(m)})}{n} \end{aligned}$$
(16)

and

$$\begin{aligned} \beta _{i,k,l}^{(m+1)}=\beta _{i,k,l}^{(m)}+\frac{\sum _{j=1}^{n}x_{j,l} \tau _{i}({\varvec{x}}_{j};{\varvec{\psi }}^{(m)}) [x_{j,k}-{\varvec{\beta }}_{i,k}^{(m)T} \tilde{{\varvec{x}}}_{j,k}]}{\sum _{j=1}^{n}[x_{j,l}^{2}\tau _{i}({\varvec{x}}_{j}; {\varvec{\psi }}^{(m)})/\alpha _{j,l}]} \end{aligned}$$
(17)

for each i, k, and \(l=0,\ldots ,k-1\). Here, \(\varvec{0}\) is a zero matrix/vector of appropriate dimensionality.

Similarly, the max-step for even m proceeds by setting \({\varvec{\psi }}_{1}^{(m+1)}={\varvec{\psi }}_{1}^{(m)}\) and solving for the root of

$$\begin{aligned} \frac{\partial Q({\varvec{\psi }}_{1}^{(m)}, {\varvec{\psi }}_{2};{\varvec{\psi }}^{(m)})}{\partial {\varvec{\psi }}_{2}}=\varvec{0} \end{aligned}$$

to obtain the updates

$$\begin{aligned} \sigma _{i,k}^{(m+1)2}=\frac{\sum _{j=1}^{n}\tau _{i} ({\varvec{x}}_{j};{\varvec{\psi }}^{(m)}) [x_{j,k}-{\varvec{\beta }}_{i,k}^{(m)T} \tilde{{\varvec{x}}}_{j,k}]^{2}}{\sum _{j=1}^{n}\tau _{i} ({\varvec{x}}_{j};{\varvec{\psi }}^{(m)})} \end{aligned}$$
(18)

for each i and k. We now show that together, updates (16)–(18) generate a sequence of monotonically increasing log-likelihood values.

Theorem 2

If m is odd and \({\varvec{\psi }}_{2}^{(m+1)}={\varvec{\psi }}_{2}^{(m)},\) then updates (16) and (17) result in the inequalities, 

$$\begin{aligned} \log \mathcal {L}_{R,n}({\varvec{\psi }}^{(m+1)})\ge & {} Q^{\prime }({\varvec{\psi }}_{1}^{(m+1)}, {\varvec{\psi }}_{2}^{(m)};{\varvec{\psi }}^{(m)})\nonumber \\\ge & {} Q^{\prime }({\varvec{\psi }}_{1}^{(m)}, {\varvec{\psi }}_{2}^{(m)};{\varvec{\psi }}^{(m)}) \ge \log \mathcal {L}_{R,n}({\varvec{\psi }}^{(m)}). \end{aligned}$$
(19)

If m is even and \({\varvec{\psi }}_{1}^{(m+1)}={\varvec{\psi }}_{1}^{(m)},\) then update (18) results in the inequalities, 

$$\begin{aligned} \log \mathcal {L}_{R,n}({\varvec{\psi }}^{(m+1)})\ge & {} Q({\varvec{\psi }}_{1}^{(m)}, {\varvec{\psi }}_{2}^{(m+1)};{\varvec{\psi }}^{(m)})\nonumber \\\ge & {} Q({\varvec{\psi }}_{1}^{(m)}, {\varvec{\psi }}_{2}^{(m)};{\varvec{\psi }}^{(m)}) \ge \log \mathcal {L}_{R,n}({\varvec{\psi }}^{(m)}). \end{aligned}$$
(20)

In general, the MM algorithm is iterated until some convergence criterion is met. Usually, this is either the absolute convergence criterion

$$\begin{aligned} \log \mathcal {L}_{R,n}({\varvec{\psi }}^{(m+1)}) -\log \mathcal {L}_{R,n}({\varvec{\psi }}^{(m)})\le \epsilon , \end{aligned}$$

or the relative convergence criterion

$$\begin{aligned} \frac{\log \mathcal {L}_{R,n}({\varvec{\psi }}^{(m+1)}) -\log \mathcal {L}_{R,n}({\varvec{\psi }}^{(m)}) }{|\log \mathcal {L}_{R,n}({\varvec{\psi }}^{(m)})|} \le \epsilon , \end{aligned}$$
(21)

for some small \(\epsilon >0\). In either case, upon convergence, the final iterate of the algorithm is declared the ML estimator \(\hat{{\varvec{\psi }}}_{n}\). Let \({\varvec{\psi }}^{*}\) be a limit point of the MM algorithm, such that \(\hat{{\varvec{\psi }}}_{n}\rightarrow {\varvec{\psi }}^{*}\) as \(\epsilon \rightarrow 0\) for some starting parameter \({\varvec{\psi }}^{(0)}\).

The algorithm that we have devised is an instance of the block successive lower-bound maximization algorithms of Razaviyayn et al. (2013). As such, we can apply Theorem 2 of Razaviyayn et al. (2013) to obtain the following result regarding its limit points.

Theorem 3

If \({\varvec{\psi }}^{*}\) is a limit point of the algorithm conducted via the steps \({\varvec{\psi }}_{2}^{(m+1)}={\varvec{\psi }}_{2}^{(m)},\) (16), and (17), when m is odd;  and \({\varvec{\psi }}_{1}^{(m+1)}={\varvec{\psi }}_{1}^{(m)}\) and (18), when m is even,  for some initial vector \({\varvec{\psi }}^{(0)};\) then \({\varvec{\psi }}^{*}\) is a stationary point of (7).

Theorem 3 shows that given suitable initial parameter vector \({\varvec{\psi }}^{\left( 0\right) }\), the MM algorithm generates a sequence \({\varvec{\psi }}^{\left( m\right) }\) that converges to a stationary point of (7); this is a good result considering its multimodality.

3.2 Covariance constraints

We note that Theorem 3 requires that the limit points be finite values. This cannot always be guaranteed since \(\log \mathcal {L}_{R,n}({\varvec{\psi }}^{(m)})\rightarrow \infty \) if any of the sequences \(\sigma _{i,k}^{(m)2}\rightarrow 0\). This is equivalent to the problem of component covariance matrices \({\varvec{\varSigma }}_{i}\) becoming singular in the natural parametrization [i.e. in \(\mathcal {L}_{n}(\varvec{\theta })\)]. In the natural parametrization, the usual approach is to restrict the component covariance matrices to be positive definite via conditioning on the eigenvalues of the matrices. Such approaches were pioneered in Hathaway (1985); examples of recent developments include Greselin and Ingrassia (2008), Ingrassia (2004), and Ingrassia and Rocci (2007, (2011). We proceed to provide a simple alternative to the aforementioned approaches, based upon the LRC parametrization.

In the LRC, ensuring finite limit points amounts to guaranteeing that for each i and k, \(\sigma _{i,k}^{(m)2}\rightarrow \xi _{i,k}\) for some \(\xi _{i,k}>0\). This can be implemented by adding a small \(\xi >0\) to the right-hand side of update (18) at each iteration. Through doing this, we ensure that each \(\sigma _{i,k}^{*2}\) is positive, as well as retaining the monotonicity of the likelihood, for each update. The following result is then applicable.

Theorem 4

In (5), if \(\sigma _{k}^{2}>0\) for each k,  then the corresponding covariance matrix \({\varvec{\varSigma }},\) of the natural parametrization,  is positive-definite.

Theorem 4 implies that the covariance matrices in the natural parametrization will always be positive-definite, if we apply the described process. As suggested by a reviewer, it is practically important to choose a \(\xi \) such that it does not impede upon the estimation of mixture components with small variances. If we assume that the marginal variances of each marginal component do not exceed a proportion \(\varXi ^{-1}\) (\(\varXi >1\)) of the corresponding marginal variances of the overall distribution [i.e. \(\hbox {var}(X_{1}),\ldots ,\hbox {var}(X_{d})\)], then we can choose

$$\begin{aligned} \xi =\min \left\{ \frac{\hat{\hbox {var}}\left( X_{1}\right) }{\varXi },\ldots , \frac{\hat{\hbox {var}}\left( X_{d}\right) }{\varXi }\right\} \hbox {,} \end{aligned}$$

where \(\hat{\hbox {var}}(X_{k})\) is an estimate for the marginal variance of the \(k\hbox {th}\) dimension. We use \(\varXi =10^{10}\) for all numerical applications presented in this article.

4 Statistical properties

The consistency and asymptotic normality of ML estimators for GMM under the natural parametrization have been proven in many instances; see for example, Redner and Walker (1984), Hathaway (1985), and Atienza et al. (2007). We now seek the consistency of the ML estimators of the LRC of the GMM. Such a result can be obtained via Theorem 4.1.2 of Amemiya (1985).

Theorem 5

Let \({\varvec{X}}_{1},\ldots ,{\varvec{X}}_{n}\) be independent and identically distributed random samples from a distribution with density \(f_{R}({\varvec{x}}; {\varvec{\psi }}^{0}),\) and let \(\varPsi _{n}\) be the set of roots of the equation \(\partial (\log \mathcal {L}_{R,n}({\varvec{\psi }}))/ \partial {\varvec{\psi }}=\varvec{0},\) where \(\varPsi _{n}=\{\varvec{0}\}\) if there are no roots. If \({\varvec{\psi }}^{0}\) is a strict local maximizer of \(\mathbb {E}[\log f_{R}({\varvec{X}}; {\varvec{\psi }})],\) then for any \(\epsilon >0,\)

$$\begin{aligned} \underset{n\rightarrow \infty }{\lim }\mathbb {P} \left[ \underset{{\varvec{\psi }}\in \varPsi _{n}}{\inf } ({\varvec{\psi }}-{\varvec{\psi }}^{0})^{T} ({\varvec{\psi }}-{\varvec{\psi }}^{0})> \epsilon \right] =0. \end{aligned}$$

Theorem 5 is an adequate result, considering that the log-likelihoods of GMMs are often multimodal and unbounded, and that the MM algorithm is able to locate local maximizers when started from suitable values. We note that the result similarly holds when a lower bound is enforced for each of the variance limit points \(\sigma _{ik}^{*2}\), as in Sect. 3.2.

We now seek to establish the asymptotic normality of the ML estimators. Upon making some assumptions (see the proof in the Appendix), we are able to utilize Theorem 4.2.4 of Amemiya (1985) to get the following result.

Theorem 6

Under Assumption B4, the ML estimator \(\hat{{\varvec{\psi }}}_{n}\) (as in Theorem 5) satisfies

$$\begin{aligned} \sqrt{n}(\hat{{\varvec{\psi }}}_{n}-{\varvec{\psi }}^{0}) \overset{D}{\rightarrow }N\left( \varvec{0},-\mathbb {E} \left[ \frac{\partial ^{2}\log f_{R}({\varvec{x}}; {\varvec{\psi }})}{\partial {\varvec{\psi }} \partial {\varvec{\psi }}^{T}}\bigg |_{{\varvec{\psi }}= {\varvec{\psi }}^{0}}\right] ^{-1}\right) . \end{aligned}$$

Theorems 5 and 6 allow for inferences to be drawn from the ML estimators and their functions. For example, if \({\varvec{x}}\) is an observation that arises from a distribution with density \(f_{R}({\varvec{x}}; {\varvec{\psi }}^{0})\), then we can compute the conditional probability of its latent component variable \(Z=i\) given \({\varvec{X}}={\varvec{x}}\), by \(\tau _{i}({\varvec{x}}; {\varvec{\psi }}^{0})\) via an application of Bayes rule. Furthermore, if \(\hat{{\varvec{\psi }}}_{n}\overset{P}{\rightarrow }{\varvec{\psi }}^{0},\) then by continuous mapping, we have \(\tau _{i}({\varvec{x}};\hat{{\varvec{\psi }}}_{n}) \overset{P}{\rightarrow }\tau _{i}({\varvec{x}}; {\varvec{\psi }}^{0})\), for each i. Thus, the estimated allocation rule that assigns \({\varvec{x}}\) into component \(\hat{z}\in \{1,\ldots ,g\}\),

$$\begin{aligned} \hat{z}=\arg \underset{i\in \{1,\ldots ,g\}}{\max } \tau _{i}({\varvec{x}};\hat{{\varvec{\psi }}}_{n}), \end{aligned}$$

is asymptotically correct (i.e. it asymptotically assigns \({\varvec{x}}\) to the component which maximizes the a posteriori probability).

5 Numerical simulations

To assess the performance of the MM algorithm proposed in Sect. 3.1, we conduct a set of three numerical simulation studies: S1, S2, and S3. In S1, we investigate the performance of the MM algorithm against the standard EM algorithm for estimating GMMs in a setting where the simulated data arise from clusters of equal sample sizes. In S2, the setup from S1 is repeated albeit with simulated data that arises from clusters with differing sample sizes. Finally, in S3, we assess whether or not the ordering of the variables (as discussed in Sect. 2.1) are influential in the performance of the MM algorithm. We present the setups and results of the numerical simulations below.

5.1 Numerical simulation S1

In S1, we simulated \(2^{N-G}\) observations from each of \(2^{G}\) Gaussian distributions of dimensionality \(2^{D}\), where \(D=1,\ldots ,4\), \(G=1,\ldots ,4\), and \(N=15,16,17\). These sample sizes were chosen since performance improvements are most relevant in large data sets. A sample simulated via this design approximately corresponds to a sample of \(2^{N}\) observations from a \(2^{G}\) component GMM, where \(\pi _{1}=\cdots =\pi _{2^{G}}=2^{-G}\).

In each scenario, each of the \(2^{G}\) distribution means is randomly sampled from a Gaussian distribution with mean \(\varvec{0}\) and covariance matrix \(20\times {\varvec{I}}\). The covariance matrices of the Gaussian distributions are each sampled from a Wishart distribution with scale matrix \({\varvec{I}}\) and \(2^{D}+2\) degrees of freedom. An example of the \(D=2\), \(G=3\), and \(N=15\) case is shown in Fig. 1.

Fig. 1
figure 1

Pairwise marginal plots of data generated from the \(D=2\), \(G=3\), and \(N=15\) case of the simulations. The eight colors indicate the different origins of each of the generated data points

For each D, G, and N, 100 trials are simulated. In each trial, the MM algorithm is used to compute the ML estimates \(\hat{{\varvec{\psi }}}_{n}\) for the LRC parameters. Here, the algorithm is terminated using criterion (21) with \(\epsilon =10^{-5}\), and the computational time is recorded. The traditional EM algorithm (see Section 3.2 of McLachlan and Peel 2000) is then used to compute the ML estimates \(\hat{\varvec{\theta }}_{n}\) for the natural parameters, using the same starting values as for the MM algorithm. The EM algorithm is terminated using the criterion

$$\begin{aligned} \log \mathcal {L}_{R,n}(\hat{{\varvec{\psi }}}_{n})- \log \mathcal {L}_{n}(\varvec{\theta }_{n}^{(m)})<\epsilon , \end{aligned}$$

using \(\epsilon =10^{-5}\), and the computational time is recorded. The \(k\hbox {-means}\) algorithm was used to initialize parameters, as per Section 2.12 of McLachlan and Peel (2000); see MacQueen (1967) regarding the \(k\hbox {-means}\) algorithm.

The algorithms were applied via implementations in the R programming environment (version 3.0.2) on an Intel Core i7-2600 CPU running at 3.40 GHz with 16 GB internal RAM, and the timing was conducted using the proc.time function from said environment. The computational times, in seconds, for both algorithms were then averaged over the trials, for each scenario, and the results are reported in Table 1. In Fig. 2, we also plot the average ratio of EM to MM algorithm computational times, for each scenario.

Table 1 Results of numerical simulation S1 for assessing the performance of the MM and EM algorithms
Fig. 2
figure 2

Plots of the average ratio of EM to MM algorithm computational times in S1. The panels are separated into the G values of each scenario, and the N values are indicated by the line colors. Here, red, green, and blue indicate the values \(N=15,16,17\), respectively. The dotted line indicates a ratio of 1 (color figure online)

Upon inspection, Table 1 suggests that both the MM and the EM algorithms behave as expected, with regards to the increases in computation times with respect to increases in D, G, and N. Furthermore, we notice that in each scenario, the MM algorithm is faster than the EM algorithm on average. In the best case, the MM algorithm is approximately 35 times faster than the EM (i.e. case \(D=1\), \(G=2\), and \(N=15\)), and in the worst case, the MM and EM are approximately at parity (i.e. case \(D=4\), \(G=1\), and \(N=17\)).

In Fig. 2, the performance of the MM algorithm over the EM decreases due to increases in D, G, and N, with D decreasing this gain more severely than the other two variables. This pattern may be explained by some of the additional computation overhead of the MM algorithm. For instance, notice that the MM algorithm requires the computation and storage of \(\alpha _{jl}\) for each j, l, and k. The number of these computations increase quadratically in D, but only linearly in G and N. Due to this effect, we cannot recommend the MM algorithm in all situations. However, it is noticeable that in the low D cases, the MM algorithm appears to be distinctly faster than the EM.

5.2 Numerical simulation S2

In S2, we repeat the simulation design of S1 except instead of simulated \(2^{G}\) equally sized samples, we simulated \(2^{G-1}\) samples of size \((1/2)\times 2^{N-G}\) and \(2^{G-1}\) samples of size \((3/2)\times 2^{N-G}\). A sample simulated via this design approximately corresponds to a sample of \(2^{N}\) observations from a \(2^{G}\) component GMM, where \(\pi _{1}=\cdots =\pi _{2^{G-1}}=1/2^{G+1}\) and \(\pi _{2^{G-1}+1}=,\ldots ,\pi _{2^{G}}=3/2^{G+1}\). Using the same comparison method and termination criterion as applied in S1, we obtain the results presented in Table 2. These results are further visualized in Fig. 3.

Table 2 Results of numerical simulation S2 for assessing the performance of the MM and EM algorithms
Fig. 3
figure 3

Plots of the average ratio of EM to MM algorithm computational times in S2. The panels are separated into the G values of each scenario, and the N values are indicated by the line colors. Here, red, green, and blue indicate the values \(N=15,16,17\), respectively. The dotted line indicates a ratio of 1 (color figure online)

From Table 2, we notice that both the MM and EM algorithm average computational times are greater than the times in each of the respective scenarios, in S1. This indicates that the problem of having estimating GMMs with differing cluster sizes using either algorithms may require more iterations to converge, on average. Like S1, it appears that the MM algorithm is again faster than the EM in all tested scenarios. However, we do note that the difference between the two algorithms is less in S2. For example, in the best case, the MM algorithm is only 25.6, times faster than the EM (i.e. case \(D=1\), \(G=2\), and \(N=15\)).

Upon inspection of Fig. 3, we again see that the performance of the MM algorithm over the EM decreases due to increases in D, G, and N, again with D decreasing this gain more severely than the other two variables. Thus, we recognize that although there is a decrease in computational gains as the tested variables increase, there is a good case for the MM algorithm to be used instead of the EM when D is relatively small.

5.3 Numerical simulation S3

Following the design of S1, we simulated 100 samples from the \(D=2\), \(G=2\), and \(N=15,16,17\) cases. For each sample of each of the three cases, we perform ML estimation using the MM algorithm on all 24 possible permutations of the four variables. The computation time and the log-likelihood value of each permutation is then recorded, and a mean, range and relative range (as a ratio to the absolute value of the mean) is then computed over the 24 permutations, for both the computation time and the log-likelihood value. The average mean values, ranges and relative ranges over the 100 samples for all three scenarios are presented in Table 3.

Table 3 Results of numerical simulation S3 for assessing the effects of variable ordering

Upon inspection of Table 3, we notice that the range of computation times across the three scenarios can be 40 to 50 % of the average computational times. Thus, selecting an advantageous permutation of the variables can lead to significant improvements in algorithm performance. Unfortunately, the most advantageous permutation was not predictable in our simulations, and even in the \(D=2\) case with four variables, the number of permutations can be very large. Fortunately, the range of log-likelihood values is only two to three percent of that of the mean log-likelihood values in each scenario. As such, there appears to be little variation of the optimal outcome, once the algorithm has converged. This implies that regardless of performance, the algorithm is converging as expected according to Theorem 3.

We finally note that the results from all three numerical simulation studies are dependent on the specific performances of the subroutines, algorithms, and hardware. Thus, the results may vary when conducting performance tests under different settings. As such, we believe that this simulation study serves to demonstrate the potential computational performance in R, rather than the performance in all settings.

6 Conclusions

In this article, we introduced the LRC of the GMM, and show that there is a mapping between the LRC parameters and the natural parameters of a GMM. Using the LRC, we devised an MM algorithm for ML estimation, which does not depend on matrix operations. We then proved that the MM algorithm monotonically increases the log-likelihood in ML estimation, and that the sequence of estimators obtained from the algorithm is convergent to a stationary point of the log-likelihood function, under regularity conditions. Through simulations, we were able to demonstrate that the computational speed of the MM algorithm for the LRC parameter estimates was faster than the traditional EM algorithm for estimating GMMs in some large data situations, when both algorithms are implemented in the R programming environment. We also show that although the ordering of the variables may have a significant effect on the computational times of the MM algorithm, there appears to be little effect on the ability of the algorithm to converge to an appropriate limit point.

We also proved that the ML estimators of the LRC parameters, like those of the natural parameters, are also consistent and asymptotically normal. This allows for asymptotically valid statistical inference, such as using the LRC of the GMM for clustering data. Furthermore, we showed that the LRC allows for a simple method for handling singularities in the ML estimation of GMM parameters.

To the best of our knowledge, we are the first to apply the LRC for constructing a matrix operation-free algorithm for estimating GMMs. In the future, we hope to extend our matrix operation-free approach to the ML estimation of mixtures of \(t\hbox {-distributions}\), as well as skew variants of the GMM.