1 Introduction

The exponential family of distributions is an important class of probabilistic models with numerous applications in statistics and machine learning. The exponential family contains many of the most commonly used univariate distributions, including the Bernoulli, binomial, gamma, geometric, inverse Gaussian, logarithmic normal, Poisson, and Rayleigh distributions, as well as multivariate distributions such as the Dirichlet, multinomial, multivariate normal, von Mises, and Wishart distributions. See Forbes et al. (2011, Ch. 18), DasGupta (2011, Ch. 18), and Amari (2016, Ch. 2).

Let \({\varvec{Y}}^{\top }=\left( Y_{1},\dots ,Y_{d}\right) \) be a random variable (with realization \({\varvec{y}}\)) on the support \(\mathbb {Y}\subseteq \mathbb {R}^{d}\) (\(d\in \mathbb {N}\)) , arising from a data generating process (DGP) with probability density/mass function (PDF/PMF) \(f\left( {\varvec{y}};{\varvec{\theta }}\right) \) that is characterized by some parameter vector \({\varvec{\theta }}\in \Theta \subseteq \mathbb {R}^{p}\) (\(p\in \mathbb {N}\)). We say that the distribution that characterizes the DGP of \({\varvec{Y}}\) is in the exponential family class, if the PDF/PMF can be written in the form

$$\begin{aligned} f\left( {\varvec{y}};{\varvec{\theta }}\right) =h\left( {\varvec{y}}\right) \exp \left\{ \left[ {\varvec{s}}\left( {\varvec{y}}\right) \right] ^{\top }{\varvec{\phi }}\left( {\varvec{\theta }}\right) -\psi \left( {\varvec{\theta }}\right) \right\} , \end{aligned}$$
(1)

where \({\varvec{s}}\left( \cdot \right) \) and \({\varvec{\phi }}\left( \cdot \right) \) are p-dimensional vector functions, and \(h\left( \cdot \right) \) and \(\psi \left( \cdot \right) \) are \(1\text {-dimensional}\) functions of \({\varvec{y}}\) and \({\varvec{\theta }}\), respectively. If the dimensionality of \({\varvec{s}}\left( \cdot \right) \) and \({\varvec{\phi }}\left( \cdot \right) \) is less than p, then we say that the distribution that characterizes the DGP of \({\varvec{Y}}\) is in the curved exponential class.

Let \(Z\in \left[ g\right] \) (\(\left[ g\right] =\left\{ 1,\dots ,g\right\} \); \(g\in \mathbb {N}\)) be a latent random variable, and write \({\varvec{X}}^{\top }=\left( {\varvec{Y}}^{\top },Z\right) \). Suppose that the PDF/PMF of \(\left\{ {\varvec{Y}}={\varvec{y}}|Z=z\right\} \) can be written as \(f\left( {\varvec{y}};{\varvec{\omega }}_{z}\right) \), for each \(z\in \left[ g\right] \). If we assume that \(\mathbb {P}\left( Z=z\right) =\pi _{z}>0\), such that \(\sum _{z=1}^{g}\pi _{z}=1\), then we can write the marginal PDF/PMF of \({\varvec{Y}}\) in the form

$$\begin{aligned} f\left( {\varvec{y}};{\varvec{\theta }}\right) =\sum _{z=1}^{g}\pi _{z}f\left( {\varvec{y}}; {\varvec{\omega }}_{z}\right) , \end{aligned}$$
(2)

where we put the unique elements of \(\pi _{z}\) and \({\varvec{\omega }}_{z}\) into \({\varvec{\theta }}\). We call \(f\left( {\varvec{y}};{\varvec{\theta }}\right) \) the g-component finite mixture PDF, and we call \(f\left( {\varvec{y}};{\varvec{\omega }}_{z}\right) \) the \(z\text {th}\) component PDF, characterized by the parameter vector \({\varvec{\omega }}_{z}\in \Omega \), where \(\Omega \) is some subset of a real product space. We also say that the elements \(\pi _{z}\) are prior probabilities, corresponding to the respective component.

The most common finite mixtures models are mixtures of normal distributions, which were popularized by Pearson (1894), and have been prolifically used by numerous prior authors (cf. McLachlan et al. 2019). The g-component d-dimensional normal mixture model has PDF of the form

$$\begin{aligned} f\left( {\varvec{y}};{\varvec{\theta }}\right) =\sum _{z=1}^{g}\pi _{z}\varphi \left( {\varvec{y}}; {\varvec{\mu }}_{z},{\varvec{\Sigma }}_{z}\right) , \end{aligned}$$
(3)

where the normal PDFs

$$\begin{aligned} \varphi \left( {\varvec{y}};{\varvec{\mu }}_{z},{\varvec{\Sigma }}_{z}\right)= & {} \left| 2\pi {\varvec{\Sigma }}_{z}\right| ^{-1/2}\exp \nonumber \\&\quad \left[ -\frac{1}{2} \left( {\varvec{y}}-{\varvec{\mu }}_{z}\right) ^{\top }{\varvec{\Sigma }}_{z}^{-1}\left( {\varvec{y}}-{\varvec{\mu }}_{z}\right) \right] , \end{aligned}$$
(4)

replace the component densities \(f\left( {\varvec{y}};{\varvec{\omega }}_{z}\right) \), in (2). Each component PDF (4) is parameterized by a mean vector \({\varvec{\mu }}_{z}\in \mathbb {R}^{d}\) and a positive-definite symmetric covariance matrix \({\varvec{\Sigma }}_{z}\in \mathbb {R}^{d\times d}\). We then put each \(\pi _{z}\), \({\varvec{\mu }}_{z}\), and \({\varvec{\Sigma }}_{z}\) into the vector \({\varvec{\theta }}\).

As earlier noted, the normal distribution is a member of the exponential family and thus (4) can be written in form (1). This can be observed by putting the unique elements of \({\varvec{\mu }}_{z}\) and \({\varvec{\Sigma }}_{z}\) into \({\varvec{\omega }}_{z}\), and writing \(\varphi \left( {\varvec{y}};{\varvec{\mu }}_{z},{\varvec{\Sigma }}_{z}\right) = f\left( {\varvec{y}};{\varvec{\omega }}_{z}\right) \) in form (1), with mappings

$$\begin{aligned} h\left( {\varvec{y}}\right)= & {} \left( 2\pi \right) ^{-d/2}, {\varvec{s}}\left( {\varvec{y}}\right) =\left[ \begin{array}{c} {\varvec{y}}\\ \text {vec}({\varvec{y}}{\varvec{y}}^{\top }) \end{array}\right] , \nonumber \\ {\varvec{\phi }}\left( {\varvec{\omega }}_{z}\right)= & {} \left[ \begin{array}{c} {\varvec{\Sigma }}_{z}^{-1}{\varvec{\mu }}_{z}\\ -\frac{1}{2}\text {vec}({\varvec{\Sigma }}_{z}^{-1}) \end{array}\right] \text {, and} \end{aligned}$$
(5)
$$\begin{aligned} \psi \left( {\varvec{\omega }}_{z}\right)= & {} \frac{1}{2}{\varvec{\mu }}_{z}^{\top } {\varvec{\Sigma }}_{z}^{-1}{\varvec{\mu }}_{z}+\frac{1}{2}\log \left| {\varvec{\Sigma }}_{z}\right| \text {.} \end{aligned}$$
(6)

When conducting data analysis using a normal mixture model, one generally observes an independent and identically (IID) sequence of \(n\in \mathbb {N}\) observations \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{n}\), arising from a DGP that is hypothesized to be characterized by a PDF of the form (3), with unknown parameter vector \({\varvec{\theta }}={\varvec{\theta }}_{0}\). The inferential task is to estimate \({\varvec{\theta }}_{0}\) via some estimator that is computed from \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{n}\). The most common computational approach to obtaining an estimator of \({\varvec{\theta }}_{0}\) is via maximum likelihood (ML) estimation, using the expectation–maximization algorithm (EM; Dempster et al. 1977). See McLachlan and Peel (2000, Ch. 3.2) for a description of the normal mixture EM algorithm. Generally, when g, d, and n are of small to moderate size, the conventional EM approach is feasible and is able to perform the task of ML estimation in a timely manner. Unfortunately, due to its high memory demands, costly matrix operations (Nguyen and McLachlan 2015), and slow convergence rates (Sec. 3.9 McLachlan and Krishnan 2008), the conventional EM algorithm is not suited for the computational demands of analyzing increasingly large data sets, such as those that could be considered as big data in volumes such as Buhlmann et al. (2016), Han et al. (2017), and Hardle et al. (2018).

Over the years, numerous algorithms have been proposed as means to alleviate the computational demands of the EM algorithm for normal mixture models. Some of such approaches include the component-wise algorithm of Celeux et al. (2001), the greedy algorithm of Vlassis and Likas (2002), the sparse and incremental kd-tree algorithm of Ng and McLachlan (2004), the subspace projection algorithm of Bouveyron et al. (2007), and the matrix operations-free algorithm of Nguyen and McLachlan (2015).

There has been a recent resurgence in stochastic approximation algorithms, of the Robbins and Monro (1951) and Kiefer and Wolfowitz (1952) type, developed for the purpose of solving computationally challenging optimization problems, such as the ML estimation of normal mixture models. A good review of the current literature can be found in Chau and Fu (2015). Nave and direct applications of the stochastic approximation approach to mixture model estimation can be found in Liang and Zhang (2008), Zhang and Liang (2008), and Nguyen and Jones (2018).

Following a remark from Cappé and Moulines (2009) regarding the possible extensions of the online EM algorithm, we propose mini-batch EM algorithms for the ML estimation of exponential family mixture models. These algorithms include a number of variants, among which are update truncation variants that had not been made explicit, before. Using the theorems from Cappé and Moulines (2009), we state results regarding the convergence of our algorithms. We then specialize our attention to the important case of normal mixture models, and demonstrate that the required assumptions for convergence are met in such a scenario.

A thorough numerical study is conducted in order to assess the performance of our normal mixture mini-batch algorithms. Comparisons are drawn between our algorithms and the usual batch EM algorithm for ML estimation of normal mixture models. We show that our mini-batch algorithms can be applied to very large data sets by demonstrating its applicability to the ML estimation of normal mixture models on the famous MNIST data of LeCun et al. (1998).

References regarding mixtures of exponential family distributions and EM-type stochastic approximation algorithms, and comments regarding some recent related literature are relegated to the Supplementary Materials, in the interest of brevity. Additional remarks, numerical results, and derivations are also included in these Supplementary Materials in order to provide extra context and further demonstrate the capabilities of the described framework. These demonstrations include the derivation of mini-batch EM algorithms for mixtures of exponential and Poisson distributions. The Supplementary Materials can be found at https://github.com/hiendn/StoEMMIX/blob/master/Manuscript_files/SupplementaryMaterials.pdf.

The remainder of the paper is organized as follows. In Sect. 2, we present the general results of Cappé and Moulines (2009) and demonstrate how they can be used for mini-batch ML estimation of exponential family mixture models. In Sect. 3, we derive the mini-batch EM algorithms for the ML estimation of normal mixtures, as well as verify the convergence of the algorithms using the results of Cappé and Moulines (2009). Via numerical simulations, we compare the performance of our mini-batch algorithms to the usual EM algorithm for ML estimation of normal mixture models, in Sect. 4. A set of real data study on a very large data set is presented in Sect. 5. Conclusions are drawn in Sect. 6. Additional material, such as mini-batch EM algorithms for exponential and Poisson mixture models, can be found in the Supplementary Materials.

2 The mini-batch EM algorithm

Suppose that we observe a single pair of random variables \({\varvec{X}}^{\top }=\left( {\varvec{Y}}^{\top },{\varvec{Z}}^{\top }\right) \), where \({\varvec{Y}}\in \mathbb {Y}\) is observed but \({\varvec{Z}}\in \mathbb {L}\) is latent, where \(\mathbb {Y}\) and \(\mathbb {L}\) are subsets of multivariate real-valued spaces. Furthermore, suppose that the marginal PDF/PMF of \({\varvec{Y}}\) is hypothesized to be of the form \(f\left( {\varvec{y}};{\varvec{\theta }}_{0}\right) \), for some unknown parameter vector \({\varvec{\theta }}_{0}\in \Theta \subseteq \mathbb {R}^{p}\). A good estimator for \({\varvec{\theta }}_{0}\) is the ML estimator \(\hat{{\varvec{\theta }}}\) that can be defined as:

$$\begin{aligned} \hat{{\varvec{\theta }}}\in \left\{ \hat{{\varvec{\theta }}}:\log f\left( {\varvec{Y}};\hat{{\varvec{\theta }}}\right) =\max _{{\varvec{\theta }}\in \Theta }\log f\left( {\varvec{Y}};{\varvec{\theta }}\right) \right\} \text {.} \end{aligned}$$
(7)

When the problem (7) cannot be solved in a simple manner (e.g., when the solution does not exist in closed form), one may seek to employ an iterative scheme in order to obtain an ML estimator. If the joint PDF/PMF of \({\varvec{X}}\) is known, then one can often construct an EM algorithm in order to solve the problem in the bracket of (7).

Start with some initial guess for \({\varvec{\theta }}_{0}\) and call it the zeroth iterate of the EM algorithm \({\varvec{\theta }}^{\left( 0\right) }\) and suppose that we can write the point PDF/PMF of \({\varvec{X}}\) as \(f\left( {\varvec{y}},{\varvec{z}};{\varvec{\theta }}\right) \), for any \({\varvec{\theta }}\). At the \(r\text {th}\) iterate of the EM algorithm, we perform an expectation (E-) step, followed by a maximization (M-) step. The \(r\text {th}\) E-step consists of obtaining the conditional expectation of the complete-data log-likelihood (i.e., \(\log f\left( {\varvec{y}},{\varvec{z}};{\varvec{\theta }}\right) \)) given the observed data, using the current estimate of the parameter vector

$$\begin{aligned} Q\left( {\varvec{\theta }};{\varvec{\theta }}^{\left( r-1\right) }\right) =\mathbb {E}_{ {\varvec{\theta }}^{\left( r-1\right) }}\left[ \log f\left( {\varvec{y}},{\varvec{Z}};{\varvec{\theta }}\right) |{\varvec{Y}}={\varvec{y}}\right] , \end{aligned}$$

which we will call the conditional expected complete-data log-likelihood.

Upon obtaining the conditional expectation of the complete-data log-likelihood, one then conducts the \(r\text {th}\) M-step by solving the problem

$$\begin{aligned} {\varvec{\theta }}^{\left( r\right) }=\arg \underset{{\varvec{\theta }}\in \Theta }{\max }\text { }Q\left( {\varvec{\theta }};{\varvec{\theta }}^{\left( r-1\right) }\right) \text {.} \end{aligned}$$

The E- and M-steps are repeated until some stopping criterion is met. Upon termination, the final iterate of the algorithm is taken as a solution for problem (7). See McLachlan and Krishnan (2008) for a thorough exposition regarding the EM algorithm.

2.1 The online EM algorithm

Suppose that we observe a sequence of n IID replicates of the variable \({\varvec{Y}}\), \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{n}\), where each \({\varvec{Y}}_{i}\) is the visible component of the pair \({\varvec{X}}_{i}=\left( {\varvec{Y}}_{i}^{\top },{\varvec{Z}}_{i}^{\top }\right) \) (\(i\in \left[ n\right] \)). In the online learning context, each of the observations from \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{n}\) is observed one at a time, in sequential order.

Using the sequentially obtained sequence \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{n}\), we wish to obtain an ML estimator for the parameter vector \({\varvec{\theta }}_{0}\), in the same sense as in (7). In order to construct an online EM algorithm framework with provable convergence, Cappé and Moulines (2009) assume the following restrictions regarding the nature of the hypothesized DGP of \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{n}\).

  1. A1

    The complete-data likelihood corresponding to the pair \({\varvec{X}}\) is of exponential family form. That is,

    $$\begin{aligned} f\left( {\varvec{x}};{\varvec{\theta }}\right) =h\left( {\varvec{x}}\right) \exp \left\{ \left[ {\varvec{s}}\left( {\varvec{x}}\right) \right] ^{\top }{\varvec{\phi }}\left( {\varvec{\theta }}\right) -\psi \left( {\varvec{\theta }}\right) \right\} , \end{aligned}$$
    (8)

    where \(h\left( \cdot \right) \), \(\psi \left( \cdot \right) \), \({\varvec{s}}\left( \cdot \right) \), and \({\varvec{\phi }}\left( \cdot \right) \) are as defined for (1).

  2. A2

    The function

    $$\begin{aligned} \bar{{\varvec{s}}}\left( {\varvec{y}};{\varvec{\theta }}\right) =\mathbb {E}_{{\varvec{\theta }}} \left[ {\varvec{s}}\left( {\varvec{X}}\right) |{\varvec{Y}}={\varvec{y}}\right] \end{aligned}$$
    (9)

    is well defined for all \({\varvec{y}}\in \mathbb {Y}\) and \({\varvec{\theta }}\in \Theta \).

  3. A3

    There exists a convex open subset \(\mathbb {S}\subseteq \mathbb {R}^{p}\), which satisfies the properties that:

    1. (i)

      for all \({\varvec{s}}\in \mathbb {S}\), \({\varvec{y}}\in \mathbb {Y}\), \({\varvec{\theta }}\in \Theta \), \(\left( 1-\gamma \right) {\varvec{s}}+\gamma \bar{{\varvec{s}}}\left( {\varvec{y}};{\varvec{\theta }}\right) \in \mathbb {S}\) for any \(\gamma \in \left( 0,1\right) \), and

    2. (ii)

      for any \({\varvec{s}}\in \mathbb {S}\), the function

      $$\begin{aligned} q\left( {\varvec{s}};{\varvec{\theta }}\right) ={\varvec{s}}^{\top }{\varvec{\phi }}\left( {\varvec{\theta }}\right) -\psi \left( {\varvec{\theta }}\right) \end{aligned}$$

      has a unique global maximum over \(\Theta \), which will be denoted by

      $$\begin{aligned} \bar{{\varvec{\theta }}}\left( {\varvec{s}}\right) =\arg \underset{{\varvec{\theta }}\in \Theta }{\max }\;q\left( {\varvec{s}};{\varvec{\theta }}\right) \text {.} \end{aligned}$$

Let \(Q_{n}\left( {\varvec{\theta }};{\varvec{\theta }}^{\left( r-1\right) }\right) \) be the expected complete-data log-likelihood over data \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{n}\), at the \(r\text {th}\) E-step of an EM algorithm for solving the problem:

$$\begin{aligned} \hat{{\varvec{\theta }}}_{n}\in & {} \left\{ \hat{{\varvec{\theta }}}:n^{-1}\sum _{i=1}^{n}\log f\left( {\varvec{Y}}_{i};\hat{{\varvec{\theta }}}\right) =\max _{{\varvec{\theta }}\in \Theta }n^{-1}\right. \\&\times \left. \sum _{i=1}^{n}\log f\left( {\varvec{Y}}_{i};{\varvec{\theta }}\right) \right\} , \end{aligned}$$

where we say that \(\hat{{\varvec{\theta }}}_{n}\) is the ML estimator, based on the data \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{n}\). When, A1–A3 are satisfied, we can write

$$\begin{aligned} Q_{n}\left( {\varvec{\theta }};{\varvec{\theta }}^{\left( r-1\right) }\right) =nq\left( n^{-1}\sum _{i=1}^{n}\bar{{\varvec{s}}}\left( {\varvec{Y}}_{i};{\varvec{\theta }}^{ \left( r-1\right) }\right) ;{\varvec{\theta }}\right) +\text {Constant}, \end{aligned}$$

which can then be maximized, with respect to \({\varvec{\theta }}\), in order to yield an M-step update of the form:

$$\begin{aligned} {\varvec{\theta }}^{\left( r\right) }=\bar{{\varvec{\theta }}}\left( n^{-1}\sum _{i=1}^{n} \bar{{\varvec{s}}}\left( {\varvec{Y}}_{i};{\varvec{\theta }}^{\left( r-1\right) }\right) \right) , \end{aligned}$$
(10)

where \({\varvec{\theta }}^{\left( r\right) }\) is a function that depends only on the average \(n^{-1}\sum _{i=1}^{n}\bar{{\varvec{s}}}\left( {\varvec{Y}}_{i};{\varvec{\theta }}^{\left( r-1\right) }\right) \).

Now we suppose that we sample the individual observations of \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{n}\), one at a time and sequentially. Furthermore, upon observation of \({\varvec{Y}}_{i}\), we wish to compute an online estimate of \({\varvec{\theta }}_{0}\), which we denote as \({\varvec{\theta }}^{\left( i\right) }\). Based on the simplification of the EM algorithm under A1–A3, as described above, Cappé and Moulines (2009) proposed the following online EM algorithm.

Upon observation of \({\varvec{Y}}_{i}\), compute the intermediate updated sufficient statistic

$$\begin{aligned} {\varvec{s}}^{\left( i\right) }={\varvec{s}}^{\left( i-1\right) }+\gamma _{i} \left[ \bar{{\varvec{s}}}\left( {\varvec{Y}}_{i};{\varvec{\theta }}^{\left( i-1\right) }\right) - {\varvec{s}}^{\left( i-1\right) }\right] , \end{aligned}$$
(11)

with \({\varvec{s}}^{\left( 0\right) }=\bar{{\varvec{s}}}\left( {\varvec{Y}}_{i};{\varvec{\theta }}^{\left( 0\right) } \right) \). Here, \(\gamma _{i}\) is the \(i\text {th}\) term of the learning rate sequence that we will discuss in further details in the sequel. Observe that we can also write

$$\begin{aligned} {\varvec{s}}^{\left( i\right) }=\gamma _{i}\bar{{\varvec{s}}}\left( {\varvec{Y}}_{i}; {\varvec{\theta }}^{\left( i-1\right) }\right) +\left( 1-\gamma _{i}\right) {\varvec{s}}^{\left( i-1\right) }, \end{aligned}$$

which makes it clear that for \(\gamma _{i}\in \left( 0,1\right) \), \({\varvec{s}}^{\left( i\right) }\) is a weighted average between \(\bar{{\varvec{s}}}\left( {\varvec{Y}}_{i};{\varvec{\theta }}^{\left( i-1\right) }\right) \) and \({\varvec{s}}^{\left( i-1\right) }\). Using \({\varvec{s}}^{\left( i\right) }\) and the function \(\bar{{\varvec{\theta }}}\), we can then express the \(i\text {th}\) iteration online EM estimate of \({\varvec{\theta }}_{0}\) as

$$\begin{aligned} {\varvec{\theta }}^{\left( i\right) }=\bar{{\varvec{\theta }}} \left( {\varvec{s}}^{\left( i\right) }\right) \text {.} \end{aligned}$$
(12)

Next, we state a consistency theorem that strongly motivates the use of the online EM algorithm, defined by (11) and (12). Suppose that the true DGP that generates each \({\varvec{Y}}_{i}\) of \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{n}\) is characterized by the probability measure \(F_{0}\). Write the expectation operator with respect to this measure as \(\mathbb {E}_{F_{0}}\). In order to state the consistency result of Cappé and Moulines (2009), we require the following additional set of assumptions.

  1. B1

    The parameter space \(\Theta \) is a convex and open subset of a real product space, and the functions \({\varvec{\phi }}\) and \(\psi \), in (8), are both twice continuously differentiable with respect to \({\varvec{\theta }}\in \Theta \).

  2. B2

    The function \(\bar{{\varvec{\theta }}}\), as defined in (10), is a continuously differentiable function with respect to \({\varvec{s}}\in \mathbb {S}\), where \(\mathbb {S}\) is as defined in A3.

  3. B3

    For some \(p>2\), and all compact \(\mathbb {K}\subset \mathbb {S}\),

    $$\begin{aligned} \sup _{{\varvec{s}}\in \mathbb {K}}\;\mathbb {E}_{F_{0}}\left[ \left| \bar{{\varvec{s}}}\left( {\varvec{Y}}; \bar{{\varvec{\theta }}}\left( {\varvec{s}}\right) \right) \right| ^{p}\right] <\infty \text {.} \end{aligned}$$
    (13)

As the algorithm defined by (11) and (12) is of the Robbins–Monro type, establishment of convergence of the algorithm requires the definition of a mean field (see Chen 2003 and Kushner and Yin 2003 for comprehensive treatments regarding such algorithms). In the case of the online EM algorithm, we write the mean field as

$$\begin{aligned} {\varvec{h}}\left( {\varvec{s}}\right) =\mathbb {E}_{F_{0}}\left[ \bar{{\varvec{s}}} \left( {\varvec{Y}};\bar{{\varvec{\theta }}}\left( {\varvec{s}}\right) \right) \right] -{\varvec{s}} \end{aligned}$$

and define the set of its roots as \(\Gamma =\left\{ {\varvec{s}}\in \mathbb {S}:{\varvec{h}}\left( {\varvec{s}}\right) ={\varvec{0}}\right\} \).

Define the log-likelihood of the hypothesized PDF \(f\left( \cdot ;{\varvec{\theta }}\right) \) with respect to the measure \(F_{0}\), as

$$\begin{aligned} \ell \left( f\left( \cdot ;{\varvec{\theta }}\right) \right) =\mathbb {E}_{F_{0}}\left[ \log f\left( {\varvec{Y}};{\varvec{\theta }}\right) \right] \text {.} \end{aligned}$$

Let \(\nabla _{{\varvec{\theta }}}\) denote the gradient with respect to \({\varvec{\theta }}\), and define the sets

$$\begin{aligned} \mathbb {W}_{\Gamma }=\left\{ \ell \left( f\left( \cdot ;{\varvec{\theta }}\right) \right) :{\varvec{\theta }}=\bar{{\varvec{\theta }}} \left( {\varvec{s}}\right) \text {, }{\varvec{s}}\in \Gamma \right\} \end{aligned}$$

and

$$\begin{aligned} \mathbb {M}_{\Theta }=\left\{ \hat{{\varvec{\theta }}}\in \Theta :\nabla _{{\varvec{\theta }}} \left. \ell \left( f\left( \cdot ;{\varvec{\theta }}\right) \right) \right| _{{\varvec{\theta }} =\hat{{\varvec{\theta }}}}={\varvec{0}}\right\} \text {.} \end{aligned}$$

Note that \(\mathbb {M}_{\Theta }\) is the set of stationary points of the log-likelihood function. Further, define the distance between a real vector \({\varvec{a}}\) and a set \(\mathbb {B}\) by

$$\begin{aligned} \text {dist}\left( {\varvec{a}},\mathbb {B}\right) =\inf _{{\varvec{b}}\in \mathbb {B}}\left\| {\varvec{a}}-{\varvec{b}}\right\| , \end{aligned}$$

where \(\left\| \cdot \right\| \) is the usual Euclidean metric, and denote the complement of a subset \(\mathbb {A}\) of a real product space by \(\mathbb {A}^{c}\). Finally, make the following assumptions.

  1. C1

    The sequence of learning rates \(\left\{ \gamma _{i}\right\} _{i=1}^{\infty }\) fulfills the conditions that \(0<\gamma _i<1\), for each i,

    $$\begin{aligned} \sum _{i=1}^{\infty }\gamma _{i}=\infty \text {, and }\sum _{i=1}^{\infty }\gamma _{i}^{2}<\infty \text {.} \end{aligned}$$
  2. C2

    At initialization \({\varvec{s}}^{\left( 0\right) }\in \mathbb {S}\) and, with probability 1,

    $$\begin{aligned} \limsup _{i\rightarrow \infty }\left| {\varvec{s}}^{\left( i\right) }\right| <\infty \text {, and }\liminf _{i\rightarrow \infty } \text {dist}\left( {\varvec{s}}^{\left( i\right) },\mathbb {S}^{c}\right) =0\text {.} \end{aligned}$$
  3. C3

    The set \(\mathbb {W}_{\Gamma }\) is nowhere dense.

Theorem 1

(Cappé and Moulines 2009) Assume that A1–A3, B1–B3, and C1–C3 are satisfied, and let \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{\infty }\) be an IID sample with DGP characterized by the PDF \(f_{0}\), which is hypothesized to have the form \(f\left( \cdot ;{\varvec{\theta }}\right) \), as in (8). Further, let \(\left\{ {\varvec{s}}^{\left( i\right) }\right\} _{i=1}^{\infty }\) and \(\left\{ {\varvec{\theta }}^{\left( i\right) }\right\} _{i=1}^{\infty }\) be sequences generated by the online EM algorithm, defined by (11) and (12). Then, with probability 1,

$$\begin{aligned} \lim _{i\rightarrow \infty }\mathrm{dist}\left( {\varvec{s}}^{\left( i\right) },\Gamma \right) = 0\text {, and }\lim _{i\rightarrow \infty } \mathrm{dist}\left( {\varvec{\theta }}^{\left( i\right) },\mathbb {M}_{\Theta }\right) = 0\text {.} \end{aligned}$$

Notice that this result allows for a mismatch between the true probability measure \(F_{0}\) and the assumed pseudo-true family \(f\left( \cdot ;{\varvec{\theta }}\right) \) from which \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{\infty }\) is hypothesized to arise. This therefore allows for misspecification, in the sense of White (1982), which is almost certain to occur in the modeling of any sufficiently complex data. In any case, the online EM algorithm will converge toward an estimate of the parameter vector \({\varvec{\theta }}\), which is in the set \(\mathbb {M}_{\Theta }\). When the DGP can be characterized by a density in the family of the form \(f\left( \cdot ;{\varvec{\theta }}\right) \), we observe that \(\mathbb {M}_{\Theta }\) contains not only the global maximizer of the log-likelihood function, but also local maximizers, minimizers, and saddle points. Thus, the online algorithm suffers from the same lack of strong convergence guarantees, as the batch EM algorithm (cf. Wu 1983).

In the case of misspecification the set \(\mathbb {M}_{\Theta }\) will include the parameter vector \({\varvec{\theta }}_{0}\) that maximizes the log-likelihood function, with respect to the true probability measure \(F_{0}\). However, as with the well-specified case, it will also include stationary points of other types, as well. We further provide characterizations of the sets \(\mathbb {W}_{\Gamma }\) and \(\mathbb {M}_{\Theta }\) in terms of the Kullback–Leibler divergence (KL; Kullback and Leibler 1951) in the Supplementary Materials.

Assumption C1 can be fulfilled by taking sequences \(\left\{ \gamma _{i}\right\} _{i=1}^{\infty }\) of form \(\gamma _{i}=\gamma _{0}i^{\alpha }\), for some \(\alpha \in \left( 0,1\right] \) and \(\gamma _{0}\in \left( 0,1\right) \). We shall discuss this point further, in the sequel. Although the majority of the assumptions can be verified or are fulfilled by construction, the two limits in C2 stand out as being particularly difficult to verify. In Cappé and Moulines (2009), the authors suggest that one method for enforcing C2 is to use the method of update truncation, but they did not provide an explicit scheme for conducting such truncation.

A truncation version of the algorithm defined by (11) and (12) can be specified via the method of Delyon et al. (1999). That is, let \(\left\{ \mathbb {K}_{m}\right\} _{m=0}^{\infty }\) be a sequence of compact sets, such that

$$\begin{aligned} \mathbb {K}_{m}\subset \text {interior}\left( \mathbb {K}_{m+1}\right) \text {, and }\bigcup _{m=0}^{\infty }\mathbb {K}_{m}=\mathbb {S}. \end{aligned}$$
(14)

We then replace (11) and (12) by the following scheme. At the \(i\text {th}\) iteration, firstly compute

$$\begin{aligned} \tilde{{\varvec{s}}}^{\left( i\right) }={\varvec{s}}^{\left( i-1\right) }+ \gamma _{i}\left[ \bar{{\varvec{s}}}\left( {\varvec{Y}}_{i};{\varvec{\theta }}^{\left( i-1\right) }\right) -{\varvec{s}}^{\left( i-1\right) }\right] \text {.} \end{aligned}$$
(15)

Secondly,

$$\begin{aligned} \text {if }\tilde{{\varvec{s}}}^{\left( i\right) }\in \mathbb {K}_{m_{i-1}}\text {, then set }{\varvec{s}}^{\left( i\right) }= & {} \tilde{{\varvec{s}}}^{\left( i\right) }, {\varvec{\theta }}^{\left( i\right) }=\bar{{\varvec{\theta }}}\left( {\varvec{s}}^{\left( i\right) }\right) ,\nonumber \\&\text { and }m_{i}=m_{i-1}, \end{aligned}$$
(16)

else

$$\begin{aligned} \text {if }\tilde{{\varvec{s}}}^{\left( i\right) }\notin \mathbb {K}_{m_{i-1}}\text {, then set }{\varvec{s}}^{\left( i\right) }= & {} {\varvec{S}}_{i}, {\varvec{\theta }}^{\left( i\right) }=\bar{{\varvec{\theta }}}\left( {\varvec{S}}_{i}\right) ,\nonumber \\&\text { and }m_{i}=m_{i-1}+1, \end{aligned}$$
(17)

where \(\left\{ {\varvec{S}}_{i}\right\} _{i=1}^{\infty }\) is an arbitrary random sequence, such that \({\varvec{S}}_{i}\in \mathbb {K}_{0}\), for each \(i\in \mathbb {N}\). We have the following result regarding the algorithm defined by (15)–(17).

Proposition 1

Assume that A1–A3, B1–B3, C1 and C3 are satisfied, and let \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{\infty }\) be an IID sample with DGP characterized by the PDF \(f_{0}\), which is hypothesized to have the form \(f\left( \cdot ;{\varvec{\theta }}\right) \), as in (8). Further, let \(\left\{ {\varvec{s}}^{\left( i\right) }\right\} _{i=1}^{\infty }\) and \(\left\{ {\varvec{\theta }}^{\left( i\right) }\right\} _{i=1}^{\infty }\) be sequences generated by the truncated online EM algorithm, defined by (15)–(17). Then, with probability 1,

$$\begin{aligned} \lim _{i\rightarrow \infty }\mathrm{dist}\left( {\varvec{s}}^{\left( i\right) },\Gamma \right) =0\text {, and }\lim _{i\rightarrow \infty }\mathrm{dist}\left( {\varvec{\theta }}^{\left( i\right) },\mathbb {M}_{\Theta }\right) =0\text {.} \end{aligned}$$

The proof of Proposition 1 requires the establishment of equivalence between A1–A3, B1–B3, C1, and C3, and the many assumptions of Theorem 3 and 6 of Delyon et al. (1999). Thus, the proof is simple and mechanical, but long and tedious. We omit it for the sake of brevity.

2.2 The mini-batch algorithm

At the most elementary level, a mini-batch algorithm for computation of a sequence of estimators \(\left\{ {\varvec{\theta }}^{\left( r\right) }\right\} _{r=1}^{R}\) for some parameter \({\varvec{\theta }}_{0}\), from some sample \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{n}\), where \(R\in \mathbb {N}\), has the following property. The algorithm is iterative, and at the \(r\text {th}\) iteration of the algorithm, the estimator \({\varvec{\theta }}^{\left( r\right) }\) only depends on the previous iterate \({\varvec{\theta }}^{\left( r-1\right) }\) and some subsample, possibly with replacement, of \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{n}\). Typical examples of mini-batch algorithms include the many variants of the stochastic gradient descent-class of algorithms; see, for example, Cotter et al. (2011), Li et al. (2014), Zhao et al. (2014), and Ghadimi et al. (2016).

Suppose that we observe a fixed size realization \(\left\{ {\varvec{y}}_{i}\right\} _{i=1}^{n}\) of some IID random sample \(\left\{ {\varvec{Y}}\right\} _{i=1}^{n}\). Furthermore, fix a so-called batch size \(N\le n\) and a learning rate sequence \(\left\{ \gamma _{r}\right\} _{r=1}^{R}\), and select some appropriate initial values \({\varvec{s}}^{\left( 0\right) }\) and \({\varvec{\theta }}^{\left( 0\right) }\) from which the sequences \(\left\{ {\varvec{s}}^{\left( r\right) }\right\} _{r=1}^{R}\) and \(\left\{ {\varvec{\theta }}^{\left( r\right) }\right\} _{r=1}^{R}\) can be constructed. A mini-batch version of the online EM algorithm, specified by (11) and (12) can be specified as follows. For each \(r\in \left[ R\right] \), sample N observations from \(\left\{ {\varvec{y}}_{i}\right\} _{i=1}^{n}\) uniformly, with replacement, and denote the subsample by \(\left\{ {\varvec{Y}}_{i}^{r}\right\} _{i=1}^{N}\). Then, using \(\left\{ {\varvec{Y}}_{i}^{r}\right\} _{i=1}^{N}\), compute

$$\begin{aligned} {\varvec{s}}^{\left( r\right) }= & {} {\varvec{s}}^{\left( r-1\right) }+\gamma _{r}\left[ N^{-1} \sum _{i=1}^{N}\bar{{\varvec{s}}}\left( {\varvec{Y}}_{i}^{r}; {\varvec{\theta }}^{\left( r-1\right) }\right) -{\varvec{s}}^{\left( r-1\right) }\right] ,\nonumber \\&\quad \text { and }{\varvec{\theta }}^{\left( r\right) }=\bar{{\varvec{\theta }}}\left( {\varvec{s}}^{\left( r\right) }\right) \text {.} \end{aligned}$$
(18)

In order to justify the mini-batch algorithm, we make the following observation. The online EM algorithm, defined by (11) and (12), is designed to obtain a root in the set \(\mathbb {M}_{\Theta }\), which is a vector \(\hat{{\varvec{\theta }}}\in \Theta \) such that

$$\begin{aligned} \nabla _{{\varvec{\theta }}}\left. \ell \left( f\left( \cdot ;{\varvec{\theta }}\right) \right) \right| _{{\varvec{\theta }}=\hat{{\varvec{\theta }}}}={\varvec{0}}. \end{aligned}$$

If \(N=1\) (i.e., the case proposed in Cappé and Moulines 2009, Sec. 2.5), then the DGP for generating subsamples is simply a single draw from the empirical measure:

$$\begin{aligned} F_{\mathrm{Emp}}\left( {\varvec{y}}\right) =\sum _{i=1}^{n}\frac{1}{n}\delta \left( {\varvec{y}} -{\varvec{y}}_{i}\right) , \end{aligned}$$

where \(\delta \) is the Dirac delta function (see, for details, Prosperetti 2011, Ch. 2). We can write

$$\begin{aligned} \ell \left( f\left( \cdot ;{\varvec{\theta }}\right) \right)&=\mathbb {E}_{F_{0}}\left[ \log f\left( {\varvec{Y}};{\varvec{\theta }}\right) \right] \nonumber \\&=\mathbb {E}_{F_{\mathrm{Emp}}}\left[ \log f\left( {\varvec{Y}};{\varvec{\theta }}\right) \right] \nonumber \\&=\frac{1}{n}\sum _{i=1}^{n}\log f\left( {\varvec{y}}_{i};{\varvec{\theta }}\right) , \end{aligned}$$
(19)

which is the log-likelihood function, with respect to the realization \(\left\{ {\varvec{y}}_{i}\right\} _{i=1}^{n}\), under the density function of form \(f\left( \cdot ;{\varvec{\theta }}\right) \). Thus, in the \(N=1\) case, the algorithm defined by (18) solves for log-likelihood roots \(\hat{{\varvec{\theta }}}\) of the form

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}\nabla _{{\varvec{\theta }}}\left. \log f\left( {\varvec{y}}_{i};{\varvec{\theta }}\right) \right| _{{\varvec{\theta }}=\hat{{\varvec{\theta }}}}=\mathbf {0}, \end{aligned}$$

or equivalently, solving for an element in the set

$$\begin{aligned} \mathbb {M}_{\Theta }^{\mathrm{Emp}}=\left\{ \hat{{\varvec{\theta }}}\in \Theta :\sum _{i=1}^{n}\nabla _{{\varvec{\theta }}}\left. \log f\left( {\varvec{y}}_{i};{\varvec{\theta }}\right) \right| _{{\varvec{\theta }}=\hat{{\varvec{\theta }}}}=\mathbf {0}\right\} \text {.} \end{aligned}$$

The \(N>1\) case follows the same argument and is described in the Supplementary Materials (Section 2.2). Let \(F_{\mathrm{Emp}}^{N}\) denote the probability measure corresponding to the DGP of N independent random samples from \(F_{\mathrm{Emp}}\). We have the following result, based on Theorem 1.

Corollary 1

For any \(N\in \mathbb {N}\), assume that A1–A3, B1–B3, and C1–C3 are satisfied (replacing i by r, and \(F_{0}\) by \(F_{\mathrm{Emp}}^{N}\), where appropriate), and let \(\left\{ {\varvec{y}}_{i}\right\} _{i=1}^{n}\) be a realization of some IID random sequence \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{n}\), where each \({\varvec{Y}}_{i}\) is hypothesized to arise from a DGP having PDF of the form \(f\left( \cdot ;{\varvec{\theta }}\right) \), as in (8). Let \(\left\{ {\varvec{s}}^{\left( r\right) }\right\} _{i=1}^{\infty }\) and \(\left\{ {\varvec{\theta }}^{\left( r\right) }\right\} _{i=1}^{\infty }\) be sequences generated by the mini-batch EM algorithm, defined by (11) and (12). Then, with probability 1,

$$\begin{aligned} \lim _{r\rightarrow \infty }\mathrm{dist}\left( {\varvec{s}}^{\left( r\right) },\Gamma \right) =0\text {, and }\lim _{r\rightarrow \infty } { \mathrm dist}\left( {\varvec{\theta }}^{\left( r\right) },\mathbb {M}_{\Theta }^{\mathrm{Emp}}\right) =0\text {.} \end{aligned}$$

That is, as we take \(R\rightarrow \infty \), the algorithm defined by (11) and (12) will identify elements in the sets \(\Gamma \) and \(\mathbb {M}_{\Theta }^{\mathrm{Emp}}\), with probability 1. As with the case of Theorem 1, C2 is again difficult to verify. Let \(\left\{ \mathbb {K}_{m}\right\} _{m=0}^{\infty }\) be as per (14). Then, we replace the algorithm defined via (18), by the following truncated version.

Again, suppose that we observe a fixed size realization \(\left\{ {\varvec{y}}_{i}\right\} _{i=1}^{n}\) of some IID random sample \(\left\{ {\varvec{Y}}\right\} _{i=1}^{n}\). Furthermore, fix a so-called batch size \(N\le n\) and a learning rate sequence \(\left\{ \gamma _{r}\right\} _{r=1}^{R}\), and select some appropriate initial values \({\varvec{s}}^{\left( 0\right) }\) and \({\varvec{\theta }}^{\left( 0\right) }\) from which the sequences \(\left\{ {\varvec{s}}^{\left( r\right) }\right\} _{r=1}^{R}\) and \(\left\{ {\varvec{\theta }}^{\left( r\right) }\right\} _{r=1}^{R}\) can be constructed. For each \(r\in \left[ R\right] \), sample N observations from \(\left\{ {\varvec{y}}_{i}\right\} _{i=1}^{n}\) uniformly, with replacement, and denote the subsample by \(\left\{ {\varvec{Y}}_{i}^{r}\right\} _{i=1}^{N}\). Using \(\left\{ {\varvec{Y}}_{i}^{r}\right\} _{i=1}^{N}\), compute

$$\begin{aligned} \tilde{{\varvec{s}}}^{\left( r\right) }={\varvec{s}}^{\left( r-1\right) }+\gamma _{r} \left[ N^{-1}\sum _{i=1}^{N}\bar{{\varvec{s}}}\left( {\varvec{Y}}_{i}^{r}; {\varvec{\theta }}^{\left( r-1\right) }\right) -{\varvec{s}}^{\left( r-1\right) }\right] \text {.} \end{aligned}$$
(20)

Then, with i being appropriately replaced by r, use (16) and (17) to compute \({\varvec{s}}^{\left( r\right) }\) and \({\varvec{\theta }}^{\left( r\right) }\). We obtain the following result via an application of Proposition 1.

Corollary 2

For any \(N\in \mathbb {N}\), assume that A1–A3, B1–B3, and C1–C3 are satisfied (replacing i by r, and \(F_{0}\) by \(F_{\mathrm{Emp}}^{N}\), where appropriate), and let \(\left\{ {\varvec{y}}_{i}\right\} _{i=1}^{n}\) be a realization of some IID random sequence \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{n}\), where each \({\varvec{Y}}_{i}\) is hypothesized to arise from a DGP having PDF of the form \(f\left( \cdot ;{\varvec{\theta }}\right) \), as in (8). Let \(\left\{ {\varvec{s}}^{\left( r\right) }\right\} _{i=1}^{\infty }\) and \(\left\{ {\varvec{\theta }}^{\left( r\right) }\right\} _{i=1}^{\infty }\) be sequences generated by the truncated mini-batch EM algorithm, defined by (20), (16), and (17). Then, with probability 1,

$$\begin{aligned} \lim _{r\rightarrow \infty }\mathrm{dist}\left( {\varvec{s}}^{\left( r\right) },\Gamma \right) =0\text {, and }\lim _{r\rightarrow \infty } \mathrm{dist}\left( {\varvec{\theta }}^{\left( r\right) },\mathbb {M}_{\Theta }^{\mathrm{Emp}}\right) =0\text {.} \end{aligned}$$

2.3 The learning rate sequence

As previously stated, a good choice for the learning rate sequence \(\left\{ \gamma _{i}\right\} _{i=1}^{\infty }\) is to take \(\gamma _{i}=\gamma _{0}i^{\alpha }\), for each \(i\in \mathbb {N}\), such that \(\alpha \in (1/2,1]\) and \(\gamma _{0}\in \left( 0,1\right) \). Under the assumptions of Theorem 1, Cappé and Moulines (2009, Thm. 2) showed that the learning rate choice leads to the convergence of the sequence \(\gamma _{0}^{1/2}i^{\alpha /2}\left( {\varvec{\theta }}^{\left( i\right) }-{\varvec{\theta }}_{0}\right) \), in distribution, to a normal distribution with mean \({\varvec{0}}\) and covariance matrix depending on \({\varvec{\theta }}_{0}\), for some \({\varvec{\theta }}_{0}\in \mathbb {M}_{\Theta }\). Here \(\left\{ {\varvec{\theta }}^{\left( i\right) }\right\} _{i=1}^{\infty }\) is a sequence of online EM algorithm iterates, generated by (11) and (12). A similar result can be stated for the truncated online EM, mini-batch EM, and truncated mini-batch EM algorithms, by replacing the relevant indices and quantities in the previous statements by their respective counterparts.

The result above implies that the convergence rate is \({\varvec{\theta }}^{\left( i\right) }-{\varvec{\theta }}_{0}=o_{\text {p}}\left( i^{\alpha /2}\right) \), for any valid \(\alpha \), where \(o_{\text {p}}\) is the usual order in probability notation (see White 2001, Defn. 2.33). Thus, it would be tempting to take \(\alpha =1\) in order to obtain a rate with optimal order of \(n^{1/2}\). However, as shown in Cappé and Moulines (2009, Thm. 2), the \(\alpha =1\) case requires constraints on \(\gamma _{0}\) in order to fulfill a stability assumption that is impossible to validate, in practice.

It is, however, still possible to obtain a sequence of estimators that converges to some \({\varvec{\theta }}_{0}\) at a rate with optimal order \(n^{1/2}\). We can do this via the famous so-called Polyak averaging scheme of Polyak (1990) and Polyak and Juditsky (1992). In the current context, one takes as an input the sequence of online EM iterates \(\left\{ {\varvec{\theta }}^{\left( i\right) }\right\} _{i=1}^{\infty }\), and output the running average sequence \(\left\{ {\varvec{\theta }}_{\text {A}}^{\left( i\right) }\right\} _{i=1}^{\infty }\), where

$$\begin{aligned} {\varvec{\theta }}_{\text {A}}^{\left( i\right) }=i^{-1}\sum _{j=1}^{i} {\varvec{\theta }}^{\left( j\right) }, \end{aligned}$$
(21)

for each \(i\in \mathbb {N}\). For any \(\alpha \in \left( 1/2,1\right) \), it is provable that \({\varvec{\theta }}_{\text {A}}^{\left( i\right) }-{\varvec{\theta }}_{0}=o_{\text {p}}\left( n^{1/2}\right) \). As before, this result generalizes to the cases of the truncated online EM, mini-batch EM, and truncated mini-batch EM algorithms, also.

We note that the computation of the \(i\text {th}\) running average term (21) does not require the storage of the entire sequence of iterates \(\left\{ {\varvec{\theta }}^{\left( i\right) }\right\} _{i=1}^{\infty }\), as one would anticipate by applying (21) navely. One can instead write (21) in the iterative form

$$\begin{aligned} {\varvec{\theta }}_{\text {A}}^{\left( i\right) }=i^{-1}\left[ \left( i-1\right) {\varvec{\theta }}_{\text {A}}^{\left( i-1\right) }+{\varvec{\theta }}^{\left( i\right) }\right] \text {.} \end{aligned}$$

3 Normal mixture models

3.1 Finite mixtures of exponential family distributions

We recall from Sect. 1 that the random variable \({\varvec{Y}}\) is said to arise from a DGP characterized by a g component finite mixture of component PDFs of form \(f\left( {\varvec{y}};{\varvec{\omega }}_{z}\right) \), if it has a PDF of the form (2). Furthermore, if the component PDFs are of the exponential family form (1), then we further write the PDF of \({\varvec{Y}}\) as

$$\begin{aligned} f\left( {\varvec{y}};{\varvec{\theta }}\right) =\sum _{z=1}^{g}\pi _{z}h\left( {\varvec{y}}\right) \exp \left\{ \left[ {\varvec{s}}\left( {\varvec{y}}\right) \right] ^{\top }{\varvec{\phi }}\left( {\varvec{\omega }}_{z}\right) -\psi \left( {\varvec{\omega }}_{z}\right) \right\} \text {.} \end{aligned}$$
(22)

From the construction of the finite mixture model, we have the fact that (22) is the marginalization of the joint density of the random variable \({\varvec{X}}^{\top }=\left( {\varvec{Y}}^{\top },Z\right) \):

$$\begin{aligned}&f\left( {\varvec{x}};{\varvec{\theta }}\right) =\prod _{\zeta =1}^{g}\nonumber \\&\quad \times \left[ \pi _{\zeta }h\left( {\varvec{y}}\right) \exp \left\{ \left[ {\varvec{s}}\left( {\varvec{y}}\right) \right] ^{\top }{\varvec{\phi }}\left( {\varvec{\omega }}_{\zeta }\right) -\psi \left( {\varvec{\omega }}_{\zeta }\right) \right\} \right] ^{\llbracket z=\zeta \rrbracket }\qquad \end{aligned}$$
(23)

over the random variable \(Z\in \left[ g\right] \), recalling that Z is a categorical random variable with g categories (cf. McLachlan and Peel 2000, Ch. 2). Here, \(\llbracket c\rrbracket \) is the Iverson bracket notation that takes value 1 if condition c is true, and 0 otherwise (Ch. 1 Iverson 1967). We rewrite (23) as follows:

$$\begin{aligned} f\left( {\varvec{x}};{\varvec{\theta }}\right)&=h\left( {\varvec{y}}\right) \exp \left\{ \sum _{\zeta =1}^{g}\llbracket z=\zeta \rrbracket \left[ \log \pi _{\zeta }+\left[ {\varvec{s}}\left( {\varvec{y}}\right) \right] ^{\top }\right. \right. \\&\quad \left. \left. {\varvec{\phi }}\left( {\varvec{\omega }}_{\zeta }\right) -\psi \left( {\varvec{\omega }}_{\zeta }\right) \right] \right\} \\&=h\left( {\varvec{x}}\right) \exp \left\{ \left[ {\varvec{s}}\left( {\varvec{x}}\right) \right] ^{\top }{\varvec{\phi }}\left( {\varvec{\theta }}\right) -\psi \left( {\varvec{\theta }}\right) \right\} , \end{aligned}$$

where \(h\left( {\varvec{x}}\right) =h\left( {\varvec{y}}\right) \), \(\psi \left( {\varvec{\theta }}\right) =0\),

$$\begin{aligned} {\varvec{s}}\left( {\varvec{x}}\right) =\left[ \begin{array}{c} \llbracket z=1\rrbracket \\ \llbracket z=1\rrbracket {\varvec{s}}\left( {\varvec{y}}\right) \\ \vdots \\ \llbracket z=g\rrbracket \\ \llbracket z=g\rrbracket {\varvec{s}}\left( {\varvec{y}}\right) \end{array}\right] \text {, and }{\varvec{\phi }}\left( {\varvec{\theta }}\right) =\left[ \begin{array}{c} \log \pi _{1}-\psi \left( {\varvec{\omega }}_{1}\right) \\ {\varvec{\phi }}\left( {\varvec{\omega }}_{1}\right) \\ \vdots \\ \log \pi _{g}-\psi \left( {\varvec{\omega }}_{g}\right) \\ {\varvec{\phi }}\left( {\varvec{\omega }}_{g}\right) \end{array}\right] , \end{aligned}$$

and thus obtain the following general result regarding finite mixtures of exponential family distributions.

Proposition 2

The complete-data likelihood of any finite mixture of exponential family distributions with PDF of the form (22) can also be written in the exponential family form (8).

With Proposition 2, we have proved that when applying the online EM or the mini-batch EM algorithm to the problem of conducting ML estimation for any finite mixture model of exponential family distributions, A1 is automatically satisfied.

3.2 Finite mixtures of normal distributions

Recall from Sect. 1 that the random variable \({\varvec{Y}}\) is said to be distributed according to a \(g\text {-component}\) finite mixture of normal distributions, if it characterized by a PDF of the form (3). Using the exponential family decomposition from (5) and (6), we write the complete-data likelihood of \({\varvec{X}}^{\top }=\left( {\varvec{Y}}^{\top },Z\right) \) in the form (8) by setting \(h\left( {\varvec{x}}\right) =\left( 2\pi \right) ^{-d/2}\), \(\psi \left( {\varvec{\theta }}\right) =0\),

$$\begin{aligned} {\varvec{s}}\left( {\varvec{x}}\right)= & {} \left[ \begin{array}{c} \llbracket z=1\rrbracket \\ \llbracket z=1\rrbracket {\varvec{y}}\\ \llbracket z=1\rrbracket \text {vec}( {\varvec{y}}{\varvec{y}}^{\top })\\ \vdots \\ \llbracket z=g\rrbracket \\ \llbracket z=g\rrbracket {\varvec{y}}\\ \llbracket z=g\rrbracket \text {vec}({\varvec{y}}{\varvec{y}}^{\top }) \end{array}\right] \text {, and }\nonumber \\ {\varvec{\phi }}\left( {\varvec{\theta }}\right)= & {} \left[ \begin{array}{c} \log \pi _{1}-\frac{1}{2}{\varvec{\mu }}_{1}^{\top }{\varvec{\Sigma }}_{1}^{-1}{\varvec{\mu }}_{1}+\frac{1}{2}\log \left| {\varvec{\Sigma }}_{1}\right| \\ {\varvec{\Sigma }}_{1}^{-1}{\varvec{\mu }}_{1}\\ -\frac{1}{2} \text {vec}({\varvec{\Sigma }}_{1}^{-1})\\ \vdots \\ \log \pi _{g}-\frac{1}{2}{\varvec{\mu }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}{\varvec{\mu }}_{g} +\frac{1}{2}\log \left| {\varvec{\Sigma }}_{g}\right| \\ {\varvec{\Sigma }}_{g}^{-1}{\varvec{\mu }}_{g}\\ -\frac{1}{2}\text {vec}({\varvec{\Sigma }}_{g}^{-1}) \end{array}\right] , \end{aligned}$$
(24)

where \(\text {vec}(\cdot )\) is the matrix vectorization operator.

Using the results from McLachlan and Peel (2000, Ch. 3), we write the conditional expectation (9) in the form

$$\begin{aligned} \left[ \bar{{\varvec{s}}}\left( {\varvec{y}};{\varvec{\theta }}\right) \right] ^{\top }= & {} \left( \tau _{1}\left( {\varvec{y}};{\varvec{\theta }}\right) ,\tau _{1}\left( {\varvec{y}};{\varvec{\theta }} \right) {\varvec{y}},\tau _{1}\left( {\varvec{y}};{\varvec{\theta }}\right) \text {vec}({\varvec{y}}{\varvec{y}}^{\top }),\dots ,\right. \\&\left. \tau _{g}\left( {\varvec{y}};{\varvec{\theta }}\right) , \tau _{g}\left( {\varvec{y}};{\varvec{\theta }}\right) {\varvec{y}},\tau _{g}\left( {\varvec{y}};{\varvec{\theta }} \right) \text {vec}({\varvec{y}}{\varvec{y}}^{\top })\right) , \end{aligned}$$

where

$$\begin{aligned} \tau _{z}\left( {\varvec{y}};{\varvec{\theta }}\right) =\frac{\pi _{z}\varphi \left( {\varvec{y}}; {\varvec{\mu }}_{z},{\varvec{\Sigma }}_{z}\right) }{\sum _{\zeta =1}^{g}\pi _{\zeta } \varphi \left( {\varvec{y}};{\varvec{\mu }}_{\zeta },{\varvec{\Sigma }}_{\zeta }\right) }, \end{aligned}$$

is the usual a posteriori probability that \(Z=z\) (\(z\in \left[ g\right] \)), given observation of \({\varvec{Y}}={\varvec{y}}\). Again, via the results from McLachlan and Peel (2000, Ch. 3), we write the update function \(\bar{{\varvec{\theta }}}\) in the following form. Define \(\bar{{\varvec{\theta }}}\) to have the elements \(\bar{\pi }_{z}\) and \(\bar{{\varvec{\omega }}}_{z}\), for each \(z\in \left[ g\right] \), where each \(\bar{{\varvec{\omega }}}_{z}\) subsequently has elements \(\bar{{\varvec{\mu }}}_{z}\) and \(\bar{{\varvec{\Sigma }}}_{z}\). Furthermore, for convenience, we define for \({\varvec{s}}\) the following notation

$$\begin{aligned} {\varvec{s}}^\top = (s_{11}, {\varvec{s}}_{21}, \text {vec}(\mathbf {S}_{31}),\dots , s_{1g}, {\varvec{s}}_{2g}, \text {vec}(\mathbf {S}_{3g})), \end{aligned}$$

and

$$\begin{aligned} \left[ \bar{{\varvec{s}}}\left( {\varvec{y}};{\varvec{\theta }}\right) \right] ^\top= & {} (\bar{s}_{11}\left( {\varvec{y}};{\varvec{\theta }}\right) , \bar{{\varvec{s}}}_{21}\left( {\varvec{y}};{\varvec{\theta }}\right) , \text {vec}(\bar{\mathbf {S}}_{31}\left( {\varvec{y}};{\varvec{\theta }}\right) ),\dots ,\\&\quad \bar{s}_{1g}\left( {\varvec{y}};{\varvec{\theta }}\right) , \bar{{\varvec{s}}}_{2g}\left( {\varvec{y}};{\varvec{\theta }}\right) , \text {vec}(\bar{\mathbf {S}}_{3g}\left( {\varvec{y}};{\varvec{\theta }}\right) )), \end{aligned}$$

with

$$\begin{aligned} \bar{s}_{1z}\left( {\varvec{y}};{\varvec{\theta }}\right)= & {} \tau _{z}\left( {\varvec{y}};{\varvec{\theta }}\right) \text {, }\bar{{\varvec{s}}}_{2z}\left( {\varvec{y}};{\varvec{\theta }}\right) =\tau _{z}\left( {\varvec{y}};{\varvec{\theta }}\right) {\varvec{y}}\text {, and}\\ \bar{\mathbf {S}}_{3z}\left( {\varvec{y}};{\varvec{\theta }}\right)= & {} \tau _{z}\left( {\varvec{y}};{\varvec{\theta }}\right) {\varvec{y}}{\varvec{y}}^{\top }. \end{aligned}$$

Then, the application of the M-step is equivalent to apply function \(\bar{{\varvec{\theta }}}\) as a function of \({\varvec{s}}\), containing the unique elements of \(\bar{\pi }_{z}\), \(\bar{{\varvec{\mu }}}_{z}\), and \(\bar{{\varvec{\Sigma }}}_{z}\), for \(z\in [g]\), defined by

$$\begin{aligned} \bar{\pi }_{z}({\varvec{s}})= & {} \frac{s_{1z}}{\sum _{j=1}^g s_{1j}}, \quad \bar{{\varvec{\mu }}}_{z}({\varvec{s}}) = \frac{ {\varvec{s}}_{2z}}{ s_{1z}}, \quad \text{ and } \nonumber \\ \bar{{\varvec{\Sigma }}}_{z}({\varvec{s}})= & {} \frac{\mathbf {S}_{3z}}{s_{1z}}-\frac{{\varvec{s}}_{2z} {\varvec{s}}_{2z}^{\top }}{s_{1z}^{2}}. \end{aligned}$$
(25)

This implies that the mini-batch EM and truncated mini-batch EM algorithms proceed via update rule \(\bar{{\varvec{\theta }}}\left( {\varvec{s}}^{\left( r\right) }\right) \), where \(\bar{{\varvec{\theta }}}\) and \({\varvec{s}}^{\left( r\right) }\) are as given in (18). We start from \({\varvec{\theta }}^{(0)}\) and \({\varvec{s}}^{(1)}=N^{-1} \sum _{i=1}^N \bar{{\varvec{s}}}({\varvec{Y}}_i, {\varvec{\theta }}^{(0)}) \). Then, \({\varvec{\theta }}^{(1)\top }= [\bar{{\varvec{\theta }}}({\varvec{s}}^{(1)})]^\top ,\) which has elements

$$\begin{aligned}&\bar{\pi }_{z}\left( {\varvec{s}}^{\left( 1\right) }\right) =N^{-1}\sum _{i=1}^{N}\tau _{z} \left( {\varvec{Y}}_{i};{\varvec{\theta }}^{\left( 0\right) }\right) ,\nonumber \\&\quad \bar{{\varvec{\mu }}}_{z}\left( {\varvec{s}}^{\left( 1\right) }\right) =\frac{\sum _{i=1}^{N} {\tau }_{z}\left( {\varvec{Y}}_{i};{\varvec{\theta }}^{\left( 0\right) }\right) {\varvec{Y}}_{i}}{\sum _{i=1}^{N}\tau _{z}\left( {\varvec{Y}}_{i};{\varvec{\theta }}^{\left( 0\right) } \right) }, \end{aligned}$$
(26)

and

$$\begin{aligned}&\bar{{\varvec{\Sigma }}}_{z}\left( {\varvec{s}}^{\left( 1\right) }\right) =\frac{\sum _{i=1}^{N}\mathbf {\tau }_{z}\left( {\varvec{Y}}_{i};{\varvec{\theta }}^{\left( 0\right) }\right) {\varvec{Y}}_{i}{\varvec{Y}}_{i}^{\top } }{\sum _{i=1}^{N}\tau _{z}\left( {\varvec{Y}}_{i};{\varvec{\theta }}^{\left( 0\right) }\right) }\nonumber \\&\quad -\frac{\left[ \sum _{i=1}^{N}\tau _{z}\left( {\varvec{Y}}_{i};{\varvec{\theta }}^{\left( 0\right) }\right) {\varvec{Y}}_{i}\right] \left[ \sum _{i=1}^{N}\tau _{z}\left( {\varvec{Y}}_{i};{\varvec{\theta }}^{\left( 0\right) }\right) {\varvec{Y}}_{i}\right] ^{\top }}{\left[ \sum _{i=1}^{N}\tau _{z}\left( {\varvec{Y}}_{i};{\varvec{\theta }}^{\left( 0\right) }\right) \right] ^{2}}. \end{aligned}$$
(27)

3.3 Convergence analysis of the mini-batch algorithm

In addition to Assumptions A1–A3, B1–B3, and C1–C3, make the additional assumption

  1. D1

    The Hessian matrix of \(\sum _{i=1}^{n}\log f\left( {\varvec{y}}_{i};{\varvec{\theta }}\right) \), evaluated at any \({\varvec{\theta }}_{0}\in \mathbb {M}_{\Theta }^{\mathrm{Emp}}\), is non-singular with respect to \({\varvec{\theta }}\in \Theta \).

Assumption D1 is generally satisfied for all but pathological samples \(\left\{ {\varvec{y}}_{i}\right\} _{i=1}^{n}\). The following result is proved in the Supplementary Materials.

Proposition 3

Let \(\left\{ {\varvec{y}}_{i}\right\} _{i=1}^{n}\) be a realization of some IID random sequence \(\left\{ {\varvec{Y}}_{i}\right\} _{i=1}^{n}\), where each \({\varvec{Y}}_{i}\) is hypothesized to arise from a DGP having PDF of the form (3). If \(\left\{ {\varvec{s}}^{\left( r\right) }\right\} _{i=1}^{\infty }\) and \(\left\{ {\varvec{\theta }}^{\left( r\right) }\right\} _{i=1}^{\infty }\) are sequences generated by the mini-batch EM algorithm, defined by (18) and (25), then for any \(N\in \mathbb {N}\), if C1, C2, and D1 are satisfied (replacing i by r, and \(F_{0}\) by \(\prod _{j=1}^{N}F_{\text {Emp}}\), where appropriate), then, with probability 1,

$$\begin{aligned} \lim _{r\rightarrow \infty }\text { dist}\left( {\varvec{s}}^{\left( r\right) },\Gamma \right) =0\text {, and }\lim _{r\rightarrow \infty }\text {dist}\left( {\varvec{\theta }}^{\left( r\right) },\mathbb {M}_{\Theta }^{\text {Emp}}\right) =0\text {.} \end{aligned}$$

Alternatively, if \(\left\{ {\varvec{s}}^{\left( r\right) }\right\} _{i=1}^{\infty }\) and \(\left\{ {\varvec{\theta }}^{\left( r\right) }\right\} _{i=1}^{\infty }\) are sequences generated by the truncated mini-batch EM algorithm, defined by (16) , (17), (20) and (25), then for any \(N\in \mathbb {N}\), if C1 and D1 are satisfied (replacing i by r, and \(F_{0}\) by \(\prod _{j=1}^{N}F_{\mathrm{Emp}}\), where appropriate), then, with probability 1,

$$\begin{aligned} \lim _{r\rightarrow \infty }\mathrm{dist}\left( {\varvec{s}}^{\left( r\right) },\Gamma \right) =0\text {, and }\lim _{r\rightarrow \infty }\mathrm{dist}\left( {\varvec{\theta }}^{\left( r\right) },\mathbb {M}_{\Theta }^{\mathrm{Emp}}\right) =0\text {.} \end{aligned}$$

3.4 A truncation sequence

In order to apply the truncated version of the mini-batch EM algorithm, we require an appropriate sequence \(\left\{ \mathbb {K}_{m}\right\} _{m=0}^{\infty }\) that satisfies condition (14). This can be constructed in parts. Let us write

$$\begin{aligned} \mathbb {K}_{m}=\mathbb {D}_{g-1}^{m}\times \prod _{i=1}^{g}\left( \mathbb {B}_{d}^{m}\times \mathbb {H}_{d}^{m}\right) , \end{aligned}$$
(28)

where we shall let \(c_{1},c_{2},c_{3}\ge 1\),

$$\begin{aligned} \mathbb {D}_{g-1}^{m}= & {} \left\{ \left( \pi _{1},\dots ,\pi _{g}\right) \in \mathbb {R}^{g}:\sum _{z=1}^{g}\pi _{z}=1\text {, and }\right. \\&\left. \quad \pi _{z}\ge \frac{1}{c_{1}+m}\text {, for each }z\in \left[ g\right] \right\} ,\\ \mathbb {B}_{d}^m= & {} \left[ -\left( c_{2}+m\right) ,c_{2}+m\right] ^d,\\ \end{aligned}$$

and

$$\begin{aligned} \mathbb {H}_{d}^{m}=\left\{ \mathbf {H}\in \mathbb {H}_{d}:\lambda _{1}\left( \mathbf {H}\right) \ge \frac{1}{c_{3}+m} \lambda _{d},\left( \mathbf {H}\right) \le c_{3}+m\right\} , \end{aligned}$$

using the notation \(\lambda _{1}\left( \mathbf {H}\right) \) and \(\lambda _{d}\left( \mathbf {H}\right) \) to denote the smallest and largest eigenvalues of the matrix \(\mathbf {H}\). A justification regarding this truncation scheme can be found in the Supplementary Materials.

We make a final note that the construction (28) is not a unique method for satisfying the conditions of (14). One can instead, for example, replace \(c_{j}+m\), by \(c_{j}\left( 1+m\right) \) (\(j\in \left[ 3\right] \)) in the definitions of the sets that constitute (28).

4 Simulation studies

We present a pair of simulation studies, based upon the famous Iris data set of Fisher (1936) and the Wreath data of Fraley et al. (2005), in the main text. A further four simulation scenarios are presented in the Supplementary Materials. In each case, we utilize the initial small data sets, obtained from the base \(\texttt {R}\) package (R Core Team 2018) and the \(\texttt {mclust}\) package for \(\texttt {R}\) (Scrucca et al. 2016), respectively, and use them as templates to generate much larger data sets. All computations are conducted in the \(\texttt {R}\) programming environment, although much of the bespoke programs are programmed in \(\texttt {C}\) and integrated in \(\texttt {R}\) via the \(\texttt {Rcpp}\) and \(\texttt {RcppArmadillo}\) packages of (Eddelbuettel 2013). Furthermore, timings of programs were conducted on a MacBook Pro with a 2.2 GHz Intel Core i7 processor, 16 GB of 1600 MHz DDR3 RAM, and a 500 GB SSD hard drive. We note that all of the code used to conduct the simulations and computations for this manuscript can be accessed from https://github.com/hiendn/StoEMMIX.

In the sequel, in all instances, we shall use the learning rate sequence \(\left\{ \gamma _{r}\right\} _{r=1}^{\infty }\), where \(\gamma _{r}=\left( 1-10^{-10}\right) \times r^{6/10}\), which follows from the choice made by Cappé and Moulines (2009) in their experiments. In all computations, a fixed number of epochs (or epoch equivalence) of 10 is allotted to each algorithm. Here, recall that the number of epochs is equal to the number of sweeps through the data set \(\left\{ {\varvec{y}}_{i}\right\} _{i=1}^{n}\) that an algorithm is allowed. Thus, drawing 10n observations from the data \(\left\{ {\varvec{y}}_{i}\right\} _{i=1}^{n}\), with replacement, is equivalent to 10 epochs. Thus, each iteration of the standard EM algorithm counts as a single epoch, whereas, for a mini-batch algorithm with batch size N, every n/N iterations counts as an epoch.

Next, in both of our studies, we consider batch sizes of \(N=n/10\) and \(N=n/5\), we further consider Polyak averaging as well as truncation. Thus, for each study, a total of eight variants of the mini-batch EM algorithm is considered. In the truncate case, we set \(c_{1},c_{2},c_{3}=1000\). Finally, the variants of the mini-batch EM algorithm are compared to the standard (batch) EM algorithm for fitting finite mixtures of normal distributions. In the interest of fairness, each of the algorithms is initialized at the same starting value of \({\varvec{\theta }}^{\left( 0\right) }\), using the randomized initialization scheme suggested in McLachlan and Peel (McLachlan and Peel (2000), Sec. 3.9.3). That is, the same randomized starting instance is used for the EM algorithm and each of the mini-batch variants.

To the best of our knowledge, the most efficient and reliable implementation of the EM algorithm for finite mixtures of normal distributions, in \(\texttt {R}\), is the \(\text {\texttt {em}}\) function from the \(\texttt {mclust}\) package. Thus, this will be used for all of our comparisons. Data generation from the template data sets is handled using the \(\texttt {simdataset}\) function from the \(\texttt {MixSim}\) package (Melnykov et al. 2012), in the Iris study, and the \(\texttt {simVVV}\) function from \(\texttt {mclust}\) in the Wreath study. Timing was conducted using the \(\mathtt {proc.time}\) function.

4.1 Iris data

The Iris data (accessed in \(\texttt {R}\) via the \(\texttt {data(iris)}\) command) contain measurements of \(d=4\) dimensions from 150 iris flowers, 50 of each are of the species Setosa, Versicolor, and Virginica, respectively. The 4 dimensions of each flower that are measured are petal length, petal width, sepal length, and sepal width. To each of the subpopulations of species, we fit a single multivariate normal distribution to the 50 observations (i.e., we estimate a mean vector and covariance matrix, for each species). Then, using the three mean vectors and covariance matrices, we construct a template \(g=3\) component normal mixture model with equal mixing proportions \(\pi _{z}=1/3\) (\(z\in \left[ 3\right] \)), of form (3). This template distribution is then used to generate synthetic data sets of any size n.

Two experiments are performed using this simulation scheme. In the first experiment, we generate \(n=10^{6}\) observations \(\left\{ {\varvec{y}}_{i}\right\} _{i=1}^{n}\) from the template. We then utilize \(\left\{ {\varvec{y}}_{i}\right\} _{i=1}^{n}\) and each of the truncated EM algorithm variants as well as the batch EM algorithm to compute ML estimates. We use a number of measures of performance for each algorithm variant. These include the computation time, the log-likelihood, the squared error of the parameter estimates (SE; the Euclidean distance as compared to the generative parameter vector), and the adjusted-Rand index (ARI; Hubert and Arabie 1985) between the maximum a posteriori clustering labels obtained from the fitted mixture model and the true generative data labels.

The ARI measures whether or not two sets of labels are in concordance or not. Here, a value of 1 indicates perfect similarity, and 0 indicates discordance. Since the ARI allows for randomness in the labelling process, it is possible to have negative ARI values, which are rare and also indicates discordance in the data. Each variant is repeated \(\text {Rep}=100\) times, and each performance measurement is recorded in order to obtain a measure of the overall performance of each algorithm. For future reference, we name this study Iris1. In the second study, which we name Iris2, we repeat the setup of Iris1 but with the number of observations increased to \(n=10^{7}\).

4.2 Wreath data

The Wreath data (accessed in \(\texttt {R}\) via the \(\texttt {data(wreath)}\) command) contain 1000 observations of \(d=2\) dimensional vectors, each belonging to one of \(g=14\) distinct but unlabelled subpopulations. We use the \(\texttt {Mclust}\) function from \(\texttt {mclust}\) to fit a 14 component mixture normal distributions to the data. The data, along with the means of the subpopulation normal distributions, are plotted in Fig. 1. Here, each observation is colored based upon the subpopulation that maximizes its a posteriori probability.

Fig. 1
figure 1

Plot of the 1000 observations of the Wreath data set, colored by subpopulation with subpopulation means indicated by crosses

As with the Iris data, using the fitted mixture model as a template, we can then simulate synthetic data sets of any size n. We perform two experiments using this scheme. In the first experiment, we simulate \(n=10^{6}\) observations and assess the different algorithms, based on the computation time, the log-likelihood, the SE, and the ARI over \(\text {Rep=100}\) repetitions, as per Iris1. We refer to this experiment as Wreath1. In the second experiment, we repeat the setup of Wreath1, but with \(n=10^{7}\), instead. We refer to this case as Wreath2.

4.3 Results

Figures 2 and 3 contain box plots that summarize the results of Iris1 and Iris2, respectively. Similarly, Figs. 4 and 5 contain box plots that summarize the results of Wreath1 and Wreath2, respectively.

Fig. 2
figure 2

Results from \(\text {Rep}=100\) replications of the Iris1 simulation experiment. The ‘EM’ box plot summarizes the performance of the standard EM algorithm. The other plots are labelled by which variant of the mini-batch EM algorithm is summarized. The value of the batch size N is indicated (either \(N=n/10\) or \(N=n/5\)), and a ‘P’ or a ‘T’ designates that Polyak averaging or truncation was used, respectively

Fig. 3
figure 3

Results from \(\text {Rep}=100\) replications of the Iris2 simulation experiment. The ‘EM’ box plot summarizes the performance of the standard EM algorithm. The other plots are labelled by which variant of the mini-batch EM algorithm is summarized. The value of the batch size N is indicated (either \(N=n/10\) or \(N=n/5\)), and a ‘P’ or a ‘T’ designates that Polyak averaging or truncation was used, respectively

Fig. 4
figure 4

Results from \(\text {Rep}=100\) replications of the Wreath1 simulation experiment. The ‘EM’ box plot summarizes the performance of the standard EM algorithm. The other plots are labelled by which variant of the mini-batch EM algorithm is summarized. The value of the batch size N is indicated (either \(N=n/10\) or \(N=n/5\)), and a ‘P’ or a ‘T’ designates that Polyak averaging or truncation was used, respectively

Fig. 5
figure 5

Results from \(\text {Rep}=100\) replications of the Wreath2 simulation experiment. The ‘EM’ box plot summarizes the performance of the standard EM algorithm. The other plots are labelled by which variant of the mini-batch EM algorithm is summarized. The value of the batch size N is indicated (either \(N=n/10\) or \(N=n/5\)), and a ‘P’ or a ‘T’ designates that Polyak averaging or truncation was used, respectively

Firstly, we note that Polyak averaging requires no additional computational effort, for a given value of N. Thus, we do not require separate timing data for Polyak averaging variants of the mini-batch EM algorithms in each of Figs. 2, 3, 4 and 5. From the timing results, we observe that the standard EM algorithm is faster than the mini-batch versions, in all scenarios, regardless of the fact that all of the algorithms were computed using 10 epochs worth of data access. This is because the mini-batch algorithms require more additional intermediate steps in each algorithm loop (e.g., random sampling from the empirical distribution), as well as a multiplicative factor of n/N more loops. From the plots, we observe that the larger value of N tends to result in smaller computing times. There appears to be no difference in timing between the use of truncation or not.

In the Iris1 and Iris2 studies, we observe that the mini-batch EM algorithms uniformly outperform the standard EM algorithm in terms of the log-likelihood, SE, and ARI. In all three measurements, we observe that \(N=n/10\) performed better than \(N=n/5\), and also that there were no differences between truncated versions of the mini-batch algorithms and equivalent variants without truncation. Polyak averaging appears to reduce the performance of the mini-batch algorithms, for a given level of N, with respect to each of the three measurements.

In the Wreath1 and Wreath2 studies, we observe that the standard and mini-batch EM algorithms perform virtually the same across the log-likelihood, SE, and ARI metrics. This is likely due to the high degree of separability of each of the \(g=12\) mixture components of the Wreath data, in comparison to the overlapping components of the Iris data. Our observation is true for all of the mini-batch EM variants, with or without truncation or Polyak averaging. Upon first impression, this may appear as a weakness of the mini-batch EM algorithm, since it produces the same performance while requiring more computational time. However, we must also remember that the mini-batch EM algorithm does not require all of the data to be stored in memory at each iteration of algorithm, whereas the EM algorithm does. Thus, the mini-batch algorithm is feasible in very large data situations, since it only requires a fixed memory size, of order N, regardless of sample size n, whereas the standard EM algorithm has a memory requirement that increases with n.

To expand upon our currently presented results, we have also included a further two simulation studies regarding the fitting of finite mixtures of normal distributions using mini-batch EM algorithms. Our two studies are based on the Flea data of Wickham et al. (2011), a test scenario from the ELKI project of Schubert et al. (2015), and an original data generating process. The ELKI scenario is chosen due to its separability, in order to assess whether our conclusion regarding Wreath1 and Wreath2 are correct. The Flea data share similarities with the Iris data but is higher dimensional. In all cases, we found that the mini-batch algorithms tended to outperform the EM algorithm in all but timing, on average. Detailed assessments of these studies can be found in the Supplementary Materials.

In addition, we have also investigated the use of the mini-batch EM algorithm for estimation of non-normal mixture models. Namely, we present a pair of algorithms for the estimation of exponential and Poisson mixture models. We demonstrate their performance via an additional pair of simulation studies.

To conclude, we make the following recommendations. Firstly, smaller batch sizes appear to yield higher likelihood values. Secondly, averaging appears to slow convergence of the algorithm to the higher likelihood value and is thus not recommended. Thirdly, truncation appears to have no effect on the performance. This is likely due to the fact that truncation may not have been needed in any of the experiments. In any case, it is always useful to use the truncated version of the algorithm, in case there are unforeseen instabilities in the optimization process. And finally, the standard EM algorithm may be preferred to the mini-batch EM algorithm when sample sizes are small and when the data are highly separable. However, even in the face of high separability, for large n, it may not be feasible to conduct estimation by the standard EM algorithm and thus the mini-batch algorithms may be preferred due to feasibility.

It is interesting to observe that Polyak averaging tended to diminish the performance of the algorithms, in our studied scenarios. This is in contradiction to the theory that suggests that Polyak averaging should in fact increase the convergence rate to stationary solutions. We note, however, that the theory is asymptotic and the number of epochs that were used may be too short for the advantages of Polyak averaging to manifest, in practice.

5 Real data study

5.1 MNIST data

The MNIST data of LeCun et al. (1998) consists of \(n=70,000\) observations of \(d=28\times 28=784\) pixel images of handwritten digits. These handwritten digits were sampled nearly uniformly. That is there were 6903, 7877, 6990, 7141, 6824, 6313, 6876, 7293, 6825, and 6958 observations of the digits 0–9, respectively.

Next, it is notable that not all d pixels are particularly informative. In fact, there is a great amount of redundancy in the d dimensions. Out of the d pixels, 65 are always zero, for every observation. Thus, the dimensions of the data are approximately 8.3% sparse.

We eliminate the spare pixels across all images to obtain a dense dimensionality of \(d_{\text {dense}}=719\). Using the \(d_{\text {dense}}\) dimensions of the data, we conduct a principal component analysis (PCA) in order to further reduce the data dimensionality; see Jolliffe 2002 for a comprehensive treatment on PCA. Using the PCA, we extract the principal components (PCs) of each observation, and for various number of PCs \(d_{\text {PC}}\in \left[ d_{\text {dense}}\right] \). We can then use the data sets of n observations and dimension \(d_{\text {PC}}\), to estimate mixture of normal distributions for various values of g.

5.2 Experimental setup

In the following study, we utilize only the truncated version of the mini-batch algorithm, having drawn the conclusions, from Sect. 4, that there appeared to be no penalty in performance due to truncation in practice. Again, drawing upon our experience from Sect. 4, we set \(N=n/10=7000\) as the batch size in all applications. The same learning rate sequence of \(\left\{ \gamma _{r}\right\} _{r=1}^{\infty }\), where \(\gamma _{r}=\left( 1-10^{-10}\right) \times r^{6/10}\) is also used, and \(c_{1},c_{2},c_{3}=1000\).

We apply the mini-batch algorithm to data with \(d_{\text {PC}}=10,20,50,100\). Initialization of the parameter vector \({\varvec{\theta }}^{\left( 0\right) }\) was conducted via the randomization scheme of McLachlan and Peel (2000, Sec. 3.9.3). The mini-batch algorithm was run 100 times for each \(d_{\text {PC}}\) and the log-likelihood values were recorded for both the fitted models using the Polyak averaging and no averaging versions of the algorithm. The standard EM algorithm, as applied via the \(\text {\texttt {em}}\) function of the \(\texttt {mclust}\) package is again used for comparison. Each of the algorithms, including the \(k\text {-means}\) algorithm, was initialized from the same initial randomization, in the interest of fairness, for each of the 100 runs. That is, a random partition of the data is generated once for each of the 100 runs, and the initial parameters for the EM, mini-batch and \(k\text {-means}\) algorithms are all computed from the same initialization. The log-likelihood values of the standard and mini-batch EM algorithms are compared along with the ARI values. Algorithms are run for 10 epochs.

We compute the ARI values obtained when comparing the maximum a posteriori clustering labels, obtained from each of the algorithms (cf. McLachlan and Peel 2000, Sec. 1.15), and the true digit classes of each of the images. For a benchmark, we also compare the performance of the three EM algorithms with the \(k\text {-means}\) algorithm, as applied via the \(\texttt {kmeans}\) function in \(\texttt {R}\), which implements the algorithm of Hartigan and Wong (1979). For fairness of comparison, we also allow the \(k\text {-means}\) algorithm 10 epochs in each of 100 runs. As in Sect. 4, we note that all codes are available at https://github.com/hiendn/StoEMMIX, for the sake of reproducibility and transparency.

5.3 Results

The results from the MNIST experiment are presented in Table 1. We observe that for \(d_{\text {PC}}\in \left\{ 10,20,50\right\} \), all three EM variants provided better ARI than the \(k\text {-means}\) algorithm. The best ARI values for all three EM algorithms occur when \(d_{\text {PC}}=20\). When \(d_{\text {PC}}=100\), the \(k\text {-means}\) algorithm provided a better ARI, which appeared to be somewhat uniform across the four values of \(d_{\text {PC}}\).

Table 1 Tabulation of results from the 100 runs of the EM algorithms and the \(k\text {-means}\) algorithm, for each value of \(d_{\text {PC}}\in \left\{ 10,20,50,100\right\} \)

Among the EM algorithms, the mini-batch algorithm provided better ARI values, with the two variants not appearing to be significantly different from one another, when considering the standard errors of the ARI values, when \(d_{\text {PC}}\in \left\{ 20,50\right\} \). When \(d_{\text {PC}}=10\), we observe that no averaging yielded a better ARI, whereas, when \(d_{\text {PC}}=100\), averaging appeared to be better, on average.

Regarding the log-likelihoods, the mini-batch EM algorithm, when applied without averaging, uniformly and significantly outperformed the standard EM algorithm. On the contrary, when applied with averaging, the EM algorithm uniformly and significantly outperformed the mini-batch algorithm. This is also in contrary with what was observed in Sect. 4. This is an interesting result considering that the ARI of the mini-batch algorithm, with averaging, is still better than that of the EM algorithm. As in Sect. 4, we can recommend the use of the mini-batch EM algorithm without averaging, as it tends to outperform the standard EM algorithm for fit and also yields better clustering outcomes, when measured via the ARI.

6 Conclusions

In Sect. 2, we reviewed the online EM algorithm framework of Cappé and Moulines (2009), and stated the key theorems that guarantee the convergence of algorithms that are constructed under the online EM framework. We then presented a novel interpretation of the online EM algorithm that yielded our framework for constructing mini-batch EM algorithms. We then utilized the theorems of Cappé and Moulines (2009) in order to produce convergence results for this new mini-batch EM algorithm framework. Extending upon some remarks of Cappé and Moulines (2009), we also made rigorous the use of truncation in combination with both the online EM and mini-batch EM algorithm frameworks, using the construction and theory of Delyon et al. (1999).

In Sect. 3, we demonstrated how the mini-batch EM algorithm framework could be applied to construct algorithms for conducting ML estimation of finite mixtures of exponential family distributions. A specific analysis is made of the particularly interesting case of the normal mixture models. Here, we validate the conditions that permit the use of the Theorems from Sect. 2 in order to guarantee the convergence of the mini-batch EM algorithms for ML estimation of normal mixture models.

In Sect. 4, we conducted a set of four simulation studies in order to study the performance of the mini-batch EM algorithms, implemented in eight different variants, as compared to the standard EM algorithm for ML estimation of normal mixture models. There, we found that regardless of implementation, in many cases, the mini-batch EM algorithms were able to obtain log-likelihood values that were better on average than the standard EM algorithm. We also found that the use of larger batch sizes and Polyak averaging tended to diminish performance of the mini-batch algorithms, but the use of truncation tended to have no effect. Although the mini-batch algorithms is generally slower than the standard EM algorithm, we note that in many cases, the fixed memory requirement of the mini-batch algorithms make them feasible where the standard EM algorithm is not.

A real data study was conducted in Sect. 5. There, we explored the use of the standard EM algorithm and the truncated mini-batch EM algorithm for cluster analysis of the famous MNIST data of LeCun et al. (1998). From our study, we found that the mini-batch EM algorithm was able to obtain better log-likelihood values than the standard EM algorithm, when applied without Polyak averaging. However, with averaging, the mini-batch EM algorithm was worse than the standard EM algorithm, on average. However, regardless of whether averaging was used, or not, the mini-batch EM algorithm appeared to yield better clustering outcomes, when measured via the ARI of Hubert and Arabie (1985).

This research poses numerous interesting directions for the future. First, we may extend the results to other exponential family distributions that permit the satisfaction of theorem assumptions from Sect. 2. We make initial steps in this direction via a pair of mini-batch algorithms for exponential and Poisson distribution mixtures. Secondly, we may use the framework to construct mini-batch algorithms for large-scale mixture of regression models (cf. Jones and McLachlan 1992), following the arguments made by Cappé and Moulines (2009) that permitted them to construct an online EM algorithm for their mixture of regressions example analysis. Thirdly, this research theme can be extended further to the construction of mini-batch algorithms for mixture of experts models (cf. Nguyen and Chamroukhi 2018), which may be facilitated via the Gaussian gating construction of Xu et al. (1995).

In addition to the three previous research questions, we may also ask questions regarding the practical application of the mini-batch algorithms. For instance, we may consider the question of optimizing learning rates and batch sizes for particular application settings. Furthermore, we may consider whether the theoretical framework still applies to algorithms where we may have adaptive batch sizes and learning rate regimes. As these directions fall vastly outside the scope of the current paper, we shall leave them for future exploration.