Keywords

1 Introduction and Motivation

Mixture models and in particular Gaussian Mixture Models (GMM), are commonly used for density estimation and classification. In this era of Big Data and everyday, the data is highly complex and enormous in size. Mixture models offer a powerful and flexible way to represent the data. A comprehensive discussion on mixture models can be found in [1, 2].

When the number of mixture components is known and the component densities are assumed to belong to a specified parametric family, the popular Expectation Maximization (EM) algorithm [3] based on Maximum Likelihood Estimation (MLE) is often used to estimate the GMM parameters. However, when there is a small perturbation in one of the component densities, MLE becomes significantly biased and very sensitive to outliers [4]. Furthermore, when the data is not Gaussian, the EM method may not cluster a set of data points to a Gaussian with a meaningful mean vector and covariance matrix. The EM based approach is not robust when the underlying probability density function of the data does not match a mixture of Gaussians (known as a data/model mismatch).

To overcome this limitation, Scott [58] introduced an alternative minimum distance estimation method based on the integrated squared error criterion (termed \(L_{2}E\)) which avoids the use of nonparametric kernel density estimators. The \(L_{2}E\) approach is a special case of a general method introduced in [9] that is based on a whole continuum of divergence estimators that begin with MLE and interpolate to the \(L_{2}E\) estimator. Markatou [10] used the weighted likelihood estimation approach to address the effects of data/model mismatch on parameter estimates.

In this paper, the focus is on the \(L_{2}E\) as an alternative to the EM for parameter estimation of models with a known finite number of mixtures. A discussion of the EM and \(L_{2}E\) approaches are given. Simulation results specific to GMM are shown to depict the robustness property of the \(L_{2}E\) method with respect to noise in the data (particularly outliers) and data/model mismatch [1113].

The basic notation in this paper is as follows. Let \(f_{\varvec{\theta }_m}(x)\) denote a general mixture probability density function with m components as given by

$$\begin{aligned} f_{\varvec{\theta }_m} ({\varvec{x}}) = \sum _{i=1}^m \pi _i f({\varvec{x}}|{\varvec{\phi }_i}) \end{aligned}$$
(1)

where \(\varvec{\theta }_m = (\pi _1, \dots , \pi _{m-1}, \pi _{m}, {\varvec{\phi }_1}^T , \dots , {\varvec{\phi }_m}^T)^T\), the weights \(\pi _i >0\), \(\sum _{i=1}^m \pi _i =1\) and \(f({\varvec{x}}|{\varvec{\phi }_i})\) is a probability density function with parameter vector \({\varvec{\phi }_i}\). In theory, the \(f({\varvec{x}}|{\varvec{\phi }_i})\) could be any parametric density, although in practice they are often from the same parametric family (usually Gaussian).

2 EM Algorithm

The Expectation-Maximization (EM) algorithm [3] is broadly based on the iterative computation of MLE. The EM method alternates between two steps:

  1. 1.

    Expectation (E) step: Computes an expectation of the likelihood by including the latent variables as if they were observed and a

  2. 2.

    Maximization (M) step: computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found in the E step.

The parameters found in the M step are then used to begin another E step and the process is repeated.

For finite mixture models, the observed data samples \({\varvec{X}} = \left\{ {\varvec{x}_1},{\cdots },{\varvec{x}_n}\right\} \) are viewed as incomplete. The complete data is obtained as \({\varvec{Z}} = \left\{ {\varvec{x}_i},{\varvec{y}_i}\right\} \) for \(i=1\) to n where \({\varvec{y}_i} = ({{\varvec{y}_{1i}},{\cdots },{\varvec{y}_{mi}}})^{T}\) is a latent (unobserved or missing) indicator vector with \(\varvec{y}_{ij}=1\) if \(\varvec{x}_i\) is from the mixture component j and zero otherwise. The log-likelihood of \({\varvec{Z}}\) is defined by

$$\begin{aligned} L(\varvec{\theta }_m|\varvec{Z})= \sum _{i=1}^n \sum _{j=1}^m y_{ij} \log y_{ij}\log [\pi _{j} f({\varvec{x}_i}|{\varvec{\phi }_j})] \end{aligned}$$
(2)

The EM algorithm obtains a sequence of estimates \(\varvec{\theta }^{(t)},t=0,1,\cdots \) by alternating the E-Step and the M-Step until some convergence criterion is met.

  1. 1.

    E-Step: Calculate the Q function, the conditional expectation of the complete log-likelihood, given \({\varvec{X}}\) and the current estimate \(\varvec{\theta }^{(t)}\).

  2. 2.

    M-Step: Update the estimate of the parameters by maximizing the Q function.

In the case of GMM, maximizing Q provides an explicit solution. In most instances, EM has the advantages of reliable global convergence, low cost per iteration, economy of storage, ease of programming and heuristic appeal. However, its convergence can be very slow in simple problems which are often encountered in practice. Also, when there is a small perturbation in one of the component densities due to noise in the data, the MLE estimates become highly unstable due to the lack of robustness to outliers. For the case of GMM [14], this can be seen easily as maximization of the likelihood function under an assumed Gaussian distribution is equivalent to finding the least-squares solution, whose lack of robustness is well known. As a robust alternative we discuss an approach based on the minimization of the integrated square distance, namely \(L_{2}E\).

3 Robust \(L_{2}E\) Estimator

The integrated squared distance has been used as the goodness-of-fit criterion in nonparametric density estimation for a long time. In the classic papers of Scott [6, 7], an alternative minimum distance estimation method based on the integrated squared error criterion, termed \(L_{2}E\), was introduced and has the following attributes.

  1. 1.

    The use of nonparametric kernel density estimators is avoided.

  2. 2.

    The \(L_{2}E\) is especially suited for parameter-rich models such as mixture models.

  3. 3.

    The genesis of Scott the \(L_{2}E\) approach, which can be traced to the pioneering work of Rudemo [15] and Bowman [16], is computationally feasible and leads to robust estimators.

  4. 4.

    The \(L_{2}E\) is a special class of robust estimators like the median-based estimators, which sacrifice some asymptotic efficiency for substantial computational benefits in difficult estimation problems.

  5. 5.

    The \(L_{2}E\) estimator performs much better than other robust estimators such as minimum Hellinger estimates (MHD) under severe data contamination.

The \(L_{2}E\) estimator belongs to the family of minimum density power divergence (MDPD) estimators introduced in [9] with the tuning parameter \(\alpha =1\). The tuning parameter \(\alpha \) in an MDPD estimator controls the trade-off between robustness and efficiency. It is also shown that the robustness of the \(L_{2}E\) estimator is achieved at a fairly stiff price in asymptotic efficiency [9]. For the normal, exponential and Poisson distributions with small values of \(\alpha \le 0.10\), the MDPD has strong robustness properties and retains high asymptotic relative efficiency (ARE) with respect to MLE. However, within the family of density-based power divergence measures, the \(L_{2}E\) approach has the distinct advantage that a key integral can be computed in closed form, especially for Gaussian mixtures.

3.1 \(L_{2}E\) Algorithm

Given the true probability density \(g({\varvec{x}})\) and the finite mixture with m components, \(f_{\varvec{\theta }_m}({\varvec{x}})\), consider the \(L_{2}\) distance between \(f_{\varvec{\theta }_m}\) and \(g({\varvec{x}})\) as given by

$$\begin{aligned} L_{2}(f_{\varvec{\theta }_m},g({\varvec{x}})) = \int _{-\infty }^\infty [f_{\varvec{\theta }_m}({\varvec{x}}) - g({\varvec{x}})]^2 d{\varvec{x}}. \end{aligned}$$
(3)

The aim is to derive an estimate of \(\varvec{\theta }_m\) that minimizes the \(L_{2}\) distance [57, 1113]. Expanding Eq. (3) gives

$$\begin{aligned} L_{2}(f_{\varvec{\theta }_m},g({\varvec{x}}))= & {} \int _{-\infty }^\infty f_{\varvec{\theta }_m}^2({\varvec{x}}) d{\varvec{x}} - 2\int _{-\infty }^\infty f_{\varvec{\theta }_m}({\varvec{x}})g({\varvec{x}}) d{\varvec{x}} \nonumber \\+ & {} \int _{-\infty }^\infty g({\varvec{x}})^2 d{\varvec{x}} \end{aligned}$$
(4)

where the last integral is a constant with respect to \(\varvec{\theta }_m\) and therefore, may be ignored for the minimization. The first integral in Eq. (4) is often available as a closed form expression that, for Gaussian mixtures, may be evaluated for any specified value of \(\varvec{\theta }_m\) as shown later in Eq. (7). The second integral in Eq. (4) is simply the average height of the density estimate, which may be estimated as \(-2n^{-1} \sum _{i=1}^n f_{\varvec{\theta }_m}({\varvec{X}}_i)\) where \({\varvec{X}}_i\) is a sample observation. Based on the above analysis, the \(L_{2}E\) estimator of \(\varvec{\theta }_m\) is given by

$$\begin{aligned} \hat{\varvec{\theta }}^{L_{2}E}_{m} = \arg \min _{\varvec{\theta }_m} \left[ \int _{-\infty }^\infty f^{2}_{\varvec{\theta }_m} ({\varvec{x}}) d{\varvec{x}} - 2n^{-1} \sum _{i=1}^n f_{\varvec{\theta }_m}({\varvec{X}}_i)\right] , \end{aligned}$$
(5)

3.2 GMM Models

For multivariate Gaussian mixtures,

$$\begin{aligned} f({\varvec{x}}|{\varvec{\phi }_i}) = \phi ({\varvec{x}}|~ {\varvec{\mu }_{i}},{\varSigma _{i}}) \end{aligned}$$
(6)

where \({\varvec{\mu }_{i}}\) is the mean vector and \(\varSigma _{i}\) is the covariance matrix for component i. In this case, the problem reduces to finding the \(L_{2}E\) estimator for a Gaussian Mixture Model (GMM). Now, the first integral in Eq. (4) reduces to

$$\begin{aligned} \int _{-\infty }^\infty f^{2}_{\varvec{\theta }_m} ({\varvec{x}}) d{\varvec{x}} = \sum _{k=1}^{m}\sum _{l=1}^{m} \pi _{k} \pi _{l}~~ \phi ({\varvec{\mu }_{k}} - {\varvec{\mu }_{l}}|~0,{\varSigma _{k} +\varSigma _{l}}), \end{aligned}$$
(7)

thereby making Eq. (4) tractable for minimization and significantly reducing the computations involved in getting the \(L_{2}E\) estimator. Since this is a computationally feasible closed-form expression, estimation of the GMM parameters by the \(L_{2}E\) procedure may be performed by any standard nonlinear optimization algorithm [5, 6, 1113]. In this work, we used the ‘nlminb’ nonlinear minimization routine in [17].

4 Experimental Results

4.1 Performance Due to Data Contamination (Outliers)

In this section, simulations using EM and \(L_{2}E\) parameter estimates are compared when there is no data contamination and when there is (with and without the presence of outliers/noise).

Gaussian Mixture Model with No Outliers: A GMM model f(x) with two components, each being a univariate Gaussian density \(\phi (x)\) is simulated as given by

$$\begin{aligned} f(x)= 0.75\phi (x|~ \mu _{1}=0,\sigma _{1}^2=1) + 0.25\phi (x|~ \mu _{2}=1,\sigma _{2}^2=1). \end{aligned}$$
(8)

The variable \(\mu \) denotes the mean and the variable \(\sigma ^2\) denotes the variance. A total of 10000 sample points from the above Gaussian mixture (see Eq. (8)) are generated and parameter estimation is performed. A total of 100 Monte Carlo simulations are performed to evaluate consistency and efficiency.

Fig. 1.
figure 1

Boxplots of the estimated mean for \(L_{2}E \) and EM from 100 Monte Carlo Simulations of a GMM Model With No Outliers

The boxplots of the parameter estimates of the component means for the mixture model in Eq. (8) with no data contamination are shown in Fig. 1. The results clearly show that both solutions are comparable and close to the true estimates. Note that the average of the 100 Monte Carlo estimates of the \(L_{2}E\) and EM means are close to the true value.

Gaussian Mixture Model with Outliers: The second simulation extends our study by adding outliers to illustrate the robustness property of \(L_{2}E\) against outliers. In this case, 9900 sample points from the above Gaussian mixture in Eq. (8) are contaminated by adding 100 sample points (outliers) simulated from \(\phi (x|~ \mu =5,\sigma ^2=1)\). Once again, 100 Monte Carlo simulations are performed to evaluate the performance of \(L_{2}E\) and EM for consistency and efficiency.

Fig. 2.
figure 2

Boxplots of the estimated mean for \(L_{2}E\) and EM from 100 Monte Carlo Simulations of a GMM Model With Outliers

The boxplots of the parameter estimates of the component means for the mixture model in Eq. (8) with 1 % data contamination are shown in Fig. 2. The results clearly show that the outliers have a great influence on the EM method and that the \(L_{2}E\) method is inherently robust to outliers.

4.2 Performance Due to Data/Model Mismatch

In this section, data/model mismatch is assessed. The robustness of \(L_{2}E\) and EM is investigated when the postulated model is a mixture of Gaussians (GMM) but the data are generated from a mixture with symmetric departure from component normality. The setup as described in [12, 18] is considered for the parameter estimation. More specifically, for the simulation study, a mixture with two components given by

$$\begin{aligned} f_{\varvec{\theta }_2} (x) = \pi f_{1}(x) + (1-\pi )f_{2}(x), \end{aligned}$$
(9)

is considered. Note that \(f_{1}\) is the density associated with a random variable \(X_1 = aY_{1}\) (\(a=1\) chosen for the simulation) and \(Y_{1}\) is a Student’s t(df)-random variable with a degree of freedom \(df = 1\). Also, \(f_{2}\) is the density associated with a random variable \(X_2 = Y_{2} + b\) (\(b=2\) chosen for the simulation) and \(Y_{2}\) is a Student’s t(df)-random variable with degrees of freedom \(df = 4\). A total of 100 data points were generated and 50 Monte Carlo simulations were conducted to evaluate the performance of \(L_{2}E\) and EM for consistency and efficiency by calculating the Bias and Mean Square Error (MSE).

Suppose T(X) is an estimate of \(\theta \). The Bias and MSE of T are defined as

$$\begin{aligned} Bias(\theta )= & {} E_{\theta }T -\theta \end{aligned}$$
(10)
$$\begin{aligned} MSE(\theta )= & {} E_{\theta }(T -\theta )^2=Var_{\theta }(T) +Bias^2(\theta ) \end{aligned}$$
(11)

Note that the general shapes of such a two-component postulated (Gaussian mixture) model and a two-component t-mixture model from which the data are generated are different and further, the component densities in the sampling model have a much heavier tail than those in the postulated (Gaussian) mixture model. Table 1 depicts the bias and the mean square error for the mean estimates provided by the \(L_{2}(E)\) and EM algorithms. The results show that the \(L_{2}E\) is more robust than the EM approach with respect to data/model mismatch.

Table 1. Simulation results for data/model mismatch

5 Summary and Conclusions

The \(L_{2}E\) estimation technique can be easily constructed and applied to GMM and is a viable alternative to EM. Simulation studies revealed that the \(L_{2}E\) mean estimates are robust to both outliers and data/model mismatch. The competitive performance of \(L_{2}E\) make it stand out as an attractive alternative to EM for practical applications.