Robust $$L_{2}E$$ Parameter Estimation of Gaussian Mixture Models: Comparison with Expectation Maximization

Thayasivam, Umashanger; Kuruwita, Chinthaka; Ramachandran, Ravi P.

doi:10.1007/978-3-319-26555-1_32

Umashanger Thayasivam¹⁷,
Chinthaka Kuruwita¹⁸ &
Ravi P. Ramachandran¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9491))

Included in the following conference series:

International Conference on Neural Information Processing

2777 Accesses
1 Citations

Abstract

The purpose of this paper is to discuss the use of $L_{2}E$ estimation that minimizes integrated square distance as a practical robust estimation tool for unsupervised clustering. Comparisons to the expectation maximization (EM) algorithm are made. The $L_{2}E$ approach for mixture models is particularly useful in the study of big data sets and especially those with a consistent numbers of outliers. The focus is on the comparison of $L_{2}E$ and EM for parameter estimation of Gaussian Mixture Models. Simulation examples show that the $L_{2}E$ approach is more robust than EM when there is noise in the data (particularly outliers) and for the case when the underlying probability density function of the data does not match a mixture of Gaussians.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting

Article 04 March 2020

The parsimonious Gaussian mixture models with partitioned parameters and their application in clustering

Article 25 January 2024

Maximum likelihood estimation of Gaussian mixture models without matrix operations

Article 05 June 2015

Keywords

1 Introduction and Motivation

Mixture models and in particular Gaussian Mixture Models (GMM), are commonly used for density estimation and classification. In this era of Big Data and everyday, the data is highly complex and enormous in size. Mixture models offer a powerful and flexible way to represent the data. A comprehensive discussion on mixture models can be found in [1, 2].

When the number of mixture components is known and the component densities are assumed to belong to a specified parametric family, the popular Expectation Maximization (EM) algorithm [3] based on Maximum Likelihood Estimation (MLE) is often used to estimate the GMM parameters. However, when there is a small perturbation in one of the component densities, MLE becomes significantly biased and very sensitive to outliers [4]. Furthermore, when the data is not Gaussian, the EM method may not cluster a set of data points to a Gaussian with a meaningful mean vector and covariance matrix. The EM based approach is not robust when the underlying probability density function of the data does not match a mixture of Gaussians (known as a data/model mismatch).

To overcome this limitation, Scott [5–8] introduced an alternative minimum distance estimation method based on the integrated squared error criterion (termed $L_{2}E$) which avoids the use of nonparametric kernel density estimators. The $L_{2}E$ approach is a special case of a general method introduced in [9] that is based on a whole continuum of divergence estimators that begin with MLE and interpolate to the $L_{2}E$ estimator. Markatou [10] used the weighted likelihood estimation approach to address the effects of data/model mismatch on parameter estimates.

In this paper, the focus is on the $L_{2}E$ as an alternative to the EM for parameter estimation of models with a known finite number of mixtures. A discussion of the EM and $L_{2}E$ approaches are given. Simulation results specific to GMM are shown to depict the robustness property of the $L_{2}E$ method with respect to noise in the data (particularly outliers) and data/model mismatch [11–13].

The basic notation in this paper is as follows. Let $f_{\varvec{\theta }_m}(x)$ denote a general mixture probability density function with m components as given by

$$\begin{aligned} f_{\varvec{\theta }_m} ({\varvec{x}}) = \sum _{i=1}^m \pi _i f({\varvec{x}}|{\varvec{\phi }_i}) \end{aligned}$$

(1)

where $\varvec{\theta }_m = (\pi _1, \dots , \pi _{m-1}, \pi _{m}, {\varvec{\phi }_1}^T , \dots , {\varvec{\phi }_m}^T)^T$, the weights $\pi _i >0$, $\sum _{i=1}^m \pi _i =1$ and $f({\varvec{x}}|{\varvec{\phi }_i})$ is a probability density function with parameter vector ${\varvec{\phi }_i}$. In theory, the $f({\varvec{x}}|{\varvec{\phi }_i})$ could be any parametric density, although in practice they are often from the same parametric family (usually Gaussian).

2 EM Algorithm

The Expectation-Maximization (EM) algorithm [3] is broadly based on the iterative computation of MLE. The EM method alternates between two steps:

1.
Expectation (E) step: Computes an expectation of the likelihood by including the latent variables as if they were observed and a
2.
Maximization (M) step: computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found in the E step.

The parameters found in the M step are then used to begin another E step and the process is repeated.

For finite mixture models, the observed data samples ${\varvec{X}} = \left\{ {\varvec{x}_1},{\cdots },{\varvec{x}_n}\right\} $ are viewed as incomplete. The complete data is obtained as ${\varvec{Z}} = \left\{ {\varvec{x}_i},{\varvec{y}_i}\right\} $ for $i=1$ to n where ${\varvec{y}_i} = ({{\varvec{y}_{1i}},{\cdots },{\varvec{y}_{mi}}})^{T}$ is a latent (unobserved or missing) indicator vector with $\varvec{y}_{ij}=1$ if $\varvec{x}_i$ is from the mixture component j and zero otherwise. The log-likelihood of ${\varvec{Z}}$ is defined by

$$\begin{aligned} L(\varvec{\theta }_m|\varvec{Z})= \sum _{i=1}^n \sum _{j=1}^m y_{ij} \log y_{ij}\log [\pi _{j} f({\varvec{x}_i}|{\varvec{\phi }_j})] \end{aligned}$$

(2)

The EM algorithm obtains a sequence of estimates $\varvec{\theta }^{(t)},t=0,1,\cdots $ by alternating the E-Step and the M-Step until some convergence criterion is met.

1.
E-Step: Calculate the Q function, the conditional expectation of the complete log-likelihood, given ${\varvec{X}}$ and the current estimate $\varvec{\theta }^{(t)}$.
2.
M-Step: Update the estimate of the parameters by maximizing the Q function.

In the case of GMM, maximizing Q provides an explicit solution. In most instances, EM has the advantages of reliable global convergence, low cost per iteration, economy of storage, ease of programming and heuristic appeal. However, its convergence can be very slow in simple problems which are often encountered in practice. Also, when there is a small perturbation in one of the component densities due to noise in the data, the MLE estimates become highly unstable due to the lack of robustness to outliers. For the case of GMM [14], this can be seen easily as maximization of the likelihood function under an assumed Gaussian distribution is equivalent to finding the least-squares solution, whose lack of robustness is well known. As a robust alternative we discuss an approach based on the minimization of the integrated square distance, namely $L_{2}E$.

3 Robust $L_{2}E$ Estimator

The integrated squared distance has been used as the goodness-of-fit criterion in nonparametric density estimation for a long time. In the classic papers of Scott [6, 7], an alternative minimum distance estimation method based on the integrated squared error criterion, termed $L_{2}E$, was introduced and has the following attributes.

1.
The use of nonparametric kernel density estimators is avoided.
2.
The $L_{2}E$ is especially suited for parameter-rich models such as mixture models.
3.
The genesis of Scott the $L_{2}E$ approach, which can be traced to the pioneering work of Rudemo [15] and Bowman [16], is computationally feasible and leads to robust estimators.
4.
The $L_{2}E$ is a special class of robust estimators like the median-based estimators, which sacrifice some asymptotic efficiency for substantial computational benefits in difficult estimation problems.
5.
The $L_{2}E$ estimator performs much better than other robust estimators such as minimum Hellinger estimates (MHD) under severe data contamination.

The $L_{2}E$ estimator belongs to the family of minimum density power divergence (MDPD) estimators introduced in [9] with the tuning parameter $\alpha =1$. The tuning parameter $\alpha $ in an MDPD estimator controls the trade-off between robustness and efficiency. It is also shown that the robustness of the $L_{2}E$ estimator is achieved at a fairly stiff price in asymptotic efficiency [9]. For the normal, exponential and Poisson distributions with small values of $\alpha \le 0.10$, the MDPD has strong robustness properties and retains high asymptotic relative efficiency (ARE) with respect to MLE. However, within the family of density-based power divergence measures, the $L_{2}E$ approach has the distinct advantage that a key integral can be computed in closed form, especially for Gaussian mixtures.

3.1 $L_{2}E$ Algorithm

Given the true probability density $g({\varvec{x}})$ and the finite mixture with m components, $f_{\varvec{\theta }_m}({\varvec{x}})$, consider the $L_{2}$ distance between $f_{\varvec{\theta }_m}$ and $g({\varvec{x}})$ as given by

$$\begin{aligned} L_{2}(f_{\varvec{\theta }_m},g({\varvec{x}})) = \int _{-\infty }^\infty [f_{\varvec{\theta }_m}({\varvec{x}}) - g({\varvec{x}})]^2 d{\varvec{x}}. \end{aligned}$$

(3)

The aim is to derive an estimate of $\varvec{\theta }_m$ that minimizes the $L_{2}$ distance [5–7, 11–13]. Expanding Eq. (3) gives

$$\begin{aligned} L_{2}(f_{\varvec{\theta }_m},g({\varvec{x}}))= & {} \int _{-\infty }^\infty f_{\varvec{\theta }_m}^2({\varvec{x}}) d{\varvec{x}} - 2\int _{-\infty }^\infty f_{\varvec{\theta }_m}({\varvec{x}})g({\varvec{x}}) d{\varvec{x}} \nonumber \\+ & {} \int _{-\infty }^\infty g({\varvec{x}})^2 d{\varvec{x}} \end{aligned}$$

(4)

where the last integral is a constant with respect to $\varvec{\theta }_m$ and therefore, may be ignored for the minimization. The first integral in Eq. (4) is often available as a closed form expression that, for Gaussian mixtures, may be evaluated for any specified value of $\varvec{\theta }_m$ as shown later in Eq. (7). The second integral in Eq. (4) is simply the average height of the density estimate, which may be estimated as $-2n^{-1} \sum _{i=1}^n f_{\varvec{\theta }_m}({\varvec{X}}_i)$ where ${\varvec{X}}_i$ is a sample observation. Based on the above analysis, the $L_{2}E$ estimator of $\varvec{\theta }_m$ is given by

$$\begin{aligned} \hat{\varvec{\theta }}^{L_{2}E}_{m} = \arg \min _{\varvec{\theta }_m} \left[ \int _{-\infty }^\infty f^{2}_{\varvec{\theta }_m} ({\varvec{x}}) d{\varvec{x}} - 2n^{-1} \sum _{i=1}^n f_{\varvec{\theta }_m}({\varvec{X}}_i)\right] , \end{aligned}$$

(5)

3.2 GMM Models

For multivariate Gaussian mixtures,

$$\begin{aligned} f({\varvec{x}}|{\varvec{\phi }_i}) = \phi ({\varvec{x}}|~ {\varvec{\mu }_{i}},{\varSigma _{i}}) \end{aligned}$$

(6)

where ${\varvec{\mu }_{i}}$ is the mean vector and $\varSigma _{i}$ is the covariance matrix for component i. In this case, the problem reduces to finding the $L_{2}E$ estimator for a Gaussian Mixture Model (GMM). Now, the first integral in Eq. (4) reduces to

$$\begin{aligned} \int _{-\infty }^\infty f^{2}_{\varvec{\theta }_m} ({\varvec{x}}) d{\varvec{x}} = \sum _{k=1}^{m}\sum _{l=1}^{m} \pi _{k} \pi _{l}~~ \phi ({\varvec{\mu }_{k}} - {\varvec{\mu }_{l}}|~0,{\varSigma _{k} +\varSigma _{l}}), \end{aligned}$$

(7)

thereby making Eq. (4) tractable for minimization and significantly reducing the computations involved in getting the $L_{2}E$ estimator. Since this is a computationally feasible closed-form expression, estimation of the GMM parameters by the $L_{2}E$ procedure may be performed by any standard nonlinear optimization algorithm [5, 6, 11–13]. In this work, we used the ‘nlminb’ nonlinear minimization routine in [17].

4 Experimental Results

4.1 Performance Due to Data Contamination (Outliers)

In this section, simulations using EM and $L_{2}E$ parameter estimates are compared when there is no data contamination and when there is (with and without the presence of outliers/noise).

Gaussian Mixture Model with No Outliers: A GMM model f(x) with two components, each being a univariate Gaussian density $\phi (x)$ is simulated as given by

$$\begin{aligned} f(x)= 0.75\phi (x|~ \mu _{1}=0,\sigma _{1}^2=1) + 0.25\phi (x|~ \mu _{2}=1,\sigma _{2}^2=1). \end{aligned}$$

(8)

The variable $\mu $ denotes the mean and the variable $\sigma ^2$ denotes the variance. A total of 10000 sample points from the above Gaussian mixture (see Eq. (8)) are generated and parameter estimation is performed. A total of 100 Monte Carlo simulations are performed to evaluate consistency and efficiency.

The boxplots of the parameter estimates of the component means for the mixture model in Eq. (8) with no data contamination are shown in Fig. 1. The results clearly show that both solutions are comparable and close to the true estimates. Note that the average of the 100 Monte Carlo estimates of the $L_{2}E$ and EM means are close to the true value.

Gaussian Mixture Model with Outliers: The second simulation extends our study by adding outliers to illustrate the robustness property of $L_{2}E$ against outliers. In this case, 9900 sample points from the above Gaussian mixture in Eq. (8) are contaminated by adding 100 sample points (outliers) simulated from $\phi (x|~ \mu =5,\sigma ^2=1)$. Once again, 100 Monte Carlo simulations are performed to evaluate the performance of $L_{2}E$ and EM for consistency and efficiency.

The boxplots of the parameter estimates of the component means for the mixture model in Eq. (8) with 1 % data contamination are shown in Fig. 2. The results clearly show that the outliers have a great influence on the EM method and that the $L_{2}E$ method is inherently robust to outliers.

4.2 Performance Due to Data/Model Mismatch

In this section, data/model mismatch is assessed. The robustness of $L_{2}E$ and EM is investigated when the postulated model is a mixture of Gaussians (GMM) but the data are generated from a mixture with symmetric departure from component normality. The setup as described in [12, 18] is considered for the parameter estimation. More specifically, for the simulation study, a mixture with two components given by

$$\begin{aligned} f_{\varvec{\theta }_2} (x) = \pi f_{1}(x) + (1-\pi )f_{2}(x), \end{aligned}$$

(9)

is considered. Note that $f_{1}$ is the density associated with a random variable $X_1 = aY_{1}$ ($a=1$ chosen for the simulation) and $Y_{1}$ is a Student’s t(df)-random variable with a degree of freedom $df = 1$. Also, $f_{2}$ is the density associated with a random variable $X_2 = Y_{2} + b$ ($b=2$ chosen for the simulation) and $Y_{2}$ is a Student’s t(df)-random variable with degrees of freedom $df = 4$. A total of 100 data points were generated and 50 Monte Carlo simulations were conducted to evaluate the performance of $L_{2}E$ and EM for consistency and efficiency by calculating the Bias and Mean Square Error (MSE).

Suppose T(X) is an estimate of $\theta $. The Bias and MSE of T are defined as

$$\begin{aligned} Bias(\theta )= & {} E_{\theta }T -\theta \end{aligned}$$

(10)

$$\begin{aligned} MSE(\theta )= & {} E_{\theta }(T -\theta )^2=Var_{\theta }(T) +Bias^2(\theta ) \end{aligned}$$

(11)

Note that the general shapes of such a two-component postulated (Gaussian mixture) model and a two-component t-mixture model from which the data are generated are different and further, the component densities in the sampling model have a much heavier tail than those in the postulated (Gaussian) mixture model. Table 1 depicts the bias and the mean square error for the mean estimates provided by the $L_{2}(E)$ and EM algorithms. The results show that the $L_{2}E$ is more robust than the EM approach with respect to data/model mismatch.

Table 1. Simulation results for data/model mismatch

Full size table

5 Summary and Conclusions

The $L_{2}E$ estimation technique can be easily constructed and applied to GMM and is a viable alternative to EM. Simulation studies revealed that the $L_{2}E$ mean estimates are robust to both outliers and data/model mismatch. The competitive performance of $L_{2}E$ make it stand out as an attractive alternative to EM for practical applications.

References

Titterington, D.M., Smith, A.F.M., Markov, U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, New York (1985)
MATH Google Scholar
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
Book MATH Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum-likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B 39, 1–38 (1977)
MathSciNet MATH Google Scholar
Aitkin, M., Wilson, G.T.: Mixture models, outliers, and the EM algorithm. Technometrics 22, 325–331 (1980)
Article MATH Google Scholar
Scott, D.W.: On fitting and adapting of density estimates. Comput. Sci. Stat. 30, 124–133 (1998). (Weisberg, S., ed.)
Google Scholar
Scott, D.W.: Remarks on fitting and interpreting mixture models. Comput. Sci. Stat. 31, 104–109 (1999). (Berk, K., Pourahmadi, M., eds.)
Google Scholar
Scott, D.W.: Parametric statistical modeling by minimum integrated square error. Technometrics 43, 274–285 (2001)
Article MathSciNet Google Scholar
Scott, D.W.: Outlier detection and clustering by partial mixture modeling. In: COMPSTAT Symposium. Physica-Verlag/Springer (2004)
Google Scholar
Basu, A., Harris, I.R., Hjort, H.L., Jones, M.C.: Robust and efficient estimation by minimizing a density power divergence. Biometrika 85, 549–560 (1998)
Article MathSciNet MATH Google Scholar
Markatou, M., Basu, A., Lindsay, B.G.: Weighted likelihood estimating equations with a bootstrap root search. J. Am. Stat. Assoc. 93, 740–750 (1998)
Article MathSciNet MATH Google Scholar
Thayasivam, U., Sriram, T.N.: $L_{2}E$ estimation for mixture complexity for count data. Comput. Stat. Data Anal. 53, 4243–4254 (2009)
Article MATH Google Scholar
Thayasivam, U., Sriram, T.N., Lee, J.: Simultaneous robust estimation in finite mixtures: the continuous case. J. Indian Stat. Assoc. 50, 277–295 (2012)
MathSciNet Google Scholar
Thayasivam, U., Shetty, S., Kuruwita, C., Ramachandran, R.P.: Detection of anomalies in network traffic using L2E for accurate speaker recognition. In: 55th International Midwest Symposium on Circuits & Systems, Boise, pp. 884–887 (2012)
Google Scholar
Kai, Y., Dang, X., Bart, H., Chen, Y.: Robust model-based learning via Spatial-EM algorithm. IEEE Trans. Knowl. Data Eng. 27, 1670–1682 (2015)
Article Google Scholar
Rudemo, M.: Empirical choice of histograms and kernel density estimators. Scand. J. Statist. 9, 65–78 (1982)
MathSciNet MATH Google Scholar
Bowman, A.W.: An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71, 353–360 (1984)
Article MathSciNet Google Scholar
R: A Language and Environment for Statistical Computing, R Development Core Team, R Foundation for Statistical Computing, Vienna, Austria, (2011). http://www.R-project.org/
Woodward, W.A., Parr, W.C., Schucany, W.R., Lindsay, H.: A comparison of minimum distance and maximum likelihood estimation of a mixture proportion. J. Am. Stat. Assoc. 79, 590–598 (1984)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgment

This work was supported by the National Science Foundation through Grant DUE-1122296.

Author information

Authors and Affiliations

Rowan University, Glassboro, NJ, USA
Umashanger Thayasivam & Ravi P. Ramachandran
Hamilton College, Clinton, NY, USA
Chinthaka Kuruwita

Authors

Umashanger Thayasivam
View author publications
You can also search for this author in PubMed Google Scholar
Chinthaka Kuruwita
View author publications
You can also search for this author in PubMed Google Scholar
Ravi P. Ramachandran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ravi P. Ramachandran .

Editor information

Editors and Affiliations

University of Istanbul, Istanbul, Turkey
Sabri Arik
University at Qatar, Doha, Qatar
Tingwen Huang
Tunku Abdul Rahman University College, Kuala Lumpur, Malaysia
Weng Kin Lai
University of Science Technology, Wuhan, China
Qingshan Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Thayasivam, U., Kuruwita, C., Ramachandran, R.P. (2015). Robust $L_{2}E$ Parameter Estimation of Gaussian Mixture Models: Comparison with Expectation Maximization. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9491. Springer, Cham. https://doi.org/10.1007/978-3-319-26555-1_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-26555-1_32
Published: 09 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26554-4
Online ISBN: 978-3-319-26555-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Robust \(L_{2}E\) Parameter Estimation of Gaussian Mixture Models: Comparison with Expectation Maximization