Weighted likelihood mixture modeling and model-based clustering

Greco, Luca; Agostinelli, Claudio

doi:10.1007/s11222-019-09881-1

Weighted likelihood mixture modeling and model-based clustering

Published: 10 June 2019

Volume 30, pages 255–277, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Statistics and Computing Aims and scope Submit manuscript

Weighted likelihood mixture modeling and model-based clustering

Download PDF

564 Accesses
13 Citations
1 Altmetric
Explore all metrics

Abstract

A weighted likelihood approach for robust fitting of a mixture of multivariate Gaussian components is developed in this work. Two approaches have been proposed that are driven by a suitable modification of the standard EM and CEM algorithms, respectively. In both techniques, the M-step is enhanced by the computation of weights aimed at downweighting outliers. The weights are based on Pearson residuals stemming from robust Mahalanobis-type distances. Formal rules for robust clustering and outlier detection can be also defined based on the fitted mixture model. The behavior of the proposed methodologies has been investigated by numerical studies and real data examples in terms of both fitting and classification accuracy and outlier detection.

Weighted likelihood latent class linear regression

Article 23 July 2020

Finding Outliers in Gaussian Model-based Clustering

Article 30 May 2024

Estimation and computations for Gaussian mixtures with uniform noise under separation constraints

Article Open access 25 July 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Multivariate normal mixture models represent a very popular tool for both density estimation and clustering (McLachlan and Peel 2004). The parameters of a mixture model are commonly estimated by maximum likelihood by resorting to the EM algorithm (Dempster et al. 1977). Let $y=(y_1,y_2,\ldots ,y_n)^{^\top }$ be a random sample of size n. The mixture likelihood can be expressed as

$$\begin{aligned} L(y; \tau )=\prod _{i=1}^n\sum _{k=1}^K \pi _k\phi _p(y_i; \mu _k,\varSigma _k) \ , \end{aligned}$$

(1)

where $\tau =(\pi , \mu _1, \ldots , \mu _K, \varSigma _1, \ldots , \varSigma _K)$, $\phi _p(\cdot ;\cdot )$ is the p-dimensional multivariate normal density, $\pi =(\pi _1,\ldots ,\pi _K)$ denotes the vector of prior membership probabilities and $(\mu _k,\varSigma _k)$ are the mean vector and variance-covariance matrix of the $k\mathrm{th}$ component, respectively. Rather than using the likelihood in (1), the EM algorithm works with the complete likelihood function

$$\begin{aligned} L^c(y;\tau )=\prod _{i=1}^n\prod _{k=1}^K \left[ \pi _k\phi _p(y_i; \mu _k,\varSigma _k) \right] ^{u_{ik}}\ , \end{aligned}$$

(2)

where $u_{ik}$ is an indicator of the $i\mathrm{th}$ unit belonging to the $k\mathrm{th}$ component. The EM algorithm iteratively alternates between two steps: expectation (E) and maximization (M). In the E-step, the posterior expectation of (2) is evaluated by setting $u_{ik}$ equal to the posterior probability that $y_i$ belongs to the $k\mathrm{th}$ component, i.e.

$$\begin{aligned} u_{ik}\propto \pi _k\phi _p(y_i; \mu _k,\varSigma _k) \ , \end{aligned}$$

whereas at the M-step $\pi $, $\mu _k$ and $\varSigma _k$ are estimated conditionally on $u_{ik}$.

An alternative strategy is given by the penalized classification EM (CEM) algorithm (Symon 1977; Bryant 1991; Celeux and Govaert 1993): the substantial difference is that the E-step is followed by a C-step (where C stands for classification) in which $u_{ik}$ is estimated as either 0 or 1, meaning that each unit is assigned to the most likely component, conditionally on the current parameters’ values, i.e. $k_i=\mathrm {argmax}_k u_{ik}$, $u_{ik_i}=1$ and $u_{ik}=0$ for $k\ne k_i$. The classification approach is aimed at maximizing the corresponding classification likelihood (2) over both the mixture parameters and the individual components’ labels. In the case $\pi _k=1/K$, then the standard CEM algorithm is recovered. A detailed comparison of the EM and CEM algorithms can be found in Celeux and Govaert (1993).

When the sample data are prone to contamination and several unexpected outliers occur with respect to (w.r.t.) the assumed mixture model, maximum likelihood is likely to lead to unrealistic estimates and to fail in recovering the underlying clustering structure of the data (see Farcomeni and Greco 2015a, for a recent account). In the presence of noisy data that depart from the underlying mixture model, there is the need to replace maximum likelihood with a suitable robust procedure, leading to estimates and clustering rules that are not badly affected by contamination.

The need for robust tools in the estimation of mixture models has been first addressed in Campbell (1984), who suggested to replace standard maximum likelihood with M-estimation. In a more general fashion, Farcomeni and Greco (2015b) proposed to resort to multivariate S-estimation of location and scatter in the M-step. Actually, the authors focused their attention on hidden Markov models, but their approach can be adapted from dynamic to static finite mixtures. According to such strategies, each data point is attached a weight lying in [0, 1] (a strategy commonly addressed as soft trimming). An alternative approach to robust fitting and clustering is based on hard trimming procedures, i.e. a crispy weight $\left\{ 0, 1\right\} $ is attached to each observation: atypical observations are expected to be trimmed, and the model is fitted by using a subset of the original data. The tclust methodology (Garcia-Escudero et al. 2008; Fritz et al. 2013) is particularly appealing: model parameters are estimated by developing a penalized CEM algorithm augmented with an impartial trimming step. Very recent extensions have been discussed in Dotto et al. (2016), who proposed a reweighted trimming procedure (rtclust) and Dotto and Farcomeni (2019), in which trimming has been introduced in parsimonious model-based clustering (mtclust). A related proposal has been presented in Neykov et al. (2007) based on the so-called trimmed likelihood methodology. Furthermore, it is worth to mention that mixture model estimation and clustering can be also implemented by using the adaptive hard trimming strategy characterizing the Forward Search (Atkinson et al. 2013).

There are also different proposals aimed at being robust that are not based on soft or hard trimming procedures. Some of them are characterized by the use of flexible components in the mixture. The idea is that of embedding the Gaussian mixture in a supermodel: McLachlan and Peel (2004) introduced a mixture of Student’s t distributions, a mixture of skewed Student’s t distributions has been proposed in Lin (2010) and Lee and McLachlan (2014), whereas Fraley and Raftery (1998, 2002) considered an additional component modeled as a Poisson process to handle noisy data (the method is available from package mclust (Fraley et al. 2012) in R (R Core Team 2019). A robust approach, named otrimle, has been proposed recently by Coretto and Hennig (2016, 2017), who considered the addition of an improper uniform mixture component to accommodate outliers.

We propose a robust version of both the EM and the penalized CEM algorithms to fit a mixture of multivariate Gaussian components based on soft trimming, in which weights are evaluated according to the weighted likelihood methodology (Markatou et al. 1998). A first attempt in this direction has been pursued by Markatou (2000). Here, that approach has been developed further and made more general leading to a newly established technique, in which weights are based on the recent results stated in Agostinelli and Greco (2018). The methodology leads to a robust fit and is aimed at providing both cluster assignment of genuine data points and outlier detection rules. Data points flagged as anomalous are not meant to be classified into any of the clusters. Furthermore, a relevant aspect of our proposal is represented by the introduction of constraints, not considered in Markatou (2000), aimed at avoiding local or spurious solutions (Fritz et al. 2013).

Some necessary preliminaries on weighted likelihood estimation are given in Sect. 2. The weighted EM and penalized CEM algorithms are introduced in Sect. 3: some computational details are discussed concerning constraints, initialization issues, the tuning of the methods and classification and outlier detection rules are outlined. Section 4 states asymptotic results, whereas Sect. 5 is devoted to model selection. Numerical studies are presented in Sect. 6, and real data examples are discussed in Sect. 7.

2 Background

Let us assume a mixture model composed by K heterogeneous multivariate Gaussian components, where K is fixed in advance, with density function denoted by $m(y;\tau )=\sum _{j=1}^K\pi _j\phi _p(y_i;\mu _j, \varSigma _j)$. Markatou (2000) suggested to work with the following weighted likelihood estimating equation (WLEE) in the M-step of the EM algorithm:

$$\begin{aligned} \sum _{i=1}^n w_i \sum _{j=1}^k u_{ij}\frac{\partial }{\partial \nu }\left[ \log \pi _j +\log \phi _p(y_i;\mu _j, \varSigma _j)\right] =0 \ . \end{aligned}$$

(3)

We notice that maximum likelihood equations are replaced by weighted equations. The weights are defined as

$$\begin{aligned} w_i=w(\delta (y_i)) = \frac{\left[ A(\delta (y_i)) + 1\right] ^+}{\delta (y_i) + 1} \ , \end{aligned}$$

(4)

where $[\cdot ]^+$ denotes the positive part, $\delta (y)$ is the Pearson residual function and $A(\delta )$ is the residual adjustment function (RAF, Basu and Lindsay 1994). The Pearson residual gives a measure of the agreement between the assumed model $m(y; \tau )$ and the data that are summarized by a nonparametric density estimate ${\hat{m}}_n(y)=n^{-1}\sum _{i=1}^n k(y; y_i, h)$, based on a kernel k(y; t, h) indexed by a bandwidth h, that is

$$\begin{aligned} \delta (y) = \frac{{\hat{m}}_n(y) }{m(y; \tau )}-1 \ , \end{aligned}$$

(5)

with $\delta \in [-1, \infty )$. In the construction of Pearson residuals, Markatou (2000) suggested to use a smoothed model density in the continuous case, by using the same kernel involved in nonparametric density estimation (see Basu and Lindsay 1994; Markatou et al. 1998, for general results), i.e.

$$\begin{aligned} m^*(y; \tau )=\int k(y;t,h) m(t;\tau ) \mathrm{d}t. \end{aligned}$$

When the model is correctly specified, the Pearson residual function (5) evaluated at the true parameter value converges almost surely to zero, whereas, otherwise, for each value of the parameters, large Pearson residuals detect regions where the observation is unlikely to occur under the assumed model. The weight function (4) can be chosen to be unimodal so that it declines smoothly as the residual $\delta (y)$ departs from zero. Hence, those observations lying in such regions are attached a weight that decreases with increasing Pearson residual. Large Pearson residuals and small weights will correspond to data points that are likely to be outliers. The RAF plays the role to bound the effect of large residuals on the fitting procedure, as well as the Huber and Tukey bisquare function bound large distances in M-estimation and we assume is such that $|A(\delta )|<|\delta |$. Here, we consider the families of RAF based on the Power Divergence Measure

$$\begin{aligned} A_{pdm}(\delta ) = \left\{ \begin{array}{lc} \nu \left( (\delta + 1)^{1/\nu } - 1 \right) &{}\quad \nu < \infty \\ \log (\delta + 1) &{}\quad \nu \rightarrow \infty \end{array} \right. \end{aligned}$$

Special cases are maximum likelihood ($\nu = 1$, as the weights become all equal to one), Hellinger distance ($\nu = 2$), Kullback–Leibler divergence ($\nu \rightarrow \infty $) and Neyman’s Chi-square ($\nu =-1$). Another example is given by the generalized Kullback–Leibler divergence (GKL) defined as

$$\begin{aligned} A_{gkl}(\delta )=\log (\nu \delta +1)/\nu , \ 0\le \nu \le 1. \end{aligned}$$

Maximum likelihood is a special case when $\nu \rightarrow 0$ and Kullback–Leibler divergence is obtained for $\nu =1$.

The shape of the kernel function has a very limited effect on weighted likelihood estimation. On the contrary, the smoothing parameter h directly affects the robustness/efficiency trade-off of the methodology in finite samples. Actually, large values of h lead to Pearson residuals all close to zero and weights all close to one and, hence, large efficiency, since the kernel density estimate is stochastically close to the postulated (smoothed) model. On the other hand, small values of h make the kernel density estimate more sensitive to the occurrence of outliers and the Pearson residuals become large for those data points that are in disagreement with the model. In other words, in finite samples more smoothing will lead to higher efficiency but larger bias under contamination.

2.1 Multivariate estimation

The computation of weights based on the Pearson residuals given in (5) becomes troublesome with growing dimensions since the data are more sparse and multivariate kernel density estimation may become unfeasible. In order to circumvent this curse of dimensionality, Agostinelli and Greco (2018) proposed a novel technique which is based on the Mahalanobis distances

$$\begin{aligned} d=d(y;\mu ,\varSigma )=[(y-\mu )^{^\top }\varSigma ^{-1}(y-\mu )]^{1/2} \ . \end{aligned}$$

Then, Pearson residuals can be evaluated by comparing a univariate kernel density estimate based on squared distances and their underlying $\chi ^2_p$ distribution at the assumed multivariate normal model, rather than working with multivariate data and multivariate kernel density estimates, that is

$$\begin{aligned} \delta (y)=\frac{{\hat{m}}_n\left( d^2\right) }{m_{\chi ^2_p}\left( d^2\right) }-1 \ , \end{aligned}$$

(6)

where

$$\begin{aligned} {\hat{m}}_n(t)=n^{-1}\sum _{i=1}^n k(t; d^2, h) \end{aligned}$$

is an unbiased at the boundary univariate kernel density estimate over $(0, \infty )$ and $m_{\chi ^2_p}(t)$ denotes the $\chi ^2_p$ density function. It is worth noting that Pearson residuals can be evaluated w.r.t. the original $\chi ^2_p$ density, so avoiding model smoothing (see also Kuchibhotla and Basu 2015, 2018). Assumptions and proofs concerning existence, convergence and asymptotic normality of the WLE of multivariate location and scatter have been also established (see the Supplementary material of Agostinelli and Greco 2018).

3 Weighted likelihood mixture modeling

The technique for weighted likelihood mixture modeling proposed by Markatou (2000) exhibits the same limitations that have been highlighted in Agostinelli and Greco (2018) in the case of weighted likelihood estimation of multivariate location and scatter. The main drawbacks are driven by the employ of multivariate kernels.

The availability of consistent estimators of multivariate location and scatter based on the Pearson residuals (6) is the starting point to build a weighted likelihood methodology to fit robustly the mixture model (5) that is also capable to handle situations in which the number of features is large enough. Therefore, by exploiting the approach developed in Agostinelli and Greco (2018), we propose both a weighted EM algorithm and a weighted penalized CEM algorithm whose M-steps are characterized by a WLEE based on the Pearson residuals (6).

It is worth to notice that the method is expected to work in large dimensions, even if it is still over-parameterized in high-dimensional spaces. The technique is meant for and confined to the $n>p$ case and to dimensions that still allow evaluation of Mahalanobis distances. The development of weighted likelihood methodologies for model-based clustering in very large dimensions and in the $n<p$ situation is beyond the scope of the present work.

The weighted EM algorithm (WEM) is structured as follows:

1.
Initialization
$$\begin{aligned} \tau ^{(0)}=(\pi ^{(0)}, \mu _1^{(0)}, \ldots , \mu _K^{(0)}, \varSigma _1^{(0)}, \ldots , \varSigma _1^{(K)}) \ . \end{aligned}$$
Details on the sensitivity of the results to different initializations and the selection of the best solution will be given in Sect. 3.5.
2.
E-step the standard E-step is left unchanged, with
$$\begin{aligned} u_{ik}^{(s)}=\frac{\pi _k^{(s-1)}\phi _p\left( y_i; \mu _k^{(s-1)},\varSigma _k^{(s-1)}\right) }{\sum _{k=1}^K \pi _k^{(s-1)}\phi _p\left( y_i; \mu _k^{(s-1)},\varSigma _k^{(s-1)}\right) } \end{aligned}$$
3.
Weighted M-step based on current parameter estimates,
1. (a)
  Soft trimming let us evaluate component-wise Mahalanobis-type distances
  $$\begin{aligned} d_{ik}^{(s)}=d\left( y_i; \mu _k^{(s-1)}, \varSigma _k^{(s-1)}\right) \ . \end{aligned}$$
  Then, for each group, compute Pearson residuals and weights as
  $$\begin{aligned} \delta _{ik}^{(s)}=\frac{{\hat{m}}_n\left( d_{ik}^{(s)^2}\right) }{m_{\chi ^2_p}\left( d_{ik}^{(s)^2}\right) }-1 \end{aligned}$$
  and
  $$\begin{aligned} w_{ik}^{(s)}=\frac{\left[ A\left( \delta _{ik}^{(s)}\right) +1\right] ^+}{\delta _{ik}^{(s)}+1} \end{aligned}$$
  respectively.
2. (b)
  Update membership probabilities and component-specific parameter estimates
  $$\begin{aligned} \pi _k^{(s+1)}= & {} \frac{\sum \nolimits _{i=1}^n u_{ik}^{(s)}w_{ik}^{(s)}}{\sum _{i=1}^n \sum _{k=1}^Ku_{ik}^{(s)}w_{ik}^{(s)}}\\ \mu _k^{(s+1)}= & {} \frac{\sum \nolimits _{i=1}^n y_i w_{ik}^{(s)}u_{ik}^{(s)}}{\sum _{i=1}^n w_{ik}^{(s)}u_{ik}^{(s)}}\\ \varSigma _k^{(s+1)}= & {} \frac{\sum \nolimits _{i=1}^n \left( y_i-\mu _k^{(s+1)}\right) \left( y_i-\mu _k^{(s+1)}\right) ^{^\top } w_{ik}^{(s)}u_{ik}^{(s)}}{\sum _{i=1}^n w_{ik}^{(s)}u_{ik}^{(s)}} \end{aligned}$$
3. (c)
  Set$\tau ^{(s+1)}=\left( \pi ^{(s+1)}, \mu _1^{(s+1)}, \ldots , \mu _K^{(s+1)}, \varSigma _1^{(s+1)}, \ldots , \varSigma _K^{(s+1)}\right) $.

It is worth noting that at the M-step it is proposed to solve the following WLEE

$$\begin{aligned} \sum _{i=1}^n \sum _{j=1}^k u_{ij}\frac{\partial }{\partial \tau }\left[ \log \pi _j +\log \phi _p(y_i;\mu _j, \varSigma _j)\right] w_{ij} =0, \end{aligned}$$

(7)

that is characterized by the evaluation of K component-wise sets of weights, rather than one weight for each observation, as in equation (3).

The weighted penalized CEM algorithm (WCEM) is obtained by introducing a standard C-step between the E-step and the weighted M-step. The main feature of the WCEM algorithm is that one single weight is attached to each unit, based on its current assignment after the C-step, rather than component-wise weights. Then, the resulting WLEE shows the same structure as in (3) but with the difference that $u_{ij}=1$ or $u_{ij}=0$. The WCEM is described as follows:

1.
Initialization
$$\begin{aligned} \tau ^{(0)}=(\pi ^{(0)}, \mu _1^{(0)}, \ldots , \mu _K^{(0)}, \varSigma _1^{(0)}, \ldots , \varSigma _K^{(0)}) \ . \end{aligned}$$
2.
E-step
$$\begin{aligned} u_{ik}^{(s)}=\frac{\pi _k^{(s-1)}\phi _p\left( y_i; \mu _k^{(s-1)},\varSigma _k^{(s-1)}\right) }{\sum _{k=1}^K \pi _k^{(s-1)}\phi _p\left( y_i; \mu _k^{(s-1)},\varSigma _k^{(s-1)}\right) } \end{aligned}$$
3.
C-step let $k_i^{(s)}=\mathrm {argmax}_k u_{ik}^{(s)}$ identify the cluster assignment for the $i\mathrm{th}$ unit at the $s\mathrm{th}$ iteration. Then
$$\begin{aligned} {\tilde{u}}_{ik}^{(s)}=\left\{ \begin{array}{cc} 1 &{}\quad \text {if} \ k=k_i,\\ 0 &{}\quad \text {if} \ k\ne k_i.\\ \end{array} \right. \end{aligned}$$
4.
Weighted M-step based on current parameter estimates $\tau ^{(s)}$ and cluster assignments $k_i$,
1. (a)
  Soft trimming evaluate the Mahalanobis-type distances of each point w.r.t. the component it belongs in
  $$\begin{aligned} d_{ik_i}^{(s)}=d\left( y_i; \mu _{k_i}^{(s-1)}, \varSigma _{k_i}^{(s-1)}\right) . \end{aligned}$$
  Then, compute the corresponding Pearson residuals and weights as
  $$\begin{aligned} \delta _{ik_i}^{(s)}=\frac{{\hat{m}}_n\left( d_{ik_i}^{(s)^2}\right) }{m_{\chi ^2_p}\left( d^{(s)^2}_{ik_i}\right) }-1 \end{aligned}$$
  and
  $$\begin{aligned} w_i^{(s)}=w_{ik_i}^{(s)}=\frac{\left[ A\left( \delta _{ik_i}^{(s)}\right) +1\right] ^+}{\delta _{ik_i}^{(s)}+1} \end{aligned}$$
  respectively, where
  $$\begin{aligned} {\hat{m}}_n(d^2)&= \frac{1}{\sum _{i=1}^n{\tilde{u}}_{ik_i}} \sum _{i=1}^n k(d^2; d^2_{ik_i}, h) \ , \\&= \frac{1}{\sum _{i=1}^n{\tilde{u}}_{ik}} \sum _{i=1}^n k(d^2; d^2_{ik}, h){\tilde{u}}_{ik} \ . \end{aligned}$$
  Hence, component-wise kernel density estimates only involve distances conditionally on cluster assignment.
2. (b)
  Update membership probabilities and component-specific parameter estimates
  $$\begin{aligned} \pi _k^{(s+1)}&= \frac{\sum \nolimits _{i=1}^n {\tilde{u}}_{ik}^{(s)}w_{ik_i}^{(s)}}{\sum _{i=1}^n w_{ik_i}^{(s)}} \ , \\ \mu _k^{(s+1)}&= \frac{\sum \nolimits _{i=1}^n y_i w_{ik_i}^{(s)}{\tilde{u}}_{ik}^{(s)}}{\sum _{i=1}^n w_{ik_i}^{(s)}{\tilde{u}}_{ik}^{(s)}}, \\ \varSigma _k^{(s+1)}&= \frac{\sum \nolimits _{i=1}^n \left( y_i{-}\mu _k^{(s+1)}\right) \left( y_i{-}\mu _k^{(s+1)}\right) ^{^\top } w_{ik_i}^{(s)}{\tilde{u}}_{ik}^{(s)}}{\sum _{i=1}^n w_{ik_i}^{(s)}{\tilde{u}}_{ik}^{(s)}} . \end{aligned}$$
3. (c)
  Set$\tau ^{(s+1)}=\left( \pi ^{(s+1)}, \mu _1^{(s+1)}, \ldots , \mu _K^{(s+1)}, \varSigma _1^{(s+1)}, \ldots , \varSigma _K^{(s+1)}\right) $

It is worth noting that both weighted algorithms return weighted estimates of covariance. The final output can be suitably modified in order to provide unbiased weighted estimates.

3.1 Eigen-ratio constraint

It is well known that maximization of the mixture likelihood (1) or the classification likelihood (2) is an ill-posed problem since the objective function may be unbounded (Day 1969; Maronna and Jacovkis 1974). Therefore, in order to avoid such problems, the optimization is performed under suitable constraints. In particular, we employed the eigen-ratio constraint defined as

$$\begin{aligned}&\frac{\max _j\max _k \lambda _j(\varSigma _k)}{\min _j\min _k \lambda _j(\varSigma _k)} \le c, \qquad j=1,2 \ldots ,p, \quad \nonumber \\&\quad k=1,2,\ldots ,K \end{aligned}$$

(8)

where $\lambda _j(\varSigma _k)$ denoted the $j\mathrm{th}$ eigenvalue of the covariance matrix $\varSigma _k$ and c is a fixed constant not smaller than one aimed at tuning the strength of the constraint. For $c=1$ spherical clusters are imposed, while as c increases varying shaped clusters are allowed. The eigen-ratio constraint (8) can be satisfied at each iteration by adjusting the eigenvalues of each $\varSigma _k^{(s)}$. This is achieved by replacing them with a truncated version

$$\begin{aligned} \lambda _j^*(\varSigma _k)=\left\{ \begin{array}{ll} c &{}\quad \text {if} \ \lambda _j(\varSigma _k) < c \\ \lambda _j(\varSigma _k) &{}\quad \text {if} \ c\le \lambda _j(\varSigma _k) \le c\theta _c \\ c\theta _c &{}\quad \text {if} \ \lambda _j(\varSigma _k) > c\theta _c \\ \end{array} \right. \end{aligned}$$

where $\theta _c$ is an unknown bound depending on c. The reader is pointed to Fritz et al. (2013); Garcia-Escudero et al. (2015) for a feasible solution to the problem of finding $\theta _c$.

3.2 Classification and outlier detection

The WCEM automatically provides a classification of the sample units, since the value of ${\tilde{u}}_{ik}$ at convergence is either zero or one. With the WEM, by paralleling a common approach, a maximum a posteriori criterion can be used for cluster assignment, that is, a C-step is applied after the last E-step. Such criteria lead to classify all the observations, both genuine and contaminated data, meaning that also outliers are assigned to a cluster. Actually, we are not interested in classifying outliers and for purely clustering purposes outliers have to be discarded.

We distinguish two main approaches to outlier detection. According to the first, outlier detection should be based on the robust fitted model and performed separately by using formal rules. The key ingredients in multivariate outlier detection are the robust distances (Rousseeuw and Van Zomeren 1990; Cerioli 2010). The reader is pointed to Cerioli and Farcomeni (2011) for a recent account on outlier detection. An observation is flagged as an outlier when its squared robust distance exceeds a fixed threshold, corresponding to the $(1-\alpha )$-level quantile of the reference (asymptotic) distribution of the squared robust distances. A common solution is represented by the use of the $\chi ^2_p$, and popular choices are $\alpha =0.025$ and $\alpha =0.01$. In the case of finite mixtures, the main idea is that the outlyingness of each data point should be measured conditionally on the final assignment. Hence, according to a proper testing strategy, an observation is declared as outlying when

$$\begin{aligned} d_{ik_i}^2 > \chi ^2_{p;1-\alpha } \ , \ d_{ik_i}^2=(y_i-{\hat{\mu }}_{k_i})^{^\top } {\hat{\varSigma }}_{k_i}(y_i-{\hat{\mu }}_{k_i}) \ . \end{aligned}$$

(9)

The second approach stems from hard trimming procedures, such as tclust, rtclust and otrimle. These techniques are not meant to provide simultaneous robust fit and outlier detection based on formal testing rules, but outliers are identified with those data points falling in the trimmed set or assigned to the improper density component, respectively. Therefore, by paralleling what happens with hard trimming, one could flag as outliers those data points whose weight, conditionally on the final cluster assignment, is below a fixed (small) threshold. Values as 0.10 or 0.20 seem reasonable choices. Furthermore, the empirical downweighting level represents a natural upper bound for the cutoff value that would give an indication of the largest tolerable swamping and of the minimum feasible masking for the given level of smoothing. This approach is motivated by the fact that the multivariate WLE shares important features with hard trimming procedures, even if it is based on soft trimming, as claimed in Agostinelli and Greco (2017).

The process of outlier detection may result in type I and type II errors. In the former case, a genuine observation is wrongly flagged as outlier (swamping); in the latter case, a true outlier is not identified (masking). Swamped genuine observations are false positives, whereas masked outliers are false negatives. According to the first strategy, the larger $\alpha $ the more swamping and the less masking. In a similar fashion, the higher the threshold the more swamping and the less masking will characterize the second approach to outlier detection.

In the following, both approaches to outlier detection will be taken into account and critically compared.

3.3 The selection of h

The selection of h is a crucial task. According to authors’ experience (see Agostinelli and Greco 2018, 2017; Greco 2017, for instance), but also as already suggested by Markatou et al. (1998), a safe selection of h can be achieved by monitoring the empirical downweighting level $(1-\hat{{\bar{\omega }}})$ as h varies, with $\hat{{\bar{\omega }}}=n^{-1}\sum _{i=1}^n {\hat{w}}_i$, where the weights at convergence ${\hat{w}}_i={\hat{w}}_{ik_i}$ are evaluated at the fitted parameter value and conditionally on the final cluster assignment, both for WEM and WCEM, along the lines outlined in Sect. 3.2. The monitoring of WLE analyses has been applied successfully in Agostinelli and Greco (2017) to the case of robust estimation of multivariate location and scatter. The reader is pointed to Cerioli et al. (2017) for an account on the benefits of monitoring. A good strategy in the tuning of the smoothing parameter would be to monitor several quantities of interest stemming from the fitted mixture model in addition to the empirical downweighting level. One could monitor the weighted log-likelihood at convergence, unit-specific robust distances conditionally on the final cluster assignment, unit-specific weights, a misclassification error if a training set with known labels is available. For instance, an abrupt change in the monitored empirical downweighting level or in the robust distances may indicate the transition from a robust to a non-robust fit and aid in the selection of a value of h that gives an appropriate compromise between efficiency and robustness. Values beyond this threshold would lead to at least one arbitrarily biased fitted component that can compromise the accuracy of clustering. It is worth to note that, the trimming level in tclust or the improper density constant in otrimle is selected in a monitoring fashion, as well.

3.4 Synthetic data

Let us consider a three component mixture model with $\pi =(0.2,0.3,0.5)$, $\mu _1=(-5,0)^{^\top }$, $\mu _2=(0,-5)^{^\top }$, $\mu _3=(5,0)^{^\top }$ and

$$\begin{aligned} \varSigma _1= & {} \left( \begin{array}{ll} 1&{}\quad -0.5\\ -0.5 &{}\quad 1\\ \end{array} \right) , \quad \varSigma _2=\left( \begin{array}{ll} 2&{}\quad 1.25\\ 1.25 &{}\quad 2\\ \end{array} \right) , \quad \\ \varSigma _3= & {} \left( \begin{array}{ll} 3&{}\quad -1.75\\ -1.75 &{}\quad 3\\ \end{array} \right) . \end{aligned}$$

and a simulated sample of size $n=1000$, with $40\%$ of background noise. Outliers have been generated uniformly within an hypercube whose dimensions include the range of the data and are such that the distance to the closest component is larger than the 0.99-level quantile of a $\chi ^2_2$ distribution. WEM and WCEM have been run by setting the eigen-ratio restriction constant to $c=15$. (The true value is 9.5.) The weights are based on the generalized Kullback–Leibler divergence and a folded normal kernel. Initialization has been provided by running tclust with a $50\%$ level of trimming. The smoothing parameter h has been selected by monitoring the empirical downweighting level and unit-specific clustering-conditioned distances over a grid of h values (Agostinelli and Greco 2017). Figure 1 displays the monitoring analyses of the empirical downweighting level, the robust distances and the misclassification error for the WEM. In all panels an abrupt change is detected, meaning that for h values on the right side of the vertical line the procedure is no more able to identify the outliers, hence being not robust w.r.t. the presence of contamination. Similar trajectories are observed for the WCEM and not reported here. In the monitoring of robust distances, a color map has been used that goes from light gray to dark gray in order to highlight those trajectories corresponding to observations that are flagged as outlying for most of the monitoring. Figure 2 displays the result of applying both the WEM and WCEM algorithm to the sample at hand with an outlier detection rule based on the 0.99-level quantile of the $\chi ^2_2$ distribution and on a threshold for weights set at 0.2. Component-specific tolerance ellipses are based on the 0.95-level quantile of the $\chi ^2_2$ distribution. We notice that both methods succeed in recovering the underlying structure of the clean data despite the challenging contamination rate and that the outliers detection rules provide quite similar and satisfactory outcomes. The entries in Table 1 give the rate of detected outliers $\epsilon $, swamping and masking stemming from the alternative strategies.

Table 1 Simulated data

Weighted likelihood mixture modeling and model-based clustering

Abstract

Similar content being viewed by others

Weighted likelihood latent class linear regression

Finding Outliers in Gaussian Model-based Clustering

Estimation and computations for Gaussian mixtures with uniform noise under separation constraints

Explore related subjects

1 Introduction

2 Background

2.1 Multivariate estimation

3 Weighted likelihood mixture modeling

3.1 Eigen-ratio constraint

3.2 Classification and outlier detection

3.3 The selection of h

3.4 Synthetic data

3.5 Sensitivity to initialization and root selection

4 Properties

5 Model selection

Proposition

Proof

6 Numerical studies

6.1 Computational burden

7 Real data examples

7.1 Swiss bank note data

7.2 2018 world happiness report data

7.3 Anuran calls

8 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation