The power of (extended) monitoring in robust clustering

Farcomeni, Alessio; Dotto, Francesco

doi:10.1007/s10260-017-0417-8

The power of (extended) monitoring in robust clustering

Discussion of “The power of monitoring: how to make the most of a contaminated multivariate sample” by Andrea Cerioli, Marco Riani, Anthony C. Atkinson and Aldo Corbellini

Original Paper
Published: 08 January 2018

Volume 27, pages 651–660, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Statistical Methods & Applications Aims and scope Submit manuscript

The power of (extended) monitoring in robust clustering

Download PDF

252 Accesses
6 Citations
Explore all metrics

Abstract

We complement the work of Cerioli, Riani, Atkinson and Corbellini by discussing monitoring in the context of robust clustering. This implies extending the approach to clustering, and possibly monitoring more than one parameter simultaneously. The cases of trimming and snipping are discussed separately, and special attention is given to recently proposed methods like double clustering, reweighting in robust clustering, and fuzzy regression clustering.

A reweighting approach to robust clustering

Article 28 March 2017

Tk-Merge: Computationally Efficient Robust Clustering Under General Assumptions

A fuzzy approach to robust regression clustering

Article 12 September 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

First of all, let us commend the authors of Cerioli et al. (2018) for a very clear and thought-provoking paper.

For the past several years the authors have given substantial contributions in the area of robust statistics. Their extensive work on the forward search has lead them to realize that monitoring robust procedures has an intrinsic informative power. In this paper they give some ideas on this, and illustrate with several interesting examples. We believe that monitoring might indeed be very informative, and might also contribute to lead several innovative and sound methods in robust statistics to be more often used in practice, and ultimately become the standard for applied sciences.

It is apparent in our opinion that monitoring is closely related to sensitivity analysis. The idea of tilting tuning parameters is certainly not new, and is often the simplest route to follow when no optimal choice is available (e.g., Farcomeni and Greco 2015; Farcomeni 2009). The welcome and very innovative concept in Cerioli et al. (2018) is that the informative content obtained by monitoring goes beyond the mere aid to choice of tuning parameters.

Monitoring has also an intuitive appeal which might make robust estimators more widely accepted, as it immediately allows the user to understand to what extent data are contaminated and classical estimators are (not) reliable (Farcomeni and Ventura 2012).

We believe that one should not necessarily monitor focusing on a single parameter. Several aspects can be varied simultaneously (generating a 3-D, or in general a p-D movie). The biggest challenge is computational, as in most cases robust estimators shall be computed from scratch for each new set of tuning parameters. In this sense, monitoring opens the issue of obtaining simple and quick updating rules when tuning parameters are tilted (e.g., increased) only slightly. Another issue opened by the idea of monitoring is that, as the authors are well aware, it becomes much more difficult to study the theoretical properties of the monitored procedures (Cerioli et al. 2014).

In the remaining part of this contribution to the discussion we will focus on the power of monitoring in robust clustering, through examples.

We extend the monitoring to clustering in a completely natural way: we compute minimal Mahalanobis distances as the minimal Mahalanobis distance of trimmed observations from their closest centroid, and maximal Mahalanobis distances as the maximal distances of non-trimmed values from their assigned centroid. Other indicators, like the average silhouette width, can be used directly. In the following we will mostly focus on monitoring with respect to the trimming level.

2 Monitoring and reweighting

Reweighting is a general principle in robust statistics according to which, after an initial very robust but possibly inefficient estimator has been obtained, discarded or downweighted observations are used to improve efficiency. In most cases reweighting is a two-step procedure, but it can also be iterative (Cerioli 2010; Dotto et al. 2017).

Dotto et al. (2017) introduced rtclust, a robust model based clustering method based on reweighting. The procedure is initialized with a robust clustering partition based on a very high trimming level and is then iterated throughout a sequence of decreasing trimming levels. At each step parameters are updated and the clean observations, initially flagged as outlying, are possibly reinserted in the active set. Each observation is moved to the active set if its Mahalanobis distance is lower than a threshold. The latter is obtained by fixing an opportune quantile $(1-\alpha _L)$ of the target distribution of the Mahalanobis distance. The parameter $\alpha _L$ establishes how far outliers are supposed to be with respect to the bulk of data. Thus such choice is pretty subjective and strongly depends on the context of application. Code is available from the authors upon request.

In this section we apply the philosophy of Cerioli et al. (2018) to explore features of data by means of monitoring the outcomes of rtclust with respect to $\alpha _L$. We use the well known Swiss banknotes dataset. The involved data describe six features (length, left and right height, lower and upper distance of inner frame to closest border, diagonal length) of 200 Swiss 1000-franc bank notes. It is a priori known that there are two balanced groups, made by genuine and counterfeit notes, respectively. Additionally Flury (1988) pointed out that the group of forged bills is not homogeneous, as 15 observations arise from a different pattern and are, for that reason, outliers. Dotto et al. (2017) applied rtclust to the Swiss bank notes dataset, fixing $\alpha _L=0.001$.

In Fig. 1a, we plot on the x-axis values for the parameter $\alpha _L$, while, on the y-axis, the final estimated proportion of outliers (namely $\pi _{K+1}$). It is straightforward to see that the estimated contamination level slowly decreases (roughly from 0.17 to 0.10) as $\alpha _L$ varies within the range [0.02–0.005). On the contrary, as we move within the range [0.005–0.001) an even slower decline is seen. Finally, for $\alpha _L\le 0.001$ the estimated contamination level suddenly drops towards 0. This clearly suggests that (i) there is contamination, (ii) there might be two or even three separated groups of outliers, with one group of approximately 0.12 (y value with $\alpha _L=0.005$) $-\,0.1$ (y value with $\alpha _L=0.001$) percent being soft outliers, and the remaining 20 being more clearly placed far from the bulk of the data. Note that, as rtclust (and reweighting procedures in general) are not direct outlier detection devices, we shall not make any claim as to outlyingness of the discarded observations; but only claim that the resulting estimates are robust.

In Fig. 1b, we monitored the Mahalanobis distance of each observation to the assigned/closest centroid as a function of $\alpha _L$. In blue bullets there are observations flagged as clean, while, with red crosses we show observations discarded from the estimation set. This plot is a more detailed, but equally clear, account of what we have seen discussing Fig. 1a of the same figure. There are two gross outliers, a group of clearly separated outliers, and a third group whose Mahalanobis distances are large but not clearly separated from the bulk of the data. Note that as soon as $\alpha _L$ is decreased too much, the estimates become unstable (which is a clear sign of contamination).

3 Monitoring and high-dimensional clustering: snipping

Farcomeni (2014a, b) developed snipping, a technique for parsimoniously trimming sparse entries of a data matrix, rather than entire rows, while clustering. This is done in the spirit of dealing with entry-wise outliers, based on the seminal paper of Alqallaf et al. (2009), and is particularly useful for moderately to high dimensional data. Snipped k-means and sclust (which is the model-based version) require the user to specify in advance the number of groups k and a trimming level $\alpha $. A fraction $np\alpha $ of entries, where n is the sample size and p the number of variables, will be discarded; and each of the n rows (without exceptions, unless all p measurements for one or more subjects are discarded) will be assigned to one of the k groups.

R functions to implement the methods, also for the case of robust location and scatter estimation ($k=1$), are available in the package snipEM.

We here revisit an example based on a data set of $n=380$ daily measurements of $p=38$ indicators, taken at a urban waste water treatment plant. These include levels of Zinc, pH, chemical and biological demand of oxygen, suspended and volatile solids, and sediments; taken at each of four different locations (input, first settler, second settler, output). As in previous analyses we fix $k=3$ and monitor the outcome of sclust as a function of the removal of $np\alpha $ entries of the data matrix, for several values of $\alpha $.

In order to work with snipping procedures we might simply define monitored measures at entry level. Hence, k-th Mahalanobis distance of entry $X_{ij}$ will correspond to the univariate distance $|X_{ij}-m_{kj}|/s_{kj}$, where $m_{kj}$ is the j-th entry of the k-th centroid and $s_{kj}$ the j-th standard deviation for the k-th centroid. In Fig. 2 we report, as a function of $\alpha $, the unweighted standard deviation of the estimated centroids (separation, left upper panel), the average silhouette width (silhouette, right upper panel), the minimal Mahalanobis distance of trimmed entries from their closest centroid (minD, left lower panel) and the maximal Mahalanobis distance of untrimmed entries from their assigned centroid (maxD, right lower panel).

It can be seen from Fig. 2 that there is a very small fraction of gross outliers (say, less than 1%) which form a cluster if not trimmed and make monitored statistics explode for $\alpha <1\%$. Separation and silhouette after the drop are slightly increasing (up to about $\alpha =10\%$), indicating that about another 8–9% of the entries can be considered as mild outliers and/or bridge points. Note that the silhouette decreases again for $\alpha >16\%$, a clearly too large snipping level. Interestingly enough, as reported by Farcomeni (2014b), for $5\%\le \alpha \le 15\%$ estimates are fairly stable. Mahalanobis distances in the lower panels are a little more unstable. The maximal Mahalanobis distances are in substantial agreement with our previous discussion, showing a very steep drop for $\alpha <1\%$, a quick but less steep decrease for $1\%\le \alpha <10\%$, and a slower decrease for larger values. The sudden drop for $\alpha \cong 17.5\%$ might indicate that for larger values of $\alpha $ groups are re-arranged, that is, some observations after removal of appropriate entries are moved from one group to another.

4 Monitoring two tuning parameters: the case of robust double clustering

Farcomeni (2009) developed trimmed double k-means. Double clustering aims at simultaneously grouping rows and columns of a data matrix. The robust approach of Farcomeni (2009) is based on separate impartial trimming of rows and columns. This was then extended to the case of snipping in Farcomeni and Greco (2015), which describes snipped double k-means.

R functions to implement the methods are freely available as web based supplementary material of Farcomeni and Greco (2015), on the publisher’s website.

Robust double clustering gives a clear motivation for simultaneous monitoring of more than one tuning parameter, as both the number of trimmed rows and columns can change. Monitoring shall proceed through 3-D plots, 2-D contour plots (as in our example), or by appropriately tabulating results.

We here monitor the analysis of a microarray data matrix for $n=200$ genes measured under $p=20$ conditions. Data are freely available in R package biclust. We fix $k_1=3$ row groups and $k_2=2$ column groups, and different row trimming levels $\alpha _1=0,1/200,2/200,\ldots ,40/200$ and column trimming levels $\alpha _2=0,1/20,2/20,3/20,4/20$. For each combination of $\alpha _1$ and $\alpha _2$ we run the trimmed double k-means algorithm, and compute the separation as above (corresponding to the standard deviation of the estimated centroid matrix), the within sum of square (withinSS) which is the sum of the squared distances of each untrimmed entry with respect to its assigned centroid, and as usual the minimal and maximal Euclidean distances (minD and maxD). In double clustering the minimal and maximal Euclidean distances shall be defined entry-wise, as for instance different columns of the same row can belong to different clusters. Let $m_{rc}$ denote the estimated centroid for an entry in row cluster r and column cluster c. Suppose the i-th row is assigned to cluster $r_i$ and the j-th row to cluster $c_j$, with $r_i=0,\ldots ,k_1$ and $c_j=0,\ldots ,k_2$; where $r_i=0$ ($c_j=0$) indicates a trimmed row (column). Then, the minimal Euclidean distance shall be defined as

$$\begin{aligned} \min _{ij:\ c_j=0\ | r_i=0} \min _{r=1,\ldots ,k_1}\min _{c=1,\ldots ,k_2} |X_{ij}-m_{rc}| \end{aligned}$$

and the maximal Euclidean distance as

$$ \begin{aligned} \max _{ij:\ c_j\ne 0 \& r_i\ne 0} |X_{ij}-m_{r_ic_j}|. \end{aligned}$$

In Fig. 3 we report, as a function of $n\alpha _1$ and $p\alpha _2$, the separation (right lower panel), the within sum of squares (left lower panel), the minimal Euclidean distance of trimmed entries from their closest centroid (left upper panel) and the maximal Euclidean distance of untrimmed entries from their assigned centroid (right upper panel).

It can be seen from the figure that the minimal Euclidean distance (which is not defined for $\alpha _1=\alpha _2=0$) rapidly increases, with minimal values for $\alpha _2=1/20$. About 8 row outliers are clearly identified when $\alpha _2=0$, while removal of one or more entire columns makes row trimming less important. This is a clear indication that the 8 gross outliers are probably generated by a single condition (that is, during one of the $p=20$ experiments). The maximal Mahalanobis distance is much more difficult to interpret, and this is probably an obvious limitation of monitoring which does not necessarily give interpretable information. The within sum of squares and separation are well in agreement with the minimal Mahalanobis distances, even if the information is less apparent.

5 Monitoring three tuning parameters: the case of robust fuzzy regression models

Dotto et al. (2016) introduce a robust fuzzy regression clustering model based on trimming. Linear clustering models are based on identifying k groups of units, each forming a separate linear structure. Each unit is assigned to the group minimizing the regression error (i.e. its squared residuals from the estimated regression line). The methodology proposed in the aforementioned paper aims at maximizing the objective function given by

$$\begin{aligned} \sum _{i=1}^{n}\sum _{j=1}^ku_{ij}^m\log \big (f(y_i;\varvec{x_i'}\varvec{b}_j+b_j^0,s_j^2)\big ) \end{aligned}$$

(1)

where $f(\cdot ;\mu ,\sigma ^2)$ is the p.d.f of a normal distribution with mean $\mu $ and standard deviation $\sigma $, and m is a fuzzification parameter. In fuzzy clustering each observation can potentially be a member, with varying degrees of memberships, of each cluster. This is controlled by the parameter m, and measured by $u_{ij}\in [0,1]$. The latter is a membership value of the observation i to the cluster j. The parameter m values in the range $[1,+\infty )$. Letting $m\rightarrow \infty $ implies equal membership values $u_{ij}=1/k$ regardless of the data; while when $m = 1$ crispy weights $\{0,1\}$ are always obtained and all (untrimmed) observations are hard-assigned to one and only one cluster. The weights $u_{ij}$ satisfy the following equalities:

$$\begin{aligned} \sum _{j=1}^k u_{ij}=1\quad \text{ if }\quad i\in \mathcal {I}\quad \text{ and }\quad \sum _{j=1}^k u_{ij}=0\quad \text{ if }\quad i\notin \mathcal {I}, \end{aligned}$$

for a subset $ \mathcal {I}\subset \{1,2,\ldots ,n\}$ whose cardinality is $n(1-\alpha )$. The $n\alpha $ observations which are not included in the subset $\mathcal {I}$ are the ones flagged as outlying and, as a consequence, receive $u_{ij}=0$ membership for each $j=1,2,\ldots ,k$. Equation (1) contains two tuning parameters: m and $\alpha $. Additionally, the number of clusters k can be seen as a tuning parameter and in our experience in fuzzy regression clustering it is strongly dependent on m, and more dependent on $\alpha $ than what usually happens in other contexts. These parameters are intertwined and a long discussion is given in Dotto et al. (2016) on their tuning. For these reasons, we will compare simultaneous monitoring of $\alpha $ and m for different values of k.

We illustrate monitoring with the real data analyzed in García-Escudero et al. (2010) and Dotto et al. (2016). The dataset is made of 362 measurements of height and diameter of Pinus Nigra trees located in the north of Palencia (Spain), and the aim is to explore the linear relationship between these two quantities.

In our context we need to use a monitoring quantity which is tailored for the special case of fuzzy clustering. We have chosen to use the Average Within-Cluster Distance (AWCD), which is basically a weighted average of the distance of each observation to each cluster (Campello and Hruschka 2006). We here generalize AWCD to fuzzy linear clustering by using squared residuals as distances.

We fixed a grid of values for the parameters m and $\alpha $ and evaluated the results by using a heatmap reporting the value of the AWCD for each combination. Additionally we fixed three different candidate values for k (i.e. $\{2,3,4\}$) and plotted a heat map for each candidate value.

In Fig. 4a–c, we show the AWCD when $k=2,3,4$, respectively. It is quite clear that reasonable values for m are within the range [1–1.3], as when $m>1.3$ very large AWCD are generally obtained (especially for small $\alpha $). When $m \in [1-1.3]$, there are clear differences between Fig. 4a and the other two. In Fig. 4a, AWCD are generally larger than the two other panels for any fixed m and $\alpha $, indicating that the right number of clusters is $k=3$ (as there is no advantage in increasing k further). The very high AWCD values obtained for $m>1.3$ make it hard to distinguish further. We could here restrict to $m \in [1-1.3]$ and repeat monitoring, but we prefer simply monitoring with respecto to $\alpha $ for fixed $m=1.3$ and $k=2,3,4$. This is done in Fig. 4d. Figure 4d clearly shows how very low values of the AWCD index are reached as the proportion of trimmed observations is $\ge 0.04$, and that basically there are no differences in performance when comparing $k=3$ with $k=4$. It shall be noted that Fig. 4d is very similar to classification trimmed likelihood curves (García-Escudero et al. 2011), even if a different quantity than the likelihood at convergence is monitored. As a final consideration point out that there is substantial agreement between the monitoring used for tuning parameter choice in Dotto et al. (2016) and the (hierarchical) monitoring adopted here, even if this different quantities are monitored.

6 Conclusions

In this paper we applied the philosophy of monitoring, presented in Cerioli et al. (2018), to four recent methodological contributions related to robust cluster analysis. We mostly focused on monitoring the trimming level. Even though different trimming principles were adopted, in all cases the trimming level is directly related to both breakdown point and efficiency.

References

Alqallaf F, Van Aelst S, Yohai VJ, Zamar RH (2009) Propagation of outliers in multivariate data. Ann Stat 37:311–331
Article MathSciNet MATH Google Scholar
Campello RJ, Hruschka ER (2006) A fuzzy extension of the silhouette width criterion for cluster analysis. Fuzzy Sets Syst 157:2858–2875
Article MathSciNet MATH Google Scholar
Cerioli A (2010) Multivariate outlier detection with high-breakdown estimators. J Am Stat Assoc 105:147–156
Article MathSciNet MATH Google Scholar
Cerioli A, Farcomeni A, Riani M (2014) Strong consistency and robustness of the forward search estimator of multivariate location and scatter. J Multivar Anal 126:167–183
Article MathSciNet MATH Google Scholar
Cerioli A, Riani M, Atkinson AC, Corbellini A (2018) The power of monitoring: how to make the most of a contaminated multivariate sample. Stat Methods Appl. https://doi.org/10.1007/s10260-017-0409-8
Dotto F, Farcomeni A, García-Escudero LA, Mayo-Iscar A (2017) A fuzzy approach to robust regression clustering. Adv Data Anal Classif 11:691. https://doi.org/10.1007/s11634-016-0271-9
Dotto F, Farcomeni A, García-Escudero LA, Mayo-Iscar A (2017) A reweighting approach to robust clustering. Stat Comput. https://doi.org/10.1007/s11222-017-9742-x
Farcomeni A (2009) Robust double clustering. J Classif 26:77–101
Article MATH Google Scholar
Farcomeni A (2014a) Robust constrained clustering in presence of entry-wise outliers. Technometrics 56:102–111
Article MathSciNet Google Scholar
Farcomeni A (2014b) Snipping for robust $k$-means clustering under component-wise contamination. Stat Comput 24:909–917
Article MathSciNet MATH Google Scholar
Farcomeni A, Greco L (2015) Robust methods for data reduction. Chapman and Hall/CRC Press, Boca Raton
Book MATH Google Scholar
Farcomeni A, Ventura L (2012) An overview of robust methods in medical research. Stat Methods Med Res 21:111–133
Article MathSciNet Google Scholar
Flury B (1988) Multivariate statistics: a practical approach. Chapman & Hall, Ltd, London
Book MATH Google Scholar
García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2011) Exploring the number of groups in model-based clustering. Stat Comput 21:585–599
Article MathSciNet MATH Google Scholar
García-Escudero LA, Gordaliza A, Mayo-Iscar A, San Martín R (2010) Robust clusterwise linear regression through trimming. Comput Stat Data Anal 54:3057–3069
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Sanità Pubblica e Malattie Infettive, Università di Roma “La Sapienza”, Piazzale Aldo Moro, 5, 00185, Rome, Italy
Alessio Farcomeni
Scuola di Economia e Studi Aziendali, Università degli Studi di Roma Tre, Via Silvio D’Amico, 77, 00145, Rome, Italy
Francesco Dotto

Authors

Alessio Farcomeni
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Dotto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alessio Farcomeni.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Farcomeni, A., Dotto, F. The power of (extended) monitoring in robust clustering. Stat Methods Appl 27, 651–660 (2018). https://doi.org/10.1007/s10260-017-0417-8

Download citation

Accepted: 06 December 2017
Published: 08 January 2018
Issue Date: 04 December 2018
DOI: https://doi.org/10.1007/s10260-017-0417-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The power of (extended) monitoring in robust clustering

Abstract

Similar content being viewed by others

A reweighting approach to robust clustering

Tk-Merge: Computationally Efficient Robust Clustering Under General Assumptions

A fuzzy approach to robust regression clustering

1 Introduction

2 Monitoring and reweighting

3 Monitoring and high-dimensional clustering: snipping

4 Monitoring two tuning parameters: the case of robust double clustering

5 Monitoring three tuning parameters: the case of robust fuzzy regression models

6 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The power of (extended) monitoring in robust clustering

Abstract

Similar content being viewed by others

A reweighting approach to robust clustering

Tk-Merge: Computationally Efficient Robust Clustering Under General Assumptions

A fuzzy approach to robust regression clustering

1 Introduction

2 Monitoring and reweighting

3 Monitoring and high-dimensional clustering: snipping

4 Monitoring two tuning parameters: the case of robust double clustering

5 Monitoring three tuning parameters: the case of robust fuzzy regression models

6 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation