1 Introduction

Clustering individuals is an essential task for data science. Clustering aims at partitioning a sample of individuals in several groups (clusters) so that individuals in a same cluster are similar from a multidimensional point of view, while individuals in separate clusters are different. \(k-\)means clustering (Forgy 1965), partitioning around medoids (Kaufman and Rousseeuw 1990), clustering by mixture models (McLachlan and Basford 1988) or hierarchical clustering (Ward 1963) are some popular methods for building this partition.

However, data are often incomplete and clustering algorithms cannot be directly applied on incomplete data. A common strategy to deal with missing values in data analysis consists in using multiple imputation (Rubin 1976, 1987; Schafer 1997). Multiple imputation (MI) consists of 3 steps: (1) the imputation of the data set according to an imputation model several times (2) the analysis of each imputed data set according to a substantive model (3) the pooling of the analysis results according to the Rubin’s rules. Such methods are mainly used for inference in linear models (see Marshall et al. (2009) for other uses), but not for clustering. For instance, for applying a regression model on incomplete quantitative data: (1) data can be imputed according to a Gaussian model (Schafer 1997), (2) the regression model is fit on each imputed data set, leading to several estimates of the regression coefficients and their associated variance, (3) regression estimates are averaged and the associated variance is computed. Thus, despite missing values, MI yields a unique estimate of substantive model parameters and an uncertainty measure, which is expressed as a variance estimate. One major issue for applying clustering after MI is how to apply an equivalent of Rubin’s rules in this context, i.e. how to apply step 3) ? Basagana et al. (2013), Faucheux et al. (2020), Bruckers et al. (2017) brought some answers in terms of partitions pooling, but the question of the uncertainty measure has not been discussed.

Uncertainty in clustering refers to (in)stability and results in various Voronoi tessellations of the metric space. These variations can cover many aspects: the used clustering algorithm, its initialization, the chosen number of clusters, etc. Here, we focus on another aspect which is the stability related to sampling. Note that we distinguish it to the probability in the assignment of an individual to a cluster (easily obtained for model-based clustering methods), which assesses the uncertainty in cluster assignment, but would remain even if the sample were fixed. See Dudoit and Fridly (2003) or Bruckers et al. (2017) for related works.

When data are complete, several resampling techniques have been proposed to assess the clustering stability (Hennig 2007). One main advantage of these methods consists in being relevant for both distance-based and model-based clustering methods. They are generally motivated by the determination of the number of clusters. The rationale is that a “too" large number of clusters should lead to a significant increase of instability. Jain and Moreau (1987), Wang (2010), Fang and Wang (2012), Mourer et al. (2020) proposed several approaches in this line. In particular, Wang (2010) proposed a measure of stability based on cross-validation. Authors demonstrate the asymptotic selection consistency of their procedure. However, the expected value of this measure is related to the number of individuals. Consequently, since by data splitting cross-validation reduces the sample size, the stability estimate could be biased. For this reason, Fang and Wang (2012) proposed an insightful bootstrap technique avoiding data splitting.

In this paper, we generally more focus on the pooling step, both in terms of partitions pooling and in terms of sample variability with incomplete data. The rest of the paper is as follows. Based on the literature on consensus clustering and stability assessment in the context of complete data, we argue in Sect. 2 how to apply Rubin’s rules after MI. Then in Sect. 3, our methodology is assessed by simulation. Finally, an application to a real data set is proposed to determine the number of clusters when data are incomplete.

2 Method

2.1 Notations

Following standard notations for incomplete data analysis, we denote \({{\textbf {X}}}=\left( x_{{i}{\ell }}\right) _{1\le {i}\le {n},1\le {\ell }\le {p}}\) the full data set and \({{\textbf {R}}}=\left( r_{{i}{\ell }}\right) _{1\le {i}\le {n},1\le {\ell }\le {p}}\) the missing data pattern, so that \(r_{{i}{\ell }}=0\) if \(x_{{i}{\ell }}\) is missing and 1 if observed. X and R are the associated random variables. For a given observation \({i}\), the set of observed values is denoted \(x_{i}^{obs}\), while the set of missing values is denoted \(x_{i}^{miss}\), so that \(x_{i}=\left( x_{i}^{obs}, x_{i}^{miss}\right) \). Note that this partition of variables is specific for each observation. Similarly, we denote \(X^{obs}\) and \(X^{miss}\) the observed and the missing part of X so that \(X=\left( X^{obs},X^{miss}\right) \). The distribution of random variable is denoted F. Next, we place ourselves under the standard missing at random (MAR) assumption, meaning R and \(X^{miss}\) are independent conditionally to \(X^{obs}\) Rubin (1976).

2.2 Rubin’s rules

From a frequentist point of view, MI aims at estimating the expected mean and the expected variance of a statistic \({Q}\) over the realizations of \(\left( X^{obs},R\right) \). For instance, \({Q}\) could be the least squared estimator of regression coefficients in a linear model, or the estimator of correlation between variables, etc. Under the MAR assumption, consistent estimators can be obtained by ignoring the distribution of R. The associated estimates are obtained in three steps:

  1. 1.

    \({M}\) values of \(X^{miss}\) are drawn from their predictive distribution \(F_{X^{miss}\vert X^{obs}}\), leading to \({M}\) imputed data sets \(\left( {{\textbf {X}}}^{obs},{{\textbf {X}}}^{miss}_{m}\right) _{1\le {m}\le {M}}\).

  2. 2.

    \({Q}\) is evaluated on each one, providing a set of point estimates \(\left( \hat{{Q}}_{m}\right) _{1\le {m}\le {M}}\) with \(\hat{{Q}}_{m}=Q\left( {{\textbf {X}}}^{obs},{{\textbf {X}}}^{miss}_{m}\right) \), as well as a set of estimated variances \(\left( U_{m}\right) _{1\le {m}\le {M}}\).

  3. 3.

    These values are aggregated according to the Rubin’s rules (Rubin 1976) leading to a unique point estimate \(\bar{{Q}}\) and a unique variance estimate T as follows:

    $$\begin{aligned}&\bar{{Q}}=\frac{1}{{M}}\sum _{{m}=1}^{M}\widehat{{Q}}_{m}\end{aligned}$$
    (1)
    $$\begin{aligned}&T=\underbrace{\frac{1}{{M}}\sum _{{m}=1}^{M}U_{m}}_{\bar{U}} +\underbrace{\frac{1}{{M}-1}\sum _{{m}=1}^{M}\left( \widehat{{Q}}_{m}-\bar{{Q}}\right) ^2}_B \end{aligned}$$
    (2)

By the second step, we obtain \({M}\) independent realizations of \({Q}\) given \(X^{obs}\). These realizations are centered around \({Q}\left( {{\textbf {X}}}^{obs},{{\textbf {X}}}^{miss}\right) \) (in expectation), i.e. around the value of the statistic which would be observed if data were fully observed. Consequently, their average given by \(\bar{{Q}}\) in the third step is an unbiased estimate of the expected value of \({Q}\) over all observed samples. Thus, \(\bar{Q}\) estimates

$$\begin{aligned} {\mathbb {E}}_{X^{obs}}\left[ {\mathbb {E}}_{X}\left[ {Q}\left( X^{obs}, X^{miss}\right) \vert X^{obs}\right] \right] \end{aligned}$$
(3)

Similarly, \(\bar{U}=\frac{1}{{M}}\sum _{{m}=1}^{M}U_{m}\) estimates the variance of \({Q}\) over observed samples and \(B=\frac{1}{{M}-1}\sum _{{m}=1}^{M}\left( \widehat{{Q}}_{m}- \bar{{Q}}\right) ^2\) the additional variance related to missing values. Following an ANOVA decomposition, the total variance T is expressed as the sum of a within imputation variance (\(\bar{U}\)) and a between imputation variance (B), so that T estimates

$$\begin{aligned} \underbrace{{\mathbb {E}}_{X^{obs}}\left[ Var_{X}\left[ {Q}\left( X^{obs}, X^{miss}\right) \vert X^{obs}\right] \right] }_{within} + \underbrace{Var_{X^{obs}}\left[ {\mathbb {E}}_{X}\left[ {Q}\left( X^{obs}, X^{miss}\right) \vert X^{obs}\right] \right] }_{between}\nonumber \\ \end{aligned}$$
(4)

Note that according to the values of \(\left( {{\textbf {X}}}^{miss}_{m}\right) _{1\le {m}\le {M}}\), \(\bar{{Q}}\) randomly varies around its expectation given \(X^{obs}\). When \({M}\) is small, this additional variability cannot be ignored. For this reason, B is generally corrected to account for it. Such a correction is obtained by multiplying B by \((1+1/{M})\) (Schafer 1997).

Compared to single imputation, MI accounts for the variability due to missing values thanks to the between variance B. The ratio B/T is helpful for interpretation since it assesses the robustness of the final analysis results to the imputation model (van Buuren 2018).

Challenges and motivations Rubin’s rules (Eqs. 1 and 2) have been developed for statistic \({Q}\) in \({\mathbb {R}}\) (Marshall et al. 2009). However, a clustering algorithm does not lie within this scope, since it can be expressed as a categorical variable \(\varPsi \) with values in the set of partitions of \({n}\) observations in \({K}\) clusters at the most. Thus, this paper aims to develop new rules in the context of cluster analysis: a first rule to pool the \({M}\) realizations of \(\varPsi \) obtained from each imputed data set, as well as a second rule to compute a unique associated uncertainty measure which accounts for sample variability and missing values. Such an innovative methodology would offer a new way for applying any cluster analysis method on incomplete data.

2.3 Partitions pooling

In the context of clustering, partitions pooling refers to consensus clustering. Based on the literature in this research field when data are complete, we are going to propose an equivalent of the first rule (Eq. 1).

2.3.1 Consensus with complete data

Consensus clustering has been addressed by several authors since several decades (see e.g. Day (1986), Vega-Pons and Ruiz-Shulcloper (2011) for a survey). The main idea of consensus clustering is to agglomerate the separate partitions \(\left( \varPsi _{j}\right) _{1\le {j}\le {J}}\) (called contributory partitions) into a global partition that must be as similar as possible to the contributory partitions according to an index, like the Rand index defined as the proportion of agreements between partitions, i.e. \(\frac{1}{{n}({n}-1)}\sum _{\left( {i},{i}'\right) }\delta _{{i}{i}'}\) where \(\delta _{{i}{i}'}\) is equal to 0 if individuals \({i}\) and \({i}'\) belong to the same cluster in one partition and not in the other; and \(\delta _{{i}{i}'}\) is equal to 1 otherwise. These contributory partitions can be due to various algorithms, or several tuning parameters, several sets of features etc. Recently, Jain (2017) brought a strong theoretical framework to consensus clustering by seeing a consensus algorithm as an estimate of the partition minimizing the expected sum of the dissimilarity \(\delta \) with all separate partitions (over all partitions). More precisely, the expected partition is defined as follows

$$\begin{aligned} argmin_{\varPsi \in {\mathcal {P}}_{{n},{K}}} \int _{{\mathcal {P}}_{{n},{K}}}\delta ^{\alpha } (\varPsi ^\star ,\varPsi )d\pi (\varPsi ^\star ) \end{aligned}$$
(5)

where \(\pi \) is a probability distribution on the partition space \({\mathcal {P}}_{{n},{K}}\) of \({n}\) observations in \({K}\) clusters at the most, \(\delta \) a dissimilarity function and \(\alpha \) a positive real. An estimator of (5) can be defined as the partition minimizing the loss function from the observed contributory partitions \(\left( \varPsi _{j}\right) _{1\le {j}\le {J}}\):

$$\begin{aligned} L(\varPsi )=\sum _{{j}=1}^{J}\delta ^{\alpha }(\varPsi ,\varPsi _{j}) \end{aligned}$$
(6)

By this theoretical framework, Jain (2017) extends the notion of mean (dedicated to real statistic) to the context of partitions.

However, because of the huge number of partitions, minimizing (6) is highly challenging. The literature mainly deals with \(\alpha \) tuned to 1 or 2, referred as median partition problem. Vega-Pons and Ruiz-Shulcloper (2011) distinguished four families of methods for solving it: non-negative matrix factorization (NMF) based methods, Mirkin distance-based methods, Kernel methods and genetics algorithms. The first two have interesting theoretical properties.

NMF methods consists in rewriting the median partition problem as

$$\begin{aligned} arg min_{{{\textbf {H}}}}\parallel {{\textbf {M}}}-{{\textbf {H}}}\parallel ^2 \end{aligned}$$
(7)

where \(\parallel . \parallel \) denotes the Frobenius norm, \({{\textbf {H}}}\) \(\left( {n}\times {n}\right) \) denotes a connectivity matrix (meaning \({{\textbf {H}}}=(h_{{i}{i}'})_{1\le {i},{i}'\le {n}}\) with \(h_{{i}{i}'}=1\) if the individuals \({i}\) and \({i}'\) are in a same cluster and \(h_{{i}{i}'}=0\) otherwise) and \({{\textbf {M}}}=\frac{1}{{J}}\sum _{{j}=1}^{J}{{\textbf {H}}}_{j}\) denotes the mean of the connectivity matrices \(\left( H_{j}\right) _{1\le {j}\le {J}}\) associated to each contributory partition \(\varPsi _{j}\) (\(1\le {j}\le {J}\)). The constraint that \({{\textbf {H}}}\) must be a connectivity matrix can be expressed as an optimization over the set of the orthogonal matrices (Li et al. 2007). The solution of this problem under orthogonality constraint corresponds to the partition minimizing the loss function given in Eq. (6). NMF is a powerful method widely used for solving many optimization problems (beyond the clustering framework) in several fields. One of the main theoretical strengths of the method is the monotone convergence of the optimization algorithm used.

Mirkin distance-based methods focus on minimizing the loss function (6) for \(\delta \) chosen as the number of disagreements between partitions (called Mirkin distance). The Mirkin distance does not make the problem less complex, but it has been widely studied and benefits from theoretical results. For example, when the solution is restricted to the set of contributory partitions, the error cannot exceed two times the one obtained by the global optimum (Filkov and Skiena 2004).

Note that many other methods have been proposed to perform consensus clustering, like the popular Cluster based Similarity Partitioning Algorithm (CSPA), which consists in re-clustering the individuals from the average (\({{\textbf {M}}}\)) of the connectivity matrices associated to the contributory partitions. However, those methods cannot be expressed as a median partition problem and consequently cannot be justified from an inferential point of view. See Vega-Pons and Ruiz-Shulcloper (2011), Strehl et al. (2002) for a review.

2.3.2 Partitions pooling after MI

Partitions pooling after MI aims at aggregating several partitions varying by the imputed values only. As the first Rubin’s rule is motivated by inferential argument, consensus clustering based on the median partition problem is theoretically appealing since it directly extends the notion of expected mean to the clustering framework, providing a straightforward application of the first Rubin’s rule to clustering. Among the consensus methods based on the median partition problem, genetics algorithms do not offer theoretical guaranties, while kernel-based methods seem irrelevant in the context of MI. Indeed, since imputed values are generated independently from their predictive distribution, we assume all contributory partitions should have the same weight. Thus, only NMF-based methods and Mirkin distance-based methods provide a suitable way to pool partitions after MI. Formally, our equivalent of the first Rubin’s rule for clustering after MI is as follows:

$$\begin{aligned} \bar{\varPsi }=argmin_{\varPsi } \sum _{{m}=1}^{M}\delta (\varPsi ,\varPsi _{m}) \end{aligned}$$
(8)

with \(\varPsi _{m}\) is the partition obtained from \(\left( {{\textbf {X}}}^{obs},{{\textbf {X}}}^{miss}_{m}\right) \) and \(\delta \) the Mirkin distance, or equivalently

$$\begin{aligned} arg min_{{{\textbf {H}}}}\parallel {{\textbf {M}}}-{{\textbf {H}}}\parallel ^2 \end{aligned}$$
(9)

with \({{\textbf {M}}}=\frac{1}{{M}}\sum _{{m}=1}^{M}{{\textbf {H}}}_{m}\) and \({{\textbf {H}}}_{m}\) the connectivity matrix associated to \(\varPsi _{m}\).

Following Jain (2017), the obtained partition estimates the theoretical partition minimizing the expected sum of the dissimilarity \(\delta \) with all separate partitions given \(X^{obs}\).

Based on other techniques, clustering after MI has been previously investigated. Some authors have proposed stacking centroids or stacking imputed data sets for dealing with several imputed data sets (Plaehn 2019). We can note that stacking ignores the differences between imputed data sets. Among other methods, Basagana et al. (2013) proposed a general framework for consensus by previously investigating the choice of the number of partitions and the choice of the subset of variables retained for clustering. For a given number of clusters and a set of variables, consensus is performed by majority vote over the contributory partitions. More recently, Bruckers et al. (2017) proposed a consensus for functional data. For achieving this goal, they first identify the indicator matrices associated to each contributory partition and then, they look at the fuzzy matrix minimizing the Euclidean distance over the indicator matrices. The consensus partition is then obtained by majority vote (Dimitriadou et al. 2002). Like CSPA, this approach in two steps cannot be considered as based on the median partition problem (Vega-Pons and Ruiz-Shulcloper 2011). Lately, Faucheux et al. (2020) proposed consensus based on the MultiCons algorithm (Al-Najdi et al. 2016). The algorithm presents many advantages, in particular it allows a visualization of the hidden cluster structure in the data set, but it does not aim at minimizing the median partition problem (Al-Najdi et al. 2016, p. 16).

While these other methods yield a unique partition from several imputed data sets, this partition cannot be expressed as an estimator of a theoretical partition, contrary to those based on the median partition problem.

2.4 Instability pooling

Based on the literature in cluster stability when data are complete, we now propose an equivalent of the second rule (Eq. 2).

2.4.1 Instability with complete data

Assessing the instability in clustering is important for data analysis. For achieving this goal, resampling methods are appealing, especially when the clustering algorithm is distance-based, like k-means, k-medoids or hierarchical clustering. Wang (2010) and Fang and Wang (2012) proposed two ways for computing instability measure from any clustering algorithms. The first one is based on cross-validation, while the second is based on bootstrap. Since the cross-validation method tends to underestimate the instability (Wang 2010; Fang and Wang 2012), the bootstrap method appears like more relevant. The main idea consists in defining a theoretical distance (instability) \(\delta \) between clusterings based on the sample distribution \(F_X\). Then, this distribution is mimicked by bootstrap and the distance is evaluated from each bootstrap replicate. Finally, distances are aggregated by averaging. More precisely, the theoretical distance \(\delta _{F_X}\) between two clusterings \(\varPsi \) and \(\varPsi '\) is defined by

$$\begin{aligned} \delta _{F_X}(\varPsi _{j},\varPsi _{{j}'})={\mathbb {P}}_{F_X} \{I\left( V_{\varPsi _{{j}}}\left( X\right) =V_{\varPsi _{{j}}} \left( X'\right) \right) +I\left( V_{\varPsi _{{j}'}}\left( X\right) =V_{\varPsi _{{j}'}}\left( X'\right) \right) =1\} \end{aligned}$$
(10)

where X and \(X'\) are independently drawn from the distribution \(F_X\) and \(V_{\varPsi _{{j}}}(X)\) is the Voronoi cell for X according to the partition given by \(\varPsi _{{j}}\). This distance measures the probability of disagreement between both clusterings.

Based on this definition, the instability of \(\varPsi \) is defined as

$$\begin{aligned} {\mathbb {E}}_{X^{n}\sim F_X^{n},\tilde{X}^{n}\sim F_X^{n}} \left[ \delta _{F_X}\left( \varPsi \left( X^{n}\right) , \varPsi \left( \tilde{X}^{n}\right) \right) \right] \end{aligned}$$
(11)

i.e. as the expectation over all random samples of size \({n}\) of the distances between partitions given by clustering trained on them.

Fang and Wang (2012) proposes an estimate of (11) by bootstrap: \({C}\) bootstrap pairs \(\left( {{\textbf {X}}}_{{c}},\tilde{{{\textbf {X}}}}_{{c}}\right) _{1\le {c}\le {C}}\) are drawn from the empirical distribution \(\widehat{F}^{n}\). From each one, \(\varPsi \) is evaluated. Both estimates are used to classify the individuals of \({{\textbf {X}}}\) according to the Voronoi cells defined by each partition (e.g. by considering the closest centroid). Finally, the instability of the clustering is estimated by

$$\begin{aligned} U^{boot}= \frac{1}{{C}}\sum _{{c}=1}^{C}\delta _{\hat{F}^{n}} \left( \varPsi \left( {{\textbf {X}}}_{c}\right) ,\varPsi \left( \tilde{{{\textbf {X}}}}_{c}\right) \right) \end{aligned}$$
(12)

with

$$\begin{aligned} \delta _{\hat{F}^{n}}\left( \varPsi \left( {{\textbf {X}}}_{c}\right) , \varPsi \left( \tilde{{{\textbf {X}}}}_{c}\right) \right)= & {} \frac{1}{{n}^2} \sum _{{i}=1}^{n}\sum _{{i}'=1}^{n}\left| I \left( V_{\varPsi \left( {{{\textbf {X}}}}_{c}\right) }\left( x_{i}\right) =V_{\varPsi \left( {{{\textbf {X}}}}_{c}\right) }\left( x_{{i}'}\right) \right) -I\left( V_{\varPsi \left( \tilde{{{\textbf {X}}}}_{c}\right) }\left( x_{i}\right) =V_{\varPsi \left( \tilde{{{\textbf {X}}}}_{c}\right) }\left( x_{{i}'}\right) \right) \right| \nonumber \\ \end{aligned}$$
(13)

Note that \(\delta _{\hat{F}^{n}}\) corresponds to the proportion of disagreements between both partitions of \({{\textbf {X}}}\) (up to the scalar factor \(\frac{{n}}{{n}-1}\)) and can be viewed as a normalized Mirkin distance. Fang and Wang (2012) indicate moderate values of \({C}\) (20 or 50) are sufficient for a precise instability assessment. Furthermore, this instability can be assessed for distance-based or non-distance-based clustering algorithms.

2.4.2 Instability with incomplete data

Following previous developments for complete data, the instability with missing data can be defined as the expectation given in (11) over observed data. Following the second Rubin’s rule, such an instability can be decomposed as the sum of a within instability (Eq. 14) and a between instability (Eq. 15):

$$\begin{aligned}&{\mathbb {E}}_{X^{obs},\tilde{X}^{obs}}\left[ {\mathbb {E}}_{X,\tilde{X}} \left[ \delta _{F_{X\vert X^{obs}}}\left( \varPsi \left( X^{obs}, X^{miss}\right) ,\varPsi \left( \tilde{X}^{obs}, \tilde{X}^{miss}\right) \right) \vert X^{obs}\right] \right] \end{aligned}$$
(14)
$$\begin{aligned}&{\mathbb {E}}_{X^{obs}, \tilde{X}^{obs}} \left[ \delta _{F_{X}}\left( \varPsi \left( X^{obs}, X^{miss}\right) ,\varPsi \left( \tilde{X}^{obs}, \tilde{X}^{miss}\right) \right) \vert X^{obs}\right] \end{aligned}$$
(15)

Following Eq. (2), the within instability (Eq. 14) can be estimated from imputed \({M}\) data sets \(\left( {{\textbf {X}}}^{obs},{{\textbf {X}}}^{miss}_{m}\right) _{1\le {m}\le {M}}\) by:

$$\begin{aligned} \bar{U}=\frac{1}{{M}}\sum _{{m}=1}^{M}U_{m}^{boot} \end{aligned}$$
(16)

where \(U_{m}^{boot}\) is the instability estimated from \(\left( {{\textbf {X}}}^{obs},{{\textbf {X}}}^{miss}_{m}\right) \) according to Eq. (12) and the between instability (Eq. 15) can be estimated by:

$$\begin{aligned} B=\frac{1}{{M}^2} \sum _{{m}=1}^{M}\sum _{{m}'=1}^{M}\delta _{\hat{F}_{X\vert X^{obs}}}\left( \varPsi \left( {{\textbf {X}}}^{obs},{{\textbf {X}}}^{miss}_{m}\right) , \varPsi \left( {{\textbf {X}}}^{obs},{{\textbf {X}}}^{miss}_{{m}'}\right) \right) \end{aligned}$$
(17)

where \(\delta _{\hat{F}_{X\vert X^{obs}}}\left( \varPsi \left( {{\textbf {X}}}^{obs},{{\textbf {X}}}^{miss}_{m}\right) , \varPsi \left( {{\textbf {X}}}^{obs},{{\textbf {X}}}^{miss}_{{m}'}\right) \right) \) corresponds to the proportion of disagreements between partitions obtained from imputed data sets \({m}\) and \({m}'\) as in Eq. (13). We note this expression does not depend on the mean partition, while the rule in Eq. (2) depends on the mean. For this reason, no correction for small values of \({M}\) is required here.

The total instability is given by the sum of Eqs. (16) and (17):

$$\begin{aligned} T=\frac{1}{{M}}\sum _{{m}=1}^{M}U_{m}^{boot} +\frac{1}{{M}^2} \sum _{{m}=1}^{M}\sum _{{m}'=1}^{M}\delta _{\hat{F}_{X\vert X^{obs}}}\left( \varPsi \left( {{\textbf {X}}}^{obs},{{\textbf {X}}}^{miss}_{m}\right) , \varPsi \left( {{\textbf {X}}}^{obs},{{\textbf {X}}}^{miss}_{{m}'}\right) \right) \end{aligned}$$
(18)

Note that without missing values, Eq. (18) is equivalent to the instability proposed in Fang and Wang (2012). More generally, T is positive and bounded by 2. The value of 0 is reached if \({K}=1\) or if the clustering is constant whatever the incomplete sample. The less stable the clustering is, the larger T will be. A large value of the between instability compared to the total one indicates a strong dependence of the clustering to the imputation model.

2.5 Summary

Based on above developments, the full procedure to perform cluster analysis after multiple imputation can be summarized as follows:

  • Imputation From an incomplete data set, generate \({M}\) imputed data sets according to a predefined multiple imputation method

  • Analysis For \({m}\) in \(\{ 1\ldots {M}\}\):

    1. 1.

      build a partition \(\varPsi _{m}\) from the \({m}^{th}\) imputed data set

    2. 2.

      compute \(U_{m}^{boot}\) the associated instability:

      1. (a)

        by resampling individuals with replacement, generate \({C}\) bootstrap pairs \(\left( {{\textbf {X}}}_{{c}},\tilde{{{\textbf {X}}}}_{{c}}\right) _{1\le {c}\le {C}}\) from the \({m}^{th}\) imputed data set

      2. (b)

        for each bootstrap pair \({c}\) in \(\{ 1\ldots {C}\}\)

        • perform cluster analysis from \(\left( {{\textbf {X}}}_{{c}},\tilde{{{\textbf {X}}}}_{{c}}\right) _{1\le {c}\le {C}}\) to obtain a pair of partitions \(\left( \varPsi _{c},\tilde{\varPsi }_{c}\right) \)

        • classify the \({n}\) individuals of the \({m}^{th}\) imputed data set according to \(\varPsi _{c}\) and \(\tilde{\varPsi }_{c}\) to obtain a pair of partitions \(\left( \varPsi '_{c},\tilde{\varPsi }'_{c}\right) \)

        • compute the proportion of disagreements between \(\varPsi '_{c}\) and \(\tilde{\varPsi }'_{c}\) as \(U_{m}^{c}=\frac{1}{{n}^2}\sum _{\left( {i}, {i}'\right) }\delta _{{i},{i}'}\) where \(\delta _{{i}{i}'}\) is equal to 0 if individuals \({i}\) and \({i}'\) belong to the same cluster in one partition and not in the other; and \(\delta _{{i}{i}'}\) is equal to 1 otherwise

      3. (c)

        compute the instability associated to \(\varPsi _{m}\) by averaging \(U_{m}^{boot}=\frac{1}{{C}}\sum _{{c}=1}^{{C}}U_{m}^{c}\)

  • Pooling The set of partitions \(\left( \varPsi _{m}\right) _{1\le {m}\le {M}}\) and the set of associated instability estimates \(\left( U_{m}^{boot}\right) _{1\le {m}\le {M}}\) are aggregated as follows:

    • First rule (partitions pooling) using NMF or Mirkin based methods, compute the consensus partition as

      $$\begin{aligned} \bar{\varPsi }=argmin_{\varPsi } \sum _{{m}=1}^{M}\delta (\varPsi ,\varPsi _{m}) \end{aligned}$$

      where \(\delta (\varPsi ,\varPsi _{m})\) denotes the number of disagreements between partitions \(\varPsi \) and \(\varPsi _{m}\) (Mirkin distance)

    • Second rule (instability pooling) compute the total instability as:

      $$\begin{aligned} T=\frac{1}{{M}}\sum _{{m}=1}^{M}U_{m}^{boot} +\frac{1}{{M}^2} \sum _{{m}=1}^{M}\sum _{{m}'=1}^{M}\delta \left( {\varPsi _{{m}}},{\varPsi _{{m}'}}\right) /{n}^2 \end{aligned}$$

Note that an implementation is provided through the R package clusterMI which is available at the web page of the first author.

3 Simulations

After proposing rules for pooling clusterings after MI, we highlight how the pooled results vary according to the data structure. Furthermore, we investigate their robustness to the number of imputed data sets \({M}\). The R code used for simulations is available on demand.

3.1 Simulation design

3.1.1 Data generation

Full data are simulated according to a \({p}\) multivariate Gaussian mixture model with two mixture components:

$$\begin{aligned} X\sim \pi _1{\mathcal {N}}_{p}\left( \mu _1,\varSigma \left( \rho \right) \right) +\pi _2{\mathcal {N}}_{p}\left( \mu _2,\varSigma \left( \rho \right) \right) \end{aligned}$$

where \({p}=10\), \(\pi _1=\pi _2=1/2\), \(\mu _1=(0,0,0,0,0,0,0,0,0,0)\), \(\mu _2=(0,0,0,0,0,2,2,2,2,2)\) and

. Several configurations are investigated by varying the number of individuals \({n}\in \{25,50,100,200\}\), the correlation between variables \(\rho \in \{0.3, 0.6\}\). For each configuration, \({S}=200\) data sets are generated. For each data set, several missing data patterns are considered varying by the percentage of missing values \(\tau \in \{0.1, 0.3, 0.5\}\) and their distribution: \(Prob(r_{{i}{\ell }}=0)=\tau \) for all \(1\le {i}\le {n}\) and \(1\le {\ell }\le {p}\) (MCAR mechanism) or \(Prob(r_{{i}{\ell }}=0)=\varPhi (a_\tau +x_{{i}1})\) for all \(1\le {i}\le {n}\) and \(2\le {\ell }\le {p}\) with \(\varPhi \) the cumulative distribution function of the standard normal distribution (MAR mechanism) and \(a_\tau \) a constant to control the percentage of missing values in expectation. Thus, 9600 incomplete data sets are investigated. Note the computational cost to investigate each one does not allow a larger number of replications.

3.1.2 Methods

Each incomplete data set is imputed according to two MI methods accounting for the structure of individuals (Schafer 2003)

  • JM-DP: MI using a non-parametric extension of the mixture model namely the Dirichlet process mixture of products of multivariate normal distributions (Kim et al. 2014). The number of components is bounded by 5. The number of iterations for the burn-in period is tuned to 500 and the number of skipped iterations to keep one imputed data set after the burn-in period is tuned to 100.

  • FCS-RF: MI by random forest (Doove et al. 2014). The number of iterations for the multivariate imputation by chained equations algorithm is tuned to 10.

For each method, the number of imputed data sets \({M}\) varies in \(\{1,5,10,20,50\}\). Note that the case \({M}=1\) corresponds to single imputation. Then, k-means clustering is performed on each imputed data set (using 2 clusters, standardization of variables and 100 initializations) and partitions are pooled according to the proposed rules. More precisely, the mean partition is estimated using the NMF clustering based method (Eq. (8)) as proposed in Li et al. (2007) and also using a Mirkin distance-based method (Eq. (9)) called Simulated-Annealing-One-element-Move (SAOM) (Filkov and Skiena 2004). Furthermore, the total instability is computed according to Eq. (18).

As benchmark, k-means clustering is also performed on the full data and on the complete cases. In addition, k-means through a bagging procedure based on bootstrap (Dudoit and Fridly 2003) is investigated on full data, while k-means through the k-pod algorithm (Chi et al. 2016) is investigated on incomplete data. This later algorithm overcomes missing values in k-means by optimizing the k-means criterion over observed values only. For achieving this goal, a majorization-minimization algorithm is used, consisting in alternating clustering of individuals (by k-means) and imputation of incomplete observations by the coordinates of their associated centroid.

3.1.3 Criteria

The accuracy of the consensus partition is assessed according to the mean (over the \({S}\) generated data sets) of the adjusted rand index (ARI) (Hubert and Arabie 1985) between the consensus partition and the reference one known by simulation. The mean (over the \({S}\) generated data sets) of the intra instability \(\bar{U}\), the mean of the between variability B, and the mean of the total instability T are reported for each configuration.

3.2 Results

3.2.1 Partitions pooling

Figures 1 and 2 summarize results over the \({S}=200\) simulations for both mechanisms (averages and interquartile ranges are available in “Tables 3 and 4 in Appendix”). Performances for MI methods consider \({M}=50\) imputed data sets, the influence of \({M}\) is discussed in Sect. 3.2.3. Note that because complete-case analysis does not cluster all individuals (compared to MI methods or clustering on full data), the following process is applied for a fair comparison: first, complete cases are clustered using kmeans clustering. Then, based on their observed profile, each incomplete case is classified according to the closest centroid. If the data set did not contain complete cases, then a cluster would be assigned at random for each incomplete case. Thus, the resulting partition concerns all observations.

Fig. 1
figure 1

Accuracy of the clustering procedure under a MCAR mechanism: distribution of the adjusted rand index over the \({S}=200\) generated data sets varying by the number of individuals (\({n}\)), the correlation between variables (\(\rho \)) and the proportion of missing values (\(\tau \)) for various imputation methods (JM-DP or FCS-RF), various consensus methods (NMF or SAOM). For each case, clustering is performed using k-means clustering. As benchmark, ARI obtained by applying k-means on complete-cases (CCA), using k-pod algorithm (kpod), on full data (Full) or using a bagging procedure (Full-boot) are also reported

Fig. 2
figure 2

Accuracy of the clustering procedure under a MAR mechanism: distribution of the adjusted rand index over the \({S}=200\) generated data sets varying by the number of individuals (\({n}\)), the correlation between variables (\(\rho \)) and the proportion of missing values (\(\tau \)) for various imputation methods (JM-DP or FCS-RF), various consensus methods (NMF or SAOM). For each case, clustering is performed using k-means clustering. As benchmark, ARI obtained by applying k-means on complete-cases (CCA), using k-pod algorithm (kpod), on full data (Full) or using a bagging procedure (Full-boot) are also reported

For both mechanisms, the ARI over the \({S}\) data sets when the contributory partitions are pooled using NMF or when they are pooled using SAOM remain generally close. However, both MI methods show higher ARI values with NMF pooling when the number of individuals and the proportion of missing values increase.

As expected, the ARI obtained by MI is close to the one obtained without missing values (Full or Full-boot) when the proportion of missing values \(\tau \) equals 0.1, and the difference is larger when this proportion increases. Furthermore, clustering after MI outperforms complete-cases analysis even if the proportion of missing values is small. Compared to the direct application of kmeans using the k-pod algorithm, similar ARI are observed for a MCAR mechanism, but MI outperforms under the MAR mechanism for a moderate (\(\tau =0.3\)) or large (\(\tau =0.5\)) proportion of missing values.

A large number of individuals slightly increases the ARI and decreases the interquartile range in MI, while it only decreases the interquartile range when data are full.

Finally, the ARI is usually higher when data are imputed using JM-DP than FCS-RF. It could be expected since JM-DP is based on an imputation model close to the model used for data generation, while FCS-RF is based on a non-parametric model.

3.2.2 Instability pooling

Table 1 gathers the average of within instability, between instability and total instability over the \({S}=200\) data sets for configurations under a MCAR mechanism with \({n}=50\) or \({n}=200\) individuals (see “Table 6 in Appendix” for a MAR mechanism and Tables 5 for \({n}\in \{25,100\}\)).

Table 1 Instability of the clustering procedure under a MCAR mechanism: average within-instability (\(\bar{U}\)) average between-instability (B) and average total instability (T) over the \({S}=200\) generated data sets for various number of individuals (\({n}\)), correlation between variables (\(\rho \)) and proportion of missing values (\(\tau \))

As expected, the total instability is always higher by using MI (\({M}=50\)) instead of SI (\({M}=1\)) for all configurations. Furthermore, clustering instability is generally smaller when using MI instead of complete-case analysis and higher compared to clustering instability when data are full. We note the average between instability (B) tends to increase when the proportion of missing values increases. This behavior was expected. More surprisingly, the within instability (\(\bar{U}\)) is also increasing with the proportion of missing values, even if this increase remains relatively smaller. This is highlighted for small values of \({n}\). Such a behavior can be explained by overfitting of the imputation models. Indeed, FCS-RF as non-parametric imputation method requires a large number of observations, while JM-DP as complex model requires a large number of observations to fit accurately the data structure. For this reason, when the number of individuals is small, the imputed values are highly variable, yielding to an increase of the within instability. This behavior is more severe when the proportion of missing values is large. Note that instability using the k-pod algorithm is not considered since the method only returns a partition from incomplete data, but no instability measure.

3.2.3 Influence of \({M}\)

As underlined in Sect. 2.4.2, the instability given by the second rule does not depend (in expectation) on the number of imputed data sets (if \({M}\ge 2\)). As regard the partition accuracy, a large number of imputed data sets should bring the consensus partition closer to the partition obtained from full data. Figure 3 reports the influence of \({M}\) on the ARI for MI using JM-DP and NMF consensus.

Fig. 3
figure 3

Accuracy of the clustering procedure according to \({M}\): adjusted rand index over the \({S}=200\) generated data sets varying by the number of individuals (\({n}\)), the correlation between variables (\(\rho \)) and the proportion of missing values (\(\tau \)) generated under a MCAR mechanism. Data sets are imputed by JM-DP varying by the number of imputed data sets (\({M}\)). For each data set, clustering is performed using k-means clustering and consensus clustering is performed using NMF

For all configurations, a large value of \({M}\) tends to increase the ARI meaning a large number of imputed data sets tends to increase clustering accuracy. The increase is even more important as the proportion of missing values is large or as the number of individuals is small.

Similar results have been observed when data are imputed by FCS-RF (“Fig. 8 in Appendix”) or when a MAR mechanism is considered (see “Fig. 9 in Appendix” for imputation by FCS-RF and Fig. 7 for imputation by JM-DP).

Figure 4 reports the influence of \({M}\) on the instability for MI using JM-DP and NMF consensus.

Fig. 4
figure 4

Instability according to \({M}\): total instability (T) over the \({S}=200\) generated data sets varying by the number of individuals (\({n}\)), the correlation between variables (\(\rho \)) and the proportion of missing values (\(\tau \)) generated under a MCAR mechanism. Data sets are imputed by JM-DP varying by the number of imputed data sets (\({M}\)). For each data set, clustering is performed using k-means clustering and consensus clustering is performed using NMF

As expected, for \({M}\ge 2\) the instability is constant whatever the proportion of missing values, the number of individuals or the correlation between variables. Similar results are observed when data are generated under a MAR mechanism or when data are imputed by FCS-RF (see “Figs. 10, 11, 12 in Appendix”).

3.3 Complement: number of clusters

As written in the introduction, having an instability measure with missing values can provide a way for estimating the number of clusters from incomplete data. To assess the second rule with regards to this goal, a complementary simulation study is conducted by considering a grid for the number of clusters. More precisely, after multiple imputation, k-means clustering is applied for \({K}\) in \(\{ 2, \dots , 5\}\). By applying the second rule, a value of instability \(T_{K}\) is obtained from each value from the grid. The estimated number of clusters is given by

$$\begin{aligned} argmin_{{K}\in \{1,\ldots ,{K}_{max}\}}\ T_{K}. \end{aligned}$$
(19)

Results are reported in Table 2. The number of clusters is accurately estimated when performing imputation by JM-DP, but a small number of observations or a large proportion of missing values leads to an upper bias. This behaviour is similar to complete case analysis (even if the method cannot always be applied). Results when using FCS-RF leads to \({K}=4\) or \({K}=5\) in most of the cases. Better performances of JM-DP compared to FCS-RF were expected since JM-DP is well tailored to account for the clustered structure of observations (Audigier et al. 2021).

Table 2 Estimated numbers of clusters based on the second rule: frequencies of estimated values within a grid between 2 and 5 clusters over the \({S}=200\) generated data sets for various number of individuals (\({n}\)), correlation between variables (\(\rho \)), missing data mechanisms (MCAR and MAR) and proportion of missing values (\(\tau \))

4 Application

In addition to estimating the instability in clustering, the second rule can be used for tuning the number of clusters with missing values. The animals data set from the cluster R package (Maechler et al. 2019) is used as an example (Kaufman and Rousseeuw 1990). It describes 20 animals by six binary variables (warm-blooded, can fly, vertebrate, endangered, live in groups, have hair). Five individuals are incomplete (see “Table 7 in Appendix”).

We propose to perform hierarchical clustering using the flexible UPGMA method (Belbin et al. 1992). Flexible UPGMA can be seen as a generalization of the average method which ensures the desirable monotonicity property of the algorithm. For achieving this goal, data are imputed \({M}=50\) times according to a log-linear model (Schafer 1997), which is considered as the gold standard for binary variables. Then, hierarchical clustering is applied on each imputed data set for a given number of clusters \({K}\). Finally, the clustering instability T is assessed using the second rule. This process is repeated for \({K}\) varying in \(\{2,3,4,5\}\). Results are presented in Fig. 5 and instability using only complete cases is also reported.

Fig. 5
figure 5

Instability estimation in hierarchical clustering according to the number of clusters. Data are imputed using a log-linear model (MI-cat) with \({M}=50\) imputed data sets. The instability with complete-case analysis (CCA) is also reported

Using MI, the instability is the smallest for \({K}=4\) clusters, suggesting a consensus clustering in 4 clusters. Results for complete-case analysis are less clear, but a partition in two clusters could be suggested. A larger number could be inappropriate compared to the number of complete cases (15).

Comparison between the consensus partition obtained by NMF in four clusters and the one obtained by hierarchical clustering on complete cases in two clusters are presented through a principal factor map (Fig. 6).

Fig. 6
figure 6

Animals data set: visualization of partitions through the principal factor map obtained using the iterative MCA algorithm (Josse et al. 2012)

The partition obtained by complete-case analysis gathers mammals with birds in the first cluster, while insects and fishes are gathers in the second. On the opposite, consensus clustering after MI in four clusters isolates mammals in the first cluster, or insects in the fourth. Furthermore, it allows suitable clustering of incomplete individuals: lion among mammals, spider among insects, salmon and frog with herring, lobster with ant (both have the same observed profile cf “Table 7 in Appendix”).

5 Discussion

Multiple imputation is a widely used method for dealing with missing values. However, applying Rubin’s rules with clustering remained unclear: 1) how to pool the partitions obtained from each imputed data set? In this paper, we argue to use median partition-based methods for pooling partitions. In particular, NMF methods are theoretically and computationally attractive for achieving this goal. 2) How to assess the instability of the clustering with missing values? Based on Fang and Wang (2012), we propose a new rule for assessing the stability with missing values. An associated R package entitled clusterMI is available at the web page of the first author

From a practical point of view, the first rule provides accurate clustering with missing values. Indeed, even without missing data, Dudoit and Fridly (2003) have shown bagging procedures based on bootstrap improve clustering accuracy. By generating \({M}\) times the imputed values and aggregating the partitions obtained from each imputed data set, a similar improvement is observed with multiple imputation. It has been highlighted that the accuracy is sensitive to \({M}\), particularly when the number of individuals is small or the proportion of missing values is large. Simulations show \({M}=50\) is generally enough, but \({M}\) can be tuned by investigating the evolution of the pooled partition according to the number of imputed data sets.

The second rule allows calculation of an additional between instability B related to missing values. This instability has the advantage to be robust to the choice of \({M}\). Availability of a between instability is precious in practice for several uses. Firstly, it provides a new way for dealing with the number of clusters when data are incomplete. This is particularly useful for distance-based clustering methods like k-means or k-medoids. Secondly, the ratio B/T provides a new way to highlight how the partition is robust to the missing values (van Buuren 2018).

In this work, we assumed data were already imputed. It could be also interesting to investigate more deeply the suitable imputation method according to the clustering method applied on each imputed data set. The topic is commonly discussed under the term congeniality (Meng 1994; Schafer 2003). Simulation results subtend a sensitivity to the imputation method both in terms of accuracy and instability which is substantiated by recent research (Audigier et al. 2021). We also assumed data were missing at random, but beyond the difficulty to impute non-missing at random data, the proposed rules could be directly applied.

Furthermore, we assumed all contributory partitions have the same number of clusters \({K}\), but the methodology can be directly applied for various number of clusters, like in hierarchical clustering where the number is generally unknown in advance. Indeed, NMF consensus clustering is essentially based on the average of the connectivity matrices associated to each contributory partition. Such an average can be obtained whatever the number of clusters since the dimensions of each connectivity matrix depends only on the number of individuals which is constant for all partitions. The only requirement is that the clustering method allows classification for new individuals. Even if this classification is always possible, certain classification methods are connected to certain clustering methods. For instance, classification using the closest centroid is suitable for k-means or k-medoids, while quadratic discriminant analysis could be more reliable for Gaussian mixtures. For other clustering methods, anyone of these classification methods could be used, but the robustness to the stability measurement should require more research.

Finally, several NMF-based methods are available for partitions pooling. In this paper, we focus on the multiplicative rules method as proposed in Li et al. (2007) which is the most common method, but which is not necessarily the most efficient. Among alternatives, alternating least squares algorithms are notably recommended for large scale data (Andrzej Cichocki and ichi Amari 2009).