1 Introduction

In data analytics, clustering is a fundamental technique concerned with partitioning a given dataset into useful groups (called clusters) according to the relative similarity among the data instances. Clustering algorithms attempt to partition a set of data instances (characterized by some features), into different clusters such that the member instances of any given cluster are akin to each other and are different from the members of the other clusters. Greater the similarity within a group and the dissimilarity between groups, better is the clustering obtained by a suitable algorithm.

Clustering techniques are of extensive use and are hence being constantly investigated in statistics, machine learning, and pattern recognition. Clustering algorithms find applications in various fields such as economics, marketing, electronic design, space research, etc. For example, clustering has been used to group related documents for web browsing (Broder et al. 1997; Haveliwala et al. 2000), by banks to cluster the previous transactions of clients to identify suspicious (possibly fraudulent) behaviour (Sabau 2012), for formulating effective marketing strategies by clustering customers with similar behaviour (Chaturvedi et al. 1997), in earthquake studies for identifying dangerous zones based on previous epicentre locations (Weatherill and Burton 2009; Shelly et al. 2009; Lei 2010), and so on. However, when we analyze such real-world data, we may encounter incomplete data where some features of some of the data instances are missing. For example, web documents may have some expired hyper-links. Such missingness may be due to a variety of reasons such as data input errors, inaccurate measurement, equipment malfunction or limitations, and measurement noise or data corruption, etc. This is known as unstructured missingness (Chan and Dunn 1972; Rubin 1976). Alternatively, not all the features may be defined for all the data instances in the dataset. This is termed as structural missingness or absence of features (Chechik et al. 2008). For example, credit-card details may not be defined for non-credit card clients of a bank.

Missing features have always been a challenge for researchers because traditional learning methods (which assume all data instances to be fully observed, i.e. all the features are observed) cannot be directly applied to such incomplete data, without suitable preprocessing. When the rate of missingness is low, the data instances with missing values may be ignored. This approach is known as marginalization. Marginalization cannot be applied to data having a sizable number of missing values, as it may lead to the loss of a sizable amount of information. Therefore, sophisticated methods are required to fill in the vacancies in the data, so that traditional learning methods can be applied subsequently. This approach of filling in the missing values is called imputation. However, inferences drawn from data having a large fraction of missing values may be severely warped, despite the use of such sophisticated imputation methods (Acuña and Rodriguez 2004).

1.1 Literature

The initial models for feature missingness are due to Rubin (1976); Little and Rubin (1987). They proposed a three-fold classification of missing data mechanisms, viz. Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). MCAR refers to the case where missingness is entirely haphazard, i.e. the likelihood of a feature being unobserved for a certain data instance depends neither on the observed nor on the unobserved characteristics of the instance. For example, in an annual income survey, a citizen is unable to participate, due to unrelated reasons such as traffic or schedule problems. MAR eludes to the cases where the missingness is conditional to the observed features of an instance, but is independent of the unobserved features. Suppose, college-goers are less likely to report their income than office-goers. But, whether a college-goer will report his or her income is independent of the actual income. MNAR is characterized by the dependence of the missingness on the unobserved features. For example, people who earn less are less likely to report their incomes in the annual income survey. Datta et al. (2016b) further classified MNAR into two sub-types, namely MNAR-I when the missingness only depends on the unobserved features and MNAR-II when the missingness is governed by both observed as well as unobserved features. Schafer and Graham (2002) and Zhang et al. (2012) have observed that MCAR is a special case of MAR and that MNAR can also be converted to MAR by appending a sufficient number of additional features. Therefore, most learning techniques are based on the validity of the MAR assumption.

A lot of research on the problem of learning with missing or absent features has been conducted over the past few decades, mostly focussing on imputation methods. Several works such as Little and Rubin (1987) and Schafer (1997) provide elaborate theories and analyses of missing data. Common imputation methods (Donders et al. 2006) involve filling the missing features of data instances with zeros [Zero Imputation (ZI)], or the means of the corresponding features over the entire dataset [Mean Imputation (MI)]. Class Mean Imputation or Concept Mean Imputation (CMI) is a slight modification of MI that involves filling the missing features with the average of all observations having the same label as the instance being filled. Yet another common imputation method is k-Nearest Neighbor Imputation (kNNI) (Dixon 1979), where the missing features of a data instance are filled in by the means of corresponding features over its k-Nearest Neighbors (kNN), on the observed subspace. Grzymala-Busse and Hu (2001) suggested various novel imputation schemes such as treating missing attribute values as special values. Rubin (1987) proposed a technique called Multiple Imputation (MtI) to model the uncertainty inherent in imputation. In MtI, the missing values are imputed by a typically small (e.g. 5–10) number of simulated versions, depending on the percentage of missing data (Chen 2013; Horton and Lipsitz 2001). Some more sophisticated imputation techniques have been developed, especially by the bioinformatics community, to impute the missing values by exploiting the correlations between data. A prominent example is the Singular Value Decomposition based Imputation (SVDI) technique (Troyanskaya et al. 2001) which performs regression based estimation of the missing values using the k most significant eigenvectors of the dataset. Other examples inlcude Least Squares Imputation (LSI) (Bo et al. 2004), Non-Negative LSI (NNLSI) and Collateral Missing Value Estimation (CMVE) (Sehgal et al. 2005). Model-based methods are related to yet distinct from imputation techniques. These methods attempt to model the distributions for the missing values instead of filling them in Dempster and Rubin (1983); Ahmad and Tresp (1993); Wang and Rao (2002a, b).

However, most of these techniques assume the pattern of missingness to be MCAR or MAR because this allows the use of simpler models of missingness (Heitjan and Basu 1996). Such simple models are not likely to perform well in case of MNAR as the pattern of missingness also holds information. Hence, other methods have to be developed to tackle incomplete data due to MNAR (Marlin 2008). Moreover, imputation may often lead to the introduction of noise and uncertainty in the data (Dempster and Rubin 1983; Little and Rubin 1987; Barceló 2008; Myrtveit et al. 2001).

In light of the observations made in the preceding paragraph, some learning methods avoid the inexact methods of imputation (as well as marginalization) altogether, while dealing with missingness. A common paradigm is random subspace learning where an ensemble of learners is trained on projections of the data in random subspaces and an inference is drawn based on the concensus among the ensemble (Krause and Polikar 2003; Juszczak and Duin 2004; Nanni et al. 2012). Chechik et al. (2008) used the geometrical insight of max-margin classification to formulate an objective function which was optimized to directly classify the incomplete data. This was extended to the max-margin regression case for software effort prediction with absent features in Zhang et al. (2012). Wagstaff (2004); Wagstaff and Laidler (2005) suggested a k-means algorithm with Soft Constraints (KSC) where soft constraints determined by fully observed objects are introduced to facilitate the grouping of instances with missing features. Himmelspach and Conrad (2010) provided a good review of partitional clustering techniques for incomplete datasets, which mentions some other techniques that do not make use of imputation.

The idea to modify the distance between the data instances to directly tackle missingness (without having to resort to imputation) was first put forth by Dixon (1979). The Partial Distance Strategy (PDS) proposed in Dixon (1979) scales up the observed distance, i.e. the distance between two data instances in their common observed subspace (the subspace consisting of the observed features common to both data instances) by the ratio of the total number of features (observed as well as unobserved) and the number of common observed features between them to obtain an estimate of their distance in the fully observed space. Hathaway and Bezdek (2001) used the PDS to extend the Fuzzy C-Means (FCM) clustering algorithm to cases with missing features. Furthermore, Millán-Giraldo et al. (2010) and Porro-Muñoz et al. (2013) generalized the idea of the PDS by proposing to scale the observed distance by factors other than the fraction of observed features. However, neither the PDS nor its extensions can always provide a good estimate of the actual distance as the observed distance between two instances may be unrelated to the distance between them in the unobserved subspace.

1.2 Motivation

As observed earlier, one possible way to adapt supervised as well as unsupervised learning methods to problems with missingness is to modify the distance or dissimilarity measure underlying the learning method. The idea is that the modified dissimilarity measure should use the common observed features to provide approximations of the distances between the data instances if they were to be fully observed. PDS is one such measure. Such approaches neither require marginalization nor imputation and are likely to yield better results than either of the two. For example, let \(X_{full}=\{{\mathbf {x}}_1=(1,2),{\mathbf {x}}_2=(1.8,1),{\mathbf {x}}_3=(2,2.5)\}\) be a dataset consisting of three points in \(\mathbb {R}^2\). Then, we have \(d_{E}({\mathbf {x}}_1,{\mathbf {x}}_2)=1.28\) and \(d_{E}({\mathbf {x}}_1,{\mathbf {x}}_3)=1.12\), \(d_{E}({\mathbf {x}}_i,{\mathbf {x}}_j)\) being the Euclidean distance between any two fully observed points \({\mathbf {x}}_i\) and \({\mathbf {x}}_j\) in \(X_{full}\). Suppose that the first coordinate of the point (1, 2) be unobserved, resulting in the incomplete dataset \(X=\{\widetilde{{\mathbf {x}}}_1=(*,2),{\mathbf {x}}_2=(1.8,1),{\mathbf {x}}_3=(2,2.5)\}\) (‘*’ denotes a missing value), on which learning must be carried out. Notice that this is a case of unstructured missingness (because the unobserved value is known to exist), as opposed to the structural missingness of Chechik et al. (2008) 0. Using ZI, MI and 1NNI respectively, we obtain the following filled in datasets:

$$\begin{aligned} \begin{aligned} X_{ZI}&= \{\widehat{{\mathbf {x}}_1}=(0,2),{\mathbf {x}}_2=(1.8,1),{\mathbf {x}}_3=(2,2.5)\},\\ X_{MI}&= \{\widehat{{\mathbf {x}}_1}=(1.9,2),{\mathbf {x}}_2=(1.8,1),{\mathbf {x}}_3=(2,2.5)\},\\ \text {and } X_{1NNI}&= \{\widehat{{\mathbf {x}}_1}=(2,2),{\mathbf {x}}_2=(1.8,1),{\mathbf {x}}_3=(2,2.5)\}, \end{aligned} \end{aligned}$$

where \(\widehat{{\mathbf {x}}_1}\) denotes an estimate of \({\mathbf {x}}_1\). If PDS is used to estimate the corresponding distances in X, then the distance \(d_{PDS}({\mathbf {x}}_1,{\mathbf {x}}_i)\) between the implicit estimate of \({\mathbf {x}}_1\) and some other instance \({\mathbf {x}}_i \in X\) is obtained by

$$\begin{aligned} d_{PDS}({\mathbf {x}}_1,{\mathbf {x}}_i) = \sqrt{\frac{2}{1}(x_{1,2}-x_{i,2})^2}, \end{aligned}$$

where \(x_{1,2}\) and \(x_{i,2}\) respectively denote the 2nd features of \({\mathbf {x}}_1\) and \({\mathbf {x}}_i\), and 2 is the numerator of the multiplying factor due to the fact that \({\mathbf {x}}_i \in \mathbb {R}^2\) and 1 is the denominator owing to the fact that only the 2nd feature is observed for both \({\mathbf {x}}_1\) and \({\mathbf {x}}_i\). Then, we get

$$\begin{aligned} \begin{aligned}&d_{PDS}({\mathbf {x}}_1,{\mathbf {x}}_2) = \sqrt{\frac{2}{1}(2-1)^2} = 1.41,\\ \text {and }&d_{PDS}({\mathbf {x}}_1,{\mathbf {x}}_3) = \sqrt{\frac{2}{1}(2-2.5)^2} = 0.71. \end{aligned} \end{aligned}$$

The improper estimates obtained by PDS are due to the fact that the distance in the common observed subspace does not reflect the distance in the unobserved subspace. This is the principal drawback of the PDS method, as discussed earlier. Since the observed distance between two data instances is essentially a lower bound on the Euclidean distance between them (if they were to be fully observed), adding a suitable penalty to this lower bound can result in a reasonable approximation of the actual distance. This approach (Datta et al. 2016b) called the Penalized Dissimilarity Measure (PDM) may be able to overcome the drawback which plagues PDS. Let the penalty between \({\mathbf {x}}_1\) and \({\mathbf {x}}_i\) be given by the ratio of the number of features which are unobserved for at least one of the two data instances and the total number of features in the entire dataset. Then, the dissimilarity \(\delta _{PDM}({\mathbf {x}}_1,{\mathbf {x}}_i)\) between the implicit estimate of \({\mathbf {x}}_1\) and some other \({\mathbf {x}}_i \in X\) is

$$\begin{aligned} \delta _{PDM}({\mathbf {x}}_1,{\mathbf {x}}_i) = \sqrt{(x_{1,2}-x_{i,2})^2} + \frac{1}{2}, \end{aligned}$$

where the 1 in the numerator of the penalty term is due to the fact that the 1st feature of \({\mathbf {x}}_1\) is unobserved. Therefore, the dissimilarities \(\delta _{PDM}({\mathbf {x}}_1,{\mathbf {x}}_2)\) and \(\delta _{PDM}({\mathbf {x}}_1,{\mathbf {x}}_3)\) are

$$\begin{aligned} \begin{aligned}&\delta _{PDM}({\mathbf {x}}_1,{\mathbf {x}}_2) = \sqrt{(2-1)^2} + \frac{1}{2} = 1.5,\\ \text {and }&\delta _{PDM}({\mathbf {x}}_1,{\mathbf {x}}_3) = \sqrt{(2-2.5)^2} + \frac{1}{2} = 1. \end{aligned} \end{aligned}$$
Fig. 1
figure 1

Comparison of various techniques for handling missing features. a Comparison for \(\mathbf {x}_1\). b Comparison for \(\mathbf {x'}_1\)

The situation is illustrated in Fig. 1a. The reader should note that the points estimated using ZI, MI and 1NNI exist in the same 2-D Cartesian space to which \(X_{full}\) is native. On the other hand, the points estimated by both PDS and PDM exist in their individual abstract spaces (likely distinct from the native 2-D space). Therefore, for the sake of easy comparison, we illustrate all the estimates together by superimposing both these abstract spaces on the native 2-D space so as to coincide at the points \({\mathbf {x}}_2\) and \({\mathbf {x}}_3\). It can be seen that the approach based on the PDM does not suffer from the drawback of PDS and is better able to preserve the relationship between the points. Moreover, it should be noted that there are two possible images for each of the estimates obtained by both PDS and PDM. Therefore, had the partially observed point instead been \(\mathbf {x'}_1 = (3,2)\) with the first feature missing (giving rise to the same incomplete dataset X; \(\widetilde{\mathbf {x'}}_1\) replacing the identical incomplete point \(\widetilde{{\mathbf {x}}}_1\)), PDS and PDM would still find reasonably good estimates (PDM still being better than PDS). This situation is also illustrated in Fig. 1b. In general,

  1. 1.

    ZI works well only for missing values in the vicinity of the origin and is also origin dependent;

  2. 2.

    MI works well only when the missing value is near the observed mean of the missing feature;

  3. 3.

    kNNI is reliant on the assumption that neighbors have similar features, but suffers from the drawbacks that missingness may give rise to erroneous neighbor selection and that the estimates are restricted to the range of observed values of the feature in question;

  4. 4.

    PDS suffers from the assumption that the common observed distances reflect the unobserved distances; and

  5. 5.

    None of these methods differentiate between identical incomplete points, i.e. \(\widetilde{{\mathbf {x}}}_1\) and \(\widetilde{{\mathbf {x}}}'_1\) are not differentiated between.

However, a PDM successfully steers clear of all these drawbacks (notice that \(\delta ({\mathbf {x}}_1,{\mathbf {x}}'_1) = \frac{1}{2}\)). Furthermore, such a PDM can also be easily applied to the case of absent features, by slightly modifying the penalty term (see “Appendix A”). This knowledge motivates us to use a PDM to adapt traditional clustering methods to problems with missing features.

1.3 Contribution

The FWPD measure is a PDM used in Datta et al. (2016b) for kNN classification of datasets with missing features.Footnote 1 The FWPD between two data instances is a weighted sum of two terms; the first term being the observed distance between the instances and the second being a penalty term. The penalty term is a sum of the penalties corresponding to each of the features which are missing from at least one of the data instances; each penalty being directly proportional to the probability of its corresponding feature being observed. Such a weighting scheme imposes greater penalty if a feature which is observed for a large fraction of the data is missing for a particular instance. On the other hand, if the missing feature is unobserved for a large fraction of the data, then a smaller penalty is imposed.

The contributions of the current article are in order:

  1. 1.

    In the current article, we formulate the k-means clustering problem for datasets with missing features based on the proposed FWPD and develop an algorithm to solve the new formulation.

  2. 2.

    We prove that the proposed algorithm is guaranteed to converge to a locally optimal solution of the modified k-means optimization problem formulated with the FWPD measure.

  3. 3.

    We also propose Single Linkage, Average Linkage, and Complete Linkage based HAC methods for datasets plagued by missingness, based on the proposed FWPD.

  4. 4.

    We provide an extensive discussion on the properties of the FWPD measure. The said discussion is more thorough compared to that of Datta et al. (2016b).

  5. 5.

    We further provide a detailed algorithm for simulating the four types of missingness enumerated in Datta et al. (2016b), namely MCAR, MAR, MNAR-I (missingness only depends on the unobserved features) and MNAR-II (missingness depends on both observed as well as unobserved features).

  6. 6.

    Moreover, since this work presents an alternative to imputation and can be useful in scenarios where imputation is not practical (such as structural missingness), we append an extension of the proposed FWPD to the case of absent features (where the absent features are known to be undefined or non-existent). We also show that the FWPD becomes a semi-metric in the case of structural missingness.

Experiments are reported on diverse datasets and covers all four types of missingness. The results are compared with the popularly used imputation techniques. The comparative results indicate that our approach generally achieves better performance than the common imputation approaches used to handle incomplete data.

1.4 Organization

The rest of this paper is organized in the following way. In Sect. 2, we elaborate on the properties of the FWPD measure. The next section (Sect. 3) presents a formulation of the k-means clustering problem which is directly applicable to datasets with missing features, based on the FWPD discussed in Sect. 2. This section also puts forth an algorithm to solve the optimization problem posed by this new formulation. The subsequent section (Sect. 4) covers the HAC algorithm formulated using FWPD to be directly applicable to incomplete datasets. Experimental results (based on the missingness simulating mechanism discussed in the same section) are presented in Sect. 5. Relevant conclusions are drawn in Sect. 6. Subsequently, “Appendix A” deals with the extension of the proposed FWPD to the case of absent features (structural missingness).

2 Feature weighted penalty based dissimilarity measure for datasets with missing features

Let the dataset \(X \subset \mathbb {R}^m\), i.e. the data instances in X are each characterized by m feature values in \(\mathbb {R}\). Further, let X consist of n instances \({\mathbf {x}}_i\) (\(i \in \{1, 2, \cdots , n\}\)), some of which have missing features. Let \(\gamma _{{\mathbf {x}}_i}\) denote the set of observed features for the data point \({\mathbf {x}}_i\). Then, the set of all features \(S=\bigcup _{i=1}^{n}\gamma _{{\mathbf {x}}_i}\) and \(|S|=m\). The set of features which are observed for all data instances in X is defined as \(\gamma _{obs}=\bigcap _{i=1}^{n}\gamma _{{\mathbf {x}}_i}\). \(|\gamma _{obs}|\) may or may not be non-zero. \(\gamma _{miss}=S\backslash \gamma _{obs}\) is the set of features which are unobserved for at least one data point in X. The important notations used in this section (and beyond) are summarized in Table 1.

Table 1 Some important notations used in Sect. 2 and beyond

Definition 1

Let the distance between any two data instances \({\mathbf {x}}_i,{\mathbf {x}}_j \in X\) in a subspace defined by \(\gamma \) be denoted as \(d_{\gamma }({\mathbf {x}}_i,{\mathbf {x}}_j)\). Then the observed distance (distance in the common observed subspace) between these two points can be defined as

$$\begin{aligned} d_{\gamma _{{\mathbf {x}}_i}\bigcap \gamma _{{\mathbf {x}}_j}} ({\mathbf {x}}_i,{\mathbf {x}}_j)=\sqrt{\sum \nolimits _{l \in \gamma _{{\mathbf {x}}_i}\bigcap \gamma _{{\mathbf {x}}_j}}(x_{i,l}-x_{j,l})^2}, \end{aligned}$$
(1)

where \(x_{i,l}\) denotes the l-th feature of the data instance \({\mathbf {x}}_i\). For the sake of convenience, \(d_{\gamma _{{\mathbf {x}}_i}\bigcap \gamma _{{\mathbf {x}}_j}}({\mathbf {x}}_i,{\mathbf {x}}_j)\) is simplified to \(d({\mathbf {x}}_i,{\mathbf {x}}_j)\) in the rest of this paper.

Definition 2

If both \({\mathbf {x}}_i\) and \({\mathbf {x}}_j\) were to be fully observed, the Euclidean distance \(d_{E}({\mathbf {x}}_i,{\mathbf {x}}_j)\) between \({\mathbf {x}}_i\) and \({\mathbf {x}}_j\) would be defined as

$$\begin{aligned} d_{E}({\mathbf {x}}_i,{\mathbf {x}}_j)=\sqrt{\sum \nolimits _{l \in S}(x_{i,l}-x_{j,l})^2}. \end{aligned}$$

Now, since \((\gamma _{{\mathbf {x}}_i}\cap \gamma _{{\mathbf {x}}_j})\subseteq S\), and \((x_{i,l}-x_{j,l})^2 \ge 0\) \(\forall \) \(l \in S\), it follows that

$$\begin{aligned} d({\mathbf {x}}_i,{\mathbf {x}}_j) \le d_{E}({\mathbf {x}}_i,{\mathbf {x}}_j) \text { } \forall \text { } {\mathbf {x}}_i,{\mathbf {x}}_j \in X. \end{aligned}$$

Therefore, to compensate for the distance in the unobserved subspace, we add a Feature Weighted Penalty (FWP) \(p({\mathbf {x}}_i,{\mathbf {x}}_j)\) (defined below) to \(d({\mathbf {x}}_i,{\mathbf {x}}_j)\).

Definition 3

The FWP between \({\mathbf {x}}_i\) and \({\mathbf {x}}_j\) is defined as

$$\begin{aligned} p({\mathbf {x}}_i,{\mathbf {x}}_j)=\frac{\mathop {\sum }\nolimits _{{l \in S \backslash (\gamma _{{\mathbf {x}}_i}\bigcap \gamma _{{\mathbf {x}}_j})}}\;w_l}{{\mathop \sum \nolimits }_{{l' \in S}}\, w_{l'}}, \end{aligned}$$
(2)

where \(w_l \in (0,n]\) is the number of instances in X having observed values of the feature l. It should be noted that FWP exacts greater penalty for unobserved occurrences of those features which are observed for a large fraction of the data instances. Moreover, since the value of the FWP solely depends on the taxable subspace \(S \backslash (\gamma _{{\mathbf {x}}_i}\bigcap \gamma _{{\mathbf {x}}_j})\), we define an alternative notation for the FWP, viz. \(p_{\gamma }={\mathop \sum \nolimits _{l \in \gamma }\;w_l}/{\mathop \sum \nolimits _{l' \in S}\;w_{l'}}\). Hence, \(p({\mathbf {x}}_i,{\mathbf {x}}_j)\) can also be written as \(p_{S \backslash (\gamma _{{\mathbf {x}}_i}\bigcap \gamma _{{\mathbf {x}}_j})}\).

Then, the definition of the proposed FWPD follows.

Definition 4

The FWPD between \({\mathbf {x}}_i\) and \({\mathbf {x}}_j\) is

$$\begin{aligned} \delta ({\mathbf {x}}_i,{\mathbf {x}}_j)=(1-\alpha )\times \frac{d({\mathbf {x}}_i,{\mathbf {x}}_j)}{d_{max}} + \alpha \times p({\mathbf {x}}_i,{\mathbf {x}}_j), \end{aligned}$$
(3)

where \(\alpha \in (0,1]\) is a parameter which determines the relative importance between the two terms and \(d_{max}\) is the maximum observed distance between any two points in X in their respective common observed subspaces.

2.1 Properties of the proposed FWPD

In this subsection, we discuss some of the important properties of the proposed FWPD measure. The following theorem discusses some of the important properties of the proposed FWPD measure and the subsequent discussion is concerned with the triangle inequality in the context of FWPD.

Theorem 1

The FWPD measure satisfies the following important properties:

  1. 1.

    \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) \le \delta ({\mathbf {x}}_i,{\mathbf {x}}_j)\) \(\forall \) \({\mathbf {x}}_i,{\mathbf {x}}_j \in X\),

  2. 2.

    \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) \ge 0\) \(\forall \) \({\mathbf {x}}_i \in X\),

  3. 3.

    \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) = 0\) iff \(\gamma _{{\mathbf {x}}_i}=S\), and

  4. 4.

    \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_j)=\delta ({\mathbf {x}}_j,{\mathbf {x}}_i)\) \(\forall \) \({\mathbf {x}}_i,{\mathbf {x}}_j \in X\).

Proof

  1. 1.

    From Eqs. (1) and (3), it follows that

    $$\begin{aligned} \delta ({\mathbf {x}}_i,{\mathbf {x}}_i) = \alpha \times p({\mathbf {x}}_i,{\mathbf {x}}_i). \end{aligned}$$
    (4)

    It also follows from Eq. (2) that \(p({\mathbf {x}}_i,{\mathbf {x}}_i) \le p({\mathbf {x}}_i,{\mathbf {x}}_j)\) \(\forall \) \({\mathbf {x}}_i,{\mathbf {x}}_j \in X\). Therefore, \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) \le \alpha \times p({\mathbf {x}}_i,{\mathbf {x}}_j)\). Since \(\alpha \le 1\), we have \(\alpha \times p({\mathbf {x}}_i,{\mathbf {x}}_j) \le p({\mathbf {x}}_i,{\mathbf {x}}_j)\). Now, it follows from Eq. (3) that \(p({\mathbf {x}}_i,{\mathbf {x}}_j) \le \delta ({\mathbf {x}}_i,{\mathbf {x}}_j)\). Hence, we get \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) \le \delta ({\mathbf {x}}_i,{\mathbf {x}}_j)\) \(\forall \) \({\mathbf {x}}_i,{\mathbf {x}}_j \in X\).

  2. 2.

    It can be seen from Eq. (3) that \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) = \alpha \times p({\mathbf {x}}_i,{\mathbf {x}}_i)\). Moreover, it follows from Eq. (2) that \(p({\mathbf {x}}_i,{\mathbf {x}}_i) \ge 0\). Hence, \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) \ge 0\) \(\forall \) \({\mathbf {x}}_i \in X\).

  3. 3.

    It is easy to see from Eq. (2) that \(p({\mathbf {x}}_i,{\mathbf {x}}_i)=0\) iff \(\gamma _{{\mathbf {x}}_i}=S\). Hence, it directly follows from Eq. (4) that \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) = 0\) iff \(\gamma _{{\mathbf {x}}_i}=S\).

  4. 4.

    From Eq. (3) we have

    $$\begin{aligned} \begin{aligned}&\delta ({\mathbf {x}}_i,{\mathbf {x}}_j)=(1-\alpha )\times \frac{d({\mathbf {x}}_i,{\mathbf {x}}_j)}{d_{max}} + \alpha \times p({\mathbf {x}}_i,{\mathbf {x}}_j),\\ \text {and }&\delta ({\mathbf {x}}_j,{\mathbf {x}}_i)=(1-\alpha )\times \frac{d({\mathbf {x}}_j,{\mathbf {x}}_i)}{d_{max}} + \alpha \times p({\mathbf {x}}_j,{\mathbf {x}}_i). \end{aligned} \end{aligned}$$

    However, \(d({\mathbf {x}}_i,{\mathbf {x}}_j)=d({\mathbf {x}}_j,{\mathbf {x}}_i)\) and \(p({\mathbf {x}}_i,{\mathbf {x}}_j)=p({\mathbf {x}}_j,{\mathbf {x}}_i)\) \(\forall \) \({\mathbf {x}}_i,{\mathbf {x}}_j \in X\) (by definition). Therefore, it can be easily seen that \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_j)=\delta ({\mathbf {x}}_j,{\mathbf {x}}_i)\) \(\forall \) \({\mathbf {x}}_i,{\mathbf {x}}_j \in X\).

\(\square \)

The triangle inequality is an important criterion which lends some useful properties to the space induced by a dissimilarity measure. Therefore, the conditions under which FWPD satisfies the said criterion are investigated below. However, it should be stressed that the satisfaction of the said criterion is not essential for the functioning of the clustering techniques proposed in the subsequent text.

Definition 5

For any three data instances \({\mathbf {x}}_i, {\mathbf {x}}_j, {\mathbf {x}}_k \in X\), the triangle inequality with respect to (w.r.t.) the FWPD measure is defined as

$$\begin{aligned} \delta ({\mathbf {x}}_i,{\mathbf {x}}_j) + \delta ({\mathbf {x}}_j,{\mathbf {x}}_k) \ge \delta ({\mathbf {x}}_k,{\mathbf {x}}_i). \end{aligned}$$
(5)

The three following lemmas deal with the conditions under which Inequality (5) will hold.

Lemma 1

For any three data instances \({\mathbf {x}}_i, {\mathbf {x}}_j, {\mathbf {x}}_k \in X\) let \(\rho _{i,j,k} = p({\mathbf {x}}_i,{\mathbf {x}}_j) + p({\mathbf {x}}_j,{\mathbf {x}}_k) - p({\mathbf {x}}_k,{\mathbf {x}}_i)\). Then \(\rho _{i,j,k} \ge 0\) \(\forall \) \({\mathbf {x}}_i, {\mathbf {x}}_j, {\mathbf {x}}_k \in X\).

Proof

Let us rewrite the penalty term \(p({\mathbf {x}}_i,{\mathbf {x}}_j)\) in terms of the spanned subspaces as \(p({\mathbf {x}}_i,{\mathbf {x}}_j) = p_{S \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_j})} + p_{\gamma _{{\mathbf {x}}_i} \backslash \gamma _{{\mathbf {x}}_j}} + p_{\gamma _{{\mathbf {x}}_j} \backslash \gamma _{{\mathbf {x}}_i}}\). Now, accounting for the subspaces overlapping with the observed subspace of \({\mathbf {x}}_k\), we get

$$\begin{aligned} \begin{aligned} p({\mathbf {x}}_i,{\mathbf {x}}_j) =&p_{S \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_j} \bigcup \gamma _{{\mathbf {x}}_k})} + p_{\gamma _{{\mathbf {x}}_k} \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_j})} + p_{\gamma _{{\mathbf {x}}_i} \backslash (\gamma _{{\mathbf {x}}_j} \bigcup \gamma _{{\mathbf {x}}_k})} \\&+ p_{(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}) \backslash \gamma _{{\mathbf {x}}_j}} + p_{\gamma _{{\mathbf {x}}_j} \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_k})} + p_{(\gamma _{{\mathbf {x}}_j} \bigcap \gamma _{{\mathbf {x}}_k}) \backslash \gamma _{{\mathbf {x}}_i}}.\\ \text {Similarly, } p({\mathbf {x}}_j,{\mathbf {x}}_k) =&p_{S \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_j} \bigcup \gamma _{{\mathbf {x}}_k})} + p_{\gamma _{{\mathbf {x}}_i} \backslash (\gamma _{{\mathbf {x}}_j} \bigcup \gamma _{{\mathbf {x}}_k})} + p_{\gamma _{{\mathbf {x}}_j} \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_k})} \\&+ p_{(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_j}) \backslash \gamma _{{\mathbf {x}}_k}} + p_{\gamma _{{\mathbf {x}}_k} \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_j})} + p_{(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}) \backslash \gamma _{{\mathbf {x}}_j}},\\ \text {and } p({\mathbf {x}}_k,{\mathbf {x}}_i) =&p_{S \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_j} \bigcup \gamma _{{\mathbf {x}}_k})} + p_{\gamma _{{\mathbf {x}}_j} \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_k})} + p_{\gamma _{{\mathbf {x}}_k} \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_j})} \\&+ p_{(\gamma _{{\mathbf {x}}_j} \bigcap \gamma _{{\mathbf {x}}_k}) \backslash \gamma _{{\mathbf {x}}_i}} + p_{\gamma _{{\mathbf {x}}_i} \backslash (\gamma _{{\mathbf {x}}_j} \bigcup \gamma _{{\mathbf {x}}_k})} + p_{(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_j}) \backslash \gamma _{{\mathbf {x}}_k}}.\\ \end{aligned} \end{aligned}$$

Hence, after canceling out appropriate terms, we get

$$\begin{aligned} \rho _{i,j,k}= & {} p_{S \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_j} \bigcup \gamma _{{\mathbf {x}}_k})} + p_{\gamma _{{\mathbf {x}}_k} \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_j})} + 2 p_{(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}) \backslash \gamma _{{\mathbf {x}}_j}}\\&+ p_{\gamma _{{\mathbf {x}}_i} \backslash (\gamma _{{\mathbf {x}}_j} \bigcup \gamma _{{\mathbf {x}}_k})} + p_{\gamma _{{\mathbf {x}}_j} \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_k})}. \end{aligned}$$

Now, since

$$\begin{aligned} p_{\gamma _{{\mathbf {x}}_i} \backslash (\gamma _{{\mathbf {x}}_j} \bigcup \gamma _{{\mathbf {x}}_k})} + p_{\gamma _{{\mathbf {x}}_k} \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_j})} + p_{(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}) \backslash \gamma _{{\mathbf {x}}_j}} = p_{(\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_k}) \backslash \gamma _{{\mathbf {x}}_j}}, \end{aligned}$$

we can further simplify to

$$\begin{aligned} \rho _{i,j,k} = p_{(\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_k})\backslash \gamma _{{\mathbf {x}}_j}} + p_{(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k})\backslash \gamma _{{\mathbf {x}}_j}} + p_{\gamma _{{\mathbf {x}}_j} \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_k})} + p_{S \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_j} \bigcup \gamma _{{\mathbf {x}}_k})}. \end{aligned}$$
(6)

Since all the terms in Expression (6) must be either zero or positive, this proves that \(\rho _{i,j,k} \ge 0\) \(\forall \) \({\mathbf {x}}_i, {\mathbf {x}}_j, {\mathbf {x}}_k \in X\). \(\square \)

Lemma 2

For any three data points \({\mathbf {x}}_i, {\mathbf {x}}_j, {\mathbf {x}}_k \in X\), Inequality (5) is satisfied when \((\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_j})=(\gamma _{{\mathbf {x}}_j} \bigcap \gamma _{{\mathbf {x}}_k})=(\gamma _{{\mathbf {x}}_k} \bigcap \gamma _{{\mathbf {x}}_i})\).

Proof

From Eq. (3) the Inequality (5) can be rewritten as

$$\begin{aligned} \begin{aligned}&(1-\alpha ) \times \frac{d({\mathbf {x}}_i,{\mathbf {x}}_j)}{d_{max}} + \alpha \times p({\mathbf {x}}_i,{\mathbf {x}}_j) \\&\qquad + (1-\alpha ) \times \frac{d({\mathbf {x}}_j,{\mathbf {x}}_k)}{d_{max}} + \alpha \times p({\mathbf {x}}_j,{\mathbf {x}}_k) \\&\quad \ge (1-\alpha ) \times \frac{d({\mathbf {x}}_k,{\mathbf {x}}_i)}{d_{max}} + \alpha \times p({\mathbf {x}}_k,{\mathbf {x}}_i). \end{aligned} \end{aligned}$$
(7)

Further simplifying (7) by moving the penalty terms to the Left Hand Side (LHS) and the observed distance terms to the Right Hand Side (RHS), we get

$$\begin{aligned} \alpha \times \rho _{i,j,k} \ge \frac{(1-\alpha )}{d_{max}} \times (d({\mathbf {x}}_k,{\mathbf {x}}_i) - (d({\mathbf {x}}_i,{\mathbf {x}}_j)+d({\mathbf {x}}_j,{\mathbf {x}}_k))). \end{aligned}$$
(8)

When \((\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_j})=(\gamma _{{\mathbf {x}}_j} \bigcap \gamma _{{\mathbf {x}}_k})=(\gamma _{{\mathbf {x}}_k} \bigcap \gamma _{{\mathbf {x}}_i})\), as \(d({\mathbf {x}}_i,{\mathbf {x}}_j)+d({\mathbf {x}}_j,{\mathbf {x}}_k) \ge d({\mathbf {x}}_k,{\mathbf {x}}_i)\), the RHS of Inequality (8) is less than or equal to zero. Now, it follows from Lemma 1 that the LHS of Inequality (8) is always greater than or equal to zero as \(\rho _{i,j,k} \ge 0\) and \(\alpha \in (0,1]\). Hence, LHS \(\ge \) RHS, which completes the proof. \(\square \)

Lemma 3

If \(|\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_j}| \rightarrow 0\), \(|\gamma _{{\mathbf {x}}_j} \bigcap \gamma _{{\mathbf {x}}_k}| \rightarrow 0\) and \(|\gamma _{{\mathbf {x}}_k} \bigcap \gamma _{{\mathbf {x}}_i}| \rightarrow 0\), then Inequality (8) tends to be satisfied.

Proof

When \(|\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_j}| \rightarrow 0\), \(|\gamma _{{\mathbf {x}}_j} \bigcap \gamma _{{\mathbf {x}}_k}| \rightarrow 0\) and \(|\gamma _{{\mathbf {x}}_k} \bigcap \gamma _{{\mathbf {x}}_i}| \rightarrow 0\), then LHS \(\rightarrow \alpha ^{+}\) and RHS \(\rightarrow 0\) for the Inequality (8). As \(\alpha \in (0,1]\), Inequality (8) tends to be satisfied. \(\square \)

The following lemma deals with the value of the parameter \(\alpha \in (0,1]\) for which a relaxed form of the triangle inequality is satisfied for any three data instances in a dataset X.

Lemma 4

Let \({\mathscr {P}} = min \{\rho _{i,j,k}:{\mathbf {x}}_i, {\mathbf {x}}_j, {\mathbf {x}}_k \in X, \rho _{i,j,k} > 0\}.\) Then, for any arbitrary constant \(\epsilon \) satisfying \(0 \le \epsilon \le {\mathscr {P}}\), if \(\alpha \ge (1-\epsilon )\), then the following relaxed form of the triangle inequality

$$\begin{aligned} \delta ({\mathbf {x}}_i,{\mathbf {x}}_j) + \delta ({\mathbf {x}}_j,{\mathbf {x}}_k) \ge \delta ({\mathbf {x}}_k,{\mathbf {x}}_i) - {\epsilon }^2, \end{aligned}$$
(9)

is satisfied for any \({\mathbf {x}}_i, {\mathbf {x}}_j, {\mathbf {x}}_k \in X\).

Proof

  1. 1.

    If \({\mathbf {x}}_i\), \({\mathbf {x}}_j\), and \({\mathbf {x}}_k\) are all fully observed, then Inequality (5) holds. Now, since \(\epsilon \ge 0\), therefore \(\delta ({\mathbf {x}}_k,{\mathbf {x}}_i) \ge \delta ({\mathbf {x}}_k,{\mathbf {x}}_i) - {\epsilon }^2\). This implies \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_j) + \delta ({\mathbf {x}}_j,{\mathbf {x}}_k) \ge \delta ({\mathbf {x}}_k,{\mathbf {x}}_i) \ge \delta ({\mathbf {x}}_k,{\mathbf {x}}_i) - {\epsilon }^2\). Hence, Inequality (9) must hold.

  2. 2.

    If \((\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_j} \bigcap \gamma _{{\mathbf {x}}_k}) \ne S\) i.e. at least one of the data instances is not fully observed, and \(\rho _{i,j,k} = 0\), then \((\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_k})\backslash \gamma _{{\mathbf {x}}_j} = \phi \), \((\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k})\backslash \gamma _{{\mathbf {x}}_j} = \phi \), \(S \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_j} \bigcup \gamma _{{\mathbf {x}}_k}) = \phi \), and \(\gamma _{{\mathbf {x}}_j} \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_k}) = \phi \). This implies that \(\gamma _{{\mathbf {x}}_j} = S\), and \(\gamma _{{\mathbf {x}}_k} \bigcup \gamma _{{\mathbf {x}}_i} = \gamma _{{\mathbf {x}}_j}\). Moreover, since \(\rho _{i,j,k} = 0\), we have \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_j) + \delta ({\mathbf {x}}_j,{\mathbf {x}}_k) - \delta ({\mathbf {x}}_k,{\mathbf {x}}_i) = d({\mathbf {x}}_i,{\mathbf {x}}_j) + d({\mathbf {x}}_j,{\mathbf {x}}_k) - d({\mathbf {x}}_k,{\mathbf {x}}_i)\). Now, \(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k} \subseteq \gamma _{{\mathbf {x}}_i}\), \(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k} \subseteq \gamma _{{\mathbf {x}}_k}\) and \(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k} \subseteq \gamma _{{\mathbf {x}}_j}\) as \(\gamma _{{\mathbf {x}}_k} \bigcup \gamma _{{\mathbf {x}}_i} = \gamma _{{\mathbf {x}}_j} = S\). Therefore, \(d({\mathbf {x}}_i,{\mathbf {x}}_j) + d({\mathbf {x}}_j,{\mathbf {x}}_k) - d({\mathbf {x}}_k,{\mathbf {x}}_i) \ge d_{\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}}({\mathbf {x}}_i,{\mathbf {x}}_j) + d_{\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}}({\mathbf {x}}_j,{\mathbf {x}}_k) - d_{\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}}({\mathbf {x}}_k,{\mathbf {x}}_i)\). Now, by the triangle inequality in subspace \(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}\), \(d_{\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}}({\mathbf {x}}_i,{\mathbf {x}}_j) + d_{\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}}({\mathbf {x}}_j,{\mathbf {x}}_k) - d_{\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}}({\mathbf {x}}_k,{\mathbf {x}}_i) \ge 0\). Hence, \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_j) + \delta ({\mathbf {x}}_j,{\mathbf {x}}_k) - \delta ({\mathbf {x}}_k,{\mathbf {x}}_i) \ge 0\), i.e. Inequalities (5) and (9) are satisfied.

  3. 3.

    If \((\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_j} \bigcap \gamma _{{\mathbf {x}}_k}) \ne S\) and \(\rho _{i,j,k} \ne 0\), as \(\alpha \ge (1-\epsilon )\), LHS of Inequality (8) \(\ge (1-\epsilon ) \times (p_{(\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_k})\backslash \gamma _{{\mathbf {x}}_j}} + p_{(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k})\backslash \gamma _{{\mathbf {x}}_j}} + p_{\gamma _{{\mathbf {x}}_j} \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_k})} + p_{S \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_j} \bigcup \gamma _{{\mathbf {x}}_k})})\). Since \(\epsilon \le {\mathscr {P}}\), we further get that LHS \(\ge (1-\epsilon )\epsilon \). Moreover, as \(\frac{1}{d_{max}}(d({\mathbf {x}}_k,{\mathbf {x}}_i) - (d({\mathbf {x}}_i,{\mathbf {x}}_j) + d({\mathbf {x}}_j,{\mathbf {x}}_k))) \le 1\), we get RHS of Inequality (8) \(\le \epsilon \). Therefore, LHS - RHS \(\ge (1-\epsilon )\epsilon - \epsilon = -{\epsilon }^2\). Now, as Inequality (8) is obtained from Inequality (5) after some algebraic manipulation, it must hold that (LHS - RHS) of Inequality (8) = (LHS - RHS) of Inequality (5). Hence, we get \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_j) + \delta ({\mathbf {x}}_j,{\mathbf {x}}_k) - \delta ({\mathbf {x}}_k,{\mathbf {x}}_i) \ge -{\epsilon }^2\) which can be simplified to obtain Inequality (9). This completes the proof.

\(\square \)

Let us now elucidate the proposed FWP (and consequently the proposed FWPD measure) by using the following example.

Example 1

Let \(X \subset \mathbb {R}^3\) be a dataset consisting of \(n=5\) data points, each having three features (\(S=\{1,2,3\}\)), some of which (marked by ’*’) are unobserved. The dataset is presented below (along with the feature observation counts and the observed feature sets for each of the instances).

Data Point

\(x_{i,1}\)

\(x_{i,2}\)

\(x_{i,3}\)

\(\gamma _{{\mathbf {x}}_i}\)

\({\mathbf {x}}_1\)

*

3

2

\(\{2,3\}\)

\({\mathbf {x}}_2\)

1.2

*

4

\(\{1,3\}\)

\({\mathbf {x}}_3\)

*

0

0.5

\(\{2,3\}\)

\({\mathbf {x}}_4\)

2.1

3

1

\(\{1,2,3\}\)

\({\mathbf {x}}_5\)

− 2

*

*

\(\{1\}\)

Obs. Count

\(w_1=3\)

\(w_2=3\)

\(w_3=4\)

The pairwise observed distance matrix \(A_{d}\) and the pairwise penalty matrix \(A_{p}\), are as follows:

$$\begin{aligned} A_{d} = \left[ \begin{matrix} 0, &{} 2, &{} 3.35, &{} 1, &{} 0 \\ 2, &{} 0, &{} 3.5, &{} 3.13, &{} 3.2 \\ 3.35, &{} 3.5, &{} 0, &{} 3.04, &{} 0 \\ 1, &{} 3.13, &{} 3.04, &{} 0, &{} 4.1 \\ 0, &{} 3.2, &{} 0, &{} 4.1, &{} 0 \end{matrix} \right] \text { and } A_{p} = \left[ \begin{matrix} 0.3, &{} 0.6, &{} 0.3, &{} 0.3, &{} 1 \\ 0.6, &{} 0.3, &{} 0.6, &{} 0.3, &{} 0.7 \\ 0.3, &{} 0.6, &{} 0.3, &{} 0.3, &{} 1 \\ 0.3, &{} 0.3, &{} 0.3, &{} 0, &{} 0.7 \\ 1, &{} 0.7, &{} 1, &{} 0.7, &{} 0.7 \end{matrix} \right] . \end{aligned}$$

From \(A_{d}\) it is observed that the maximum pairwise observed distance \(d_{max}=4.1\). Then, the normalized observed distance matrix \(A_{{\bar{d}}}\) is

$$\begin{aligned} A_{{\bar{d}}} = \left[ \begin{matrix} 0, &{} 0.49, &{} 0.82, &{} 0.24, &{} 0 \\ 0.49, &{} 0, &{} 0.85, &{} 0.76, &{} 0.78 \\ 0.82, &{} 0.85, &{} 0, &{} 0.74, &{} 0 \\ 0.24, &{} 0.76, &{} 0.74, &{} 0, &{} 1 \\ 0, &{} 0.78, &{} 0, &{} 1, &{} 0 \end{matrix} \right] . \end{aligned}$$

\({\mathscr {P}} = 0.3\). While it is not necessary, let us choose \(\alpha =0.7\). Using Eq. (3) to calculate the FWPD matrix \(A_{\delta }\), we get:

$$\begin{aligned} A_{\delta } = 0.3 \times A_{{\bar{d}}} + 0.7 \times A_{p} = \left[ \begin{matrix} 0.21, &{} 1.02, &{} 1.22, &{} 0.51, &{} 0.7 \\ 1.02, &{} 0.21, &{} 1.47, &{} 1.15, &{} 1.45 \\ 1.22, &{} 1.47, &{} 0.21, &{} 1.12, &{} 0.7 \\ 0.51, &{} 1.15, &{} 1.12, &{} 0, &{} 1.72 \\ 0.7, &{} 1.45, &{} 0.7, &{} 1.72, &{} 0.49 \end{matrix} \right] . \end{aligned}$$

It should be noted that in keeping with the properties of the FWPD described in Sect. 2.1, \(A_{\delta }\) is a symmetric matrix with the diagonal elements being the smallest entries in their corresponding rows (and columns) and the diagonal element corresponding to the fully observed point \({\mathbf {x}}_4\) being the only zero element. Moreover, it can be easily checked that the relaxed form of the triangle inequality, as given in Inequality (9), is always satisfied.

Table 2 Some important notations used in Sect. 3 and beyond

3 k-means clustering for datasets with missing features using the proposed FWPD

This section presents a reformulation of the k-means clustering problem for datasets with missing features, using the FWPD measure proposed in Sect. 2. The important notations used in this section (and beyond) are summarized in Table 2. The k-means problem (a term coined by MacQueen (1967)) deals with the partitioning of a set of n data instances into \(k(< n)\) clusters so as to minimize the sum of within-cluster dissimilarities. The standard heuristic algorithm to solve the k-means problem, referred to as the k-means algorithm, was first proposed by Lloyd in 1957 (Lloyd 1982), and rediscovered by Forgy (1965). Starting with random assignments of each of the data instances to one of the k clusters, the k-means algorithm functions by iteratively recalculating the k cluster centroids and reassigning the data instances to the nearest cluster (the cluster corresponding to the nearest cluster centroid), in an alternating manner. Selim and Ismail (1984) showed that the k-means algorithm converges to a local optimum of the non-convex optimization problem posed by the k-means problem, when the dissimilarity used is the Euclidean distance between data points.

The proposed formulation of the k-means problem for datasets with missing features using the proposed FWPD measure, referred to as the k-means-FWPD problem hereafter, differs from the standard k-means problem not only in that the underlying dissimilarity measure used is FWPD (instead of Euclidean distance), but also in the addition of a new constraint which ensures that a cluster centroid has observable values for exactly those features which are observed for at least one of the points in its corresponding cluster. Therefore, the k-means-FWPD problem to partition the dataset X into k clusters \((2 \le k < n)\), can be formulated in the following way:

$$\begin{aligned} \text {P: minimize } f(U,&Z)=\sum _{i=1}^{n} \sum _{j=1}^{k} u_{i,j}((1-\alpha )\times \frac{d({\mathbf {x}}_{i},{\mathbf {z}}_{j})}{d_{max}} + \alpha \times p({\mathbf {x}}_{i},{\mathbf {z}}_{j})), \end{aligned}$$
(10a)
$$\begin{aligned} \text {subject to }&\sum _{j=1}^{k} u_{i,j}=1 \text { } \forall \text { } i \in \{1, 2, \cdots , n\}, \end{aligned}$$
(10b)
$$\begin{aligned}&u_{i,j} \in \{0,1\} \text { } \forall \text { } i \in \{1, 2, \cdots , n\},j \in \{1, 2, \cdots , k\}, \end{aligned}$$
(10c)
$$\begin{aligned} \text {and }&\gamma _{{\mathbf {z}}_j} = \bigcup _{{\mathbf {x}}_{i} \in C_{j}} \gamma _{{\mathbf {x}}_i} \text { } \forall \text { } j \in \{1, 2, \cdots , k\}, \end{aligned}$$
(10d)

where \(U=[u_{i,j}]\) is the \(n \times k\) real matrix of memberships, \(d_{max}\) denotes the maximum observed distance between any two data points \({\mathbf {x}}_i,{\mathbf {x}}_i \in X\), \(\gamma _{{\mathbf {z}}_j}\) denotes the set of observed features for \({\mathbf {z}}_j\) \((j \in \{1, 2, \cdots , k\})\), \(C_{j}\) denotes the j-th cluster (corresponding to the centroid \({\mathbf {z}}_j\)), \(Z=\{{\mathbf {z}}_1, \cdots , {\mathbf {z}}_k\}\), and it is said that \({\mathbf {x}}_i \in C_{j}\) when \(u_{i,j}=1\).

3.1 The k-means-FWPD algorithm

To find a solution to the problem P, which is a non-convex program, we propose a Lloyd’s heuristic-like algorithm based on the FWPD (referred to as k-means-FWPD algorithm), as follows:

  1. 1.

    Start with a random initial set of cluster assignments U such that \(\sum _{j=1}^{k} u_{i,j}=1\). Set \(t=1\) and specify the maximum number of iterations MaxIter.

  2. 2.

    For each cluster \(C_{j}^{t}\) \((j = 1, 2, \cdots , k)\), calculate the observed features of the cluster centroid \({\mathbf {z}}_{j}^{t}\). The value for the l-th feature of a centroid \({\mathbf {z}}_{j}^{t}\) should be the average of the corresponding feature values for all the data instances in the cluster \(C_{j}^{t}\) having observed values for the l-th feature. If none of the data instances in \(C_{j}^{t}\) have observed values for the feature in question, then the value \(z_{j,l}^{t-1}\) of the feature from the previous iteration should be retained. Therefore, the feature values are calculated as follows:

    $$\begin{aligned} z_{j,l}^{t}=\left\{ \begin{array}{ll} \left( \underset{{\mathbf {x}}_i \in X_l}{\sum }\; u_{i,j}^{t} \times x_{i,l}\right) \bigg / \left( \underset{{\mathbf {x}}_i \in X_l}{\sum }\; u_{i,j}^{t}\right) \text { }, &{} \forall \text { } l \in \mathop \bigcup \nolimits _{{\mathbf {x}}_i \in C_{j}^{t}} \gamma _{{\mathbf {x}}_i},\\ z_{j,l}^{t-1}, &{} \forall \text { } l \in \gamma _{{\mathbf {z}}_j^{t-1}} \backslash \mathop \bigcup \nolimits _{{\mathbf {x}}_i \in C_{j}^{t}} \gamma _{{\mathbf {x}}_i},\\ \end{array} \right. \end{aligned}$$
    (11)

    where \(X_l\) denotes the set of all \({\mathbf {x}}_i \in X\) having observed values for the feature l.

  3. 3.

    Assign each data point \({\mathbf {x}}_i\) \((i=1, 2, \cdots , n)\) to the cluster corresponding to its nearest (in terms of FWPD) centroid, i.e.

    $$\begin{aligned} u_{i,j}^{t+1} = \left\{ \begin{array}{ll} 1, &{} \text{ if } {\mathbf {z}}_{j}^{t}=\mathop {\mathrm{arg\, min}}\limits _{{\mathbf {z}} \in Z^t} \; \delta ({\mathbf {x}}_i,{\mathbf {z}}),\\ 0, &{} \text{ otherwise }.\\ \end{array} \right. \end{aligned}$$

    Set \(t=t+1\). If \(U^{t}=U^{t-1}\) or \(t = MaxIter\), then go to Step 4; otherwise go to Step 2.

  4. 4.

    Calculate the final cluster centroid set \(Z^*\) as:

    $$\begin{aligned} z_{j,l}^{*}=\frac{{\mathop \sum \limits }_{{\mathbf {x}}_i \in X_l}\; u_{i,j}^{t+1} \times x_{i,l}}{{\mathop \sum \limits }_{{\mathbf {x}}_i \in X_l}\; u_{i,j}^{t+1}} \text { } \forall \text { } l \in \bigcup _{{\mathbf {x}}_i \in C_{j}^{t+1}} \gamma _{{\mathbf {x}}_i}. \end{aligned}$$
    (12)

    Set \(U^* = U^{t+1}\).

Fig. 2
figure 2

Comparison of convergences of traditional k-means and k-means-FWPD algorithms. a Traditional k-means. b k-means-FWPD

Fig. 3
figure 3

Simplified representation of how the k-means-FWPD algorithm traverses the \({\mathscr {U}} \times {\mathscr {Z}}\) space (\({\mathscr {U}}\) and \({\mathscr {Z}}\) are shown to be unidimensional for the sake of visualizability)

Remark 1

The iterations of the traditional k-means algorithm are known to each result in a decrease in the value of the objective function f (Selim and Ismail 1984) (Fig. 2a). However, for the k-means-FWPD algorithm, the \(Z^t\) calculations for some of the iterations may result in a finite increase in f, as shown in Fig. 2b. We show in Theorem 3 that only a finite number of such increments may occur during a given run of the algorithm, thus ushering in ultimate convergence. Moreover, the final feasible, locally-optimal solution is obtained using Step 4 (denoted by dotted line) which does not result in any further change to the objective function value.

3.2 Notions of feasibility in problem P

Let \({\mathscr {U}}\) and \({\mathscr {Z}}\) respectively denote the sets of all possible U and Z. Unlike the traditional k-means problem, the entire \({\mathscr {U}} \times {\mathscr {Z}}\) space is not feasible for the Problem P. There exists a set of feasible U for a given Z. Similarly, there exist sets of feasible and super-feasible Z (a super-set of the set of feasible Z) for a given U. In this subsection, we formally define these notions.

Definition 6

Given a cluster centroid set Z, the set \({\mathscr {F}}(Z)\) of feasible membership matrices is given by

$$\begin{aligned} {\mathscr {F}}({\hat{Z}}) = \{U : u_{i,j} = 0 \text { } \forall j \in \{1,2, \cdots , k\} \text { such that } \gamma _{{\mathbf {z}}_j} \subset \gamma _{{\mathbf {x}}_i} \}, \end{aligned}$$

i.e. \({\mathscr {F}}(Z)\) is the set of all such membership matrices which do not assign any \({\mathbf {x}}_i \in X\) to a centroid in Z missing some feature \(l \in \gamma _{{\mathbf {x}}_i}\).

Definition 7

Given a membership matrix U, the set \({\mathscr {F}}(U)\) of feasible cluster centroid sets can be defined as

$$\begin{aligned} {\mathscr {F}}(U) = \{Z : Z \text { satisfies constraint } (10{ d}) \}. \end{aligned}$$

Definition 8

Given a membership matrix U, the set \({\mathscr {S}}(U)\) of super-feasible cluster centroid sets is

$$\begin{aligned} {\mathscr {S}}(U) = \{Z : \gamma _{{\mathbf {z}}_j} \supseteq \bigcup _{{\mathbf {x}}_{i} \in C_{j}} \gamma _{{\mathbf {x}}_i} \text { } \forall \text { } j \in \{1, 2, \cdots , k\} \}, \end{aligned}$$
(13)

i.e. \({\mathscr {S}}(U)\) is the set of all such centroid sets which ensure that any centroid has observed values at least for those features which are observed for any of the points assigned to its corresponding cluster in U.

Remark 2

The k-means-FWPD problem differs from traditional k-means in that not all \(U \in {\mathscr {U}}\) are feasible for a given Z. Additionally, for a given U, there exists a set \({\mathscr {S}}(U)\) of super-feasible Z; \({\mathscr {F}}(U)\) a subset of \({\mathscr {S}}(U)\) being the set of feasible Z. The traversal of the k-means-FWPD algorithm is illustrated in Fig. 3 where the grey solid straight lines denote the set of feasible Z for the current \(U^t\) while the rest of the super-feasible region is marked by the corresponding grey dotted straight line. Furthermore, the grey jagged lines denote the feasible set of U for the current \(Z^t\). Starting with a random \(U^1 \in {\mathscr {U}}\) (Step 1), the algorithm finds \(Z^1 \in {\mathscr {S}}(U^1)\) (Step 2), \(U^2 \in {\mathscr {F}}(Z^1)\) (Step 3), and \(Z^2 \in {\mathscr {F}}(U^2)\) (Step 2). However, it subsequently finds \(U^3 \not \in {\mathscr {F}}(Z^2)\) (Step 3), necessitating a feasibility adjustment (see Sect. 3.4) while calculating \(Z^3\) (Step 2). Subsequently, the algorithm converges to \((U^5,Z^4)\). For the convergent \((U^{T+1},Z^T)\), \(U^{T+1} \in {\mathscr {F}}(Z^T)\) but it is possible that \(Z^T \in {\mathscr {S}}(U^{T+1})\backslash {\mathscr {F}}(U^T)\) (as seen in the case of Fig. 3). However, the final \((U^*,Z^*)\) (obtained by the dotted black line transition denoting Step 4) is seen to be feasible in both respects and is shown (in Theorem 5) to be locally-optimal in the corresponding feasible region.

3.3 Partial optimal solutions

This subsection deals with the concept of partial optimal solutions of the problem P, to one of which the k-means-FWPD algorithm is shown to converge (prior to Step 4). The following definition formally presents the concept of a partial optimal solution.

Definition 9

A partial optimal solution \(({\tilde{U}},{\tilde{Z}})\) of problem P, satisfies the following conditions (Wendel and Hurter Jr 1976):

$$\begin{aligned} \begin{aligned}&f({\tilde{U}},{\tilde{Z}}) \le f(U,{\tilde{Z}}) \text { } \forall \text { } U \in {\mathscr {F}}({\tilde{Z}}) \text { where } {\tilde{U}} \in {\mathscr {F}}({\tilde{Z}}),\\ \text {and }&f({\tilde{U}},{\tilde{Z}}) \le f({\tilde{U}},Z) \text { } \forall \text { } Z \in {\mathscr {S}}({\tilde{U}}) \text { where } {\tilde{Z}} \in {\mathscr {S}}({\tilde{U}}). \end{aligned} \end{aligned}$$

To obtain a partial optimal solution of P, the two following subproblems are defined:

$$\begin{aligned} \begin{aligned}&\text {P1: Given } U \in {\mathscr {U}}, \text { minimize } f(U,Z) \text { over } Z \in {\mathscr {S}}(U).\\&\text {P2: Given } Z \text { satisfying (10d), minimize } f(U,Z) \text { over } U \in {\mathscr {U}}. \end{aligned} \end{aligned}$$

The following lemmas establish that Steps 2 and 3 of the k-means-FWPD algorithm respectively solve the problems P1 and P2 for a given iterate. The subsequent theorem shows that the k-means-FWPD algorithm converges to a partial optimal solution of P.

Lemma 5

Given a \(U^t\), the centroid matrix \(Z^t\) calculated using Eq. (11) is an optimal solution of the Problem P1.

Proof

For a fixed \(U^t \in {\mathscr {U}}\), the objective function is minimized when \(\frac{\partial f}{\partial z_{j,l}^t}=0 \text { } \forall j \in \{1, \cdots , k\}, l \in \gamma _{{\mathbf {z}}^t_j}\). For a particular \({\mathbf {z}}^t_j\), it follows from Definition 3 that \(\{p({\mathbf {x}}_{i},{\mathbf {z}}^t_{j}): {\mathbf {x}}_i \in C^t_j\}\) is independent of the values of the features of \({\mathbf {z}}^t_j\), as \(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {z}}^t_j} = \gamma _{{\mathbf {x}}_i}\ \text { } \forall {\mathbf {x}}_i \in C^t_j\). Since an observed feature \(l \in \gamma _{{\mathbf {z}}^t_j} \backslash (\bigcup _{{\mathbf {x}}_{i} \in C^t_{j}} \gamma _{{\mathbf {x}}_i})\) of \({\mathbf {z}}^t_j\) has no contribution to the observed distances, \(\frac{\partial f}{\partial z_{j,l}^t}=0 \text { } \forall l \in \gamma _{{\mathbf {z}}^t_j} \backslash (\bigcup _{{\mathbf {x}}_{i} \in C^t_{j}} \gamma _{{\mathbf {x}}_i})\). For an observed feature \(l\in \bigcup _{{\mathbf {x}}_{i} \in C^t_{j}} \gamma _{{\mathbf {x}}_i}\) of \({\mathbf {z}}^t_j\), differentiating \(f(U^t,Z^t)\) w.r.t. \(z_{j,l}^t\) we get

$$\begin{aligned} \frac{\partial f}{\partial z_{j,l}^t} = \frac{(1-\alpha )}{d_{max}} \times \underset{{\mathbf {x}}_i \in X_l}{\sum } u_{i,j}^t\left( \frac{x_{i,l}-z_{j,l}^t}{d({\mathbf {x}}_i,{\mathbf {z}}^t_j)}\right) . \end{aligned}$$

Setting \(\frac{\partial f}{\partial z_{j,l}^t} = 0\) and solving for \(z_{j,l}^t\), we obtain

$$\begin{aligned} z_{j,l}^t=\frac{{\mathop \sum \limits }_{{\mathbf {x}}_i \in X_l} u_{i,j}^t \times x_{i,l}}{{\mathop \sum \limits }_{{\mathbf {x}}_i \in X_l} u_{i,j}^t}. \end{aligned}$$

Since Eq. (11) is in keeping with this criterion and ensures that constraint (13) is satisfied, the centroid matrix \(Z^t\) calculated using Eq. (11) is an optimal solution of P1. \(\square \)

Lemma 6

For a given \(Z^t\), problem P2 is solved if \(u^{t+1}_{i,j}=1\) and \(u^{t+1}_{i,j^{'}}=0\) \(\forall \) \(i \in \{1, \cdots , n\}\) when \(\delta ({\mathbf {x}}_i,{\mathbf {z}}^t_j) \le \delta ({\mathbf {x}}_i,{\mathbf {z}}^t_{j^{'}})\), for all \(j^{'} \ne j\).

Proof

It is clear that the contribution of \({\mathbf {x}}_i\) to the total objective function is \(\delta ({\mathbf {x}}_i,{\mathbf {z}}^t_j)\) when \(u^{t+1}_{i,j}=1\) and \(u^{t+1}_{i,j^{'}}=0\) \(\forall \) \(j^{'} \ne j\). Since any alternative solution is an extreme point of \({\mathscr {U}}\) (Selim and Ismail 1984), it must satisfy (10c). Therefore, the contribution of \({\mathbf {x}}_i\) to the objective function for an alternative solution will be some \(\delta ({\mathbf {x}}_i,{\mathbf {z}}^t_{j^{'}}) \ge \delta ({\mathbf {x}}_i,{\mathbf {z}}^t_j)\). Hence, the contribution of \({\mathbf {x}}_i\) is minimized by assigning \(u^{t+1}_{i,j}=1\) and \(u^{t+1}_{i,j^{'}}=0\) \(\forall \) \(j^{'} \ne j\). This argument holds true for all \({\mathbf {x}}_i \in X\), i.e. \(\forall \) \(i \in \{1, \cdots , n\}\). This completes the proof. \(\square \)

Theorem 2

The k-means-FWPD algorithm finds a partial optimal solution of P.

Proof

Let T denote the terminal iteration. Since Step 2 and Step 3 of the k-means-FWPD algorithm respectively solve P1 and P2, the algorithm terminates only when the obtained iterate \((U^{T+1},Z^{T})\) solves both P1 and P2. Therefore, \(f(U^{T+1},Z^{T}) \le f(U^{T+1},Z) \text { } \forall Z \in {\mathscr {S}}(U^{T+1})\). Since Step 2 ensures that \(Z^{T} \in {\mathscr {S}}(U^{T})\) and \(U^{T+1} = U^{T}\), we must have \(Z^{T} \in {\mathscr {S}}(U^{T+1})\). Moreover, \(f(U^{T+1},Z^{T}) \le f(U,Z^T)\) \(\forall U \in {\mathscr {U}}\) which implies \(f(U^{T+1},Z^{T}) \le f(U,Z^T) \text { } \forall U \in {\mathscr {F}}(Z^T)\). Now, Step 2 ensures that \(\gamma _{{\mathbf {z}}^{T}_j} \supseteq \bigcup _{{\mathbf {x}}_{i} \in C^T_{j}} \gamma _{{\mathbf {x}}_i} \text { } \forall \text { } j \in \{1, 2, \cdots , k\}\). Since we must have \(U^{T+1} = U^{T}\) for convergence to occur, it follows that \(\gamma _{{\mathbf {z}}^{T}_j} \supseteq \bigcup _{{\mathbf {x}}_{i} \in C^{T+1}_{j}} \gamma _{{\mathbf {x}}_i} \text { } \forall \text { } j \in \{1, 2, \cdots , k\}\), hence \(u^{T+1}_{i,j} = 1\) implies \(\gamma _{{\mathbf {z}}^T_j} \supseteq \gamma _{{\mathbf {x}}_i}\). Therefore, \(U^{T+1} \in {\mathscr {F}}(Z^T)\). Consequently, the terminal iterate of Step 3 of the k-means-FWPD algorithm must be a partial optimal solution of P. \(\square \)

3.4 Feasibility adjustments

Since it is possible for the number of observed features of the cluster centroids to increase over the iterations to maintain feasibility w.r.t. constraint (10d), we now introduce the concept of feasibility adjustment, the consequences of which are discussed in this subsection.

Definition 10

A feasibility adjustment for cluster j (\(j \in \{1,2,\cdots ,k\}\)) is said to occur in iteration t if \(\gamma _{{\mathbf {z}}^t_j} \supset \gamma _{{\mathbf {z}}^{t-1}_j}\) or \(\gamma _{{\mathbf {z}}^t_j} \backslash \gamma _{{\mathbf {z}}^{t-1}_j} \ne \phi \), i.e. if the centroid \({\mathbf {z}}^t_j\) acquires an observed value for at least one feature which was unobserved for its counter-part \({\mathbf {z}}^{t-1}_j\) in the previous iteration.

The following lemma shows that feasibility adjustment can only occur for a cluster as a result of the addition of a new data point previously unassigned to it.

Lemma 7

Feasibility adjustment occurs for a cluster \(C_j\) in iteration t iff at least one data point \({\mathbf {x}}_i\), such that \(\gamma _{{\mathbf {x}}_i} \backslash \gamma _{{\mathbf {z}}^{\tau }_j} \ne \phi \) \(\forall \tau < t\), which was previously unassigned to \(C_j\) (i.e. \(u^{\tau }_{i,j} = 0\) \(\forall \tau < t\)) is assigned to it in iteration t.

Proof

Due to Eq. (11), all features defined for \({\mathbf {z}}^{t-1}_j\) are also retained for \({\mathbf {z}}^t_j\). Therefore, for \(\gamma _{{\mathbf {z}}^t_j} \backslash \gamma _{{\mathbf {z}}^{t-1}_j} \ne \phi \) there must exist some \({\mathbf {x}}_i\) such that \(u^t_{i,j} = 1\), \(u^{t-1}_{i,j} = 0\), and \(\gamma _{{\mathbf {x}}_i} \backslash \gamma _{{\mathbf {z}}^{t-1}_j} \ne \phi \). Since the set of defined features for any cluster centroid is a monotonically growing set, we have \(\gamma _{{\mathbf {x}}_i} \backslash \gamma _{{\mathbf {z}}^{\tau }_j} \ne \phi \) \(\forall \tau < t\). It then follows from constraint (10d) that \(u^{\tau }_{i,j} = 0\) \(\forall \tau < t\). Now, to prove the converse, let us assume the existence of some \({\mathbf {x}}_i\) such that \(\gamma _{{\mathbf {x}}_i} \backslash \gamma _{{\mathbf {z}}^{\tau }_j} \ne \phi \) \(\forall \tau < t\) and \(u^{\tau }_{i,j} = 0\) \(\forall \tau < t\). Since \(\gamma _{{\mathbf {x}}_i} \backslash \gamma _{{\mathbf {z}}^{t-1}_j} \ne \phi \) and \(\gamma _{{\mathbf {z}}^t_j} \supseteq \gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {z}}^{t-1}_j}\), it follows that \(\gamma _{{\mathbf {z}}^t_j} \backslash \gamma _{{\mathbf {z}}^{t-1}_j} \ne \phi \). \(\square \)

The following theorem deals with the consequences of the feasibility adjustment phenomenon.

Theorem 3

For a finite number of iterations during a single run of the k-means-FWPD algorithm, there may be a finite increment in the objective function f, due to the occurrence of feasibility adjustments.

Proof

It follows from Lemma 5 that \(f(U^t,Z^t) \le f(U^t,Z)\) \(\forall Z \in {\mathscr {S}}(U^t)\). If there is no feasibility adjustment in iteration t, \({\mathscr {S}}(U^{t-1}) = {\mathscr {S}}(U^t)\). Hence, \(f(U^t,Z^t) \le f(U^t,Z^{t-1})\). However, if a feasibility adjustment occurs in iteration t, then \(\gamma _{{\mathbf {z}}^t_j} \subset \gamma _{{\mathbf {z}}^{t-1}_j}\) for at least one \(j \in \{1,2,\cdots ,k\}\). Hence, \(Z^{t-1} \in {\mathscr {Z}} \backslash {\mathscr {S}}(U^t)\) and we may have \(f(U^t,Z^t) > f(U^t,Z^{t-1})\). Since both \(f(U^t,Z^t)\) and \(f(U^t,Z^{t-1})\) are finite, \((f(U^t,Z^t) - f(U^t,Z^{t-1}))\) must also be finite. Now, the maximum number of feasibility adjustments occur in the worst case scenario where each data point, having an unique set of observed features (which are unobserved for all other data points), traverses all the clusters before convergence. Therefore, the maximum number of possible feasibility adjustments during a single run of the k-means-FWPD algorithm is \(n(k-1)\), which is finite. \(\square \)

3.5 Convergence of the k-means-FWPD algorithm

We now show that the k-means-FWPD algorithm converges to the partial optimal solution, within a finite number of iterations. The following lemma and the subsequent theorem are concerned with this.

Lemma 8

Starting with a given iterate \((U^t,Z^t)\), the k-means-FWPD algorithm either reaches convergence or encounters a feasibility adjustment, within a finite number of iterations.

Proof

Let us first note that there are a finite number of extreme points of \({\mathscr {U}}\). Then, we observe that an extreme point of \({\mathscr {U}}\) is visited at most once by the algorithm before either convergence or the next feasibility adjustment. Suppose, this is not true, and let \(U^{t_1}=U^{t_2}\) for distinct iterations \(t_1\) and \(t_2\) \((t_1 \ge t, t_1 < t_2)\) of the algorithm. Applying Step 2 of the algorithm, we get \(Z^{t_1}\) and \(Z^{t_2}\) as optimal centroid sets for \(U^{t_1}\) and \(U^{t_2}\), respectively. Then, \(f(U^{t_1},Z^{t_1}) = f(U^{t_2},Z^{t_2})\) since \(U^{t_1}=U^{t_2}\). However, it is clear from Lemmas 56 and Theorem 3 that f strictly decreases subsequent to the iterate \((U^t,Z^t)\) and prior to either the next feasibility adjustment (in which case the value of f may increase) or convergence (in which case f remains unchanged). Hence, \(U^{t_1} \ne U^{t_2}\). Therefore, it is clear from the above argument that the k-means-FWPD algorithm either converges or encounters a feasibility adjustment within a finite number of iterations. \(\square \)

Theorem 4

The k-means-FWPD algorithm converges to a partial optimal solution within a finite number of iterations.

Proof

It follows from Lemma 8 that the first feasibility adjustment is encountered within a finite number of iterations since initialization and that each subsequent feasibility adjustment occurs within a finite number of iterations of the previous. Moreover, we know from Theorem 3 that there can only be a finite number of feasibility adjustments during a single run of the algorithm. Therefore, the final feasibility adjustment must occur within a finite number of iterations. Moreover, it follows from Lemma 8 that the algorithm converges within a finite number of subsequent iterations. Hence, the k-means-FWPD algorithm must converge within a finite number of iterations. \(\square \)

3.6 Local optimality of the final solution

In this subsection, we establish the local optimality of the final solution obtained in Step 4 of the k-means-FWPD algorithm, subsequent to convergence in Step 3.

Lemma 9

\(Z^*\) is the unique optimal feasible cluster centroid set for \(U^*\), i.e. \(Z^* \in {\mathscr {F}}(U^*)\) and \(f(U^*,Z^*) \le f(U^*,Z) \text { } \forall Z \in {\mathscr {F}}(U^*)\).

Proof

Since \(Z^*\) satisfies constraint (10d) for \(U^*\), \(Z^* \in {\mathscr {F}}(U^*)\). We know from Lemma 5 that for \(f(U^*,Z^*) \le f(U^*,Z) \text { } \forall Z \in {\mathscr {F}}(U^*)\), we must have

$$\begin{aligned} z^*_{j,l}=\frac{{\mathop \sum \limits }_{{\mathbf {x}}_i \in X_l} u^*_{i,j} \times x_{i,l}}{{\mathop \sum \limits }_{{\mathbf {x}}_i \in X_l} u^*_{i,j}}. \end{aligned}$$

As this is ensured by Step 4, \(Z^*\) must be the unique optimal feasible cluster centroid set for \(U^*\). \(\square \)

Lemma 10

If \(Z^*\) is the unique optimal feasible cluster centroid set for \(U^*\), then \(f(U^*,Z^*) \le f(U,Z^*) \text { } \forall U \in {\mathscr {F}}(Z^*)\).

Proof

We know from Theorem 2 that \(f(U^{*},Z^{T}) \le f(U,Z^T) \text { } \forall U \in {\mathscr {F}}(Z^T)\). Now, \(\gamma _{{\mathbf {z}}^*_j} \subseteq \gamma _{{\mathbf {z}}^T_j} \text { } \forall j \in \{1,2,\cdots ,k\}\). Therefore, \({\mathscr {F}}(Z^*) \subseteq {\mathscr {F}}(Z^T)\) must hold. It therefore follows that \(f(U^*,Z^*) \le f(U,Z^*) \text { } \forall U \in {\mathscr {F}}(Z^*)\). \(\square \)

Now, the following theorem shows that the final solution obtained by Step 4 of the k-means-FWPD algorithm is locally optimal.

Theorem 5

The final solution \((U^*,Z^*)\) obtained by Step 4 of the k-means-FWPD algorithm is a local optimal solution of P.

Proof

We have from Lemma 10 that \(f(U^*,Z^*) \le f(U,Z^*) \text { } \forall U \in {\mathscr {F}}(Z^*)\). Therefore, \(f(U^*,Z^*) \le \min _{U} \{f(U,Z^*):U \in {\mathscr {F}}(Z^*)\}\) which implies that for all feasible directions D at \(U^*\), the one-sided directional derivative (Lasdon 2013),

$$\begin{aligned} trace(\nabla _{U}f(U^*,Z)^{\mathrm {T}}D) \ge 0. \end{aligned}$$
(14)

Now, since \(Z^*\) is the unique optimal feasible cluster centroid set for \(U^*\) (Lemma 9), \((U^*,Z^*)\) is a local optimum of problem P. \(\square \)

3.7 Time complexity of the k-means-FWPD algorithm

In this subsection, we present a brief discussion on the time complexity of the k-means-FWPD algorithm. The k-means-FWPD algorithm consists of four basic steps, which are repeated iteratively. These steps are

  1. 1.

    Centroid Calculation: As a maximum of m features of each centroid must be calculated, the complexity of centroid calculation is at most \({\mathscr {O}}(kmn)\).

  2. 2.

    Distance Calculation: As each distance calculation involves at most m features, the observed distance calculation between n data instances and k cluster centroids is at most \({\mathscr {O}}(kmn)\).

  3. 3.

    Penalty Calculation: The penalty calculation between a data point and a cluster centroid involves at most m summations. Hence, penalty calculation over all possible pairings is at most \({\mathscr {O}}(kmn)\).

  4. 4.

    Cluster Assignment: The assignment of n data points to k clusters consists of the comparisons of the dissimilarities of each point with k clusters, which is \({\mathscr {O}}(nk)\).

Therefore, if the algorithm runs for T iterations, the total computational complexity is \({\mathscr {O}}(kmnT)\) which is the same as that of the standard k-means algorithm.

4 Hierarchical agglomerative clustering for datasets with missing features using the proposed FWPD

In this section we present HAC clustering methods using the proposed FWPD measure that can be directly applied to data with missing features. The important notations used in this section (and beyond) are summarized in Table 3.

Table 3 Some important notations used in Sect. 4 and beyond

Hierarchical agglomerative schemes for data clustering seek to build a multi-level hierarchy of clusters, starting with each data point as a single cluster, by combining two (or more) of the most proximal clusters at one level to obtain a lower number of clusters at the next (higher) level. A survey of HAC methods can be found in Murtagh and Contreras (2012). However, these methods cannot be directly applied to datasets with missing features. Therefore, in this section, we develop variants of HAC methods, based on the proposed FWPD measure. Various proximity measures may be used to merge the clusters in an agglomerative clustering method. Modifications of the three most popular of such proximity measures [Single Linkage (SL), Complete Linkage (CL) and Average Linkage (AL)] so as to have FWPD as the underlying dissimilarity measure, are as follows:

  1. 1.

    Single Linkage with FWPD (SL-FWPD): The SL between two clusters \(C_i\) and \(C_j\) is \(\min \{\delta ({\mathbf {x}}_i,{\mathbf {x}}_j):{\mathbf {x}}_i \in C_i, {\mathbf {x}}_j \in C_j\}\).

  2. 2.

    Complete Linkage with FWPD (CL-FWPD): The CL between two clusters \(C_i\) and \(C_j\) is \(\max \{\delta ({\mathbf {x}}_i,{\mathbf {x}}_j):{\mathbf {x}}_i \in C_i, {\mathbf {x}}_j \in C_j\}\).

  3. 3.

    Average Linkage with FWPD (AL-FWPD): \(\frac{1}{|C_i| \times |C_j|} \underset{{\mathbf {x}}_i \in C_i}{\sum } \underset{{\mathbf {x}}_j \in C_j}{\sum } \delta ({\mathbf {x}}_i,{\mathbf {x}}_j)\) is the AL between two clusters \(C_i\) and \(C_j\), where \(|C_i|\) and \(|C_j|\) are respectively the number of instances in the clusters \(C_i\) and \(C_j\).

4.1 The HAC-FWPD algorithm

To achieve hierarchical clusterings in the presence of unstructured missingness, the HAC method based on SL-FWPD, CL-FWPD, or AL-FWPD (referred to as the HAC-FWPD algorithm hereafter) is as follows:

  1. 1.

    Set \(B^0 = X\). Compute pairwise dissimilarities \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_j)\), \(\forall \) \({\mathbf {x}}_i,{\mathbf {x}}_j \in X\) and construct the dissimilarity matrix \(Q^0\) so that \(q^0(i,j) = \delta ({\mathbf {x}}_i,{\mathbf {x}}_j)\). Set \(t=0\).

  2. 2.

    Search \(Q^t\) to identify the set \(M = \{(i_1,j_1), (i_2,j_2), \cdots , (i_k,j_k)\}\) containing all the pairs of indexes such that \(q^t(i_r,j_r) = q^t_{min}\) \(\forall \) \(r \in \{1, 2, \cdots , k\}\), \(q^t_{min}\) being the smallest non-zero element in \(Q^t\).

  3. 3.

    Merge the elements corresponding to any one pair in M, say \(\beta ^t_{ir}\) and \(\beta ^t_{jr}\) corresponding to the pair \((i_r,j_r)\), into a single group \(G = \{\beta ^t_{ir}, \beta ^t_{jr}\}\). Construct \(B^{t+1}\) by removing \(\beta ^t_{ir}\) and \(\beta ^t_{jr}\) from \(B^t\) and inserting G.

  4. 4.

    Define \(Q^{t+1}\) on \(B^{t+1} \times B^{t+1}\) as \(q^{t+1}(i,j) = q^t(i,j)\) \(\forall \) \(i, j \text { such that } \beta ^t_i, \beta ^t_j \ne G\) and \(q^{t+1}(i,i_G) = q^{t+1}(i_G,i) = L(G,\beta ^t_i)\), where \(i_G\) denotes the location of G in \(B^{t+1}\) and

    $$\begin{aligned} L(G,\beta ) = \left\{ \begin{array}{ll} \underset{{\mathbf {x}}_i \in G,{\mathbf {x}}_j \in \beta }{\min } \delta ({\mathbf {x}}_i,{\mathbf {x}}_j) &{}\text{ for } \text{ SL-FWPD },\\ \underset{{\mathbf {x}}_i \in G,{\mathbf {x}}_j \in \beta }{\max } \delta ({\mathbf {x}}_i,{\mathbf {x}}_j) &{}\text{ for } \text{ CL-FWPD },\\ \frac{1}{|G| \times |\beta |} \underset{{\mathbf {x}}_i \in G}{\sum } \underset{{\mathbf {x}}_j \in \beta }{\sum } \delta ({\mathbf {x}}_i,{\mathbf {x}}_j) &{}\text{ for } \text{ AL-FWPD }.\\ \end{array}\right. \end{aligned}$$

    Set \(t=t+1\).

  5. 5.

    Repeat Steps 2-4 until \(B^t\) contains a single element.

FWPD being the underlying dissimilarity measure (instead of other metrics such as the Euclidean distance), the HAC-FWPD algorithm can be directly applied to obtain SL, CL, or AL based hierarchical clustering of datasets with missing feature values.

4.2 Time complexity of the HAC-FWPD algorithm

Irrespective of whether SL-FWPD, AL-SWPD or CL-FWPD is used as the proximity measure, the HAC-FWPD algorithm consists of the following three basic steps:

  1. 1.

    Distance Calculation: As each distance calculation involves at most m features, the calculation of all pairwise observed distances among n data instances is at most \({\mathscr {O}}(n^{2}m)\).

  2. 2.

    Penalty Calculation: The penalty calculation between a data point and a cluster centroid involves at most m summations. Hence, penalty calculation over all possible pairings is at most \({\mathscr {O}}(n^{2}m)\).

  3. 3.

    Cluster Merging: The merging of two clusters takes place in each of the \(n-1\) steps of the algorithm, and each merge at most has a time complexity of \({\mathscr {O}}(n^2)\).

Therefore, the total computational complexity of the HAC-FWPD algorithm is \({\mathscr {O}}(n^{2}m)\) which is the same as that of the standard HAC algorithm based on SL, CL or AL.

5 Experimental results

In this section, we report the results of several experiments carried out to validate the merit of the proposed k-means-FWPD and HAC-FWPD clustering algorithms.Footnote 2 In the following subsections, we describe the experimental setup used to validate the proposed techniques. The results of the experiments for the k-means-FWPD algorithm and the HAC-FWPD algorithm, are respectively presented thereafter.

5.1 Experiment setup

Adjusted Rand Index (ARI) (Hubert and Arabie 1985) is a popular validity index used to judge the merit of the clustering algorithms. When the true class labels are known, ARI provides a measure of the similarity between the cluster partition obtained by a clustering technique and the true class labels. Therefore, a high value of ARI is thought to indicate better clusterings. But, the class labels may not always be in keeping with the natural cluster structure of the dataset. In such cases, good clusterings are likely to achieve lower values of these indexes compared to possibly erroneous partitions (which are more akin to the class labels). However, the purpose of our experiments is to find out how close the clusterings obtained by the proposed methods (and the contending techniques) are to the clusterings obtained by the standard algorithms (k-means algorithm and HAC algorithm); the proposed methods (and its contenders) being run on the datasets with missingness, while the standard methods are run on corresponding fully observed datasets. Hence, the clusterings obtained by the standard algorithms are used as the ground-truths using which the ARI values are calculated for the proposed methods (and their contenders). The performances of ZI, MI, kNNI (with \(k \in \{3,5,10,20\}\)) and SVDI (using the most significant 10% of the eigenvectors) are used for comparison with the proposed methods. The variant of MI that we impute with for these experiments differs from the traditional technique in that we use the average of the averages for individual classes, instead of the overall average. This is done to minimize the effects of severe class imbalances that may exist in the datasets. We also conduct the Wilcoxon’s signed rank test (Wilcoxon 1945) to evaluate the statistical significance of the observed results.

The performance of k-means depends on the initial cluster assignment. Therefore, to ensure fairness, we use the same set of random initial cluster assignments for both the standard k-means algorithm on the fully observed dataset as well as the proposed k-means-FWPD method (and its contenders). The maximum number of iterations of the k-means variants is set as \(MaxIter = 500\). Results are recorded in terms of average ARI values over 50 different runs on each dataset. The number of clusters is assumed to be same as the number of classes.

For HAC experiments, Results are reported as average ARI values obtained over 20 independent runs. AL is chosen over SL and CL as it is observed to generally achieve higher ARI values. The number of clusters is assumed to be same as the number of classes.

Table 4 Details of the 20 real-world datasets

5.1.1 Datasets

We take 20 real-world datasets from the University of California at Irvine (UCI) repository (Dheeru and Karra Taniskidou 2017) and the Jin Genomics Dataset (JGD) repository (Jin 2017). Each feature of each dataset is normalized so as to have zero mean and unit standard deviation. The details of these 20 datasets are listed in Table 4.

5.1.2 Simulating missingness mechanisms

Experiments are conducted by removing features from each of the datasets according to the four missingness mechanisms, namely MCAR, MAR, MNAR-I and MNAR-II (Datta et al. 2016b). The detailed algorithm for simulating the four missingness mechanisms is as follows:

  1. 1.

    Specify the number of entries MissCount to be removed from the dataset. Select the missingness mechanism as one out of MCAR, MAR, MNAR-I or MNAR-II.

  2. 2.

    If the mechanism is MAR or MNAR-II, select a random subset \(\gamma _{miss} \subset S\) containing half of the features in S (i.e. \(|\gamma _{miss}| = \frac{m}{2}\) if |S| is even or \(\frac{m+1}{2}\) if |S| is odd). If the mechanism is MNAR-I, set \(\gamma _{miss} = S\). Identify \(\gamma _{obs} = S \backslash \gamma _{miss}\). Otherwise, go to Step 5.

  3. 3.

    If the mechanism is MAR or MNAR-II, for each feature \(l \in \gamma _{miss}\), randomly select a feature \(l_c \in \gamma _{obs}\) on which the missingness of feature l may depend.

  4. 4.

    For each feature \(l \in \gamma _{miss}\) randomly choose a type of missingness \(MissType_l\) as one out of CENTRAL, INTERMEDIATE or EXTREMAL.

  5. 5.

    Randomly select a non-missing entry \(x_{i,l}\) from the data matrix. If the mechanism is MCAR, mark the entry as missing and decrement \(MissCount = MissCount - 1\) and go to Step 11.

  6. 6.

    If the mechanism is MAR, set \(\lambda = x_{i,l_c}\), \(\mu = \mu _{l_c}\) and \(\sigma = \sigma _{l_c}\), where \(\mu _{l_c}\) and \(\sigma _{l_c}\) are the mean and standard deviation of the \(l_c\)-th feature over the dataset. If the mechanism is MNAR-I, set \(\lambda = x_{i,l}\), \(\mu = \mu _{l}\) and \(\sigma = \sigma _{l}\). If the mechanism is MNAR-II, randomly set either \(\lambda = x_{i,l}\), \(\mu = \mu _{l}\) and \(\sigma = \sigma _{l}\) or \(\lambda = x_{i,l_c}\), \(\mu = \mu _{l_c}\) and \(\sigma = \sigma _{l_c}\).

  7. 7.

    Calculate \(z = \frac{1}{\sigma }\sqrt{(\lambda - \mu )^2}\).

  8. 8.

    If \(MissType_l = \text {CENTRAL}\), set \(\mu _z = 0\). If \(MissType_l = \text {INTERMEDIATE}\), set \(\mu _z = 1\). If \(MissType_l = \text {EXTREMAL}\), set \(\mu _z = 2\). Set \(\sigma _z = 0.35\).

  9. 9.

    Calculate \(pval = \frac{1}{\sqrt{2 \pi \sigma _z}} exp(- \frac{(z - \mu _z)^2}{2 \sigma _z^2})\).

  10. 10.

    Randomly generate a value qval in the interval [0, 1]. If \(pval \ge qval\), then mark the entry \(x_{i,l}\) as missing and decrement \(MissCount = MissCount - 1\).

  11. 11.

    If \(MissCount > 0\), then go to Step 5.

In the above algorithm, the dependence of the missingness on feature values for MAR, MNAR-I and MNAR-II is achieved by removing entries based on the values of control features for their corresponding data points. The control feature may be the feature itself (for MNAR-I and MNAR-II) or may be another feature for the same data point (as in the case of MAR and MNAR-II). The dependence can be of three types, namely CENTRAL, INTERMEDIATE or EXTREMAL. CENTRAL dependence removes a feature when its corresponding control feature has a value close to the mean of the control feature over the dataset. EXTREMAL dependence removes a feature when the value of its control feature lies near the extremes. INTERMEDIATE dependence removes a feature when the value of the control lies between the mean and the extremes.

For our experiments, we set \(MissCount = \frac{nm}{4}\) to remove 25% of the features from each dataset. Thus, an average of \(\frac{m}{4}\) of the features are missing from each data instance.

5.1.3 Selecting the parameter \(\alpha \)

In order to conduct experiments using the FWPD measure, we need to select a value of the parameter \(\alpha \). Proper selection of \(\alpha \) may help to boost the performance of the proposed k-means-FWPD and HAC-FWPD measures. Therefore, in this section, we undertake a study on the effects of \(\alpha \) on the performance of FWPD. Experiments are conducted using \(\alpha \in \{0.1, 0.25, 0.5, 0.75, 0.9\}\) on the datasets listed in Table 4 using the experimental setup detailed above. The summary of the results of this study is shown in Table 5 in terms of average ARI values.

A choice of \(\alpha = 0.25\) performs best overall as well as individually except for MAR missingness (where \(\alpha =0.1\) proves to be a better choice). This seems to indicate some correlation between the extent of missingness and the optimal value of \(\alpha \) (25% of the features are missing in our experiments as mentioned in Sect. 5.1.2). However, the correlation is rather weak for k-means-FWPD where all values of alphas seem to have competitive performance. On the other hand, the correlation is seen to be much stronger for HAC-FWPD. This indicates that the optimal \(\alpha \) varies considerably with the pattern of missingness for k-means-FWPD but not as much for HAC-FWPD. Another interesting observation is that performance of HAC-FWPD deteriorates considerably for high values of \(\alpha \) implying that the distance term is FWPD must be given greater importance for HAC methods. As \(\alpha =0.25\) has the best performance overall, we report the detailed experimental results in the subsequent sections for this choice of \(\alpha \).

Table 5 Summary of results for different choices of \(\alpha \) in terms of average ARI values
Table 6 Means and standard deviations of ARI values for the k-means-FWPD algorithm against MCAR
Table 7 Means and standard deviations of ARI values for the k-means-FWPD algorithm against MAR
Table 8 Means and standard deviations of ARI values for the k-means-FWPD algorithm against MNAR-I
Table 9 Means and standard deviations of ARI values for the k-means-FWPD algorithm against MNAR-II

5.2 Experiments with the k-means-FWPD algorithm

We compare the proposed k-means-FWPD algorithm to the standard k-means algorithm run on the datasets obtained after performing ZI, MI, SVDI and kNNI. All runs of k-means-FWPD were found to converge within the stipulated budget of \(MaxIter = 500\). The results of the experiments are listed in terms of the means and standard deviations of the obtained ARI values, in Tables 6, 7, 8 and 9. Only the best results for kNNI are reported, along with the best k values. The statistically significance of the listed results are summarized at the bottom of the table in terms of average ranks as well as signed rank test hypotheses and p values (\(H_0\) signifying that the ARI values achieved by the proposed method and the contending method originate from identical distributions having the same medians; \(H_1\) implies that the ARI values achieved by the proposed method and the contender originate from different distributions).

We know from Theorem 3 that the maximum number of feasibility adjustments that can occur during a single run of k-means-FWPD is \(n(k-1)\). This begs the question of whether one should choose \(MaxIter \ge n(k-1)\). However, k-means-FWPD was observed to converge within the stipulated \(MaxIter = 500\) iterations even for datasets like Isolet 5, Pendigits, Sensorless, etc. which have relatively large values of \(n(k-1)\). This indicates that the number of feasibility adjustments that occur during a run is much lower in practice. Therefore, we conclude that it is not required to set \(MaxIter \ge n(k-1)\) for practical problems.

It is seen from Tables 6, 7, 8 and 9 that the k-means-FWPD algorithm performs best, indicated by the consistently minimum average rankings on all types of missingness. The proposed method performs best on the majority of datasets for all kinds of missingness. kNNI is overall seen to be the second best performer (being statistically comparable to k-means-FWPD in case of MAR). It is also interesting to observe that the performance of MI is improved in case of MAR and MNAR-II, indicating that MI tends to be useful for partitional clustering when the missingness depends on the observed features. Moreover, SVDI is generally observed to perform poorly irrespective of the type of missingness, implying that the linear model assumed by SVDI is unable to conserve the convexity of the clusters (which is essential for good performance in case of partitional clustering).

5.3 Experiments with the HAC-FWPD algorithm

The experimental setup described in Sect. 5.1 is also used to compare the HAC-FWPD algorithm (with AL-FWPD as the proximity measure) to the standard HAC algorithm (with AL as the proximity measure) in conjunction with ZI, MI, SVDI and kNNI. Results are reported as means and standard deviations of obtained ARI values over the 20 independent runs. AL is preferred here over SL and CL as it is observed to generally achieve higher ARI values. The results of the experiments are listed in Tables 10, 11, 12 and 13. The statistically significance of the listed results are also summarized at the bottom of the respective tables in terms of average ranks as well as signed rank test hypotheses and p values (\(H_0\) signifying that the ARI values achieved by the proposed method and the contending method originate from identical distributions having the same medians; \(H_1\) implies that the ARI values achieved by the proposed method and the contender originate from different distributions).

It is seen from Tables 10, 11, 12 and 13 that the HAC-FWPD algorithm is able to perform best on all types of missingness, as evident from the consistently minimum average ranking. The proposed method performs best on the majority of datasets for all types of missingness. Moreover, the performance of HAC-FWPD is observed to be significantly better than kNNI which performs poorly overall, indicating that kNNI may not be useful for hierarchical clustering applications with missingness. Interestingly, in case of MAR and MNAR-II, both of which are characterized by the dependence of the missingness on the observed features, ZI, MI as well as SVDI show improved performance. This indicates that the dependence of the missingness on the observed features aids these imputation methods in case of hierarchical clustering. Another intriguing observation is that all the contending HAC methods consistently achieved the best possible performance on the high-dimensional datasets Lung and Prostate. This may indicate that while convexity of the cluster structures may be harmed due to missingness, the local proximity among the points is preserved owing to the sparse nature of such high-dimensional datasets.

Table 10 Means and standard deviations of ARI values for the HAC-FWPD algorithm against MCAR
Table 11 Means and standard deviations of ARI values for the HAC-FWPD algorithm against MAR
Table 12 Means and standard deviations of ARI values for the HAC-FWPD algorithm against MNAR-I
Table 13 Means and standard deviations of ARI values for the HAC-FWPD algorithm against MNAR-II

6 Conclusions

In this paper, we propose to use the FWPD measure as a viable alternative to imputation and marginalization approaches to handle the problem of missing features in data clustering. The proposed measure attempts to estimate the original distances between the data points by adding a penalty term to those pair-wise distances which cannot be calculated on the entire feature space due to missing features. Therefore, unlike existing methods for handling missing features, FWPD is also able to distinguish between distinct data points which look identical due to missing features. Yet, FWPD also ensures that the dissimilarity for any data instance from itself is never greater than its dissimilarity from any other point in the dataset. Intuitively, these advantages of FWPD should help us better model the original data space which may help in achieving better clustering performance on the incomplete data.

Therefore, we use the proposed FWPD measure to put forth the k-means-FWPD and the HAC-FWPD clustering algorithms, which are directly applicable to datasets with missing features. We conduct extensive experimentation on the new techniques using various benchmark datasets and find the new approach to produce generally better results (for both partitional as well as hierarchical clustering) compared to some of the popular imputation methods which are generally used to handle the missing feature problem. In fact, it is observed from the experiments that the performance of the imputation schemes varies with the type of missingness and/or the clustering algorithm being used (for example, kNNI is useful for k-means clustering but not for HAC clustering; SVDI is useful for HAC clustering but not for k-means clustering; MI is effective when the missingness depends on the observed features). The proposed approach, on the other hand, exhibits good performance across all types of missingness as well as both partitional and hierarchical clustering paradigms. The experimental results attest to the ability of FWPD to better model the original data space, compared to existing methods.

However, it must be stressed, that the performance of all these methods, including the FWPD based ones, can vary depending on the structure of the dataset concerned, the choice of the proximity measure used (for HAC), and the pattern and extent of missingness plaguing the data. Fortunately, the \(\alpha \) parameter embedded in FWPD can be varied in accordance with the extent of missingness to achieve desired results. The results in Sect. 5.1.3 indicate that it may be useful to choose a high value of \(\alpha \) when a large fraction of the features are unobserved, and to choose a smaller value when only a few of the features are missing. However, in the presence of a sizable amount of missingness and the absence of ground-truths to validate the merit of the achieved clusterings, it is safest to choose a value of \(\alpha \) proportional to the percentage of missing features restricted within the range [0.1, 0.25]. We also present an appendix dealing with an extension of the FWPD measure to problems with absent features and show that this modified form of FWPD is a semi-metric.

An obvious follow-up to this work is the application of the proposed PDM variant to practical clustering problems which are characterized by large fractions of unobserved data that arise in various fields such as economics, psychiatry, web-mining, etc. Studies can be undertaken to better understand the effects that the choice of \(\alpha \) has on the clustering results. Another rewarding topic of research is the investigation of the abilities of the FWPD variant for absent features (see “Appendix A”) by conducting proper experiments using benchmark applications characterized by this rare form of missingness (structural missingness).