Keywords

1 Introduction

Clustering analysis is a common technique in machine learning and data mining and has wide applications. Traditional clustering methods only tackle “complete” data sets, where no missing feature values exist. However, missing data are common occurrences in practice due to various reasons, such as missing replies in questionnaire, high cost to acquire some feature values and improper collection procedure of data.

To address incomplete data clustering problems, a direct method is to use the two-step approach, which first estimates the missing feature values by imputation [16] and then applies the traditional clustering methods for complete data. Besides the imputation-based approaches, four strategies are proposed by [3] to tailor the FCM algorithm to handle incomplete data, including the whole data strategy (WDS), the partial distance strategy (PDS), the optimal completion strategy (OCS) and the nearest prototype strategy (NPS). In addition to these strategies, principal component analysis (PCA) [9] and local PCA [4] have also been used to capture incomplete data structure.

The limitations of these direct and iterative imputation methods are two-fold. First, these methods require an accurate estimation of the missing feature values, which is difficult to obtain in practice. To address this issue, interval data structures have been developed to represent the missing feature values in [6, 7, 12, 17]. Specifically, [6] define new distance function for interval data and extend the classical FCM to handle missing data. [7] also express missing values by interval data, but use the genetic algorithm to search for proper imputations of missing feature values. [12] estimate interval data by an enhanced back-propagation neural network. [17] design an improved interval construction approach using pre-classified cluster results and search for the optimal cluster by particle swarm optimization. Another limitation of existing clustering methods for missing data is that the cluster results are usually sensitive to the estimation of the missing feature values specifically for small-size and highly corrupted data sets [8].

The aim of this paper is to present a robust FCM clustering algorithm for incomplete data based on interval data representation. To guarantee the clustering performance, we introduce the concept of robust cluster objective function, which is the maximum of the traditional cluster objective function when the missing feature values vary in the considered intervals. Different from the existing algorithms based on interval distance function or optimal imputation [6, 7, 17], we formulate the clustering problem as a min-max optimization problem based on the idea of robust optimization, which has been successfully used in the field of operations research [18, 19], and machine learning, including the minimax probability machine [5, 10, 13], robust support vector machines [11, 15] and robust quadratic regression [14].

To solve the proposed min-max optimization problem, we design an efficient iterative solution method. We first give an equivalent reformulation of our robust optimization problem and analyze its relationship with the classical FCM. Then, we show how to update the cluster prototype and membership matrices separately by solving convex optimization problems. Since both problems are non-smooth, we propose a smoothing method to speed up solution process. To tackle the constrained optimization problem in the process of updating the membership matrix, we present an improved gradient projection method. Numerical experiments on UCI data sets validate the effectiveness of the proposed RFCM algorithm by comparison with other algorithms.

The remainder of this paper is organized as follows. Section 2 presents our RFCM model. In Sects. 3, we give an equivalent reformulation and give the solution method. In Sect. 4, numerical experiments are implemented and discussed. Finally, we conclude this research in Sect. 5.

2 RFCM Clustering Algorithm

Consider a set of n objects \(I=\{1, \cdots , n\}\) and each object has m features \(\{x_{ij}: j \in J\}\), where \(x_{ij}\) describes the j-th feature of the i-th object quantitatively. Let \(x_i=(x_{i1},\cdots , x_{ im})^\mathrm {T}\) be the feature vector of the i-th object and \(X=(x_1,\cdots , x_{n})\) be the feature matrix or data set. The task of clustering problems is to assign these objects to K clusters.

In practice, due to various reasons, the data set X may contain missing components. A data set X is referred to as an incomplete data set if it contains at least one missing feature value for some objects, that is, there exists at least one \(i\in I\) and \(j \in J\), such that \(x_{ij}=?\). For an incomplete data set X, we further partition the feature set of the i-th object into two subsets: \(J_i^0=\{j : x_{ij}=?, \forall j\in J \}\) and \(J_i^1=J\setminus J_i^0\).

Since it is usually difficult to obtain accurate estimation of missing feature values. This paper represents missing feature values by intervals. Specifically, for any \(i\in I\), we use an interval \([x^-_{ij},x^+_{ij}]\) to represent unknown missing feature value where \(j \in J_i^0\), and use \(\bar{x}_{ij}\) to represent known feature value where \(j\in J^1_i\). To simply notations, in the following, let \(\bar{x}_{ij}=\frac{x^-_{ij}+x^+_{ij}}{2}\), \(\delta _{ij}=\frac{x^+_{ij}-x^-_{ij}}{2}\) for any \(j \in J_i^0\), and \(\delta _{ij}=0\) for any \(j \in J_i^1\). For details on how to construct these intervals for missing feature values, see [6, 17].

To reduce the impact of inaccurate estimation of the missing feature values, this paper considers the following robust cluster objective function:

$$\begin{aligned} \begin{aligned}&\min \text {} \max \left\{ \sum _{k=1}^K\sum _{i\in I}u_{ik}^p\Vert \bar{x}_i+y_i-v_k\Vert ^2: y_{i} \in [-\delta _{i},\delta _{i}], \forall i\in I\right\} \\&\text { } \text {s.t.}\sum _{k=1}^K u_{ik}=1, \ u_{ik}\in [0,1],\ \forall \ i\in I, k=1,\cdots , K, \end{aligned} \end{aligned}$$
(1)

The Eq. (1) is a difficult nonlinear optimization problem, and it is hard to solve (1) directly. However, by exploiting the structural properties of (1), we will design an efficient algorithm in next section.

3 Solution Method

3.1 Equivalent Reformulation of (1)

The following Proposition 1 simplifies the two-level min-max optimization problem (1) into a single level minimization problem, and provides the basis of designing effective solution methods for (1).

Proposition 1

Problem (1) is equivalent to the following problem:

$$\begin{aligned} \begin{aligned}&\displaystyle \min _{U,V} \text {} \displaystyle \sum _{i\in I}\sum _{j\in J} \left\{ \sum _{k=1}^K u^p_{ik} (\bar{x}_{ij}-v_{kj})^2+2\delta _{ij}\left| \sum _{k=1}^K u^p_{ik} (\bar{x}_{ij}-v_{kj})\right| +\sum _{k=1}^K u^p_{ik} \delta _{ij}^2\right\} \\&\text { } \text {s.t.} \text {} \displaystyle \sum _{k=1}^K u_{ik}=1, \ u_{ik}\in [0,1],\quad \ \ \forall \ i\in I, k=1,\cdots , K. \end{aligned} \end{aligned}$$
(2)

Proof

We simplify the robust cluster objective function for given U and V as follows:

$$\begin{aligned} \begin{array}{rl} J^R(U,V)=&{}\displaystyle \sum _{i\in I}\sum _{j\in J} \max _{y_{ij}\in [-\delta _{ij}, \delta _{ij}] }\ \sum _{k=1}^K u_{ik}^p (\bar{x}_{ij}+y_{ij}-v_{kj})^2\\ =&{}\displaystyle \sum _{i\in I}\sum _{j\in J} \max \left\{ \sum _{k=1}^K u_{ik}^p (\bar{x}_{ij}+\delta _{ij}-v_{kj})^2 , \sum _{k=1}^K u_{ik}^p (\bar{x}_{ij}-\delta _{ij}-v_{kj})^2\right\} \\ \end{array} \end{aligned}$$

where the last equation uses the fact that \(f(y_{ij})=(\bar{x}_{ij}+y_{ij}-v_{kj})^2\) is a convex function and attains its maximum over \([-\delta _{ij}, \delta _{ij}] \) at the endpoints. For given \(i\in I\) and \(j \in J\), we have

$$\begin{aligned} \begin{array}{rl} &{}\displaystyle \max \left\{ \sum _{k=1}^K u_{ik}^p (\bar{x}_{ij}+\delta _{ij}-v_{kj})^2 , \sum _{k=1}^K u_{ik}^p (\bar{x}_{ij}-\delta _{ij}-v_{kj})^2\right\} \\ =&{}\displaystyle \sum _{k=1}^K u^p_{ik} (\bar{x}_{ij}-v_{kj})^2+2\delta _{ij}\left| \sum _{k=1}^K u^p_{ik} (\bar{x}_{ij}-v_{kj})\right| +\sum _{k=1}^K u^p_{ik} \delta _{ij}^2, \end{array} \end{aligned}$$

which completes the proof.

Proposition 1 also shows that when \(\delta _{ij}=0\) for any \(i\in I\) and \(j \in J\), our RFCM reduces to the classical FCM.

Let h(UV) be the objective function of (2). The following proposition analyzes properties of h(UV).

Proposition 2

For any given \(U \ge 0\), h(UV) is convex in V and for any given V, h(UV) is convex in U.

Proof

First, for any \(U \ge 0\), it is easy to see that the first term of h(UV) is convex in V. Note that the absolute function is convex; thus, the second term of h(UV) is also convex in V. Since the last term of h(UV) is constant for given value of U, we have that h(UV) is convex in V.

Second, from the proof of Proposition 1, we have that

$$\begin{aligned} h(U,V)= \sum _{i\in I}\sum _{j\in J} \max \left\{ \sum _{k=1}^K u_{ik}^p (\bar{x}_{ij}+\delta _{ij}-v_{kj})^2 , \sum _{k=1}^K u_{ik}^p (\bar{x}_{ij}-\delta _{ij}-v_{kj})^2\right\} . \end{aligned}$$

Since \(f(x)=ax^p\) is convex in x when \(a\ge 0\) and \(p \ge 1\), we have both \(\sum _{k=1}^K u_{ik}^p (\bar{x}_{ij}+\delta _{ij}-v_{kj})^2\) and \(\sum _{k=1}^K u_{ik}^p (\bar{x}_{ij}-\delta _{ij}-v_{kj})^2\) are convex in U for any given V. Note that the maximum of two convex function is still convex; thus, we complete the proof.

Proposition 2 shows that although (2) is a complex non-convex minimization problem, when either the value of U or that of V is fixed, we only need to solve convex minimization subproblems. The next two subsection focuses on designing efficient solution methods to these two subproblems.

3.2 Optimizing V for a Fixed Value of U

For a fixed value of U, we need to solve the following convex piecewise quadratic problem:

$$\begin{aligned} \min _{V}\ \sum _{j\in J}\sum _{i\in I} \left\{ \sum _{k=1}^K u^p_{ik} (\bar{x}_{ij}-v_{kj})^2+2\delta _{ij}\left| \sum _{k=1}^K u^p_{ik} (\bar{x}_{ij}-v_{kj})\right| \right\} . \end{aligned}$$
(3)

Note that (3) is decomposable in index j. Let \(v_j=\{v_{1j},\cdots , v_{Kj}\}\). Therefore, it is sufficient to solve the following subproblem separately for each \(j\in J\):

$$\begin{aligned} \min _{v_j}\ \sum _{i\in I}\sum _{k=1}^K u^p_{ik} (\bar{x}_{ij}-v_{kj})^2+2\sum _{i\in I}\delta _{ij}\left| \sum _{k=1}^K u^p_{ik} (\bar{x}_{ij}-v_{kj})\right| . \end{aligned}$$
(4)

Due to the non-differentiability of the absolute function \(g(x)=|x|\), (4) is a non-smooth convex minimization problem. Although the sub-gradient method can be used to solve such non-smooth optimization problems, its search direction may oscillate around non-smooth points. To improve computation efficiency, we introduce the following smoothing upper bound estimation (SUBE) function of g:

$$\begin{aligned} {g}_\epsilon (x)=\left\{ \begin{array}{cl} x,&{}\text {if } x \ge \epsilon ,\\ \frac{x^2}{2\epsilon }+\frac{\epsilon }{2}, &{}\text {if } |x| < \epsilon ,\\ -x,&{}\text {if } x \le -\epsilon , \end{array} \right. \end{aligned}$$

where \(\epsilon >0\). It is easy to see that for any \(\epsilon >0\) , \({g}_\epsilon (x)\) is a smooth function and \(0 \le {g}_\epsilon (x)- g(x)\le \epsilon /2\), and

$$\begin{aligned} {g}_\epsilon '(x)=\left\{ \begin{array}{cl} 1,&{}\text {if } x \ge \epsilon ,\\ \frac{x}{\epsilon }+\frac{\epsilon }{2}, &{}\text {if } |x| < \epsilon ,\\ -1,&{}\text {if } x \le -\epsilon . \end{array} \right. \end{aligned}$$

By introducing the SUBE function \({g}_\epsilon (x)\), instead of directly solving the non-smooth optimization problem (4), we only need to consider the following smooth optimization problem:

$$\begin{aligned} \displaystyle \min _{v_j}\ \phi (v_j)= \sum _{i\in I}\sum _{k=1}^K u^p_{ik} (\bar{x}_{ij}-v_{kj})^2+2\sum _{i\in I}\delta _{ij}{g}_\epsilon \left( \sum _{k=1}^K u^p_{ik} (\bar{x}_{ij}-v_{kj})\right) \end{aligned}$$
(5)

The gradient of the objective function \(\phi \) is given as follows:

$$\begin{aligned} \frac{\partial \phi (v_j)}{\partial v_{kj}}=2 \sum _{i\in I} u^p_{ik} (v_{kj}-\bar{x}_{ij})-2\sum _{i\in I}\delta _{ij}{g}_\epsilon '\left( \sum _{k=1}^K u^p_{ik} (\bar{x}_{ij}-v_{kj})\right) . \end{aligned}$$

Therefore, for a fixed value of U, we design the following smoothing gradient descent method to solve (5).

figure a

During the solution process, SGDA decreases the approximation error \(\epsilon \) repeatedly. In the SGDA, the step size can be either set as \(s_n=a/(b+n)\) with \(a,b >0\) or selected by the Armijo rule [1].

3.3 Optimizing U for a Fixed Value of V

For a fixed value of V, we need to solve the following constrained convex minimization problem:

$$\begin{aligned} \begin{aligned}&\min _{U} \text {} \displaystyle \sum _{i\in I}\sum _{j\in J} \left\{ \sum _{k=1}^K u^p_{ik} \left( (\bar{x}_{ij}-v_{kj})^2+\delta _{ij}^2\right) +2\delta _{ij}\left| \sum _{k=1}^K u^p_{ik} (\bar{x}_{ij}-v_{kj})\right| \right\} \\&\text { } \text {s.t.} \text { } \displaystyle \sum _{k=1}^K u_{ik}=1, \ u_{ik}\in [0,1],\quad \ \ \forall \ i\in I, k=1,\cdots , K. \end{aligned} \end{aligned}$$
(6)

Note that (6) is decomposable in index i. Let \(u_i=\{u_{i1},\cdots , u_{iK}\}\). Using the SUBE function as in the last subsection, we consider the following smooth subproblem separately for each \(i\in I\).

$$\begin{aligned} \begin{aligned}&\displaystyle \min _{u_i} \text {} \displaystyle \psi (u_i)= \sum _{j\in J} \left\{ \sum _{k=1}^K u^p_{ik}\left( (\bar{x}_{ij}-v_{kj})^2+\delta _{ij}^2\right) +2\delta _{ij}{g}_\epsilon \left( \sum _{k=1}^K u^p_{ik} (\bar{x}_{ij}-v_{kj})\right) \right\} \\&\text { } \text {s.t.} \text { } \displaystyle \sum _{k=1}^K u_{ik}=1, \ u_{ik}\in [0,1],\quad \ \ \forall \ k=1,\cdots , K. \end{aligned} \end{aligned}$$
(7)

The gradient of the objective function \(\psi \) is given as follows:

$$\begin{aligned} \frac{\partial \psi (u_i)}{\partial u_{ki}}=2p \sum _{j\in J} u^{p-1}_{ik}\left( (\bar{x}_{ij}-v_{kj})^2+\delta _{ij}^2\right) +2\sum _{j\in J}\delta _{ij} p u^{p-1}_{ik}(\bar{x}_{ij}-v_{kj}){g}_\epsilon '\left( \sum _{k=1}^K u^p_{ik} (\bar{x}_{ij}-v_{kj})\right) . \end{aligned}$$

To solve (7), we design the following smoothing gradient projection descent algorithm.

figure b

In SGPDA, the projection operator \(\text {Proj}_{\varDelta }\) denotes the projection onto the simplex \(\varDelta =\{x \in R^K: \sum _{k=1}^K x_k=1, x_k\ge 0, \ \forall k=1,\cdots , K\}\). [2] has shown that \(\text {Proj}_{\varDelta }\) can be computed in \(\mathcal {O}(K\log K)\) time.

Based on the solution methods for subproblems \((\tilde{\text {P}}^j_{U})\) and \((\tilde{\text {P}}^i_{V})\), our RFCM algorithm can be summarized as follows.

figure c

4 Numerical Experiments

4.1 Data Sets

We conduct experiments on two data sets of the UCI database: Wine and Seeds. All these data sets are complete data sets. To generate the incomplete data sets, we adopt the method used in [3, 6]. Specifically, we use the missing completely at random (MCAR) mechanism to generate the missing feature values. We randomly select a specified percentage of components and designate them as missing. We further make sure the following constraints are satisfied:

  1. 1.

    Each object retains at least one feature;

  2. 2.

    Each feature has at least one value present in the incomplete data set.

4.2 Experimental Results and Discussion

We compare the proposed RFCM algorithm with the classical FCM using the WDS, PDS and NPS strategies on Wine and Seeds data sets. The missing rates of these data sets vary from 0% to 20%. We report the number of misclassification and misclassification rate of these algorithms on test data sets.

We generate 100 incomplete data instances of Wine and Seeds data sets and report the averaged performance of different algorithms in Tables 1 and 2, respectively. Results given by the proposed RFCM, the classical FCM using the WDS, PDS and NPS strategies are labeled as “RFCM”, “WDS”, “PDS”, and “NPS”, respectively.

Table 1. Performance of different FCM algorithms on the incomplete Wine data set
Table 2. Performance of different FCM algorithms on the incomplete Seeds data set

From Tables 1 and 2, we have the following observations.

  1. (1)

    For all the test data sets, the proposed RFCM algorithm provides the best clustering performance in terms of misclassification rate. For example, when the missing rate of the Wine data set is \(20\%\), the misclassification rate of RFCM is only \(12.58\%\) while the misclassification rate of WDS, PDS and NPS are \(14.28\%\), \(18.33\%\) and \(14.12\%\), respectively.

  2. (2)

    When the missing rate is small, missing data have little adverse effect on the performance of the proposed RFCM and missing data even help RFCM to give better performance.

  3. (3)

    The cluster prototypes given by RFCM also have smaller cluster prototype errors compared with cluster prototypes given by WDS, PDS and NPS.

5 Conclusion

This paper proposes a robust FCM algorithm to cluster data sets with missing feature values. The RFCM algorithm represents the missing feature values by intervals. Different existing interval-based clustering algorithms, which only replace the traditional Euclidean distance with a modified distance function for interval data, the RFCM algorithm adopts the idea of robust optimization and aims to find an optimal cluster with the minimum wort-case cluster objective function value. Therefore, it can guarantee the worst-case performance of the resulted cluster output and minimizes the adverse effect of missing feature values. This paper formulates the robust clustering problem as a two-level min-max optimization problem and provides an equivalent reformulation. An efficient solution method is further designed based on smoothing and gradient projection techniques. Experiments on UCI data sets also validate the effectiveness and robustness of the RFCM algorithm by comparison with existing clustering algorithms.