Keywords

1 Introduction

In Chap. 4, we formulated different optimization models of the clustering problem. Using these models, one can apply various heuristics or optimization methods to solve clustering problems. Algorithms considered in this chapter are based on the NSO model of these problems. More specifically, we consider incremental clustering algorithms where at each iteration the clustering and the auxiliary clustering problems are solved by applying either heuristics like the k-means or NSO algorithms [19, 22, 24, 26, 29, 33, 36, 170, 171].

We start with the description of the modified global k-means algorithm in Sect. 8.2. This algorithm is an improvement of the GKM. The main difference between these two algorithms is that the GKM uses only data points to find starting cluster centers whereas the modified global k-means algorithm solves the auxiliary clustering problem to compute them. In Sect. 8.3, we describe a further improvement of the modified global k-means algorithm called the fast modified global k-means algorithm. In addition, we discuss various procedures to reduce the computational complexity of the modified global k-means algorithm.

Then, we introduce three incremental clustering algorithms where the LMBM, the DGM, and the HSM are used to solve the clustering and the auxiliary clustering problems. More precisely, the limited memory bundle method for clustering is described in Sect. 8.4; the discrete gradient clustering algorithm is presented in Sect. 8.5; and the smooth incremental clustering algorithm is given in Sect. 8.6.

2 Modified Global k-Means Algorithm

In this section, we present the modified global k-means algorithm (MGKM) to solve the clustering problem (7.2) where the similarity measure d 2 is used [19, 21]. The algorithm is the modified version of the GKM, and it is based on the incremental approach. The flowchart of the MGKM is illustrated in Fig. 8.1.

Fig. 8.1
figure 1

Modified global k-means algorithm (MGKM)

The MGKM starts with the computation of the centroid of the whole data set. Then a new cluster center is added at each iteration. More precisely, the auxiliary clustering problem (7.4) is solved to compute a starting point for the lth center. The new center together with the previous l − 1 cluster centers is taken as a starting point for solving the lth partition problem. The k-means Algorithm 5.1 is applied starting from this point to find the lth partition of the data set.

Assume that (x 1, …, x l−1), l ≥ 2, be a solution to the (l − 1)th clustering problem. Let p = 2. Recall the sets \(\bar {S}_1\) and \(\bar {S}_2\) defined in (7.7) and (7.8), respectively. Take any \(\mathbf {y} \in \bar {S}_2\) and consider the sets \(\bar {B}_{12}(\mathbf {y})\) and \(\bar {B}_3(\mathbf {y})\) given in (7.9). The algorithm for finding a starting point for the lth cluster center involves Algorithm 5.1 and proceeds as follows.

Algorithm 8.1 Finding a starting point

In Steps 1 and 2 of Algorithm 8.1, a starting point is found to minimize the auxiliary cluster function \(\bar {f}_l\), given in (7.5). This point is chosen among all data points that can attract at least one data point (see Step ??). For each such data point a, the set \(\bar {B}_3(\mathbf {a})\) and its center are computed. Then the function \(\bar {f}_l\) is evaluated at these centers, and the center that provides the lowest value of the function \(\bar {f}_l\) is selected as a starting point to minimize the function \(\bar {f}_l\).

In Step 4 of Algorithm 8.1, we apply Algorithm 5.1 to minimize the auxiliary cluster function \(\bar {f}_l\). In this case the first l − 1 cluster centers are fixed and only the lth cluster center is updated at each iteration.

Remark 8.1

Algorithm 8.1 is a special case of Algorithm 7.2 when we select γ 1 = 0 and γ 2 = 1.

Proposition 8.1

Let \(\bar {\mathbf {y}}\) be a cluster center generated by Algorithm 8.1. Then this point is a local minimizer of the auxiliary cluster function \(\bar {f}_l\).

Proof

Recall the sets B i(y), i = 1, 2, 3 defined in (4.30). Since \(\bar {\mathbf {y}}\) is a cluster center we have \(B_2(\bar {\mathbf { y}})=\emptyset \). This is due to the fact that in the hard clustering problem, each data point belongs to only one cluster. Then the function (7.5) can be rewritten as

$$\displaystyle \begin{aligned}\bar{f}_l(\bar{\mathbf{y}}) = \frac{1}{m} \left( \sum_{\mathbf{a} \in B_1(\bar{\mathbf{y}})} r^{\mathbf{a}}_{l-1} + \sum_{\mathbf{a} \in B_3(\bar{\mathbf{y}})} d_2(\bar{\mathbf{y}},\mathbf{a}) \right). \end{aligned}$$

It is clear that \(\bar {\mathbf {y}}\) is a minimum point of the convex function

$$\displaystyle \begin{aligned}\varphi(\mathbf{x}) = \frac{1}{m} \sum_{\mathbf{a} \in B_3(\bar{\mathbf{y}})} d_2(\mathbf{x},\mathbf{a}), \end{aligned}$$

that is \(\varphi (\bar {\mathbf {y}}) \leq \varphi (\mathbf {x})\) for all \(\mathbf {x} \in \mathbb {R}^n\). There exists ε > 0 such that

$$\displaystyle \begin{aligned} &d_2(\mathbf{y},\mathbf{a}) > r^{\mathbf{a}}_{l-1} \quad \mbox{for all} \quad \mathbf{a} \in B_1(\bar{\mathbf{y}}) \quad \mbox{and} \quad \mbox{for all} \quad \mathbf{y} \in B(\bar{\mathbf{y}};\varepsilon), \quad \mbox{and}\\ &d_2(\mathbf{y},\mathbf{a}) < r^{\mathbf{a}}_{l-1} \quad \mbox{for all} \quad \mathbf{a} \in B_3(\bar{\mathbf{y}}) \quad \mbox{and} \quad \mbox{for all} \quad \mathbf{y} \in B(\bar{\mathbf{y}};\varepsilon). \end{aligned} $$

Then for any \(\mathbf {y} \in B(\bar {\mathbf {y}};\varepsilon )\) we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} \bar{f}_l(\mathbf{y}) &\displaystyle =&\displaystyle \frac{1}{m} \left( \sum_{\mathbf{a} \in B_1(\bar{\mathbf{y}})} r^{\mathbf{a}}_{l-1} + \sum_{\mathbf{a} \in B_3(\bar{\mathbf{y}})} d_2(\mathbf{y},\mathbf{a})\right) \\ &\displaystyle =&\displaystyle \frac{1}{m} \sum_{\mathbf{a} \in B_1(\bar{\mathbf{y}})} r^{\mathbf{a}}_{l-1} + \varphi (\mathbf{y})\\ &\displaystyle \geq&\displaystyle \frac{1}{m} \sum_{\mathbf{a} \in B_1(\bar{\mathbf{y}})} r^{\mathbf{a}}_{l-1} + \varphi(\bar{\mathbf{y}}) \\ &\displaystyle =&\displaystyle \bar{f}_l(\bar{\mathbf{y}}). \end{array} \end{aligned} $$

Therefore, \(\bar {f}_l(\mathbf {y}) \geq \bar {f}_l(\bar {\mathbf {y}})\) for all \(\mathbf {y} \in B(\bar {\mathbf {y}};\varepsilon )\). This completes the proof. □

Next, we give the step by step form of the MGKM.

Algorithm 8.2 Modified global k-means algorithm (MGKM)

Algorithm 8.2 has two stopping criteria. The algorithm stops when either it solves all l-partition problems, l = 1, …, k or the stopping criterion in Step 5 is satisfied. Note that \(f_l^* =\inf \{f_l(\mathbf {x}),~ \mathbf {x}\in \mathbb {R}^{nk}\} \geq 0\) for all l ≥ 1 and the sequence \(\{f_l^*\}\) is decreasing, that is,

$$\displaystyle \begin{aligned}f_{l+1}^* \leq f_l^* \quad \mbox{for all} \quad l \geq 1. \end{aligned}$$

This means that the stopping criterion in Step ?? will be satisfied after a finite number of iterations and therefore, Algorithm 8.2 computes as many clusters as the data set A contains with respect to the tolerance ε > 0. Note that the choice of this tolerance is crucial for Algorithm 8.2: large values of ε can result in the appearance of large clusters whereas small values can lead to small and artificial clusters.

In Step 3 of Algorithm 8.2, a starting point for the lth cluster center is computed. This is done by applying Algorithm 8.1 and minimizing the auxiliary cluster function. This algorithm requires the calculation of the distance or affinity matrix of the data set A. The matrix can be computed before the application of Algorithm 8.1 in small and medium sized data sets. However, it cannot be done for large data sets as such a matrix is too big to be stored in the memory of a computer. This means that in the latter case, the affinity matrix is computed at each iteration of the MGKM.

3 Fast Modified Global k-Means Algorithm

Algorithm 8.2 is time-consuming in large data sets as it requires the computation of the affinity matrix at each iteration. The fast modified global k-means algorithm (FMGKM) [29] is an improved version of Algorithm 8.2 and does not rely on the affinity matrix to compute starting points. Instead, the FMGKM uses some weights within the auxiliary cluster function for generating starting points from different parts of the data set. This leads to the elimination of computing or sorting the whole affinity matrix and therefore, to the reduction of computational effort and the memory usage. The flowchart of the FMGKM is similar to that of the MGKM given in Fig. 8.1.

Next, we describe the FMGKM. Let

$$\displaystyle \begin{aligned}U = \{u_1,\ldots,u_s\}\end{aligned}$$

be a finite set of positive numbers. For u ∈ U, the auxiliary cluster function \(\bar {f}_l\), given in (7.5), is modified as follows:

$$\displaystyle \begin{aligned} \bar{f}_l^u(\mathbf{y}) = \frac{1}{m} \sum_{\mathbf{a} \in A} \min \Big\{r^{\mathbf{a}}_{l-1}, u ~d_2(\mathbf{y},\mathbf{a}) \Big\}. \end{aligned} $$
(8.1)

If u = 1, then \(\bar {f}_l^u(\mathbf {y}) = \bar {f}_l(\mathbf {y})\) for all \(\mathbf {y} \in \mathbb {R}^n\). Take u ∈ U and define the set

$$\displaystyle \begin{aligned}\tilde{S}_2^u = \big\{\mathbf{y} \in \mathbb{R}^n: r^{\mathbf{a}}_{l-1} > u ~d_2(\mathbf{y},\mathbf{a}) ~~\mbox{for some}~~\mathbf{a} \in A \big\}, \end{aligned}$$

and for any \(\mathbf {y} \in \tilde {S}_2^u\) consider the set

$$\displaystyle \begin{aligned} \tilde{B}_3^u(\mathbf{y}) = \big\{\mathbf{a} \in A: r^{\mathbf{a}}_{l-1} > u ~ d_2(\mathbf{y},\mathbf{a})\big\}. \end{aligned}$$

The set \(\tilde {S}_2^u\) is similar to the set \(\bar {S}_2\) defined in (7.8) and the set \(\tilde {B}_3^u(\mathbf {y})\) is similar to the set \(\bar {B}_3(\mathbf {y})\) described in (7.9). The set \(\tilde {B}_3^u(\mathbf {y})\) contains all data points attracted by the point \(\mathbf {y} \in \tilde {S}_2^u\) with a given weight u > 0.

The following algorithm is a modified version of Algorithm 8.1 and computes a starting point for the lth cluster center.

Algorithm 8.3 Finding a starting point

In order to solve the auxiliary clustering problem (7.4) in Step 5 of Algorithm 8.3, we apply Algorithm 5.1. Here, only one cluster center is updated, other cluster centers are known from previous iterations and they are fixed. Since Algorithm 5.1 is only able to find local solutions to this problem more than one starting points are used to compute high quality solutions.

Starting points are computed using the function (8.1) with different values of u. If u is sufficiently small, then the starting point will be close to other cluster centers, most likely near the center of the largest cluster. If u = 1, then we get the same starting point as in the case of Algorithm 8.1. As the value of u is increased the starting points become more isolated data points. This leads to the finding of starting points from different parts of the data set.

The FMGKM is presented in step by step form as follows.

Algorithm 8.4 Fast modified global k-means algorithm (FMGKM)

The most time-consuming step in Algorithm 8.4 is Step 3, where Algorithm 8.3 is applied to minimize the auxiliary cluster function (8.1) for different u ∈ U and to find the starting point for the lth cluster center. In its turn, Step ?? of Algorithm 8.3 is time-consuming as in this step, clusters are computed for each data point \(\mathbf { a} \in \tilde {S}_2^{u_t} \cap A\). This requires the partial computation of the affinity matrix. In addition, centers of those clusters and values of the function \(\bar {f}_l^u\) at these centers need to be computed. Since for each data point only one center is obtained the complexity of the computation of the function \(\bar {f}_l^u\) is the same as the complexity of the computation of the affinity matrix.

In [29], two different approaches are introduced to reduce the computational complexity of Step ?? in Algorithm 8.3. Both approaches exploit the incremental nature of the algorithm. In these approaches a matrix, consisting of the distances between data points and cluster centers is used instead of the affinity matrix. Since the number of clusters is significantly less than the number of data points the former matrix is much smaller than the latter one. More precisely, in these approaches data points which are close to cluster centers from the (l − 1)th partition are excluded. Therefore, these data points are removed from the list of points which can attract large clusters and also from the list of points which can be attracted by non-excluded data points.

Let x 1, …, x l−1, l ≥ 2 be known cluster centers. Assume v iq is the squared Euclidean distance between the data point a i, i = 1, …, m and the cluster center x q, q = 1, …, l − 1, that is

$$\displaystyle \begin{aligned}v_{iq} = d_2({\mathbf{a}}_i,{\mathbf{x}}_q). \end{aligned}$$

Let \({\mathbf {v}}_q = \big (v_{1q},\ldots ,v_{mq} \big ) \in \mathbb {R}^m,~q=1,\ldots ,l-1\). Consider a matrix V  of the size m × (l − 1), whose columns are vectors v q, q = 1, …, l − 1:

$$\displaystyle \begin{aligned}V=[{\mathbf{v}}_{iq}],\quad i=1,\ldots,m,~q=1,\ldots,l-1. \end{aligned}$$

Let also \(\mathbf {r} = (r^1_{l-1},\ldots ,r^m_{l-1})\) be a vector of m components where \(r^i_{l-1}\) is the squared Euclidean distance between the data point a i and its cluster center in the (l − 1)th partition (see (7.6)). Note that the matrix V  and the vector r are available after the (l − 1)th iteration of the incremental clustering algorithm.

Now, we are ready to describe the following two approaches to reduce the computational complexity of Step 2 of Algorithm 8.3.

  1. 1.

    Reduction of the number of pairwise distance computations. Let a data point a j ∈ A be given and x q(j) be its cluster center. Here q(j) ∈{1, …, l − 1}. For a given u ∈ U and the data point a i if

    $$\displaystyle \begin{aligned}\Big(1+\frac{1}{\sqrt u}\Big)^2 r^j_{l-1} \leq v_{iq(j)}, \end{aligned}$$

    then \({\mathbf {a}}_j \not \in \tilde {B}_3^u({\mathbf {a}}_i)\). Indeed, we have

    $$\displaystyle \begin{aligned}\|{\mathbf{a}}_i-{\mathbf{a}}_j\| \geq \|{\mathbf{a}}_i-{\mathbf{x}}_{q(j)}\| - \|{\mathbf{a}}_j-{\mathbf{x}}_{q(j)}\| \geq (1/\sqrt u)\|\mathbf{ a}_j-{\mathbf{x}}_{q(j)}\|. \end{aligned}$$

    Thus, we get \(u d_2({\mathbf {a}}_i , {\mathbf {a}}_j) \geq r_{l-1}^j\), and therefore \({\mathbf {a}}_j \not \in \tilde {B}_3^u({\mathbf {a}}_i)\). This condition allows us to reduce the number of pairwise distance computations. This reduction becomes substantial as the number of clusters increases. Define the set

    $$\displaystyle \begin{aligned}\tilde{R}^u({\mathbf{a}}_i) = \Big\{{\mathbf{a}}_j \in A: ~ \Big(1+\frac{1}{\sqrt u}\Big)^2 r^j_{l-1} > v_{iq(j)} \Big\}. \end{aligned}$$

    It is clear that

    $$\displaystyle \begin{aligned}\tilde{B}_3^u({\mathbf{a}}_i) \subseteq \tilde{R}^u({\mathbf{a}}_i). \end{aligned}$$

    The set \(\tilde {R}^u({\mathbf {a}}_i)\) can be used instead of the set A to compute the value of the function \(\bar {f}_l^u\) in Step ?? of Algorithm 8.3. In this case, one may not get the exact value of this function; however, it gives a good approximation to the exact value. Furthermore, take

    $$\displaystyle \begin{aligned}w \in \Big(1,\Big(1+\frac{1}{\sqrt u}\Big)^2\Big], \end{aligned}$$

    and consider the set

    $$\displaystyle \begin{aligned}\tilde{R}^u_w({\mathbf{a}}_i) = \Big\{{\mathbf{a}}_j \in A: ~ w r^j_{l-1} > d_2({\mathbf{a}}_i , {\mathbf{a}}_j) \Big\}. \end{aligned}$$

    Then replace A by \(\tilde {R}^u_w({\mathbf {a}}_i)\) for the computation of the function \(\bar {f}_l^u\). This will further reduce the amount of computations in Step 2 of Algorithm 8.3.

  2. 2.

    Reduction of the number of starting cluster centers. This approach is similar to that of considered in Algorithm 7.4. More specifically, data points which are very close to previous cluster centers are not considered for being starting points to minimize the auxiliary cluster function (8.1). At the (l − 1)th iteration a squared averaged radius

    $$\displaystyle \begin{aligned}\bar{d}_{av}^q = \frac{1}{|A^q|}\sum_{\mathbf{a} \in A^q} d_2({\mathbf{x}}_q,\mathbf{a}), \end{aligned}$$

    and a squared maximum radius

    $$\displaystyle \begin{aligned}\bar{d}_{\max}^q = \max_{\mathbf{a} \in A^q} d_2({\mathbf{x}}_q,\mathbf{a}) \end{aligned}$$

    of each cluster A q, q = 1, …, l − 1 are computed. Consider the numbers

    $$\displaystyle \begin{aligned}\alpha_q = \frac{\bar{d}_{\max}^q}{\bar{d}_{av}^q } \geq 1 \quad \mbox{and} \quad \beta_q = \varepsilon (\alpha_q - 1), \end{aligned}$$

    where ε > 0 is a sufficiently small number. Let

    $$\displaystyle \begin{aligned}\gamma_{ql} = 1+\beta_q (l-1),\quad q=1,\ldots,l-1. \end{aligned}$$

    It is clear that γ ql ≥ 1, q = 1, …, l − 1. Define the following subset of the cluster A q:

    $$\displaystyle \begin{aligned}\bar{A}^q = \big\{\mathbf{a} \in A^q: ~\gamma_{ql} \bar{d}_{av}^q \leq d_2({\mathbf{x}}_q,\mathbf{a})\big\}. \end{aligned}$$

    In other words, the set \(\bar {A}^q\) is obtained from the cluster A q by removing all points for which \(d_2({\mathbf {x}}_q,\mathbf {a}) < \gamma _q \bar {d}_{av}^q\). Since in the incremental approach the clusters are becoming more stable as the number l increases it follows that the numbers γ ql are increased as l increases. Define the set

    $$\displaystyle \begin{aligned}\bar{A} = \bigcup_{q=1}^{l-1} \bar{A}^q, \end{aligned}$$

    and consider only data points \(\mathbf {a} \in \bar {A}\) as the candidates to be starting points for minimizing the auxiliary cluster function \(\bar {f}_l\).

Summarizing, Steps ?? and 3 of Algorithm 8.3 can be rewritten as follows:

  1. 2’:

    for each \(\mathbf {a} \in \tilde {S}_2^{u_t} \cap \bar {A}\) compute the set \(\tilde {B}_3^{u_t}(\mathbf {a})\), its center c, and the value \(\bar {f}^{u_t}_{l, \mathbf {a}} = \bar {f}^{u_t}_l(\mathbf {c})\) of the function \(\bar {f}^{u_t}_l\) at the point c over the set \(\tilde {R}^u_w(\mathbf {a})\).

  2. 3’:

    compute

    $$\displaystyle \begin{aligned}\bar{f}^{u_t}_{l, \min} = \min_{\mathbf{a} \in \tilde{S}_2^{u_t} \cap \bar{A}} \bar{f}^{u_t}_{l, \mathbf{a}} \quad \mbox{and} \quad \bar{\mathbf{a}} = \operatorname*{\mathrm{argmin}}_{\mathbf{a} \in \tilde{S}_2^{u_t} \cap \bar{A}} \bar{f}^{u_t}_{l, \mathbf{a}}, \end{aligned}$$

    and the corresponding center \(\bar {\mathbf {c}}\).

The use of these two schemes allows us to significantly reduce the computational complexity of Algorithm 8.4 and accelerate its convergence.

4 Limited Memory Bundle Method for Clustering

In this section, we present the limited memory bundle method for clustering (LMB-Clust) [171]. The LMB-Clust has been developed specifically to solve clustering problems in large data sets. The algorithm combines two different approaches to solve the clustering problem when the squared Euclidian distance d 2 is used as a similarity measure. The MSInc-Clust is used to solve the clustering problem globally and the LMBM is applied at each iteration of the algorithm to solve both the clustering problem (7.2) and the auxiliary clustering problem (7.4). The flowchart of the LMB-Clust is given in Fig. 8.2.

Fig. 8.2
figure 2

Limited memory bundle method for clustering (LMB-Clust)

The LMBM, given in Fig. 3.5, is originally developed for solving general large-scale nonconvex NSO problems. Here, this method is slightly modified to be better suited for solving the clustering and the auxiliary clustering problems. In particular, a nonmonotone line search is used to find step sizes \(t_L^h\) and \(t_R^h\). In addition, different stopping tolerances are utilized for different problems. That is, the tolerance ε is set to be relatively large for the auxiliary clustering problem (7.4)—since this problem need not to be solved very accurately—and smaller for the clustering problem (7.2).

Next, we give the modified version of the LMBM in its step by step form. We use x 1 for the starting point; ε c > 0 for the stopping tolerance; ε L and ε R for line search parameters; γ for the distance measure parameter; \(\hat {m}_c\) for the maximum number of stored correction vectors used to form limited memory matrix updates; \(t_{\max }\) is an upper bound for serious steps; C is a control parameter for the length of the direction vector. We also use i type to show the type of the problem, that is:

  • i type = 0: the auxiliary clustering problem (7.4);  and

  • i type = 1: the clustering problem (7.2).

In both cases, the objective function is denoted by f and the number of variables in the optimization problem is denoted by n. Hence \(f = \bar {f}_l\) and n = n for the auxiliary clustering problem and f = f l and n = nl for the lth clustering problem.

Algorithm 8.5 Modified limited memory bundle algorithm

The convergence properties of the LMBM are given in Sect. 3.4. Here, we recall the most important results in light of the clustering problem. Note that Assumptions 3.13.3, needed to prove the global convergence of the LMBM, are trivially satisfied for both the clustering and the auxiliary clustering problems.

Proposition 8.2

Assume that ε c = 0. If the LMBM terminates after a finite number of iterations, say at the iteration h, then the point x h is a Clarke stationary point of the (auxiliary) clustering problem.

Proposition 8.3

Assume that ε c = 0. Every accumulation point \(\bar {\mathbf {x}}\) generated by the LMBM is a Clarke stationary point of the (auxiliary) clustering problem.

Remark 8.2

The LMBM terminates in a finite number of steps if we choose ε c > 0.

Next, we describe the LMB-Clust and give its step by step algorithm. Since the problem (7.2) is nonconvex it is important to select favorable starting points before applying a local search method like the LMBM to solve it. The LMB-Clust uses the MSInc-Clust for solving the clustering problem globally and the LMBM is applied at each iteration of the MSInc-Clust to solve both the problems (7.2) and (7.4).

Algorithm 8.6 Limited memory bundle method for clustering (LMB-Clust)

5 Discrete Gradient Clustering Algorithm

In this section, we describe the discrete gradient clustering algorithm (DG-Clust) to solve the clustering problem (7.2). As mentioned in Sect. 3.8, the underlying optimization solver DGM is a semi-derivative free method for solving nonconvex NSO problems. The DGM does not use subgradients or their approximations but only at the end of the solution process and thus, it can be used to solve problems which are not subdifferentially regular. Therefore, the clustering algorithm based on the DGM can be used to solve clustering problems with the similarity measures d 1 and d , in addition to d 2 based clustering problems.

The flowchart of the DG-Clust is given in Fig. 8.3. Similar to other optimization based clustering algorithms, the DG-Clust uses the MSInc-Clust for solving the clustering problem globally and the DGM is applied at each iteration of the MSInc-Clust to solve both the problems (7.2) and (7.4).

Fig. 8.3
figure 3

Discrete gradient clustering algorithm (DG-Clust)

The flowchart of the DGM with a more detailed description is given in Sect. 3.8. Here, we give this method in its step by step form. Note that we use x 1 for the starting point; ε > 0 for the stopping tolerance; and ε L and ε R for line search parameters.

As before, we use the following notations: the objective function is denoted by f and n stands for the size of the optimization problem. That is, \(f= \bar {f}_l\) and n = n for the auxiliary clustering problem and f = f l and n = nl for the l-partition problem.

Algorithm 8.7 Discrete gradient method

The global convergence of Algorithm 8.7 has been studied in Sect. 3.8. Note that assumptions needed to get its convergence are satisfied for both the cluster and the auxiliary cluster functions. Next, we present the step by step description of the DG-Clust.

Algorithm 8.8 Discrete gradient clustering algorithm (DG-Clust)

Note that the DGM uses only discrete gradients to find an approximate solution to both the clustering and the auxiliary clustering problems. The calculation of discrete gradients can be simplified using a special structure of a problem such as the piecewise separability or piecewise partial separability of the objective functions (see Sect. 2.6.3).

It is proved in Propositions 4.5 and 4.12 that both the cluster function (7.3) and the auxiliary cluster function (7.5) are piecewise separable with the similarity measures d 1, d 2, and d . Therefore, we can simplify the calculations of discrete gradients for both the cluster and the auxiliary cluster functions.

First, we consider the computation of discrete gradients of the cluster function f k. This function is the special case of the function f, defined in (2.21) as

$$\displaystyle \begin{aligned} f(\mathbf{x}) = \sum_{i=1}^m \max_{h \in \mathcal{H}_i} ~\min_{j \in \mathcal{J}_h} f_{ihj}(\mathbf{x}). \end{aligned}$$

The cluster function f k does not depend on the index h and the sets \( \mathcal {H}_i,~i=1,\ldots ,m\) are all singletons. Therefore, for all \(h \in \mathcal {H}_i\) we have \(\mathcal {J}_h = \{1,\ldots ,k\}\) and

$$\displaystyle \begin{aligned} f_k(\mathbf{x}) = \frac{1}{m} \sum_{i=1}^m \min_{l=1,\ldots,k} f_{il}({\mathbf{x}}_l), \end{aligned}$$

where f il(x l) = d p(x l, a i), l = 1, …, k.

Then the term functions are

$$\displaystyle \begin{aligned} &(x_{lt} - a_{it})^2 \quad \mbox{for}\quad p=2, \quad \mathrm{and}\\ &|x_{lt} - a_{it}| \quad ~~ \mbox{for}\quad p=1,\infty. \end{aligned} $$

Here, t = 1, …, n, l = 1, …, k, i = 1, …, m, and therefore, the total number of such term functions is mnk. Since the function f k has nk number of variables one needs nk + 1 evaluations of this function to compute its one discrete gradient. Then the total number of evaluations of term functions to compute one discrete gradient of f k is N t = mnk(nk + 1).

According to the definition of the discrete gradients for a given i ∈{1, …, nk} we compute values of the function f k at the following nk + 1 points:

$$\displaystyle \begin{aligned}\mathbf{x}, {\mathbf{x}}^0,{\mathbf{x}}^1,\ldots,{\mathbf{x}}^{i-1},{\mathbf{x}}^{i+1},\ldots,{\mathbf{x}}^{nk}. \end{aligned}$$

We need the full evaluation of the function f k only at two points: at x and x 0 which requires 2mnk calculations of the term functions. Other points from this sequence are obtained from the previous point by changing only one coordinate which is the coordinate of only one cluster center. This means that we need to update only m term functions at points x 1, …, x i−1, x i+1, …, x nk and the number of evaluations of the term functions at these points is m(nk − 1). Therefore, the total number of evaluations of term functions for computation of one discrete gradient is \(\bar {N}_t=m(3nk-1).\)

Thus, in order to calculate one discrete gradient of the function f k at the point x the following simplified scheme can be used. We compute the values of the function f k at the points x and x 0. Then we store values of all term functions calculated at x 0. In order to calculate the value of f k at x 1 we update only those term functions which contain the first coordinate and keep all other term functions as they are. We repeat this scheme for all other coordinates. Note that we compute the function f k at the point x when we compute the first discrete gradient at this point. The use of this scheme allows us to reduce the number of term functions evaluations for computation of the first discrete gradient

$$\displaystyle \begin{aligned}\frac{N_t}{\bar{N}_t}=\frac{mnk(nk+1)}{m(3nk-1)} \approx \frac{nk+1}{3} \end{aligned}$$

times and approximately (nk + 1)∕2 times for the computation of all other discrete gradients at x. This reduction becomes very significant as the number of clusters k increases.

The similar scheme can be designed to compute discrete gradients of the auxiliary cluster function \(\bar {f}_k\). Here, the total number of term functions is mn. The function \(\bar {f}_k\) has n variables and therefore, one needs n + 1 evaluations of this function to compute its one discrete gradient. This means that the total number of evaluations of term functions to compute one discrete gradient of \(\bar {f}_k\) is N t = mn(n + 1).

For a given i ∈{1, …, n}, we compute values of the function \(\bar {f}_k\) at the following n + 1 points:

$$\displaystyle \begin{aligned}\mathbf{x}, {\mathbf{x}}^0,{\mathbf{x}}^1,\ldots,{\mathbf{x}}^{i-1},{\mathbf{x}}^{i+1},\ldots,{\mathbf{x}}^n. \end{aligned}$$

The full evaluation of the function \(\bar {f}_k\) at points x and x 0 requires 2mn calculations of the term functions. Other points from this sequence are obtained from the previous point by changing only one coordinate. This means that we need to update only m term functions at points x 1, …, x i−1, x i+1, …, x n and therefore, the total number of evaluations of the term functions for calculating of \(\bar {f}_k\) at these points is m(n − 1). The total number of evaluations of term functions for computation of one discrete gradient is \(\bar {N}_t=m(3n-1)\).

Therefore, we can apply the following simplified scheme to compute one discrete gradient of the \(\bar {f}_k\) at the point x. We compute the function \(\bar {f}_k\) at the points x and x 0 and store the values of all term functions calculated at x 0. In order to calculate the value of \(\bar {f}_k\) at x 1 for each data point we update only the first term function and keep all other term functions as they are. This scheme is repeated for all other coordinates. Applying this scheme leads to the reduction of the number of term functions evaluations to compute the first discrete gradient

$$\displaystyle \begin{aligned}\frac{N_t}{\bar{N}_t}=\frac{mn(n+1)}{m(3n-1)} \approx \frac{n+1}{3} \end{aligned}$$

times and approximately (n + 1)∕2 times for the computation of all other discrete gradients at x.

6 Smooth Incremental Clustering Algorithm

In this section, we describe the smooth incremental clustering algorithm (IS-Clust) where the objective functions in both the clustering and the auxiliary clustering problems are approximated by smooth functions [33]. To approximate objective functions, we apply the HSM, described in Sect. 3.9. The hyperbolic smoothings of the cluster function f k and the auxiliary cluster function \(\bar {f}_k\) are given in Sects. 4.7.2 and 4.7.3, respectively. For convenience, we recall these smooth functions for any l = 2, …, k:

$$\displaystyle \begin{aligned} \varPhi_{l,\tau}(\mathbf{x},\mathbf{t}) & =-\frac{1}{m} \sum_{i=1}^m \left( t_i+\sum_{j=1}^l \frac{-d_p({\mathbf{x}}_j,\mathbf{ a}_i)-t_i+\sqrt{(d_p({\mathbf{x}}_j,{\mathbf{a}}_i)+t_i)^2+\tau^2}}{2} \right) \\ & =\frac{1}{m} \sum_{i=1}^m \left(-t_i+\sum_{j=1}^l ~\frac{t_i + d_p({\mathbf{x}}_j,\mathbf{ a}_i)-\sqrt{(d_p({\mathbf{x}}_j,{\mathbf{a}}_i)+t_i)^2 + \tau^2}}{2} \right), \end{aligned} $$

and

where \(\mathbf {x}=({\mathbf {x}}_1,\ldots ,{\mathbf {x}}_l) \in \mathbb {R}^{nl},~\mathbf {y} \in \mathbb {R}^n\) and t = (t 1, …, t m), such that

$$\displaystyle \begin{aligned}t_i = -\min\limits_{j=1,\ldots,l} d_p({\mathbf{x}}_j,{\mathbf{a}}_i),\quad i=1,\ldots,m. \end{aligned}$$

As mentioned before, if the function d p is defined using the squared Euclidean norm, then the functions Φ l,τ and \(\bar {\varPhi }_{l,\tau }\) are both smooth since d 2 is differentiable. However, the other two functions d 1 and d are nonsmooth, and we need to reapply the hyperbolic smoothing technique to these functions to approximate them with the smooth functions. These results are presented in Sect. 4.7.

Take any sequence {τ h} such that τ h 0 as h →, then the clustering and the auxiliary clustering problems (7.2) and (7.4) can be replaced by the sequence of the following smooth problems, respectively:

$$\displaystyle \begin{aligned} &\begin{cases} \text{minimize}\quad & \varPhi_{l,\tau_h}(\mathbf{x},\mathbf{t}) \\ \text{subject to} & \mathbf{x}=({\mathbf{x}}_1,\ldots,{\mathbf{x}}_l) \in \mathbb{R}^{nl}, \end{cases} \end{aligned} $$
(8.2)

and

$$\displaystyle \begin{aligned} &\begin{cases} \text{minimize}\quad & \bar{\varPhi}_{l,\tau_h}(\mathbf{y})\\ \text{subject to} & \mathbf{y} \in \mathbb{R}^n. \end{cases} \end{aligned} $$
(8.3)

The IS-Clust solves the clustering problem by combining the MSInc-Clust and an optimization method. The IS-Clust applies the MSInc-Clust to solve the clustering problem globally. Since the clustering and the auxiliary clustering problems (8.2) and (8.3) are smooth problems the IS-Clust can utilize any smooth optimization method to solve them. The flowchart of the IS-Clust is presented in Fig. 8.4.

Fig. 8.4
figure 4

Smooth incremental clustering algorithm (IS-Clust)

The IS-Clust is given in its step by step description as follows.

Algorithm 8.9 Smooth incremental clustering algorithm (IS-Clust)

Note that in Step 3 of this algorithm when we apply Algorithm 7.2 to compute the set of starting points \(\bar {A}_5\), the auxiliary cluster function \(\bar {f}_l\) is approximated by the smooth function \(\bar {\varPhi }_{l,\tau }\).