Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Spectral clustering (SC) represents the most popular class of algorithms based on graph theory [11]. It makes use of the Laplacian’s spectrum to partition a graph into weakly connected subgraphs. Moreover, if the graph is constructed based on any kind of data (vector, images etc.), data clustering can be performed.Footnote 1 SC began to be popularized when Shi and Malik introduced the Normalized Cut criterion to handle image segmentation [59]. Afterwards, Ng and Jordan [51] in a theoretical work based on matrix perturbation theory have shown conditions under which a good performance of the algorithm is expected. Finally, in the tutorial by Von Luxburg the main literature related to SC has been exhaustively summarized [63]. Although very successful in a number of applications, SC has some limitations. For instance, it cannot handle big data without using approximation methods like the Nyström algorithm [19, 64], the power iteration method [37], or linear algebra-based methods [15, 20, 52]. Furthermore, the generalization to out-of-sample data is only approximate.

These issues have been recently tackled by means of a spectral clustering algorithm formulated as weighted kernel PCA [2]. The technique, named kernel spectral clustering (KSC), is based on solving a constrained optimization problem in a primal-dual setting. In other words, KSC is a Least-Squares Support Vector Machine (LS-SVM [61]) model used for clustering instead of classification.Footnote 2 By casting SC in a learning framework, KSC allows to rigorously select tuning parameters such as the natural number of clusters which are present in the data. Also, an accurate prediction of the cluster memberships for unseen points can be easily done by projecting test data in the embedding eigenspace learned during training. Furthermore, the algorithm can be tailored to a given application by using the most appropriate kernel function. Beyond that, by using sparse formulations and a fixed-size [12, 61] approach, it is possible to readily handle big data. Finally, by means of adequate adaptations of the core algorithm, hierarchical clustering and a soft clustering approach have been proposed.

The idea behind KSC is similar to the earlier works introduced in [16, 17]. In these papers the authors showed that a general weighted kernel k-means objective is mathematically equivalent to a weighted graph partitioning objective such as ratio cut, normalized cut and ratio association. This equivalence allows, for instance, to use the weighted kernel k-means algorithm to directly optimize the graph partitioning objectives, which eliminates the need for any eigenvector computation when this is prohibitive. Although quite appealing and mathematically sound, the algorithm presents some drawbacks. The main issues concern the sensitivity of the final clustering results to different types of initialization techniques, the choice of the shift parameter, and the model selection (i.e., how to choose the number of clusters). Furthermore, the out-of-sample extension problem is not discussed. On the other hand, as we will see later, these issues are not present in the KSC algorithm.

The remainder of the Chapter is organized as follows. After presenting the basic KSC method, the soft KSC algorithm will be summarized. Next, two possible ways to accomplish hierarchical clustering will be explained. Afterwards, some sparse formulations based on the Incomplete Cholesky Decomposition (ICD) and L 0, \(L_{1},L_{0} + L_{1}\), Group Lasso regularization will be described. Lastly, various interesting applications in different domains such as computer vision, power-load consumer profiling, information retrieval, and big data clustering will be illustrated. All these examples assume a static setting. Concerning other applications in a dynamic scenario the interested reader can refer to [29, 33] for fault detection, to [32] for incremental time-series clustering, to [25, 28, 31] in case of community detection in evolving networks and [54] in relation to human motion tracking.

2 Notation

Table 1

3 Kernel Spectral Clustering (KSC)

3.1 Mathematical Formulation

3.1.1 Training Problem

The KSC formulation for k clusters is stated as a combination of k − 1 binary problems [2]. In particular, given a set of training data \(\mathcal{D}_{\text{tr}} =\{ x_{i}\}_{i=1}^{N_{\text{tr}}}\), the primal problem is:

$$\displaystyle{ \begin{array}{l@{\quad }l} \mathop{\text{min}}\limits_{w^{(l)},e^{(l)},b_{l}}\quad &\frac{1} {2}\sum _{l=1}^{k-1}w^{(l)^{T} }w^{(l)} -\frac{1} {2}\sum _{l=1}^{k-1}\gamma _{ l}e^{(l)^{T} }V e^{(l)} \\ \text{subject to} \quad &e^{(l)} =\varPhi w^{(l)} + b_{l}1_{N_{ \text{tr}}},l = 1,\mathop{\ldots },k - 1. \end{array} }$$
(1)

The \(e^{(l)} = [e_{1}^{(l)},\mathop{\ldots },e_{i}^{(l)},\mathop{\ldots },e_{N_{ \text{tr}}}^{(l)}]^{T}\) are the projections of the training data mapped in the feature space along the direction w (l). For a given point x i , the model in the primal form is:

$$\displaystyle{ e_{i}^{(l)} = w^{(l)^{T} }\varphi (x_{i}) + b_{l}. }$$
(2)

The primal problem (1) expresses the maximization of the weighted variances of the data given by \(e^{(l)^{T} }V e^{(l)}\) and the contextual minimization of the squared norm of the vector w (l), \(\forall l\). The regularization constants \(\gamma _{l} \in \mathbb{R}^{+}\) mediate the model complexity expressed by w (l) with the correct representation of the training data. \(V \in \mathbb{R}^{N_{\text{tr}}\times N_{\text{tr}}}\) is the weighting matrix and Φ is the \(N_{\text{tr}} \times d_{h}\) feature matrix \(\varPhi = [\varphi (x_{1})^{T};\mathop{\ldots };\varphi (x_{N_{ \text{tr}}})^{T}]\), where \(\varphi: \mathbb{R}^{d} \rightarrow \mathbb{R}^{d_{h}}\) denotes the mapping to a high-dimensional feature space, b l are bias terms.

The dual problem corresponding to the primal formulation (1), by setting \(V = D^{-1}\) becomesFootnote 3:

$$\displaystyle{ D^{-1}M_{ D}\varOmega \alpha ^{(l)} =\lambda _{ l}\alpha ^{(l)} }$$
(3)

where Ω is the kernel matrix with ijth entry \(\varOmega _{ij} = K(x_{i},x_{j}) =\varphi (x_{i})^{T}\varphi (x_{j})\). \(K: \mathbb{R}^{d} \times \mathbb{R}^{d} \rightarrow \mathbb{R}\) means the kernel function. The type of kernel function to utilize is application-dependent, as it is outlined in Table 1. The matrix D is the graph degree matrix which is diagonal with positive elements \(D_{ii} =\sum _{j}\varOmega _{ij}\), M D is a centering matrix defined as \(M_{D} = I_{N_{ \text{tr}}} - \frac{1} {1_{N_{ \text{tr}}}^{T}D^{-1}1_{N_{ \text{tr}}}}1_{N_{\text{tr}}}1_{N_{\text{tr}}}^{T}D^{-1}\), the α (l) are vectors of dual variables, \(\lambda _{l} = \frac{N_{\text{tr}}} {\gamma _{l}}\), \(K: \mathbb{R}^{d} \times \mathbb{R}^{d} \rightarrow \mathbb{R}\) is the kernel function. The dual clustering model for the ith point can be expressed as follows:

$$\displaystyle{ e_{i}^{(l)} =\sum _{ j=1}^{N_{\text{tr}}}\alpha _{ j}^{(l)}K(x_{ j},x_{i}) + b_{l},j = 1,\mathop{\ldots },N_{\text{tr}},l = 1,\mathop{\ldots },k - 1. }$$
(4)

The cluster prototypes can be obtained by binarizing the projections e i (l) as sign(e i (l)). This step is straightforward because, thanks to the presence of the bias term b l , both the e (l) and the α (l) variables get automatically centered around zero. The set of the most frequent binary indicators form a code-book \(\mathcal{C}\mathcal{B} =\{ c_{p}\}_{p=1}^{k}\), where each codeword of length k − 1 represents a cluster.

Table 1 Types of kernel functions for different applications

Interestingly, problem (3) has a close connection with SC based on a random walk Laplacian. In this respect, the kernel matrix can be considered as a weighted graph \(\mathcal{G} = (\mathcal{V},\mathcal{E})\) with the nodes \(v_{i} \in \mathcal{V}\) represented by the data points x i . This graph has a corresponding random walk in which the probability of leaving a vertex is distributed among the outgoing edges according to their weight: \(p_{t+1} = Pp_{t}\), where \(P = D^{-1}\varOmega\) indicates the transition matrix with the ijth entry denoting the probability of moving from node i to node j in one time-step. Moreover, the stationary distribution of the Markov Chain describes the scenario where the random walker stays mostly in the same cluster and seldom moves to the other clusters [14, 46, 47, 47].

3.1.2 Generalization

Given the dual model parameters α (l) and b l , it is possible to assign a membership to unseen points by calculating their projections onto the eigenvectors computed in the training phase:

$$\displaystyle{ e_{\text{test}}^{(l)} =\varOmega _{\text{test}}\alpha ^{(l)} + b_{ l}1_{N_{\text{test}}} }$$
(5)

where Ω test is the N test × N kernel matrix evaluated using the test points with entries \(\varOmega _{\text{test},ri} = K(x_{r}^{\text{test}},x_{ i})\), \(r = 1,\mathop{\ldots },N_{\text{test}}\), \(i = 1,\mathop{\ldots },N_{\text{tr}}\). The cluster indicator for a given test point can be obtained by using an Error Correcting Output Codes (ECOC) decoding procedure:

  • the score variable is binarized

  • the indicator is compared with the training code-book \(\mathcal{C}\mathcal{B}\) (see previous Section), and the point is assigned to the nearest prototype in terms of Hamming distance.

The KSC method, comprising training and test stage, is summarized in Algorithm 1, and the related Matlab package is freely available on the Web.Footnote 4

Algorithm 1 KSC algorithm [2]

3.1.3 Model Selection

In order to select tuning parameters like the number of clusters k and eventually the kernel parameters, a model selection procedure based on grid search is adopted. First, a validation set \(\mathcal{D}_{\text{val}} =\{ x_{i}\}_{i=1}^{N_{\text{val}}}\) is sampled from the whole dataset. Then, a grid of possible values of the tuning parameters is constructed. Afterwards, a KSC model is trained for each combination of parameters and the chosen criterion is evaluated on the partitioning predicted for the validation data. Finally, the parameters yielding the maximum value of the criterion are selected. Depending on the kind of data, a variety of model selection criteria have been proposed:

  • Balanced Line Fit (BLF). It indicates the amount of collinearity between validation points belonging to the same cluster, in the space of the projections. It reaches its maximum value 1 in case of well-separated clusters, represented as lines in the space of the e val (l) (see, for instance, the bottom left side of Fig. 1)

    Fig. 1
    figure 1

    KSC partitioning on a toy dataset. (Top) Original dataset consisting of three clusters (left) and obtained clustering results (right). (Bottom) Points represented in the space of the projections \([e^{(1)},e^{(2)}]\) (left), for an optimal choice of k (and \(\sigma ^{2} = 4.36 \cdot 10^{-3}\)) suggested by the BLF criterion (right). We can notice how the points belonging to one cluster tend to lie on the same line. A perfect line structure is not attained due to a certain amount of overlap between the clusters

  • Balanced Angular Fit or BAF [39]. For every cluster, the sum of the cosine similarity between the validation points and the cluster prototype, divided by the cardinality of that cluster, is computed. These similarity values are then summed up and divided by the total number of clusters.

  • Average Membership Strength abbr. AMS [30]. The mean membership per cluster denoting the mean degree of belonging of the validation points to the cluster is computed. These mean cluster memberships are then averaged over the number of clusters.

  • Modularity [49]. This quality function is well suited for network data. In the model selection scheme, the Modularity of the validation subgraph corresponding to a given partitioning is computed, and the parameters related to the highest Modularity are selected [26, 27].

  • Fisher Criterion. The classical Fisher criterion [8] used in classification has been adapted to select the number of clusters k and the kernel parameters in the KSC framework [4]. The criterion maximizes the distance between the means of the two clusters while minimizing the variance within each cluster, in the space of the projections e val (l).

In Fig. 1 an example of clustering obtained by KSC on a synthetic dataset is shown. The BLF model selection criterion has been used to tune the bandwidth of the RBF kernel and the number of clusters. It can be noticed how the results are quite accurate, despite the fact that the clustering boundaries are highly nonlinear.

3.2 Soft Kernel Spectral Clustering

Soft kernel spectral clustering (SKSC) makes use of Algorithm 1 in order to compute a first hard partitioning of the training data. Next, soft cluster assignments are performed by computing the cosine distance between each point and some cluster prototypes in the space of the projections e (l). In particular, given the projections for the training points \(e_{i} = [e_{i}^{(1)},\mathop{\ldots },e_{i}^{(k-1)}]\), \(i = 1,\mathop{\ldots },N_{\text{tr}}\) and the corresponding hard assignments q i p we can calculate for each cluster the cluster prototypes \(s_{1},\mathop{\ldots },s_{p},\mathop{\ldots },s_{k}\), \(s_{p} \in \mathbb{R}^{k-1}\) as:

$$\displaystyle{ s_{p} = \frac{1} {n_{p}}\sum _{i=1}^{n_{p} }e_{i} }$$
(6)

where n p is the number of points assigned to cluster p during the initialization step by KSC. Then the cosine distance between the ith point in the projections space and a prototype s p is calculated by means of the following formula:

$$\displaystyle{ d_{ip}^{\text{cos}} = 1 - e_{ i}^{T}s_{ p}/(\vert \vert e_{i}\vert \vert _{2}\vert \vert s_{p}\vert \vert _{2}). }$$
(7)

The soft membership of point i to cluster q can be finally expressed as:

$$\displaystyle{ \mathrm{sm}_{i}^{(q)} = \frac{\prod _{j\neq q}d_{ij}^{\text{cos}}} {\sum _{p=1}^{k}\prod _{j\neq p}d_{ij}^{\text{cos}}} }$$
(8)

with \(\sum _{p=1}^{k}\mathrm{sm}_{i}^{(p)} = 1\). As pointed out in [7], this membership represents a subjective probability expressing the belief in the clustering assignment.

The out-of-sample extension on unseen data consists simply of calculating Eq. (5) and assigning the test projections to the closest centroid.

An example of soft clustering performed by SKSC on a synthetic dataset is depicted in Fig. 2. The AMS model selection criterion has been used to select the bandwidth of the RBF kernel and the optimal number of clusters. The reader can appreciate how SKSC provides more interpretable outcomes compared to KSC.

Fig. 2
figure 2

SKSC partitioning on a synthetic dataset. (Top) Original dataset consisting of two clusters (left) and obtained soft clustering results (right). (Bottom) Points represented in the space of the projection e (1) (left), for an optimal choice of k (and \(\sigma ^{2} = 1.53 \cdot 10^{-3}\)) as detected by the AMS criterion (right)

The SKSC method is summarized in Algorithm 2 and a Matlab implementation is freely downloadable.Footnote 5

Algorithm 2 SKSC algorithm [30]

3.3 Hierarchical Clustering

In many cases, clusters are formed by sub-clusters which in turn might have substructures. As a consequence, an algorithm able to discover a hierarchical organization of the clusters provides a more informative result, incorporating several scales in the analysis. The flat KSC algorithm has been extended in two ways in order to deal with hierarchical clustering.

3.3.1 Approach 1

This approach, named hierarchical kernel spectral clustering (HKSC), was proposed in [4] and exploits the information of a multi-scale structure present in the data given by the Fisher criterion (see end of Sect. 3.1.3). A grid search over different values of k and \(\sigma ^{2}\) is performed to find tuning parameter pairs such that the criterion is greater than a specified threshold value. The KSC model is then trained for each pair and evaluated at the test set using the out-of-sample extension. A specialized linkage criterion determines which clusters are merging based on the evolution of the cluster memberships as the hierarchy goes up. The whole procedure is summarized in Algorithm 3.

Algorithm 3 HKSC algorithm [4]

3.3.2 Approach 2

In [42] and [41] an alternative hierarchical extension of the basic KSC algorithm was introduced, for network and vector data, respectively. In this method, called agglomerative hierarchical kernel spectral clustering (AH-KSC), the structure of the projections in the eigenspace is used to automatically determine a set of increasing distance thresholds. At the beginning, the validation point with maximum number of similar points within the first threshold value is selected. The indices of all these points represent the first cluster at level 0 of hierarchy. These points are then removed from the validation data matrix, and the process is repeated iteratively until the matrix becomes empty. Thus, the first level of hierarchy corresponding to the first distance threshold is obtained. To obtain the clusters at the next level of hierarchy the clusters at the previous levels are treated as data points, and the whole procedure is repeated again with other threshold values. This step takes inspiration from [9]. The algorithm stops when only one cluster remains. The same procedure is applied in the test stage, where the distance thresholds computed in the validation phase are used. An overview of all the steps involved in the algorithm is depicted in Fig. 3. In Fig. 4 an example of hierarchical clustering performed by this algorithm on a toy dataset is shown.

Fig. 3
figure 3

AH-KSC algorithm. Steps of AH-KSC method as described in [42] with addition of the step where the optimal \(\sigma\) and k are estimated

Fig. 4
figure 4

AH-KSC partitioning on a toy dataset. Cluster memberships for a toy dataset at different hierarchical levels obtained by the AH-KSC method

3.4 Sparse Clustering Models

The computational complexity of the KSC algorithm depends on solving the eigenvalue problem (3) related to the training stage and computing Eq. (5) which gives the cluster memberships of the remaining points. Assuming that we have N tot data and we use N tr points for training and \(N_{\text{test}} = N_{\text{tot}} - N_{\text{tr}}\) as test set, the runtime of Algorithm 1 is \(O(N_{\text{tr}}^{2}) + O(N_{\text{tr}}N_{\text{test}})\). In order to reduce the computational complexity, it is then necessary to find a reduced set of training points, without loosing accuracy. In the next sections two different methods to obtain a sparse KSC model, based on the Incomplete Cholesky Decomposition (ICD) and L 1 and L 0 penalties, respectively, are discussed. In particular, thanks to the ICD, the KSC computational complexity for the training problem is decreased to O(R 2 N tr) [53], where R indicates the reduced set size.

3.4.1 Incomplete Cholesky Decomposition

One of the KKT optimality conditions characterizing the Lagrangian of problem (1) is:

$$\displaystyle{ w^{(l)} =\varPhi ^{T}\alpha ^{(l)} =\sum _{ i=1}^{N_{\text{tr}}}\alpha _{ i}^{(l)}\varphi (x_{ i}). }$$
(9)

From Eq. (9) it is evident that each training data point contributes to the primal variable w (l), resulting in a non-sparse model. In order to obtain a parsimonious model a reduced set method based on the Incomplete Cholesky Decomposition (ICD) was proposed in [3, 53]. The technique is based on finding a small number R ≪ N tr of points \(\mathcal{R} =\{\hat{ x_{r}}\}_{r=1}^{R}\) and related coefficients ζ (l) with the aim of approximating w (l) as:

$$\displaystyle{ w^{(l)} \approx \hat{ w}^{(l)} =\sum _{ r=1}^{R}\zeta _{ r}^{(l)}\varphi (\hat{x_{ r}}). }$$
(10)

As a consequence, the projection of an arbitrary data point x into the training embedding is given by:

$$\displaystyle{ e^{(l)} \approx \hat{ e}^{(l)} =\sum _{ r=1}^{R}\zeta _{ r}^{(l)}K(x,\widehat{x_{ r}}) +\widehat{ b_{l}}. }$$
(11)

The set \(\mathcal{R}\) of points can be obtained by considering the pivots of the ICD performed on the kernel matrix Ω. In particular, by assuming that Ω has a small numerical rank, the kernel matrix can be approximated by \(\varOmega \approx \hat{\varOmega }= GG^{T}\), with \(G \in \mathbb{R}^{N_{\text{tr}}\times R}\). If we plug in this approximated kernel matrix in problem (3), the KSC eigenvalue problem can be written as:

$$\displaystyle{ \hat{D}^{-1}M_{\hat{ D}}U\varPsi ^{2}U^{T}\hat{\alpha }^{(l)} =\widehat{\lambda _{ l}}\hat{\alpha }^{(l)},l = 1,\mathop{\ldots },k }$$
(12)

where \(U \in \mathbb{R}^{N_{\text{tr}}\times R}\) and \(V \in \mathbb{R}^{N_{\text{tr}}\times R}\) denotes the left and right singular vectors deriving from the singular value decomposition (SVD) of G, and \(\varPsi \in \mathbb{R}^{N_{\text{tr}}\times N_{\text{tr}}}\) is the matrix of the singular values. If now we pre-multiply both sides of Eq. (12) by U T and replace \(\hat{\delta }^{(l)} = U^{T}\hat{\alpha }^{(l)}\), only the following eigenvalue problem of size R × R must be solved:

$$\displaystyle{ U^{T}\hat{D}^{-1}M_{\hat{ D}}U\varPsi ^{2}\hat{\delta }^{(l)} =\widehat{\lambda _{ l}}\hat{\delta }^{(l)},l = 1,\mathop{\ldots },k. }$$
(13)

The approximated eigenvectors of the original problem (3) can be computed as \(\hat{\alpha }^{(l)} = U\hat{\delta }^{(l)}\), and the sparse parameter vector can be found by solving the following optimization problem:

$$\displaystyle{ \min _{\zeta ^{(l)}} \parallel w^{(l)} -\hat{ w}^{(l)} \parallel _{ 2}^{2} =\min _{\zeta ^{ (l)}} \parallel \varPhi ^{T}\alpha ^{(l)} -\chi ^{T}\zeta ^{(l)} \parallel _{ 2}^{2}. }$$
(14)

The corresponding dual problem can be written as follows:

$$\displaystyle{ \varOmega ^{\chi \chi }\delta ^{(l)} =\varOmega ^{\chi \phi }\alpha ^{(l)}, }$$
(15)

where \(\varOmega _{rs}^{\chi \chi } = K(\tilde{x}_{r},\tilde{x}_{s})\), \(\varOmega _{ri}^{\chi \phi } = K(\tilde{x}_{r},x_{i})\), \(r,s = 1,\ldots,R,i = 1,\ldots,N_{\mathrm{tr}}\) and \(l = 1,\ldots,k - 1\). Since the size R of problem (13) can be much smaller than the size N tr of the starting problem, the sparse KSC methodFootnote 6 is suitable for big data analytics.

3.4.2 Using Additional Penalty Terms

In this part we explore sparsity in the KSC technique by using an additional penalty term in the objective function (14). In [3], the authors used an L 1 penalization term in combination with the reconstruction error term to introduce sparsity. It is well known that the L 1 regularization introduces sparsity as shown in [66]. However, the resulting reduced set is neither the sparsest nor the most optimal w.r.t. the quality of clustering for the entire dataset. In [43], we introduced alternative penalization techniques like Group Lasso [65] and [21], L 0 and \(L_{1} + L_{0}\) penalizations. The Group Lasso penalty is ideal for clusters as it results in groups of relevant data points. The L 0 regularization calculates the number of non-zero terms in the vector. The L 0-norm results in a non-convex and NP-hard optimization problem. We modify the convex relaxation of L 0-norm based on an iterative re-weighted L 1 formulation introduced in [10, 22]. We apply it to obtain the optimal reduced sets for sparse kernel spectral clustering. Below we provide the formulation for Group Lasso penalized objective (16) and re-weighted L 1-norm penalized objectives (17).

The Group Lasso [65] based formulation for our optimization problem is:

$$\displaystyle{ \mathop{\text{min}}\limits_{\beta \in \mathbb{R}^{N_{\mathrm{tr}}\times (k-1)}}\quad \|\varPhi ^{\intercal }\alpha -\varPhi ^{\intercal }\beta \|_{ 2}^{2} +\lambda \sum _{ l=1}^{N_{\mathrm{tr}} }\sqrt{\rho _{l}}\|\beta _{l}\|_{2}, }$$
(16)

where \(\varPhi = [\phi (x_{1}),\ldots,\phi (x_{N_{\mathrm{tr}}})]\), \(\alpha = [\alpha ^{(1)},\ldots,\alpha ^{(k-1)}]\), \(\alpha \in \mathbb{R}^{N_{\mathrm{tr}}\times (k-1)}\) and \(\beta = [\beta _{1},\ldots,\beta _{N_{\mathrm{tr}}}]\), \(\beta \in \mathbb{R}^{N_{\mathrm{tr}}\times (k-1)}\). Here \(\alpha ^{(i)} \in \mathbb{R}^{N_{\mathrm{tr}}}\) while \(\beta _{j} \in \mathbb{R}^{k-1}\) and we set \(\sqrt{\rho _{l}}\) as the fraction of training points belonging to the cluster to which the lth training point belongs. By varying the value of \(\lambda\) we control the amount of sparsity introduced in the model as it acts as a regularization parameter. In [21], the authors show that if the initial solutions are \(\hat{\beta }_{1},\hat{\beta }_{2},\ldots,\hat{\beta }_{N_{\mathrm{tr}}}\) then if \(\|X_{l}^{\intercal }(y -\sum _{i\neq l}X_{i}\hat{\beta }_{i})\| <\lambda\), then \(\hat{\beta }_{l}\) is zero otherwise it satisfies: \(\hat{\beta }_{l} = (X_{l}^{\intercal }X_{l} +\lambda /\|\hat{\beta }_{l}\|)^{-1}X_{l}^{\intercal }r_{l}\) where \(r_{l} = y -\sum _{i\neq l}X_{i}\hat{\beta }_{i}\).

Analogous to this, the solution to the Group Lasso penalization for our problem can be defined as: \(\|\phi (x_{l})(\varPhi ^{\intercal }\alpha -\sum _{i\neq l}\phi (x_{i})\hat{\beta }_{i})\| <\lambda\) then \(\hat{\beta }_{l}\) is zero otherwise it satisfies: \(\hat{\beta }_{l} = (\varPhi ^{\intercal }\varPhi +\lambda /\|\hat{\beta }_{l}\|)^{-1}\phi (x_{l})r_{l}\) where \(r_{l} =\varPhi ^{\intercal }\alpha -\sum _{i\neq l}\phi (x_{i})\hat{\beta }_{i}\). The Group Lasso penalization technique can be solved by a blockwise co-ordinate descent procedure as shown in [65]. The time complexity of the approach is \(O(\text{maxiter} {\ast} k^{2}N_{\mathrm{tr}}^{2})\) where maxiter is the maximum number of iterations specified for the co-ordinate descent procedure and k is the number of clusters obtained via KSC. From our experiments we observed that on an average ten iterations suffice for convergence.

Concerning the re-weighted L 1 procedure, we modify the algorithm related to classification as shown in [22] and use it for obtaining the reduced set in our clustering setting:

$$\displaystyle{ \begin{array}{ll} \mathop{\text{min}}\limits_{\beta \in \mathbb{R}^{N_{\mathrm{tr}}\times (k-1)}} & \|\varPhi ^{\intercal }\alpha -\varPhi ^{\intercal }\beta \|_{2}^{2} +\rho \sum _{ i=1}^{N_{\mathrm{tr}}}\epsilon _{i} +\|\varLambda \beta \|_{ 2}^{2} \\ \text{such that} &\|\beta _{i}\|_{2}^{2} \leq \epsilon _{i},i = 1,\ldots,N_{\mathrm{tr}} \\ & \epsilon _{i} \geq 0,\end{array} }$$
(17)

where \(\varLambda\) is matrix of the same size as the β matrix i.e. \(\varLambda \in \mathbb{R}^{N_{\mathrm{tr}}\times (k-1)}\). The term \(\|\varLambda \beta \|_{2}^{2}\) along with the constraint \(\|\beta _{i}\|_{2}^{2} \leq \epsilon _{i}\) corresponds to the L 0-norm penalty on β matrix. \(\varLambda\) matrix is initially defined as a matrix of ones so that it gives equal chance to each element of β matrix to reduce to zero. The constraints on the optimization problem forces each element of \(\beta _{i} \in \mathbb{R}^{(k-1)}\) to reduce to zero. This helps to overcome the problem of sparsity per component which is explained in [3]. The ρ variable is a regularizer which controls the amount of sparsity that is introduced by solving this optimization problem.

In Fig. 5 an example of clustering obtained using the Group Lasso formulation (16) on a toy dataset is depicted. We can notice how the sparse KSC model is able to obtain high quality generalization using only four points in the training set.

Fig. 5
figure 5

Sparse KSC on toy dataset. (Top) Gaussian mixture with three highly overlapping components. (Center) clustering results, where the reduced set points are indicated with red circles. (Bottom) generalization boundaries

4 Applications

The KSC algorithm has been successfully used in a variety of applications in different domains. In the next sections we will illustrate various results obtained in different fields such as computer vision, information retrieval and power load consumer segmentation.

4.1 Image Segmentation

Image segmentation relates to partitioning a digital image into multiple regions, such that pixels in the same group share a certain visual content. In the experiments performed using KSC only the color information is exploited in order to segment the given images.Footnote 7 More precisely, a local color histogram with a 5 × 5 pixels window around each pixel is computed using minimum variance color quantization of 8 levels. Then, in order to compare the similarity between two histograms h (i) and h (j), the positive definite χ 2 kernel \(K(h^{(i)},h^{(j)}) =\exp (-\dfrac{\chi _{ij}^{2}} {\sigma _{\chi }^{2}} )\) has been adopted [19]. The symbol χ ij 2 denotes the χ ij 2 statistical test used to compare two probability distributions [55], \(\sigma _{\chi }\) as usual indicates the bandwidth of the kernel. In Fig. 6 an example of segmentation obtained using the basic KSC algorithm is given.

Fig. 6
figure 6

Image segmentation. (Left) original image. (Right) KSC segmentation

4.2 Scientific Journal Clustering

We present here an integrated approach for clustering scientific journals using KSC. Textual information is combined with cross-citation information in order to obtain a coherent grouping of the scientific journals and to improve over existing journal categorizations. The number of clusters k in this scenario is fixed to 22 since we want to compare the results with respect to the 22 essential science indicators (ESI) shown in Table 2.

Table 2 The 22 science fields according to the essential science indicators (ESI)

The data correspond to more than six million scientific papers indexed by the Web of Science (WoS) in the period 2002–2006. The type of manuscripts considered is article, letter, note, and review. Textual information has been extracted from titles, abstracts and keywords of each paper together with citation information. From these data, the resulting number of journals under consideration is 8305.

The two resulting datasets contain textual and cross-citation information and are described as follows:

  • Term/Concept by Journal dataset: The textual information was processed using the term frequency—inverse document frequency (TF-IDF) weighting procedure [6]. Terms which occur only in one document and stop words were not considered into the analysis. The Porter stemmer was applied to the remaining terms in the abstract, title, and keyword fields. This processing leads to a term-by-document matrix of around six million papers and 669, 860 term dimensionality. The final journal-by-term dataset is a 8305 × 669, 860 matrix. Additionally, latent semantic indexing (LSI) [13] was performed on this dataset to reduce the term dimensionality to 200 factors.

  • Journal cross-citation dataset: A different form of analyzing cluster information at the journal level is through a cross-citation graph. This graph contains aggregated citations between papers forming a journal-by-journal cross-citation matrix. The direction of the citations is not taken into account which leads to an undirected graph and a symmetric cross-citation matrix.

The cross-citation and the text/concept datasets are integrated at the kernel level by considering the following linear combination of kernel matricesFootnote 8:

$$\displaystyle{\varOmega ^{\text{integr}} =\rho \varOmega ^{\text{cross-cit}} + (1-\rho )\varOmega ^{\text{text}}}$$

where 0 ≤ ρ ≤ 1 is a user-defined integration weight which value can be obtained from internal validation measures for cluster distortion,Footnote 9 Ω cross-cit is the cross-citation kernel matrix with ijth entry \(\varOmega _{ij}^{\text{cross-cit}} = K(x_{i}^{\text{cross-cit}},x_{j}^{\text{cross-cit}})\), x i cross-cit is the ith journal represented in terms of cross-citation variables, Ω text is the textual kernel matrix with ijth entry \(\varOmega _{ij}^{\text{text}} = K(x_{i}^{\text{text}},x_{j}^{\text{text}})\), x i text is the ith journal represented in terms of textual variables and \(i,j = 1,\mathop{\ldots },N\).

The KSC outcomes are depicted in Tables 3 and 4. In particular, Table 3 shows the results in terms of internal validation of cluster quality, namely mean silhouette value (MSV) [57] and Modularity [49, 50], and in terms of agreement with existing categorizations (adjusted rand index or ARI [23] and normalized mutual information (NMI [60]). Finally, Table 4 shows the top 20 terms per cluster, which indicate a coherent structure and illustrate that KSC is able to detect the text categories present in the corpus.

Table 3 Text clustering quality
Table 4 Text clustering results

4.3 Power Load Clustering

Accurate power load forecasts are essential in electrical grids and markets particularly for planning and control operations [5]. In this scenario, we apply KSC for finding power load smart meter data that are similar in order to aggregate them and improve the forecasting accuracy of the global consumption signal. The idea is to fit a forecasting model on the aggregated load of each cluster (aggregator). The k predictions are summed to form the final disaggregated prediction. The number of clusters and the time series used for each aggregator are determined via KSC [1]. The forecasting model used is a periodic autoregressive model with exogenous variables (PARX) [18]. Table 5 (taken from [1]) shows the model selection and disaggregation results. Several kernels appropriate for time series were tried including a Vector Autoregressive (VAR) kernel [Add: Cuturi, Autoregressive kernels for time series, arXiv], Triangular Global Alignment (TGA) kernel [Add: Cuturi, Fast Global Alignment Kernels, ICML 2011] and an RBF kernel with Spearman’s distance. The results show an improvement of 20. 55 % with the similarity based on Spearman’s correlation in the forecasting accuracy compared to not using clustering at all (i.e., aggregating all smart meters). The BLF was also able to detect the number of clusters that maximize the improvement (six clusters in this case), as shown in Fig. 7.

Table 5 Kernel comparisons for power load clustering data
Fig. 7
figure 7

Power load clustering results. Visualization of the six clusters obtained by KSC. (Top) aggregated load in summer. (Bottom) aggregated load in winter. The daily cycles are clearly visible and the clusters capture different characteristics of the consumption pattern. This clustering result improves the forecasting accuracy by 20. 55 %

4.4 Big Data

KSC has been shown to be effective in handling big data at a desktop PC scale. In particular, in [39], we focused on community detection in big networks containing millions of nodes and several million edges, and we explained how to scale our method by means of three steps.Footnote 10 First, we select a smaller subgraph that preserves the overall community structure by using the FURS algorithm [38], where hubs in dense regions of the original graph are selected via a greedy activation–deactivation procedure. In this way the kernel matrix related to the extracted subgraph fits the main memory and the KSC model can be quickly trained by solving a smaller eigenvalue problem. Then the BAF criterion described in Sect. 3.1.3, which is memory and computationally efficient, is used for model selection.Footnote 11 Finally, the out-of-sample extension is used to infer the cluster memberships for the remaining nodes forming the test set (which is divided into chunks due to memory constraints).

In [42] the hierarchical clustering technique summarized in Sect. 3.3.2 has been used to perform community detection in real-life networks at different resolutions. In the experiment conducted on seven networks from the Stanford SNAP datasets (http://snap.stanford.edu/data/index.html), the method has been shown to be able to detect complex structures at various hierarchical levels, by not suffering of any resolution limit. This is not the case for other state-of-the-art algorithms like Infomap [56], Louvain [9], and OSLOM [24]. In particular, we have observed that Louvain method is often not able to detect high quality clusters at finer levels of granularity ( < 1000 clusters). On the other hand, OSLOM cannot identify good quality coarser clusters (i.e. number of clusters detected are always > 1000), and Infomap method produces only two levels of hierarchy. Moreover, in general Louvain method works best in terms of the Modularity criterion, and it always performs worse than hierarchical KSC w.r.t. cut-conductance [35]. Regarding Infomap, in most of the cases the clusters at one level of hierarchy perform good w.r.t. only one quality metric. Concerning OSLOM, this algorithm in the majority of the datasets has poorer performances than KSC in terms of both Modularity and cut-conductance.

An illustration of the community structure obtained on the Cond-mat network of collaborations between authors of papers submitted to Condense Matter category in Arxiv [34] is shown in Fig. 8. This network is formed by 23, 133 nodes and 186, 936 edges. For the analysis and visualization of bigger networks, and the detailed comparison of KSC with other community detection methods in terms of computational efficiency and quality of detected communities, the reader can refer to [42].

Fig. 8
figure 8

Large-scale community detection. Community structure detected at one particular hierarchical level by the AH-KSC method summarized in Sect. 3.3.2, related to the Cond-Mat collaboration network

Finally, in [44], we propose a deterministic method to obtain subsets from big vector data which are a good representative of the inherent clustering structure. We first convert the large-scale dataset into a sparse undirected k-NN graph using a Map-Reduce framework. Then, the FURS method is used to select a few representative nodes from this graph, corresponding to certain data points in the original dataset. These points are then used to quickly train the KSC model, while the generalization property of the method is exploited to compute the cluster memberships for the remainder of the dataset. In Fig. 9 a summary of all these steps is sketched.

Fig. 9
figure 9

Big data clustering. (Top) illustration of the steps involved in clustering big vector data using KSC. (Bottom) map-reduce procedure used to obtain a representative training subset by constructing a k-NN graph

5 Conclusions

In this chapter we have discussed the kernel spectral clustering (KSC) method, which is cast in an LS-SVM learning framework. We have explained that, like in the classifier case, the clustering model can be trained on a subset of the data with optimal tuning parameters, found during the validation stage. The model is then able to generalize to unseen test data thanks to its out-of-sample extension property. Beyond the core algorithm, some extensions of KSC allowing to produce probabilistic and hierarchical outputs have been illustrated. Furthermore, two different approaches to sparsify the model based on the Incomplete Cholesky Decomposition (ICD) and L 1 and L 0 penalties have been described. This allows to handle large-scale data at a desktop scale. Finally, a number of applications in various fields ranging from computer vision to text mining have been examined.