1 Introduction

Semi-supervised clustering based on kernel learning is a popular research topic in machine learning since one can incorporate the information of a limited number of labeled data or a set of pairwise constraints into the kernel learning framework [1]. The reason is that for clustering, the pairwise constraints provide useful information about which data pairs are in the same category and which ones are not. To learn such kinds of kernel matrices, Kulis et al. [2] proposed to construct a graph based kernel matrix which unifies the vector-based and graph-based semi-supervised clustering. A further refinement on learning kernel matrices for clustering was investigated by Li et al. [3]. In their approach, data are implicitly projected onto a feature space which is a unit hyperball, subjected to a collection of pairwise constraints. However, the above clustering algorithms via kernel matrices either can not scale well with the increasing number of pairwise constraints and the amount of data, or lacks theoretical guarantee for the positive semi-definite property of the kernel matrices. In another aspect, Yueng et al. [4] proposed an efficient kernel learning algorithm through low rank matrix approximation. However, in their algorithm, the form of kernel matrix is assumed to be linear combination of several base kernel matrices. Note that this might reduce the dimension of the hypothesis kernel space, we call such kinds of algorithms parametric kernel learning. In addition, Cortes et al. [5] proposed a kernel learning algorithm by taking the non-linear combinations of kernels, which is a generalization of the linear combination case but still lies in the framework of parametric kernel learning. Addressing these two limitations is the major purpose of this paper.

On the other hand, we note that many algorithms based on the graph embedding framework often achieve an enhanced discriminant ability by utilizing the marginal information, e.g., making the dissimilarity data points near the margin as far as possible and meanwhile compacting the points in the same class [6, 7]. It is therefore worthwhile to generalize the graph embedding framework into kernel learning. Footnote 1

Based on the aforementioned goals, in this paper we propose a new scalable kernel learning framework NPKL (Non-Parametric Kernel Learning with robust pairwise constraints), and apply it for semi-supervised clustering. First, we generalize the graph embedding framework on a feature space which is assumed to be a possibly infinite subspace of the l 2 Hilbert space with unit norm, which is similar to [3]. Then the unknown feature projection function ϕ is implicitly learned by transforming the criterion of the graph embedding (i.e., maximizing the sum of distances of between-class data pairs while minimizing that of within-class data pairs) into an SDP problem. To get smoother solution of the predictive function, smoothing technique using some kind of Laplacian regularizer is introduced. By this, ideally, data from the same class would be projected into the same location in the feature space. Meanwhile, the distances between the locations of different classes should be as large as possible, as illustrated in Fig. 1. We propose two algorithms to solve this problem. One is to optimize the objective function directly by solving an SDP problem, which we call TRAnsductive Graph Embedding Kernel (TRAGEK) learning , since the SDP problem is derived from a transductive graph embedding formulation. In TRAGEK, it is not necessary to explicitly specify which pair of data points should lie close, and the running time is much less sensitive to the number of pairwise constraints than Li et al.’s work [3]. However, the SDP problem in TRAGEK limits the application of the proposed algorithm to large scale datasets. To alleviate this problem, we propose to solve the SDP problem via a constrained gradient descent algorithm that iteratively projects the unconstrained solutions to the cone formed by the constraints. Furthermore, we divide the whole dataset into groups of sub-data sets, and the corresponding sub-kernels are learned for these sub-data separately. Finally, the global kernel matrix is obtained through the combination of these sub-kernels. In this way, not only is the positive semi-definite property of the kernel matrix well preserved, but also the computational complexity scales at most linearly with the size of the dataset, which is very efficient. We call this algorithm Efficient Non-PArametric Kernel Learning (ENPAKL).

Fig. 1
figure 1

With a feature projection function ϕ, data from two classes are projected from two-dimensional data space into three-dimensional feature space in which each point corresponds to one class

The remaining of this paper is organized as follows. Section 2 reviews some related work for clustering using kernel learning. Section 3 formulates our problem and presents the TRAGEK algorithm. Section 4 elaborates the efficient algorithm ENPAKL for our problem. And experiment results are shown in Sect. 5. Finally, Sect. 6 concludes the paper.

2 Clustering and kernel learning

Learning with kernels [8] is a popular research topic in machine learning. In this section, however, rather than reviewing the theoretical aspects of kernel learning algorithms such as [9, 10], and the online kernel learning such as [11], we put our emphasize on the applications of kernel learning algorithms. More specifically, we discuss the work of kernel learning for semi-supervised clustering. To a certain extent, kernel learning for clustering can be viewed as metric learning, because the kernel matrix can be regarded as some specific distance between data points equipped with a specific metric.

For traditional clustering algorithms, the most frequently used ones include k-means and fuzzy k-means [12]. Recently, Yang et al. [13] proposed an improved fuzzy k-means algorithm to assign a value of 1 to data pairs with a defined cluster score. Moreover, Trappey et al. [14] presented a fuzzy ontology schema for hierarchically clustering of documents, which can solve the inconsistent and ineffective problem encountered by the traditional keyword-based methods. For the traditional k-means clustering algorithm, Xiong et al. [15] proposed to use the coefficient of variation (CV) as some criterion to analysis the performance of k-means algorithm under skewed data distribution.

To utilize label information to enhance the cluster performance, semi-supervised clustering approaches were proposed, which can be roughly categorized into constraint- and metric-based ones. The former utilizes either labeled data or pairwise constraints to improve the performance of clustering [16, 17], while the latter learns a more rational metric to fit the constraints by utilizing the provided label information [18, 19], thus the semi-supervised kernel learning is closely related to the clustering algorithms.

More specifically, Wagstaff et al. [20] modified the k-means algorithm by considering pairwise similarities and dissimilarities, known as the constrained k-means. To boost the performance of constrained k-means, Hong et al. [21] refined the assignment order of this algorithm by ranking all instances in the dataset according to their clustering uncertainty. Based on the hidden Markov random field (HMRF), Basu et al. [16] used the provided pairwise constraints in the objective function for semi-supervised clustering, while Lu and Leen [22] proposed a clustering algorithm with the must-link and cannot-link constraints using the Gaussian mixture model (GMM). Xing et al. [19] learned a distance matrix for clustering by explicitly minimizing the distances between similar samples and maximizing those between dissimilar ones. Bar-Hillel et al. [23] proposed a simpler but more efficient algorithm using the relevant component analysis (RCA). Furthermore, they proposed another metric learning algorithm [24] that learns a non-parametric distance function, but without the guarantee that the function is actually a metric.

Although kernel methods have been studied for decades, not much work has focused on learning a non-parametric kernel using only the training data. In contrary, most of the work focuses on learning the kernel from some predefined base kernels [25]. Recently, kernel learning for clustering has attracted more and more attentions because one can easily incorporate some useful information into the kernel learning framework [2, 26]. Earlier kernel learning algorithms mainly focus on linear or non-linear combination of some base kernels [27]. For example, Yeung et al. [4] proposed a scalable kernel learning algorithm in which some low rank kernel matrices obtained by the eigenvectors of the initial kernel matrix are used as base kernels, and a collection of optimal weights of the base kernels need to be learned. Cortes [5] proposed to learn the kernel matrix by defining some non-linear terms of the basis kernels and use a projection-based gradient descent algorithm to learn the weights. One disadvantage of this algorithm is that only the must-link constraint information is incorporated. To employ both must-link and cannot-link constraint information, Hoi et al. [28] proposed a kernel learning algorithm by formulating it into a semi-definite programming (SDP) problem. To the best of our knowledge, this is the first non-parametric kernel learning algorithm that does not need to explicitly take the base kernels into consideration. Latter, an efficient algorithm for solving the SDP problem mentioned above is proposed by Zhuang et al. [29] by introducing an extra low rank constraint on the objective function. Furthermore, assuming that the feature space is a unit hyperball in which must-link data pairs are constrained to be one point and cannot-link pairs should be orthogonal with each other, Li et al. [3] proposed another SDP based algorithm for kernel learning. Note that one problem for the SDP related algorithms introduced above is the computational complexity, how to avoid this complexity is one of the major concerts of this paper.

3 TRAGEK: transductive non-parametric kernel learning

Our non-parametric kernel learning problem assumes that the feature space of the data is a subspace of the l 2 Hilbert space with unit norm. More specifically, given a data point x, there exists a projection ϕ from the data space to the feature space endowed with a unit norm, i.e., \(\|\phi(x)\| = 1. \) In this way, data are mapped to the surface of a unit hyperball in which data of different classes can be separated easily. To find such a mapping function ϕ, we tried to generalize the graph embedding [6] into the kernel space and transformed the problem into the semi-definite programming with pairwise constraints. We further introduced some regularization terms such as the Laplacian regularizer to smooth the prediction function. Finally, the kernel k-means clustering algorithm is performed in the learned kernel matrix for clustering.

3.1 Transductive kernel learning with graph embedding

One goal of the graph embedding is to learn a low-dimensional discriminant subspace in which the sum of within-class distances is minimized meanwhile that of between-class distances is maximized [6]. This idea is formulated in Eq. 1:

$$F_g=\mathop{\min}\limits_Y\frac{tr\{YLY^T\}}{tr\{YL^pY^T\}}, $$
(1)

where \(Y = (y_1, y_2, \ldots, y_M), y_i\in\Re^m\) is the data representation in the feature space, M is the number of samples. L = D − W and L p = D p − W p are two graph Laplacian matrices with D(ii) = ∑ j W(ij) and D p(ii) = ∑ j W p(ij) representing two diagonal matrices. Two similarity matrices W and W p represent the within-class relationship and between-class relationship of data points, respectively, which are defined according to the pairwise constraints as:

$$ W(i, j) = \left\{ \begin{array}{ll} 1,& i \hbox{ and } j \ \hbox{belong to the same class}\\ 0,& \hbox{otherwise.} \end{array} \right.,\qquad W^p(i, j) = \left\{ \begin{array}{ll} 1,& i \hbox{ and } j \ \hbox{belong to different classes}\\ 0,& \hbox{otherwise.} \end{array} \right. $$
(2)

For better generalization, the two quadratic forms in Eq. 1 are lifted into linear forms in the kernel space as:

$$F_1 = tr\{YLY^T\}= tr\{LY^TY\}= \langle l, K_v\rangle, $$
(3)

where l = vec(L) is the vectorization of the matrix L, and K v Footnote 2 is the vectorization of the kernel matrix K defined by \(K(i, j) = \langle y_i, y_j\rangle. \)

In contrast to the graph embedding, the goal of TRAGEK is to cluster data in the feature space defined on the unit norm subspace of the l 2 Hilbert space. It is easy to get the inner product of two vector in the feature space using the re-producing property of the re-producing Hilbert space:

$$K(\phi(x), \phi(y)) = k_{\phi(y)}(\phi(x)) = \langle\phi(x), \phi(y)\rangle, $$
(4)

where \(k_{\phi(y)}(\cdot)\in {\mathcal{H}}, \) x and y are two data points in the data space, ϕ is the mapping function to be learned. Furthermore, in our transductive inference, the matrices W and W p are defined using the pairwise constraints as in Eq. 2. Based on this, the objective function of TRAGEK is to kernelize the criterion of graph embedding [6] as:

$$ \begin{aligned} &F_t = \min_K\frac{\langle l, K_v\rangle}{\langle l^p, K_v\rangle}\\ &\hbox{s.t. } l^TK_v > 0,\quad {l^p}^TK_v > 0, \\ &K(i, i) = 1, \quad \hbox{for all } i,\\ & K(i, j)\leq 1,\quad \hbox{for all } i, j. \end{aligned} $$
(5)

The optimization function means that the learned kernel K should cluster within-class data as close as possible and between-class data as far as possible in the feature space. The first constraint is required according to the positive semi-definite property of the Laplacian matrix L, and the second and third constraints stem from the assumption that data in the feature space should lie on the surface of a hyperball with unit radius.

3.2 Conic optimization programming relaxation

To transform this problem into a convex optimization problem which has a unique optimal solution and polynomial time computational complexity, we first prove Theorem 1 as follows.

Theorem 1

The nonlinear optimization problem defined in Eq.  5 can be relaxed to a second order cone programming as:

$$ \begin{aligned} &\min\,:\,\{t\}\\ &s.t.\ t + \langle l^p, K_v\rangle \geq \left\|\begin{array}{l}\langle l, K_v\rangle \\ t - \langle l^p, K_v\rangle \end{array} \right\|,\\ &l^TK_v > 0,\quad {l^p}^TK_v > 0,\\ & K(i, i) = 1, \quad for \ all\ i,\\ & K(i, j) \leq 1,\quad for\ all\ i, j.\\ \end{aligned} $$
(6)

Proof

we first relax the F t in Eq. 5 as:

$$ F_t^{\prime} = \mathop{\min}\limits_f\frac{(\langle l, K_v\rangle)^2}{4\cdot\langle l^p, K_v\rangle} $$
(7)

This does not bring a significant reduction to the optimization formula since both 〈lK v 〉 and (〈l, K v 〉)2 are monotonously increasing when l T K v  > 0. Now let’s introduce an extra variable t such that \(t\geq\frac{(\langle l, K_v\rangle )^2}{4\cdot\langle l^p, K_v\rangle }. \) As a consequence, the optimization equation in Eq. 7 is equivalent to:

$$ \begin{aligned} & \min\, :\, \{t\}\\ & \hbox{s.t. } t \geq \frac{(\langle l, K_v\rangle )^2}{4\cdot\langle l^p, K_v\rangle }.\\ \end{aligned} $$
(8)

Furthermore, the constraint in Eq. 8 can be further transformed as:

$$ \begin{aligned} &4t\cdot(\langle l^p, K_v\rangle ) \geq (\langle l, K_v\rangle) ^2\\ &\Leftrightarrow (t + \langle l^p, K_v\rangle)^2 \geq (\langle l, K_v\rangle) ^2 + (t - \langle l^p, K_v\rangle )^2, \end{aligned} $$

which is a second order cone constraint:

$$t+\langle l^p, K_v\rangle \geq \left\|\begin{array}{l} \langle l, K_v\rangle \\ t - \langle l^p, K_v \rangle \end{array} \right\|. $$
(9)

Combining Eqs. 7, 8 and 9 we get the conclusion in Theorem 1.

To further simplify Eq. 5, we here introduce Theorem 2.

Theorem 2

If K is a positive semi-definite matrix of size n × nthen

$$ K(i, j) \leq \max\{K(k, k)\},\quad 1 \leq i, j, k \leq n. $$
(10)

Proof

Since matrix K is positive semi-definite, we can decompose K as:

$$ K = (k_1, k_2, \ldots, k_n)^T(k_1, k_2, \ldots, k_n), $$
(11)

where k i is a vector of arbitrary dimension. So we have:

$$ \begin{aligned} K(i, j) &= \langle k_i, k_j\rangle \\ &\leq\sqrt{\langle k_i, k_i\rangle\cdot\langle k_j, k_j\rangle}\\ &\leq \max\{K(i, i), K(j, j)\}\\ &\leq \max\{K(k, k)\}, 1 \leq k \leq n. \end{aligned} $$
(12)

By Theorem 2, we can introduce an extra positive semi-definite constraint to replace the last n 2 inequality constraints in Eq. 6, resulting in:

$$ \begin{aligned} &\min \, :\, \{t\}\\ & \hbox{s.t. } t + \langle l^p, K_v\rangle \geq \left\|\begin{array}{l} \langle l, K_v\rangle\\ t - \langle l^p, K_v\rangle\end{array} \right\|,\\ & K \succcurlyeq 0,\\ & K(i, i) = 1,\quad \hbox{for all } i,\\ & l^TK_v > 0,\quad {l^p}^TK_v > 0. \end{aligned} $$
(13)

3.3 Smoothness controlling

While the number of constraints has been greatly reduced, it is necessary to refine Eq. 13 since the current prediction function is not smooth enough in the Hilbert space. Note that if all the elements of K in Eq. 13 were the same, it would lead to 〈l, K〉 = 0 and 〈l p, K〉 = 0. Consequently, the optimization problem would be infeasible since there were no interior points in the second order cone. We regard this case as over-smoothness. Therefore, we propose two strategies to smooth the predicted function and avoid over-smoothness.

Note that the “smoothness” of the manifold, which is measured by the LaplaceBeltrami operator on Riemannian manifolds, can be substituted by a discrete analogue operator defined as the graph Laplacian on the graph [30]. Therefore, we can employ Laplacian regularizer, denoted as S, to smooth the prediction function in its Hilbert space. Similar to Li et al.’s work [3], we here introduce a refined regularizer by incorporating a global normalized graph LaplacianFootnote 3 into our optimization framework, which is defined as:

$$ \begin{aligned} S &= \sum_{i, j = 1}^nW^{\prime}(i, j)\left\|\frac{\phi(x_i)}{\sqrt{D^{\prime}(i, i)}} - \frac{\phi(x_j)}{\sqrt{D^{\prime}(j, j)}}\right\|^2\\ &= tr\{(I -(D^{\prime})^{-\frac{1}{2}}W^{\prime}(D^{\prime})^{-\frac{1}{2}})K\}\\ &= \langle \bar{l}, K_v\rangle\end{aligned} $$
(14)

where KK v are defined as the same as the previous ones, I is the identity matrix, W′ is defined as:

$$ W^{\prime}(i, j) = \left\{ \begin{array}{ll} e^{-\frac{\|x_i - x_j\|^2}{2\sigma^2}}, & i \neq j\\ 0, &i = j, \end{array} \right. $$
(15)

where σ is a scale factor, \((I -(D^{\prime})^{-\frac{1}{2}}W^{\prime}(D^{\prime})^{-\frac{1}{2}})\) is a normalized Laplacian matrix corresponding to W′, and \({\bar{l}} = vec(I - (D^{\prime})^{-\frac{1}{2}}W^{\prime}(D^{\prime})^{-\frac{1}{2}})\) is the vectorization of the Laplacian matrix. Obviously, this formulation enforces smoothness over all the data globally. Finally, adding this term as an regularizer of the optimization problem in Eq. 13 results in the following optimization problem:

$$ \begin{aligned} &\min\, :\,\{t + \lambda S\}\\ & \hbox{s.t. } t +\langle l^p, K_v\rangle \geq \left\|\begin{array}{l} \langle l, K_v\rangle \\ t - \langle l^p, K_v\rangle \end{array} \right\|,\\ &K \succcurlyeq 0,\\ & K(i, i) = 1,\quad \hbox{for all } i\\ &\langle l, K_v\rangle > 0,\quad \langle l^p, K_v\rangle >0, \end{aligned} $$
(16)

where λ is a parameter controlling the degree of smoothness on the predicted function.

As we stated before, over-smoothness is likely to happen. In addition, dropping the last two constraints \(l^{T} K_{v} > 0,\, l^{{p}^{TK}}_v> 0\) is prone to numerical problem.Footnote 4 We solve this by adding two extra terms into the optimization framework.

figure a

Remember that our goal is to make within-class samples as close as possible, and between-class samples as far as possible, this can be formulated in the following two formulas:

$$f_1 = \min\ tr\{YLY^T\}\equiv \min\{-\langle w, K_v\rangle\}, $$
(17)
$$ f_2 = \max\ tr\{YL^pY^T\}\equiv \min\{\langle w^p, K_v\rangle \}, $$
(18)

where w,  w p are the vectorization forms of the two similarity matrices W and W p defined in Eq. 2. Denote these two regularizers as S 1 =  − 〈w, K v 〉 and S 2 = 〈w pK v 〉, we can incorporate them into the objective function to get the final optimization formula as.

$$ \begin{aligned} &\min \,: \,\{t + \lambda S + \lambda_1S_1 + \lambda_2S_2\}\\ & \hbox{s.t. } t + \langle l^p, K_v\rangle \geq \left\|\begin{array}{l} \langle l, K_v\rangle \\ t - \langle l^p, K_v\rangle \end{array} \right\|,\\ &K \succcurlyeq 0,\\ &K(i, i) = 1, \quad\hbox{for all } i, \end{aligned} $$
(19)

where the two parameters λ1 and λ2 control the weights of the two graphs. As the two regularization terms are introduced, the two terms 〈lK v 〉 and 〈l pK v 〉 will be enforced larger than zero. Therefore, we can drop the last two constraints in Eq. 16 in practice without causing any problem. Furthermore, a remarkable advantage of Eq. 19 is that it is a conic optimization programming (also SDP problem) which can be solved using the popular conic optimization software such as SeDuMi, which is of polynomial time complexity and has a theoretically proven \(O(\sqrt{n}\log(\frac{1}{\epsilon}))\) worst-case iteration bound [31]. TRAGEK is illustrated in Algorithm 1.

4 ENPAKL: an efficient kernel learning algorithm

Although TRAGEK is a convex optimization problem which means that there exists a global optimal solution, the computational cost is often relatively high and it often results in unstable solutions for large datasets, furthermore, it is sensitive to the parameter settings such as the choices of λ, λ1, λ2, etc. In this section, we propose to resolve these problems by two strategies. Firstly, we propose a constrained gradient descent based algorithm to make the learning procedure much stable and efficient. Secondly, we reduce the large-scale kernel learning problem into sub-kernel learning and combine these sub-kernels to approximate the global kernel matrix, this trick makes the computational complexity rely linearly on the number of data points. Experimental results in Sect. 5 show that the proposed strategies can approximate the true kernels well.

4.1 Constrained gradient descent

Note that the original optimization problem of Eq. 19 is not efficient enough, by taking the advantage of the iterative projection algorithm [19], we propose a constrained gradient descent based algorithm for training. The algorithm iteratively projects the solution obtained by gradient descent to the cones formed by the constraints.

Specifically, we want to avoid the SDP formulation above to reduce the computational complexity, thus we reformulate the original kernel learning problem of Eq. 19 in the following form:

$$ \begin{aligned} &F^{\prime} = \min_K\frac{\langle l, K_v\rangle}{\langle l^p, K_v\rangle }\\ & \hbox{s.t. } K \succcurlyeq 0, \\ &K(i, i) = 1, \quad\hbox{for all } i. \end{aligned} $$
(20)

Taking the logarithm of F′, and using the Laplacian smoother S in Eq. 19 as a regularization term, the objective function of ENPAKL is:

$$\begin{aligned} & F = \min_K\{\log(\langle l, K_v\rangle)- \log(\langle l^p, K_v\rangle) + \lambda\log(\langle \overline{l}, K_v\rangle)\}\\ &\hbox{s.t. } K \succcurlyeq 0, \\ &K(i, i) = 1, \quad \hbox{for all } i. \end{aligned} $$
(21)

It is straightforward to derive the gradient of F in Eq. 21:

$$ \frac{\partial}{\partial K_v}F = \frac{l}{\langle l, K_v} - \frac{l^p}{\langle l^p, K_v\rangle}+ \lambda\frac{\overline{l}}{\langle \overline{l}, K_v\rangle}. $$
(22)

Thus, we can update the kernel matrix K v by constrained gradient descent as:

$$ \begin{aligned} & K_v^{t} = K_v^{t-1} - \omega\cdot\frac{\partial}{\partial K_v^{t-1}}F\\ & \hbox{s.t. } K^t \succcurlyeq 0, \\ &K^t(i, i) = 1,\quad \hbox{for all } i, \end{aligned} $$
(23)

where ω is the step size for the current update, and t is the iteration index. Actually, we can regard Eq. 23 as the optimization problem on manifolds [32], however, instead of solving this problem directly on manifolds, we use the projection method by iteratively projecting the updated values into the cones formed by the constraints until converged. The constrained gradient descent algorithm is described in Algorithm 2.

Note that in Algorithm 2, the solution of Eqs. 24 and 25 can be calculated based on the following theorems:

Theorem 3

The solution to Eq24 is to set all the negative eigenvalues of K v to 0, that is, K = Vmax(D, 0)V Twhere VD are eigenvectors and eigenvalues of K v .

figure b

Theorem 4

The solution to Eq25 is equal to K′, except that all the diagonal elements of Kare set to 1.

Similar problem and proof for Theorem 3 can be found in [33]. Here we prove Theorem 4.

Proof of Theorem 4

Unfolding the norm and omitting the terms independent of K′, we can rewrite Eq. 25 as:

$$ \begin{aligned} & K^{\prime\prime} = {\arg\min}_{K}tr\{KK^T - 2K{K^{\prime}}^T\}\\ & \hbox{s.t. } tr\{KE_i\} = 1. \end{aligned} $$
(26)

where E i is a matrix of the same size with K, and with all elements being 0 except the ith element of the diagonal being 1. Then the corresponding Lagrangian function with the corresponding Lagrangian multipliers λ i ’s is:

$$g(K, \lambda) = tr\left\{KK^T - 2KK^{\prime T} - \sum\limits_i\lambda_i(tr\{KE_i\} - 1)\right\}.$$
(27)

Taking the derivative of g(K, λ) with respect to K, we have:

$$ \begin{aligned} \frac{\partial}{\partial K}g(K, \lambda) &= 2K - 2K^{\prime} -\sum_i\lambda_iE_i\\ &= 2K - 2K^{\prime} - diag(\lambda). \end{aligned} $$
(28)

Setting Eq. 28 to zero leads to:

$$ K = K^{\prime} + \frac{1}{2}diag(\lambda). $$
(29)

Remember the constraint is \(diag(K)=1, \) where 1 is a vector with all elements being 1, and note the form of the solution of K in Eq. 28, we can get the solution K by setting the diagonal elements of K′ to 1. This completes the proof.

4.2 Learning the global kernel from sub-kernels

In this section, the second strategy to improve the efficiency of the proposed algorithm is presented. First note that it is not scalable and efficient enough for large datasets using Algorithm 2 directly, since we need to perform an eigen-decomposition for the current kernel matrix to solve Eq. 24, which is time consuming when the number of data points is large. Therefore, we propose to approximate the global kernel matrix using local kernel matrices (or sub-kernel matrices) formed by a subset of data points.

Suppose we start with a small subset of data (namely, m data points) denoted as \(D = \{x_1, x_2, \ldots, x_m\}, \) and the corresponding sub-kernel matrix K D has been learned using the constrained gradient descent algorithm described in Algorithm 2. The idea is to approximate the other elements of the global kernel matrix using this sub-kernel matrix. Note that because data of the same class in the feature space \({\mathcal{H}}\) is assume to be flat (they are clustered into one point ideally in the feature space), it is reasonable to approximate all other data points ϕ(x i ) using the linear combination of this subset of data ϕ(D), that is: ϕ(x i ) = ∑ j w ij ϕ(x j ), where w ij are the weights to be learned. There are two situations for \(x_i\not\!\in D: \)

  1. 1.

    If x i has at least one link constraint with some points x j in D, according to our assumption, this means in the feature space, \(\phi(x_i) = \phi(x_j),\, x_j\in D. \) Taking all such points into consideration, we relax ϕ(x i ) to be the linear combination of other points in the feature space, then we get ϕ(x i ) = ∑ j ξϕ(x j ), where w ij  = ξ is equal for all x j .

  2. 2.

    If x i has no link constraints with the points in D, then we approximate ϕ(x i ) using the weighted combination of ϕ(D) in the feature space. We assume these weights should be approximately the same with those learned by minimizing the reconstruction error in the original data space. This makes our approximation different from the one proposed by Yueng et al. [4]. While the objective function for w ij is similar to local linear embedding (LLE) [34], the definition of neighborhood is different. The objective function for w ij is:

    $$ E = \mathop{\min}\limits_{w_{ij}}\sum\limits_i \left\|x_i - \sum_{j\in {\mathcal{N}}(i)}w_{ij}x_j\right\|^2, $$
    (30)

    where \({\mathcal{N}}(i)\) is defined to be the k nearest data points of x i except for those having cannot link constraints with x i .

To sum up, the weights w ij ’s are defined as:

$$ \begin{aligned} w_{ij} = \left\{\begin{array}{ll} 1 & x_i, x_j \in D \hbox{ and } i = j,\\ \frac{1}{T} & x_i\not\in D,\ x_j\in D \hbox{ has a linked constraint },\\ \hbox{Eq}.\,(30) & x_i\not\in D,\ x_j\in D \hbox{ is } x_i\hbox{'s neighboring point, but has no cannot linked constraint},\\0 & \hbox{otherwise}, \end{array}\right. \end{aligned} $$

where T is the number of link constraints for \(x_i\not\!\in D\) and \(x_j\not\!\in D. \) Thus, the whole dataset in the feature space can now be written in the matrix form as:

$$ \phi(X) = \phi(D)W^T, $$
(31)

where X is the whole dataset, \(W = (w_1^T, w_2^T, \ldots,w_n^T)^T, w_i = (w_{i1}, w_{i2}, \ldots, w_{iM})^T. \) Then, the whole kernel matrix can be approximated using Eq. 31 as:Footnote 5

$$ K_X = \phi(X)^T\phi(X) = WK_DW^T. $$
(32)
figure c

From Eq. 32 it can be seen that the kernel matrix K X of the whole dataset can be approximated by a sub-kernel matrix K D . However, given arbitrary must-link and cannot-link constraints, only one sub-kernel matrix might not approximate the whole kernel matrix well because all the pairwise constraints might not be included in one sub dataset. To solve this problem, we propose a sub-data set picking schema that scales at most linearly with the size of the dataset to partition the whole dataset into several sub-data sets, then we use the corresponding sub-kernels to approximate the whole kernel. It can be proved that the computational complexity for this strategy is at most O(n) times larger than that of using only one sub-kernel.

In this schema, we use the number of constraints (degrees of nodes in the graph) in the sub-data as the measure of the prior information this sub-data contains. The larger degree of one data point, the more prior information it has, and thus the higher probability the data point should be used to learn the sub-kernel. This sub-data set picking schema is described in Algorithm 3.

We can see from Algorithm 3 that the number of sub-data sets scales at most linearly with the number of the whole data points, thus is very efficient. Suppose at last we divide the whole dataset into L sub-data sets, and for each of such sub-data set a sub-kernel is learned by some kernel learning algorithms such as Algorithm 2, also we denote \(K_1, K_2, \ldots, K_L\) as the approximated kernel matrices calculating using Eq. 32, then the final kernel matrix for the whole dataset is set to be:

$$ K = \sum_i\alpha_iK_i,\quad \hbox{s.t. } \sum_i\alpha_i = 1, $$
(33)

where α i is the weight for the ith kernel. In the experiments, we set α i proportional to the total degree of their data points.

Note that this algorithm is efficient because it solves the original SDP problem of TRAGEK using a constraint gradient descend based algorithm, we will compare these two algorithms with respect to their efficiency and accuracy for clustering in the experiments.

5 Experiments

To test the proposed kernel learning algorithms TRAGEK and ENPAKL , we employed them for clustering and also used ENPAKL for image segmentation. We carried out the evaluations on two simulated datasets and ten datasets from the UCI machine learning repository [35]. The details of these datasets are tabulated in Table 1, where the first nine datasets have been often used in evaluating the performance of semi-supervised clustering algorithms [3, 28]. We compared ENPAKL with the PCP algorithm [3] and the SSKK algorithm [2] as well as the traditional k-means algorithm. We also investigated the influence of the number of pairwise constraints to the clustering performance for ENPAKL . To measure the clustering performance, we adopted the metric defined in [28]:

$$ acc = \sum_{i > j}\frac{2\cdot {\mathcal{I}}({\mathcal{I}}(c_i, c_j), {\mathcal{I}}({\hat{c}}_i, {\hat{c}}_j))}{n(n-1)}, $$
(34)

where \({\mathcal{I}}(a, b)\) is an indicator function returning 1 if a = b, and 0 otherwise, c denotes the true cluster membership and \({\hat{c}}\) denotes predicted cluster membership, and n is the number of samples. Without loss of generality, moreover, we set parameters λ, λ1 and λ2 in Eq. 19 to 1, and the scale factor σ in Eq. 15 to the average pairwise distance of the data set.

Table 1 Twelve datasets used in our experiments

Other than this, we also applied the proposed ENPAKL together with the k-means and N-cut algorithms [36] on the MSRC Object Category Image Database (v2) [37] for image segmentation. Details are described in Sect. 5.5.

5.1 An illustrative experiment

To show that the proposed algorithms can propagate label information through the datasets, so that data in different classes can be separated as far as possible and those in the same class are clustered as close as possible, we run TRAGEK on the two synthetic datasets as used in [28], i.e., the chessboard and double-spiral datasets in Fig. 2. For better illustration, we rearranged the order of the data such that the first part of the data matrix belongs to one class and the last part to the other. It can be seen in Fig. 2 that the two classes are well separated. More specifically, we observed that the elements in the learned kernel K corresponding to the same class tend to 1 (black), while those corresponding to different classes tend to −1 (white). This means that the data from different classes are projected onto the opposite points on the hyperball of the feature space.

Fig. 2
figure 2

Clustering results on chessboard and double-spiral datasets. c, d Black color means the corresponding values in the kernel matrix is 1, and white color means −1

5.2 On small datasets

To compare the proposed kernel learning algorithms with some related algorithms for clustering, we tested them on nine small-scale data sets in Table 1 ranging from glass to wine. For ENPAKL , we set B = 20, R = 10 in Algorithm 3. Note that pairwise constraints are required for TRAGEK , ENPAKL , PCP , and SSKK , so we randomly generated k must-link constraints in each class and k cannot-link constraints between each two classes, where k ranges from 10 to 100 with an interval of 10. We thus have a total of \(\frac{c(c+3c)k}{2}\) pairwise constraints for each experiment with a dataset of c classes. For each k, we randomly generated 20 different pairwise constrains, resulting in 20 different realizations of the pairwise constraints. The reported results were the average of the 20 different realizations together with 10 repetitions in the kernel k-means clustering step in Algorithm 1 for each pairwise constraint realization. The results are illustrated in Fig. 3.

Fig. 3
figure 3

Clustering performance on nine small UCI datasets

We can observe from Fig. 3 that:

  • TRAGEK outperforms the other three algorithms in all datasets except for the Iris and Wine datasets when the number of pairwise constraints is <30.

  • PCP is worse than TRAGEK in clustering accuracy in most cases and it runs into numerical problems when the number of pairwise constraints is large or when some noisy constraints (constraints that are wrongly labeled) are added. Furthermore, we observed in the experiments that the running time of TRAGEK fluctuated little when varying the number of pairwise constraints, which can be seen in Sect. 5.4.

  • ENPAKL approximates amazingly well to the original kernel learning problem TRAGEK , sometimes even gets better performance. Another merit of ENPAKL is that it is much faster than TRAGEK and PCP . We will give some examples below.

5.3 On larger datasets

Note that the datasets used in Sect. 5.2 are small, though often used in evaluating semi-supervised clustering algorithms [3, 28]. TRAGEK and other algorithms such as PCP can not scale well with large datasets and robust pairwise constraints. To evaluate the scalability of the proposed ENPAKL algorithm, we performed the experiments on three larger datasets described in the last three rows of Table 1. The optdigits dataset is a subset of a large digital dataset, and it contains the digits from 0 to 4 with each class containing 200 instances ( Digital04 ); Waveform also comes from a large dataset and it has 1,800 instances.

The intrinsic disadvantage of PCP prevents it from being applied on such kind of large data with robust constraints. In order to enable it to work, we reduced the number of constraints by sampling. Specifically, the number of sampled constraints was set to the final number of constraints after the reduction in TRAGEK . We repeated the experiments for ten times with random sampling for the PCP algorithm, and picked up the best result and reported as the performance of PCP , denoted as PCPSample . In this experiment, we varied the number of pairwise constraints relative to the number of total data samples, and the parameters in Algorithm 3 were set to B = 100, R = 100. Also ENPAKL 1 is a variant of ENPAKL by replacing the base sub-kernel learning algorithm in Algorithm 2 with PCP . The results for these algorithms are shown in Fig. 4.

Fig. 4
figure 4

Clustering performance on three large UCI datasets

It is found in the experiments that:

  • ENPAKL is a little faster than ENPAKL 1, meanwhile both of them are much faster than PCP and SSKK .

  • TRAGEK is apparently superior to PCP and SSKK , where these two algorithms even fail to compete with the traditional k-means algorithm.

  • The performances of ENPAKL and ENPAKL 1 are competitive, and represent the best algorithms in terms of effectiveness and efficiency.

5.4 Running time

This section shows the running time of several related algorithms and we claim that the running time of TRAGEK is not sensitive to the number of pairwise constraints. To test this, we perform experiments on the Heart dataset and the Chessboard dataset with increasing number of pairwise constraints from 10 to 100 with an interval of 10. The results are shown in Fig. 5a and b. We can see from the figures that as the number of pairwise constraints increases, the running time of TRAGEK varies little, whereas that of PCP increases dramatically.

Fig. 5
figure 5

Running time comparison. The x-axis is the number of constraints, the y-axis represents the running time in seconds

Next we examined the efficiency of ENPAKL . We used the Wisconsin dataset and recorded the corresponding running time. Note that PCP is too time consuming when the constraints are large, thus we do not show its running time here. We compared ENPAKL with TRAGEK , the results are shown in Fig. 5c. Obviously, ENPAKL is much more efficient than TRAGEK in term of computational complexity Footnote 6.

5.5 Image segmentation

In this section we applied ENPAKL for image segmentation by doing clustering on images. We tested our algorithm on the MSRC Object Category Image Database (v2) [37], which contains 791 images of size approximately 320 × 210, and includes different scenes such as grasses, forests, streets, etc. In this experiment, we do not care about what feature we used. Instead, we want to test the effectiveness and robustness of the proposed algorithm against other popular clustering algorithms such as k-means, N-cut, and etc.. As a result we simply used the histogram features in the experiments (richer features for image segmentation would be our future work). Specifically, we divided each image into 5 × 5 patches, and exacted the color histograms of each patch as its features, and finally used these features to do the segmentation. We set the number of clusters to the ground truth, for ENPAKL , we randomly generated 50 must-link and cannot-link constraints for each cluster in the images. For simplicity, we compared the proposed ENPAKL algorithm with the k-means and N-cut algorithms Footnote 7 which are popularly used in image segmentation, and also because the above experiments have shown the superior of the k-means algorithm over PCP and SSKK . Some examples of the images and their segmentation results are shown in Fig. 6. From these results we can see the superior of ENPAKL over the k-means and N-cut algorithms in term of segmentation accuracy, though it runs much slower, which is a typical problem for kernel based algorithms. Footnote 8 We used Eq. 34 as the segmentation accuracy criterion, and the corresponding accuracies are also shown in the figure. We see from the figure that ENPAKL performs best while N-cut and k-means are comparable. Also note that for some images, the segmentations learned by ENPAKL are very close to the Ground Truth, while those learned by the k-means and the N-cut are much worse, this indicates that supervisory information could help image segmentation a lot, and it is encouraged to use such kind of information to boost the segmentation accuracy. We believe better segmentation results can be obtained by choosing the constraints carefully, by using other kinds of features such as the sift features [39] and rich textual features [40], and also by taking the spatial information into consideration.

Fig. 6
figure 6

Image segmentation using ENPAKL , k-means and N-cut. Here acc means segmentation accuracy evaluating using Eq. 34

6 Conclusion

In this paper, we proposed a non-parametric kernel learning framework. It generalizes the graph embedding framework [6] into kernel space and is reformed as a conic optimization programming. A global Laplacian regularizer is used to smooth the functional space. Two algorithms are proposed for the corresponding kernel learning problem, one is to solve the original optimization problem through semi-definite programming. The other is to relax the SDP problem and solve with a constrained gradient descent based algorithm. To further reduce the computational complexity, the whole data is proposed to be divided into groups, and sub-kernels for these groups are learned separately, then the global kernel is constructed by combining these sub-kernels . Experiments are performed on nine datasets for clustering and one image dataset for image segmentation. Experimental results show that the proposed ENPAKL algorithm is superior to the recently developed algorithms [2, 3] in terms of computational effectiveness and clustering accuracy, and often achieves better image segmentation.

We will study the parameters setting problem in the future. For example, the regularizer S in Eq. 19 may be replaced by a more sophisticated regularizer such as the s-weighted Laplacian operator [41]. The algorithms should also be evaluated with different settings of B and R in Algorithm 3, the k in the graph construction, etc. Furthermore, we can incorporate ENPAKL into other kernel methods such as kernelization of some dimensional reduction algorithms. In addition, we will apply the proposed algorithm to more real applications, and explore more efficient algorithms for this problem since the current methods is not fast enough for large scale datasets.