1 Introduction

Graph-based learning algorithms have received considerable attention in machine learning community. For example, label propagation (e.g., Blum and Chawla 2001; Szummer and Jaakkola 2001; Joachims 2003; Zhu et al. 2003; Zhou et al. 2004; Herbster et al. 2005; Sindhwani et al. 2005; Belkin et al. 2006; Bengio et al. 2006) is widely accepted as a state-of-the-art approach for semi-supervised learning, in which node labels are estimated through the input graph structure. Spectral clustering (e.g., Shi and Malik 2000; Ng et al. 2001; Meila and Shi 2001; von Luxburg 2007) is also a famous graph-based algorithm, in which cluster partitions are determined according to the minimum cut of the given graph. A common important property of these graph-based approaches is that the manifold structure of the input data can be captured by the graph. Their practical performance advantage has been demonstrated in various application areas (e.g., Patwari and Hero 2004; Lee and Kriegman 2005; Zhang and Zha 2005; Fergus et al. 2009; Aljabar et al. 2012).

On the other hand, it is well-known that the accuracy of the graph-based methods highly depends on the quality of the input graph (e.g., Zhu et al. 2003; Kapoor et al. 2006; Zhang and Lee 2007; von Luxburg 2007; Wang and Zhang 2008), which is typically generated from a set of numerical input vectors (i.e., feature vectors). A general framework of graph-based learning can be represented as the following three-step procedure:

  1. Step 1:

    Generating graph edges from given data, where nodes of the generated graph correspond to the instances of input data.

  2. Step 2:

    Giving weights to the graph edges.

  3. Step 3:

    Estimating node labels based on the generated graph, which is often represented as an adjacency matrix.

This framework is employed by many graph-based algorithms including label propagation and spectral clustering.

In this paper, we focus on the second step in the three-step procedure; estimating edge weights for the subsequent label estimation. Optimizing edge weights is difficult in semi- or un-supervised learning, because there are only a small number of (or no) labeled instances. Also this problem is important because edge weights heavily affect the final prediction accuracy of graph-based methods, while in reality rather simple heuristics strategies have been employed.

There are two standard approaches for estimating edge weights: similarity function based- and locally linear embedding (LLE) (Roweis and Saul 2000) based-approaches. Each of these two approaches has its own disadvantage. The similarity based approaches use similarity functions, such as Gaussian kernel, while most similarity functions have scale parameters (such as the width parameter of Gaussian kernel) that are in general difficult to be tuned. On the other hand, in LLE, the true underlying manifold can be approximated by a graph by minimizing a local reconstruction error. LLE is more sophisticated than the similarity-based approach, and LLE based graphs have been applied to semi-supervised learning and clustering (Wang and Zhang 2008; Daitch et al. 2009; Cheng et al. 2009; Liu et al. 2010). However LLE is noise-sensitive (Chen and Liu 2011). In addition, to avoid a kind of degeneracy problem (Saul and Roweis 2003), LLE has to have additional tuning parameters.Footnote 1 Yet another practical approach is to optimize weights by regarding them as hyper-parameters of learning methods (e.g., Zhang and Lee 2007). Also general model selection criteria can be used, while the reliability of those criteria are unclear for graphs with a small number of labeled instances. We will discuss those related approaches in Sect. 5.

Our approach is a similarity-based method, yet also captures the manifold structure of the input data; we refer to our approach as adaptive edge weighting (AEW) because graph edges are determined by a data adaptive manner in terms of both similarity and manifold structure. The objective function in AEW is based on local reconstruction, by which estimated weights capture the manifold structure. However, unlike conventional LLE based approaches, we introduce an additional constraint that edges represent similarity of two nodes. Due to this constraint, AEW has the following advantages compared to LLE based approaches:

  • Our formulation alleviates the problem of over-fitting due to the parameterization of weights. We observed that AEW is more robust against noise of input data and the change of the number of graph edges.

  • Since edge weights are defined as a parameterized similarity function, resultant weights still represent the similarity of each node pair. This is very reasonable for many graph-based algorithms.

We provide further justifications for our approach based on the ideas of feature propagation and local linear approximation. Our objective function can be seen as a cross validation error of a propagation model for feature vectors, which we call feature propagation. This allows us to interpret that AEW optimizes graph weights through cross validation (for prediction) in the feature vector space instead of given labels, assuming that input feature vectors and given labels share the same local structure. Another interpretation is provided through local linear approximation, by which we can analyze the error of local reconstruction in the output (label) space under the assumption of low dimensional manifold model.

The rest of this paper is organized as follows: In Sect. 2, we briefly review some standard algorithms in graph-based methods on which we focus in this paper. Section 3 introduces our proposed method for adaptively optimizing graph edge weights. Section 4 describes analytical consideration for our approach which provides interesting interpretations and error analysis of AEW. In Sect. 5, we discuss relationships to other existing topics. Section 6 presents experimental results obtained by a variety of datasets including synthetic and real-world datasets, demonstrating the performance advantage of the proposed approach. Finally, Sect. 7 concludes the paper.

This paper is an extended version of our preliminary conference paper presented at NIPS 2013 (Karasuyama and Mamitsuka 2013). In this paper, we describe our framework in a more general way by using three well-known graph-based learning methods [harmonic Gaussian field (HGF) model, local global consistency (LLGC) method, and spectral clustering], while the preliminary version only deals with HGF. Furthermore, we have conducted experimental evaluation more thoroughly which includes mainly three points: in semi-supervised setting, (1) comparison with other state-of-the-art semi-supervised methods and (2) comparison with hyper-parameter optimization methods, and as an additional problem setting, (3) comparisons on the clustering.

2 Graph-based semi-supervised learning and clustering

In this paper we consider label propagation and spectral clustering as the methods in the third step in the three-step procedure. Both are the state-of-the-art graph-based learning algorithms, and labels of graph nodes (or clusters) are estimated by using a given adjacency matrix.

Suppose that we have n feature vectors \(\mathcal{X}= \{ \varvec{x}_1, \ldots , \varvec{x}_n \}\), where \(\varvec{x}\in \mathbb {R}^p\). An undirected graph \(\mathcal{G}\) is generated from \(\mathcal{X}\), where each node (or vertex) corresponds to each data point \(\varvec{x}_i\). The graph \(\mathcal{G}\) can be represented by the adjacency matrix \(\varvec{W}\in \mathbb {R}^{n \times n}\) where (ij)-element \(W_{ij}\) is a weight of the edge between \(\varvec{x}_i\) and \(\varvec{x}_j\). The key idea of graph-based classification is that instances connected by large weights \(W_{ij}\) on a graph tend to have the same labels (meaning that labels are kept the same in the strongly connected region of the graph).

Let \(\varvec{F}\in \mathbb {R}^{n \times c}\) be a label score matrix which gives estimation of labels, where c is the number of classes or clusters. To realize the key idea described above, graph-based approaches force the label scores to have similar values for strongly connected nodes. (This corresponds to the “smoothness” penalty commonly called in semi-supervised learning literature though “continuity” is a more appropriate word to represent it.) A penalty for the variation of the score \(\varvec{F}\) on the graph can be represented as

$$\begin{aligned} \sum _{k=1}^c \sum _{ij} W_{ij} (F_{ik} - F_{jk})^2. \end{aligned}$$
(1)

For the adjacency matrix \(W_{ij}\), the following weighted k-nearest neighbor (k-NN) graph is commonly used in graph-based learning algorithms:

$$\begin{aligned} W_{ij}&= \left\{ \begin{array}{ll} \exp \left( - \frac{\Vert \varvec{x}_i - \varvec{x}_j \Vert }{2 \sigma ^2} \right) , &{} \quad j \in \mathcal{N}_i \text { or } i \in \mathcal{N}_j, \\ 0, &{}\quad \text {otherwise}, \end{array} \right. \end{aligned}$$

where \(\mathcal{N}_i\) is a set of indices of the k-NN of \(\varvec{x}_i\) (Note that the adjacency matrix \(\varvec{W}\) is not necessarily positive definite).

From this adjacency matrix, the graph Laplacian (e.g., Chung 1997) can be defined by

$$\begin{aligned} \varvec{L}&= \varvec{D}- \varvec{W}, \end{aligned}$$

where \(\varvec{D}\) is a diagonal matrix with the diagonal entry \(D_{ii} = \sum _{j} W_{ij}\). Instead of \(\varvec{L}\), normalized variants of Laplacian such as \(\varvec{L}= \varvec{I}- \varvec{D}^{-1} \varvec{W}\) or \(\varvec{L}= \varvec{I}- \varvec{D}^{-1/2} \varvec{W}\varvec{D}^{-1/2}\) is also used, where \(\varvec{I}\in \mathbb {R}^{n \times n}\) is the identity matrix. Using the graph Laplacian, the score variation penalty (1) can also be written as

$$\begin{aligned} \text {trace}\left( \varvec{F}^\top \varvec{L}\varvec{F}\right) , \end{aligned}$$
(2)

where \(\text {trace}(\cdot )\) is defined as the sum of the diagonal entries of a given matrix.

2.1 Label propagation

Label propagation is a widely-accepted graph-based semi-supervised learning algorithm. Among many methods which have been proposed so far, we focus on the formulation derived by Zhu et al. (2003) and Zhou et al. (2004), which is the current standard formulation of graph-based semi-supervised learning.

Suppose that the first \(\ell \) data points in \(\mathcal{X}\) are labeled by \(\mathcal{Y}= \{ y_1, \ldots , y_\ell \}\), where \(y_i \in \{1, \ldots , c\}\) and c is the number of classes. The goal of label propagation is to predict the labels of unlabeled nodes \(\{ \varvec{x}_{\ell +1},\ldots ,\varvec{x}_n \}\). The scoring matrix \(\varvec{F}\) gives an estimation of the label of \(\varvec{x}_i\) by \(\mathop {\mathrm{argmax}}_{ j \le c} F_{ij}\). Label propagation can be defined as estimating \(\varvec{F}\) in such a way that the score \(\varvec{F}\) has a smaller amount of changes on neighboring nodes as well as it can accurately predict given labeled points. The following is a regularization formulation of label propagation (Zhou et al. 2004):

$$\begin{aligned} \min _{\varvec{F}}&\ \text {trace}\left( \varvec{F}^\top \varvec{L}\varvec{F}\right) + \lambda \Vert \varvec{F}- \varvec{Y}\Vert _F^2, \end{aligned}$$
(3)

where \(\varvec{Y}\in \mathbb {R}^{n \times c}\) is the label matrix with \(Y_{ij} = 1\) if \(\varvec{x}_i\) is labeled as \(y_i = j\); otherwise, \(Y_{ij} = 0, \lambda \) is the regularization parameter, and \(\Vert \cdot \Vert _F^2\) denotes the matrix Frobenius norm defined as \(\Vert \varvec{M}\Vert _F^2 = \sum _{ij} M_{ij}^2\). This formulation is called local and global consistency (LLGC) method. The first term of LLGC (3) represents the penalty for the score differences on neighboring nodes and the second term represents the deviation from the initial labeling \(\varvec{Y}\). The following is another standard formulation, which is called the harmonic Gaussian field (HGF) model, of label propagation (Zhu et al. 2003):

$$\begin{aligned} \min _{\varvec{F}}&\ \text {trace}\left( \varvec{F}^\top \varvec{L}\varvec{F}\right) \\ \text {subject to}&\ F_{ij} = Y_{ij},\quad \text { for }\;\; i = 1,\ldots , \ell . \end{aligned}$$

In this formulation, the scores for labeled nodes are fixed as constants. These two formulations can be both reduced to linear systems, which can be solved efficiently, especially when Laplacian \(\varvec{L}\) has some sparse structure (Spielman and Teng 2004).

2.2 Spectral clustering

Spectral clustering exploits the manifold structure of input data through a given graph. The objective function can be represented as follows (von Luxburg 2007):

$$\begin{aligned} \min _{\varvec{F}}&\ \text {trace}\left( \varvec{F}^\top \varvec{L}\varvec{F}\right) \\ \text {subject to}&\ \varvec{F}^\top \varvec{F}= \varvec{I}. \end{aligned}$$

Here c, the number of columns of \(\varvec{F}\), is corresponding to the number of clusters which should be specified beforehand. We can solve this optimization problem through the eigenvalue decomposition of the graph Laplacian matrix, and then the partitions can be obtained by applying k-means clustering to the obtained eigenvectors. Although the formulation was originally derived as an approximation of the graph mincut problem, it can also be interpreted as minimizing the label score variation on the graph.

3 Basic framework

The performance of the graph-based algorithms, which are described in the previous section, heavily depends on the quality of the input graph. Our proposed approach, adaptive edge weighting (AEW), optimizes the edge weights for the graph-based learning algorithms. In the three step procedure, we note that AEW is for the second step and has nothing to do with the first and third steps. In this paper we consider that the input graph is generated by k-NN graph (the first step is based on k-NN), while we note that AEW can be applied to any types of graphs.

First of all graph edges should satisfy the following conditions:

  • Capturing the manifold structure of the input space.

  • Representing similarity between two nodes.

These two conditions are closely related to manifold assumption of graph-based learning algorithms, in which label scores should have similar values if two data points are close in the input manifold. Since the manifold structure of the input data is unknown beforehand, the graph is used to approximate the manifold (the first condition). Subsequent predictions are performed in such a way that label scores are consistent with the similarity structure provided by the graph (the second condition). Our algorithm simultaneously pursues these two important aspects of the graph for the graph-based learning algorithms.

3.1 Formulation

We first define \(W_{ij}\) as a similarity function of two nodes (ij), and we employ the following standard kernel function for a similarity measure (Note: other similarity functions can also be used):

$$\begin{aligned} W_{ij}&= \exp \left( - \sum _{d = 1}^p \frac{(x_{id} - x_{jd})^2}{\sigma _d^2} \right) \quad \text { for }\; i \in \mathcal{N}_j \text { or } j \in \mathcal{N}_i, \end{aligned}$$
(4)

where \(x_{id}\) is the dth element of \(\varvec{x}_i\), and \(\{ \sigma _d \}_{d=1}^p\) is a set of parameters. This edge weighting is commonly used in many graph-based algorithms, and this weighting can also be interpreted as the solution of the heat equation on the graph (Zhu et al. 2003). To adapt the difference of scaling in each data point, we can also use the following local scaling kernel (Zelnik-Manor and Perona 2004):

$$\begin{aligned} W_{ij}&= \exp \left( - \sum _{d = 1}^p \frac{(x_{id} - x_{jd})^2}{s_i s_j \sigma _d^2 } \right) \quad \text { for }\; i \in \mathcal{N}_j\quad \text { or }\;j \in \mathcal{N}_i, \end{aligned}$$
(5)

where the constant \(s_i \ge 0\) is defined as the distance to the K-th neighbor from \(\varvec{x}_i\) (e.g., \(K = 7\) in Zelnik-Manor and Perona 2004).

To optimize parameter \(\sigma _d\), we evaluate how well the graph fits the input features (manifold) by the following objective function:

$$\begin{aligned} \min _{ \{ \sigma _d \}_{d=1}^p }&\ \sum _{i = 1}^n \left\| \varvec{x}_i - \frac{1}{D_{ii}} \sum _{j \sim i} W_{ij} \varvec{x}_j \right\| _2^2, \end{aligned}$$
(6)

where \(j \sim i\) means that j is connected to i. This objective function represents the local reconstruction error by local linear patch, which captures the input manifold structure (Roweis and Saul 2000; Saul and Roweis 2003). Figure 1 illustrates the idea of this approach. Similar objective functions have been used in locally linear embedding (LLE) (Roweis and Saul 2000) and graph construction (Jebara et al. 2009; Daitch et al. 2009; Talukdar 2009; Liu et al. 2010), from which the difference of our approach is that the parameters of the similarity function are optimized. The LLE type objective function assumes that a flat model provides a good local approximation of the underlying manifold (by regarding that higher-order behaviors of the manifold are not dominant locally). The parameters of a set of locally fitted flat models \(\varvec{W}\) are expected to reflect an intrinsic geometrical relation in a sense that the same parameters \(\varvec{W}\) can also approximately reconstruct each data point locally in the lower dimensional manifold by its neighbors. This property is also advantageous for the graph-based semi-supervised setting in which labels are propagated through the connectivity of the adjacency matrix \(\varvec{W}\). We will discuss a relation between the adjacency matrix \(\varvec{W}\) and the underlying manifold in Sect. 4.2.2.

Fig. 1
figure 1

An illustration of local linear approximation of a manifold. The underlying manifold, which can not be observed, is approximated by a set of local linear models (also called local linear patches)

3.2 Optimization

To optimize the problem (6), we can use any gradient-based algorithm (such as steepest descent and conjugate gradient) using the following gradient of the objective function with respect to \(\sigma _d\):

$$\begin{aligned} \sum _{i = 1}^n \frac{1}{D_{ii}} ( \varvec{x}_i - \hat{\varvec{x}}_i )^\top \left( \sum _{j} {\frac{\partial W_{ij}}{\partial \sigma _d}} \varvec{x}_j - {\frac{\partial D_{ii}}{\partial \sigma _d}} \hat{\varvec{x}}_i \right) , \end{aligned}$$

where

$$\begin{aligned} \hat{\varvec{x}}_i&= \frac{1}{D_{ii}} \sum _{j} W_{ij} \varvec{x}_j. \end{aligned}$$

The derivatives of \(W_{ij}\) and \(D_{ii}\) are

$$\begin{aligned} {\frac{\partial W_{ij}}{\partial \sigma _d}}&= 2 W_{ij} (x_{id} - x_{jd})^2 \sigma _d^{-3},\\ {\frac{\partial D_{ii}}{\partial \sigma _d}}&= \sum _{j} {\frac{\partial W_{ij}}{\partial \sigma _d}}. \end{aligned}$$

Due to the non-convexity of the objective function, we cannot guarantee that solutions converge to the global optimal which means that the solutions depend on the initial \(\sigma _d\). In our experiments, we employed well-known median heuristics (e.g., Gretton et al. 2007) for setting initial values of \(\sigma _d\) (Sect. 6). Another possible strategy is to use a number of different initial values for \(\sigma _d\), which needs a high computational cost.

The gradient can be computed efficiently, due to the sparsity of the adjacency matrix. Since the number of edges of a k-NN graph is O(nk), the derivative of adjacency matrix \(\varvec{W}\) can be calculated by O(nkp). Then the entire derivative of the objective function can be calculated by \(O(nkp^2)\). Note that k often takes a small value such as \(k = 10\).

3.3 Normalization

The standard similarity function (4) cannot adapt to differences of the local scaling around each data point. These differences may cause a highly imbalanced adjacency matrix, in which \(W_{ij}\) has larger values around high density regions in the input space while \(W_{ij}\) has much smaller values around low density regions. As a result, labels in the high density regions will be dominantly propagated.

To overcome this problem we use symmetric normalized graph Laplacian and local scaling kernel. Symmetric normalized graph Laplacian (hereinafter, normalized Laplacian for short) is defined as

$$\begin{aligned} \varvec{L}= \varvec{I}- \varvec{D}^{-\frac{1}{2}} \varvec{W}\varvec{D}^{-\frac{1}{2}}. \end{aligned}$$

The off-diagonal elements of this matrix is \(- W_{ij} / \sqrt{D_{ii} D_{jj}}\), in which each similarity value is divided by the degrees of the connected two nodes. Another approach is to use local scaling kernel (Zelnik-Manor and Perona 2004) defined as (5), which can also be seen as a symmetric normalization. Let \(\varvec{S}\) be a diagonal matrix in which diagonal entries are defined as \(S_{ii} = s_i^2\) (\(s_i\) is a local scaling parameter in (5)), and \(\varvec{\Delta }\) be a scaled distance matrix for \(\varvec{x}\) which means the elements of \(\varvec{\Delta }\) are \(\sum _{d} (x_{id} - x_{jd})^2/\sigma _d\) if (ij) has edges, and 0 otherwise. Then, local scaling kernel can be represented as

$$\begin{aligned} \varvec{W}= \exp \left( - \varvec{S}^{-\frac{1}{2}} \varvec{\Delta }\varvec{S}^{-\frac{1}{2}} \right) \end{aligned}$$

where \(\exp \) is applied to each element of matrix (note that it does not mean the matrix exponential, here). This means local scaling kernel normalizes the distances in the exponential of Gaussian kernel by a similar manner to symmetric normalized Laplacian. Instead of using the distance to the Kth neighbor to define \(s_i\), the average distance of nearest-neighbors is also used (Jegou et al. 2007).

4 Analytical considerations

In Sect. 3, we defined our approach as the minimization of the local reconstruction error of input features. We here describe several interesting properties and interpretations of this definition.

4.1 Interpretation as feature propagation

First, we show that our objective function can be interpreted as a cross-validation error of the HGF model for the feature vector \(\varvec{x}\) on the graph. Let us divide a set of node indices \(\{1,\ldots ,n\}\) into a training set \(\mathcal{T}\) and a validation set \(\mathcal{V}\). Suppose that we try to predict \(\varvec{x}\) in the validation set \(\{ \varvec{x}_i \}_{i \in \mathcal{V}}\) from the given training set \(\{ \varvec{x}_i \}_{i \in \mathcal{T}}\) and the adjacency matrix \(\varvec{W}\). For this prediction problem, we consider the HGF model for \(\varvec{x}\):

$$\begin{aligned} \min _{ \{ \hat{\varvec{x}}_i \}_{i=1}^n }&\ \sum _{ij} W_{ij} \Vert \hat{\varvec{x}}_i - \hat{\varvec{x}}_j \Vert _2^2. \\ \text {subject to}&\ \hat{\varvec{x}}_i = \varvec{x}_i,\quad \text { for }\; i \in \mathcal{T}, \nonumber \end{aligned}$$
(7)

where \(\hat{\varvec{x}}_i\) is a prediction for \(\varvec{x}_i\). Here, only \(\hat{\varvec{x}}_i\) in the validation set \(\mathcal{V}\) is regarded as free variables in the optimization problem because the other \(\{ \hat{\varvec{x}}_i \}_{i \in \mathcal{T}}\) is fixed at the observed values by the constraint. This can be interpreted as propagating \(\{ \varvec{x}_i \}_{i \in \mathcal{T}}\) to predict \(\{ \varvec{x}_i \}_{i \in \mathcal{V}}\). We call this process as feature propagation. The following matrix representation shows that the objective function of feature propagation has the same form as the basic objective function of the graph-based learning methods (2):

$$\begin{aligned} \min _{\hat{\varvec{X}}}&\ \text {trace}\left( \hat{\varvec{X}}^\top \varvec{L}\hat{\varvec{X}}\right) \\ \text {subject to}&\ \hat{x}_{ij} = x_{ij}, \text { for } i \in \mathcal{T}, \end{aligned}$$

where \(\varvec{X}= (\varvec{x}_1, \varvec{x}_2, \ldots \varvec{x}_n)^\top , \hat{\varvec{X}} = (\hat{\varvec{x}}_1, \hat{\varvec{x}}_2, \ldots \hat{\varvec{x}}_n)^\top \), and \(x_{ij}\) and \(\hat{x}_{ij}\) indicate (ij)th entries of \(\varvec{X}\) and \(\hat{\varvec{X}}\) respectively.

When we employ leave-one-out as the cross-validation of the feature propagation model, we obtain

$$\begin{aligned} \sum _{i = 1}^n \Vert \varvec{x}_i - \hat{\varvec{x}}_{-i} \Vert _2^2, \end{aligned}$$
(8)

where \(\hat{\varvec{x}}_{-i}\) is a prediction for \(\varvec{x}_i\) with \(\mathcal{T}= \{ 1, \ldots , i-1, i+1, \ldots , n\}\) and \(\mathcal{V}= \{ i \}\). Due to the local averaging property of HGF (Zhu et al. 2003), we see \(\hat{\varvec{x}}_{-i} = \sum _{j} W_{ij} \varvec{x}_j / D_{ii}\), and then (8) is equivalent to our objective function (6). From this equivalence, AEW can be interpreted as the optimization of parameters in graph weights of the HGF model for feature vectors through the leave-one-out cross-validation. This also means that our framework estimates labels using the adjacency matrix \(\varvec{W}\) optimized in the feature space instead of the output (label) space. Thus, if input features and labels share the same adjacency matrix (i.e., sharing the same local structure), the minimization of the objective function (6) should estimate the adjacency matrix by accurately propagating the labels of graph nodes.

4.2 Local linear approximation

The feature propagation model provides the interpretation of our approach as the optimization of the adjacency matrix under the assumption that \(\varvec{x}\) and y can be reconstructed by the same adjacency matrix. We here justify our approach in a more formal way from a viewpoint of local reconstruction with a lower dimensional manifold model.

4.2.1 Graph-based learning as local linear reconstruction

First, we show that all graph-based methods that we consider can be characterized by the local reconstruction for the outputs on the graph. The following proposition shows that all three methods which we reviewed in Sect. 2 have the local reconstruction property:

Proposition 1

Assuming that we use unnormalized Laplacian \(\varvec{L}= \varvec{D}- \varvec{W}\), the optimal solutions of HGF, LLGC and spectral clustering can be represented as the following local averaging forms:

  • HGF:

    $$\begin{aligned} F_{ik} = \frac{\sum _{j} W_{ij} F_{jk}}{D_{ii}}\quad \text { for }\;\; i = \ell +1, \ldots , n. \end{aligned}$$
  • LLGC:

    $$\begin{aligned} F_{ik} = \frac{\sum _{j} W_{ij} F_{jk}}{D_{ii} + \lambda } + \frac{\lambda Y_{ik}}{D_{ii} + \lambda }\quad \text { for }\; i = 1, \ldots , n. \end{aligned}$$
  • Spectral clustering:

    $$\begin{aligned} F_{ik} = \frac{\sum _{j} W_{ij} F_{jk}}{D_{ii} - \rho _k}\quad \text { for }\; i = 1, \ldots , n, \end{aligned}$$

    where \(\rho _k\) is the kth smallest eigenvalue of \(\varvec{L}\).

Proof

For HGF, the same equation was shown in Zhu et al. (2003). We here derive the reconstruction equations for LLGC and spectral clustering.

For LLGC, setting the derivatives to zero, we obtain

$$\begin{aligned} (\varvec{D}- \varvec{W}) \varvec{F}+ \lambda ( \varvec{F}- \varvec{Y}) = {\varvec{0}}. \end{aligned}$$

From this equation, the reconstruction form of LLGC is obtained.

As is well known, the score \(\varvec{F}\) of spectral clustering can be calculated as the eigenvectors of graph Laplacian \(\varvec{L}\) (von Luxburg 2007) which are corresponding to the smallest c eigenvalues. We thereby obtain the following equation:

$$\begin{aligned} \varvec{L}\varvec{F}= \varvec{F}\varvec{\mathrm {P}}, \end{aligned}$$

where \(\varvec{\mathrm {P}}\in \mathbb {R}^{c \times c}\) is a diagonal matrix with the diagonal entry \(\mathrm {P}_{ii} = \rho _i\). Substituting \(\varvec{L}= \varvec{D}- \varvec{W}\) to the above equation, we obtain

$$\begin{aligned} \varvec{D}\varvec{F}- \varvec{F}\varvec{\mathrm {P}}= \varvec{W}\varvec{F}. \end{aligned}$$

Since \(\varvec{D}\) and \(\varvec{\mathrm {P}}\) are diagonal matrices, this equation derives the reconstruction form of spectral clustering. \(\square \)

Regarding the optimization problems of the three methods as the minimization of the same score penalty term \(\text {trace}(\varvec{F}\varvec{L}\varvec{F})\) under the different regularization strategies, which prevent a trivial solution \(\varvec{F}= {\varvec{0}}\), it is reasonable that the similar reconstruction form is shared by those three methods. Among the three methods, HGF has the most standard form of local averaging. The output of the ith node is the weighted average over their neighbors connected by the graph edges. LLGC can be interpreted as a regularized variant of the local averaging. The averaging score \(\varvec{W}\varvec{F}\) is regularized by the initial labeling \(\varvec{Y}\), and the balance of regularization is controlled by the parameter \(\lambda \). Spectral clustering also has a similar form to the local reconstruction. The only difference here is that the denominator is modified by the eigenvalue of graph Laplacian. The eigenvalue \(\rho _k\) of graph Laplacian has a smaller value when the score matrix \(\varvec{F}\) has a smaller amount of variation on neighboring nodes. Spectral clustering thus has the same local reconstruction form in particular when the optimal scores have close values for neighboring nodes.

4.2.2 Error analysis

Proposition 1 shows that the graph-based learning algorithms can be regarded as local reconstruction methods. We next show the relationship between the local reconstruction error in the feature space described by our objective function (6) and the output space. For simplicity we consider the vector form of the score function \(\varvec{f}~\in ~\mathbb {R}^n\) which can be considered as a special case of the score matrix \(\varvec{F}\), and discussions here can be applied to \(\varvec{F}\).

We assume the following manifold model for the input feature space, in which \(\varvec{x}\) is generated from corresponding some lower dimensional variable \(\varvec{\tau }\in \mathbb {R}^q\):

$$\begin{aligned} \varvec{x}= g(\varvec{\tau }) + \varepsilon _x, \end{aligned}$$

where \(g\,{:}\,\mathbb {R}^q \rightarrow \mathbb {R}^p\) is a smooth function, and \(\varepsilon _x \in \mathbb {R}^p\) represents noise. In this model, y is also represented by some function form of \(\varvec{\tau }\):

$$\begin{aligned} y = h(\varvec{\tau }) + \varepsilon _y, \end{aligned}$$

where \(h\,{:}\,\mathbb {R}^q \rightarrow \mathbb {R}\) is a smooth function, and \(\varepsilon _y \in \mathbb {R}\) represents noise. For simplicity, we here consider a continuous output \(y \in \mathbb {R}\) rather than discrete labels, and thus the latent function \(h(\varvec{\tau })\) is also defined as a continuous function. Our subsequent analysis is applicable to the discrete label case by introducing an additional intermediate score variable which we will see in the end of this section. For this model, the following theorem shows the relationship between the reconstruction error of the feature vector \(\varvec{x}\) and the output y:

Theorem 1

Suppose \(\varvec{x}_i\) can be approximated by its neighbors as follows

$$\begin{aligned} \varvec{x}_i&= \frac{1}{D_{ii}} \sum _{j \sim i} W_{ij} \varvec{x}_j + \varvec{e}_i, \end{aligned}$$
(9)

where \(\varvec{e}_i \in \mathbb {R}^p\) represents an approximation error. Then, the same adjacency matrix reconstructs the output \(y_i \in \mathbb {R}\) with the following error:

$$\begin{aligned} y_i - \frac{1}{D_{ii}} \sum _{j \sim i} W_{ij} y_j&= \varvec{J}\varvec{e}_i + O( \delta \varvec{\tau }_i ) + O(\varepsilon _x + \varepsilon _y), \end{aligned}$$
(10)

where

$$\begin{aligned} \varvec{J}= {\frac{\partial h(\varvec{\tau }_i)}{\partial \varvec{\tau }^\top }} \left( {\frac{\partial g(\varvec{\tau }_i)}{\partial \varvec{\tau }^\top }} \right) ^{+} \end{aligned}$$

with superscript \(^+\) indicates pseudoinverse, and \(\delta \varvec{\tau }_i = \max _{j} (\Vert \varvec{\tau }_i -\varvec{\tau }_j \Vert _2^2)\).

Proof

Let \(\beta _j = W_{ij} / D_{ii}\) (Note that then \(\sum _{j \sim i} \beta _j = 1\)). Assuming that g is smooth enough, we obtain the following first-order Taylor expansion at \(\varvec{\tau }_i\) for the right hand side of (9).

$$\begin{aligned} \varvec{x}_i =&\sum _{j \sim i} \beta _j \left( g(\varvec{\tau }_i) + {\frac{\partial g(\varvec{\tau }_i)}{\partial \varvec{\tau }^\top }} (\varvec{\tau }_j - \varvec{\tau }_i) + O(\Vert \varvec{\tau }_j - \varvec{\tau }_i \Vert _2^2) \right) \\&+ \varvec{e}_i + O(\varepsilon _x), \end{aligned}$$

Arranging this equation, we obtain

$$\begin{aligned} {\frac{\partial g(\varvec{\tau }_i)}{\partial \varvec{\tau }^\top }} \sum _{j \sim i} \beta _j (\varvec{\tau }_j - \varvec{\tau }_i)&= -\varvec{e}_i + O( \delta \varvec{\tau }_i ) + O(\varepsilon _x). \end{aligned}$$

If the Jacobian matrix \({\frac{\partial g(\varvec{\tau }_i)}{\partial \varvec{\tau }^\top }}\) has full column rank, we obtain

$$\begin{aligned} \sum _{j \sim i} \beta _j (\varvec{\tau }_j - \varvec{\tau }_i) =&- \left( {\frac{\partial g(\varvec{\tau }_i)}{\partial \varvec{\tau }^\top }} \right) ^{+} \varvec{e}_i + O( \delta \varvec{\tau }_i ) + O(\varepsilon _x). \end{aligned}$$
(11)

On the other hand, we can see

$$\begin{aligned} \sum _{j \sim i} \beta _j y_j =&\sum _{j \sim i} \beta _j \Bigl ( h(\varvec{\tau }_i) + {\frac{\partial h(\varvec{\tau }_i)}{\partial \varvec{\tau }^\top }} (\varvec{\tau }_j - \varvec{\tau }_i) + O(\Vert \varvec{\tau }_j - \varvec{\tau }_i \Vert _2^2) \Bigr ) \nonumber \\&+ O(\varepsilon _y) \nonumber \\ =&y_i + {\frac{\partial h(\varvec{\tau }_i)}{\partial \varvec{\tau }^\top }} \sum _{j \sim i} \beta _j (\varvec{\tau }_j - \varvec{\tau }_i) + O( \delta \varvec{\tau }_i ) + O(\varepsilon _y) \end{aligned}$$
(12)

Substituting (11) into (12), we obtain

$$\begin{aligned} y_i - \sum _{j \sim i} \beta _j y_j =&{\frac{\partial h(\varvec{\tau }_i)}{\partial \varvec{\tau }^\top }} \left( {\frac{\partial g(\varvec{\tau }_i)}{\partial \varvec{\tau }^\top }} \right) ^{+} \varvec{e}_i \\&+ O( \delta \varvec{\tau }_i ) + O(\varepsilon _x + \varepsilon _y). \end{aligned}$$

\(\square \)

From (10), we can see that the reconstruction error of \(y_i\) consists of three terms. The first term includes the reconstruction error for \(\varvec{x}_i\) which is represented by \(\varvec{e}_i\), and the second term is the distance between \(\varvec{\tau }_i\) and \(\{ \varvec{\tau }_j \}_{j \sim i}\). Minimizing our objective function corresponds to reducing this first term, which means that reconstruction weights estimated in the input space provide an approximation of reconstruction weights in the output space. The two terms in (10) have a kind of trade-off relationship because we can reduce \(\varvec{e}_i\) if we use a lot of data points \(\varvec{x}_j\), but then \( \delta \varvec{\tau }_i\) would increase. The third term is the intrinsic noise which we cannot directly control.

A simple approach to exploit this theorem would be the regularization formulation, which can be a minimization of a combination of the reconstruction error for \(\varvec{x}\) and a penalization term for distances between data points connected by the edges. Regularized LLE (Wang et al. 2008; Cheng et al. 2009; Elhamifar and Vidal 2011; Kong et al. 2012) can be interpreted as one realization of such an approach. However, in the context of semi-supervised learning and un-supervised learning, selecting appropriate values of the regularization parameter for such a regularization term is difficult. We therefore optimize edge weights through the parameter of a similarity function, especially the bandwidth parameter of Gaussian similarity function \(\sigma \). In this approach, a very large bandwidth (giving large weights to distant data points) may cause a large reconstruction error, while an extremely small bandwidth causes the problem of not giving enough weights to reconstruct.

For symmetric normalized graph Laplacian, we can not apply Theorem 1 to our algorithm. For example, in HGF, the local averaging relation for normalized Laplacian is \(f_i = \sum _{j \sim i} W_{ij} f_j / \sqrt{D_{ii} D_{jj}}\). The following theorem is the normalized counterpart of Theorem 1:

Theorem 2

Suppose that \(\varvec{x}_i\) can be approximated by its neighbors as follows

$$\begin{aligned} \varvec{x}_i&= \sum _{j \sim i} \frac{W_{ij}}{\sqrt{D_{ii} D_{jj}}} \varvec{x}_j + \varvec{e}_i, \end{aligned}$$
(13)

where \(\varvec{e}_i \in \mathbb {R}^p\) represents an approximation error. Then, the same adjacency matrix reconstructs the output \(y_i \in \mathbb {R}\) with the following error:

$$\begin{aligned} y_i - \sum _{j \sim i} \frac{ W_{ij}}{\sqrt{D_{ii} D_{jj}}} y_j&= \left( 1 - \sum _{j \sim i} \gamma _j\right) (h(\varvec{\tau }_i) + \varvec{J}g(\varvec{\tau }_i)) \nonumber \\&\quad +\varvec{J}\varvec{e}_i + O( \delta \varvec{\tau }_i ) + O(\varepsilon _x \! + \! \varepsilon _y), \end{aligned}$$
(14)

where

$$\begin{aligned} \gamma _j = \frac{W_{ij}}{\sqrt{D_{ii} D_{jj}}}. \end{aligned}$$

Proof

The proof is almost the same as Theorem 1. However, the sum of the coefficients \(\gamma _j\) (corresponding to \(\beta _j\) in Theorem 1) cannot be 1. Applying the same Taylor expansion to the right hand side of (13), we obtain

$$\begin{aligned} {\frac{\partial g(\varvec{\tau }_i)}{\partial \varvec{\tau }^\top }} \sum _{j \sim i} \gamma _j (\varvec{\tau }_j - \varvec{\tau }_i) =&-\varvec{e}_i + (1 - \sum _{j \sim i} \gamma _j) g(\varvec{\tau }_i)\\&+ O( \delta \varvec{\tau }_i ) + O(\varepsilon _x). \end{aligned}$$

On the other hand, applying Taylor expansion to \(y_i - \sum _{j \sim i} \gamma _j y_j\), we obtain

$$\begin{aligned} y_i - \sum _{j \sim i} \gamma _j y_j =&(1 \! - \! \sum _{j \sim i} \gamma _j) h(\varvec{\tau }_i) - {\frac{\partial h(\varvec{\tau }_i)}{\partial \varvec{\tau }^\top }} \sum _{j \sim i} \gamma _j (\varvec{\tau }_j - \varvec{\tau }_i)\\&+ O(\delta \varvec{\tau }_i ) + O(\varepsilon _y) \end{aligned}$$

Using the above two equations, we obtain (14). \(\square \)

Although Theorem 2 has a similar form to Theorem 1, we prefer Theorem 1 to Theorem 2 by the following reasons:

  • Since the sum of the reconstruction coefficients \(W_{ij}/\sqrt{D_{ii}D_{jj}}\) is no longer 1, the interpretation of local linear patch cannot be applied to (13).

  • The reconstruction error (objective function) led by Theorem 2 (i.e., \(\varvec{x}_i - \sum _{j \sim i} {W_{ij} \varvec{x}_j}/{\sqrt{D_{ii} D_{jj}}}\)) results in a more complicated optimization.

  • The error Eq. (14) in Theorem 2 has the additional term \((1 - \sum _{j \sim i} \gamma _j ) (h(\varvec{\tau }_i) + \varvec{J}g(\varvec{\tau }_i))\) compared to Theorem 1.

However, symmetric normalized graph Laplacian has been often preferred to the unnormalized one in many papers due to its practical performance advantage. For example, in semi-supervised learning, the balancing degree of each node would prevent only a small fraction of nodes from a dominant effect for propagating labels. Another example is in the context of spectral clustering, for which better convergence conditions of symmetric normalized graph Laplacian compared to the unnormalized one were proved by von Luxburg et al. (2008). We thus use symmetric normalized graph Laplacian as well in our experiments, though we do not change the objective function for reconstruction (6). Note that the normalization using local scaling kernel does not affect Theorem 1.

We have considered continuous outputs for simplicity, while our work can be extended to discrete outputs by redefining our output model, into which an intermediate label score variable f is incorporated:

$$\begin{aligned} f = h(\varvec{\tau }) + \varepsilon _y. \end{aligned}$$

The discrete label can be determined through f, following regular machine learning classification methods (for example, in binary classification: \(y = 1\) if \(f > 0.5\), and \(y = 0\) if \(f < 0.5\)). According to the manifold assumption on this model, the label score f should have similar values for close data points on the manifold because the labels do not change drastically in the local region.

5 Related topics

We here describe relations of our approach with other related topics.

5.1 Relation to LLE

The objective function (6) is similar to the local reconstruction error of LLE (Roweis and Saul 2000), in which \(\varvec{W}\) is directly optimized as a real valued matrix. This manner has been used in many methods for graph-based semi-supervised learning and clustering (Wang and Zhang 2008; Daitch et al. 2009; Cheng et al. 2009; Liu et al. 2010), but LLE is very noise-sensitive (Chen and Liu 2011) and the resulting weights \(W_{ij}\) cannot necessarily represent the similarity between the corresponding nodes (ij). For example, for two nearly identical points \(\varvec{x}_{j_1}\) and \(\varvec{x}_{j_2}\), both connecting to \(\varvec{x}_i\), it is not guaranteed that \(W_{ij_1}\) and \(W_{ij_2}\) have similar values. To solve this problem, a regularization term can be introduced (Saul and Roweis 2003), while it is not easy to optimize the regularization parameter for this term. On the other hand, we optimize parameters of the similarity (kernel) function. This parameterized form of edge weights can alleviate the over-fitting problem. Moreover, obviously, the optimized weights still represent the node similarity.

5.2 Other hyper-parameter optimization strategies

AEW optimizes the parameters of graph edge weights without labeled instances. This property is powerful, especially for the case with only few (or no) labeled instances. Although several methods have been proposed for optimizing graph edge weights with standard model selection approaches (such as cross-validation and marginal likelihood maximization) by regarding them as usual hyper-parameters in supervised learning (Zhu et al. 2005; Kapoor et al. 2006; Zhang and Lee 2007; Muandet et al. 2009), most of those methods need labeled instances and become unreliable under the cases with few labels. Another approach is optimizing some criterion designed specifically for each graph-based algorithm (e.g., Ng et al. 2001; Zhu et al. 2003; Bach and Jordan 2004). Some of these criteria however have degenerate (trivial) solutions for which heuristics are proposed to prevent such solutions but the validity of those heuristics is not clear. Compared to these approaches, our approach is more general and flexible for problem settings, because AEW is independent of the number of classes (clusters), the number of labels, and the subsequent learning algorithms (the third step). In addition, model selection based approaches are basically for the third step in the three-step procedure, by which AEW can be combined with such methods, like that the optimized graph by AEW can be used as the input graph of these methods.

5.3 Graph construction

Besides k-NN, there have been several methods generating a graph (edges) from the feature vectors (e.g., Talukdar 2009; Jebara et al. 2009; Liu et al. 2010). Our approach can also be applied to those graphs because AEW only optimizes weights of edges. In our experiments, we used the edges of the k-NN graph as the initial graph of AEW. We then observed that AEW is not sensitive to the choice of k, comparing with usual k-NN graphs. This is because the Gaussian similarity value becomes small if \(\varvec{x}_i\) and \(\varvec{x}_j\) are not close to each other to minimize the reconstruction error (6). In other words, redundant weights can be reduced drastically, because in the Gaussian kernel, weights decay exponentially according to the squared distance.

In the context of spectral clustering, a connection to statistical properties of graph construction has been analyzed (Maier et al. 2009, 2013). For example, Maier et al. (2013) have shown conditions under which the optimal convergence rate is achieved for a few types of graphs. These different studies also indicate that the quality of the input graph affects the final performance of learning methods.

5.4 Discussion on manifold assumption

A key assumption for our approach is manifold assumption which has been widely accepted in semi-supervised learning (e.g., see Chapelle et al. 2010). In manifold assumption, the input data is assumed to lie on a lower-dimensional manifold compared to the original input space. Although verifying manifold assumption itself accurately would be difficult (because it is equivalent to estimating intrinsic dimensionality of the input data), the graph-based approach is known as a practical approximation of the underlying manifold which is applicable without knowing such a dimensionality. Many empirical evaluations have revealed that the manifold assumption-based approaches (most of them are graph-based) achieve high accuracy in various applications, particularly image and text data (see e.g., Patwari and Hero 2004; Lee and Kriegman 2005; Zhang and Zha 2005; Fergus et al. 2009; Chapelle et al. 2010; Aljabar et al. 2012). In these applications, the manifold assumption is reasonable, as implied by prior knowledge of the given data (e.g., in the face image classification, each person lies on different low-dimensional manifolds of pixels).

Another important implication in most of graph-based semi-supervised learning is cluster assumption (or low density separation) in which different classes are assumed to be separated by a low-density region. Graph-based approaches assume that nodes in the same class are densely connected while different classes are not so. If different classes are not separated by a low-density region, a nearest-neighbor graph would connect different classes which may cause miss-classification by propagating wrong class information. Several papers have addressed this problem (Wang et al. 2011; Gong et al. 2012). They considered the existence of singular points at which different classes of manifolds have intersections. Their approach is to measure similarity of two instances by considering similarity of tangent spaces of two instances, but this approach has to consider accurately modeling both of the local tangent and their similarity measure which introduces additional parameters and estimation errors. We perform experimental evaluation for this approach in Sect. 6.2.1.

6 Experiments

We evaluate the performance of our approach using synthetic and real-world datasets. AEW is applicable to all graph based learning methods reviewed in Sect. 2. We investigated the performance of AEW using the harmonic Gaussian field (HGF) model and local and global consistency (LLGC) model in semi-supervised learning, and using spectral clustering (SC) in un-supervised learning. For comparison in semi-supervised learning, we used linear neighborhood propagation (LNP) (Wang and Zhang 2008), which generates a graph using a LLE based objective function. LNP can have two regularization parameters, one of which is for the LLE process (the first and second steps in the three-step procedure), and the other is for the label estimation process (the third step in the three-step procedure). For the parameter in the LLE process, we used the heuristics suggested by Saul and Roweis (2003), and for the label propagation process, we chose the best parameter value in terms of the test accuracy. LLGC also has the regularization parameter in the propagation process (3), and we chose the best one again. This choice was to remove the effect by model selection and to compare the quality of the graphs directly. HGF does not have such hyper-parameters. All experimental results were averaged over 30 runs with randomly sampled data points.

6.1 Synthetic datasets

Using simple synthetic datasets in Fig. 2, we here illustrate the advantage of AEW by comparing the prediction performance in the semi-supervised learning scenario. Two datasets in Fig. 2 have the same form, but Fig. 2b has several noisy data points which may become bridge points (which can connect different classes, defined by Wang and Zhang 2008). In both cases, the number of classes is 4 and each class has 100 data points (thus, \(n = 400\)).

Table 1 shows the error rates for the unlabeled nodes of HGF and LNP under 0–1 loss. For HGF, we used the median heuristics to choose the parameter \(\sigma _d\) in similarity function (4), meaning that a common \(\sigma (= \sigma _1 = \cdots = \sigma _p\)) is set as the median distance between all connected pairs of \(\varvec{x}_i\), and as the normalization of graph Laplacian, the symmetric normalization was used. The optimization of AEW started from the median \(\sigma _d\). The results by AEW are shown in the column ‘AEW + HGF’ of Table 1. The number of labeled nodes was 10 in each class (\(\ell = 40\), i.e., 10% of the entire datasets), and the number of neighbors in the graphs was set as \(k = 10\) or 20.

In Table 1, we see HGF with AEW achieved better prediction accuracies than the median heuristics and LNP in all cases. Moreover, for both of datasets (a) and (b), AEW was most robust against the change of the number of neighbors k. This is because \(\sigma _d\) is automatically adjusted in such a way that the local reconstruction error is minimized and then weights for connections between different manifolds are reduced. Although LNP also minimizes the local reconstruction error, LNP may connect data points far from each other if it reduces the reconstruction error.

Fig. 2
figure 2

Synthetic datasets

Table 1 Test error comparison for synthetic datasets

Figure 3 shows the graphs generated by (a) k-NN, (b) AEW, and (c) LNP, under \(k = 20\) for the dataset of Fig. 2a. In this figure, the k-NN graph connects a lot of nodes in different classes, while AEW favorably eliminates those undesirable edges. LNP also has less edges between different classes compared to k-NN, but it still connects different classes. We can see that AEW shows the class structure more clearly, which can lead the better prediction performance of subsequent learning algorithms.

Fig. 3
figure 3

Resultant graphs for the synthetic dataset of Fig. 2a \((k = 20)\)

6.2 Real-world datasets

We here examine the performance of our proposed approach on the eight popular datasets shown in Table 2, namely COIL (COIL-20) (Nene et al. 1996), USPS (a preprocessed version from Hastie et al. 2001), MNIST (LeCun et al. 1998), ORL (Samaria and Harter 1994), Vowel (Asuncion and Newman 2007), Yale (Yale Face Database B) (Georghiades et al. 2001), optdigit (Asuncion and Newman 2007), and UMIST (Graham and Allinson 1998), for both semi-supervised learning and clustering tasks.

Table 2 List of datasets

6.2.1 Semi-supervised learning

First, we show the results in the semi-supervised learning scenario using HGF. We evaluated three variants of the HGF model (three different normalizations), which are listed in the following:

  • N-HGF. ‘N’ indicates normalized Laplacian, and the similarity function was (4) (not locally scaled).

  • L-HGF. ‘L’ indicates local scaling kernel (5), and the graph Laplacian was not normalized.

  • NL-HGF. ‘NL’ indicates that both of normalized Laplacian and local scaling kernel were used.

For all three variants, the median heuristics was used to set \(\sigma _d\) (for local scaling kernel, we used median for scaled distance \(\Vert \varvec{x}_i - \varvec{x}_j\Vert ^2_2 / s_i s_j\)).

Fig. 4
figure 4

Performance comparison using HGF on real-world datasets. For N-HGF, L-HGF, and NL-HGF, shown as the dashed lines, ‘N’ indicates normalized Laplacian, ‘L’ indicates local scaling kernel and ‘NL’ means that both normalized Laplacian and local scaling kernel are applied. HGFs with AEW are by solid lines with markers, while LNP is by a solid line without any markers

Figure 4 shows the test error for unlabeled nodes. In this figure, three dashed lines with different markers are by N-HGF, L-HGF and NL-HGF, while three solid lines with the same markers are by HGF with AEW. The performance difference within the variants of HGF was not large, compared to the effect of AEW, particularly in COIL, ORL, Vowel, Yale, and UMIST. We can rather see that AEW substantially improved the prediction accuracy of HGF in most cases. LNP is by the solid line without any markers. LNP outperformed HGF (without AEW, shown as the dashed lines) in COIL, ORL, Vowel, Yale and UMIST, while HGF with AEW (at least one of three variants) achieved better performance than LNP in all these datasets except for Yale (In Yale, LNP and HGF with AEW attained a similar accuracy). We further compared AEW with the following two baselines of hyper-parameter optimization approaches (both employing normalized Laplacian (NL) with the local scaling kernel):

  1. 1.

    Minimization of entropy (Zhu et al. 2003) with normalized Laplacian (NL-MinEnt): NL-MinEnt is represented by solid lines with the star shaped marker. We set the smoothing factor parameter \(\varepsilon \) in NL-MinEnt, which prevents the Gaussian width parameter from converging to 0, as \(\varepsilon = 0.01\), and if the solution still converges to such a trivial solution, we increase \(\varepsilon \) by multiplying 10. The results of NL-MinEnt was unstable. Although we see that MinEnt stably improved the accuracy for all numbers of labels in the COIL dataset, NL-MinEnt sometimes largely deteriorated the accuracy, for example, in , and 4 for the ORL dataset.

  2. 2.

    Cross-validation with normalized Laplacian (NL-CV): We employed 5-fold cross validation and a method proposed by Zhang and Lee (2007) which imposes an additional penalty to the CV error, defined as deviation of the Gaussian width parameter from pre-defined value \(\tilde{\sigma }\). We used the median heuristic for \(\tilde{\sigma }\), and selected an additional regularization parameter for hyper-parameter optimization from \(\{ 0.001, 0.01, 0.1 \}\) by 5-fold CV with different data partitioning (Note that we can not apply the CV approach for only one labeled instance for each class).

Compared to NL-HGF, Fig. 4 shows that any drastic change cannot be found for NL-CV under this small labeled instances setting. For example, in the Yale dataset, although CV improved the accuracy for all the numbers of labeled instances, the change was small compared to AEW.

Overall AEW-NL-HGF had the best prediction accuracy, where typical examples were USPS and MNIST. Although Theorem 1 exactly holds only for AEW-L-HGF among all methods, we can see that AEW-NL-HGF, which is scaled for both Euclidian distance in the similarity and the degrees of the graph nodes, had more highly stable performance. This result suggests that balancing node weights (by normalized Laplacian) is practically advantageous to stabilize the propagation process of label information.

For further performance evaluation, we compared our approach with other state-of-the-art methods available for semi-supervised learning. We used the following two methods:

  1. 1.

    Spectral multi-manifold clustering (SMMC) (Wang et al. 2011): Wang et al. (2011) proposed to generate a similarity matrix using a similarity of tangent spaces for two instances. One of their main claims is that SMMC can handle intersections of manifolds which are difficult to deal with by using standard nearest-neighbor graph methods as we mentioned in Sect. 5.4. Although SMMC is proposed for a graph construction of spectral clustering, the resulting graph can be used for semi-supervised learning directly. We used the implementation by the authors,Footnote 2 and a recommended parameter setting in which the number of local probabilistic principal components analysis (PPCA) is \(M = \lceil n/(10d) \rceil \), the number of neighbors is \(k = 2 \lceil \log (n) \rceil \), and a parameter of an exponential function in the similarity is \(o = 8\), where d is a reduced dimension by PPCA which was set as \(d = 10\) except for the Vowel dataset (\(d = 5\) was used for the Vowel dataset because the original dimension is 10). Due to high computational complexity of SMMC which is at least \(O(n^3 + n^2 d p^2)\) (see the paper for detail), we used those default settings instead of performing cross-validation.

  2. 2.

    Measure propagation (MP) (Subramanya and Bilmes 2011): MP estimates class assignment probabilities using a Kullback–Leibler (KL) divergence based criterion. We used the implementation provided by the authors.Footnote 3 Subramanya and Bilmes (2011) proposed an efficient alternating minimization procedure which enables us to perform cross-validation based model selection in the experiment. We tuned three parameters of MP including two regularization parameters \(\mu \) and \(\nu \), and a kernel width parameter. The regularization parameters were selected from \(\mu \in \{ 10^{-4}, 10^{-3}, 10^{-2}, 0.1, 1 \}\) and \(\nu \in \{ 10^{-4}, 10^{-3}, 10^{-2}, 0.1, 1 \}\). For a kernel function, we employed the local scaling kernel, in which the width parameters were set as \(\sigma _d = 10^a \sigma \), where \(\sigma \) is the value determined by the median heuristics and the parameter a were selected by cross-validation with 5 values uniformly taken from \([-1,1]\). When we can not perform cross-validation because of the lack of labeled instances, we used \(\mu = 10^{-2}, \nu = 10^{-2}\), and \(a = 0\).

We here employed AEW-NL-HGF as a representative of our approach (the results of AEW-NL-HGF were the same as Fig. 4). Figure 5 shows the results on the comparison with the above two methods for semi-supervised learning approaches. We can see that AEW-NL-HGF had clearly better performance except for the optdigit dataset in which MP has similar prediction accuracy to AEW-NL-HGF. Also our approach outperformed SMMC, which considers possible intersections of manifolds.

Fig. 5
figure 5

Performance comparison with other semi-supervised methods

Table 3 Test error rate comparison using LLGC (averages on 30 runs and their standard deviation)

Next, we used the LLGC model for comparison. Table 3 shows the test error rates for the eight datasets in Table 2. Here again, we can see that AEW improved the test error rates of LLGC, and here AEW-L-LLGC showed the best performance in all cases except only one case (MNIST ). In this table, the symbol ‘\(^*\)’ means that AEW improved the accuracy from the corresponding method without AEW (which is shown in one of the left three methods) in terms of t-test with the significance level of 5%. AEW improved the prediction performance of LLGC except Vowel under , and all three methods with AEW outperformed LNP in all 24 (\(= 3 \times 8\)) cases except only one exception.

In Table 3, AEW-L-LLGC or AEW-NL-LLGC (i.e., AEW with local scaling kernel) achieved the lowest test error rates in 22 out of all 24 cases. This result suggests that incorporating differences of local scaling is important for real-world datasets. We can also see that AEW-NL-LLGC, for which Theorem 1 does not hold exactly, shows the best in 14 cases (highlighted by boldface) and the second best in 9 cases out of all 24 cases.

We further examined the effect of k. Figure 6 shows the test error rates for \(k = 20\) and 10, using NL-HGF, AEW-NL-HGF, and LNP. The number of labeled instances in each dataset is the midst value in the horizontal axis of Fig. 4. We can see that AEW was most robust among the three methods, meaning that the test error rate was not sensitive to k. In all cases, the median performance (the horizontal line in each box) of NL-HGF with \(k = 20\) was worse than that with \(k = 10\). On the other hand, AEW-NL-HGF with \(k = 20\) had a similar (or clearly better) median to that with \(k = 10\) in each case. The performance of LNP was also deteriorated substantially by letting \(k=20\) in some cases, such as Vowel and optdigit.

Fig. 6
figure 6

Comparison in test error rates of \(k = 20\) with \(k = 10\). Two boxplots of each method correspond to \(k = 10\) in the left (with a smaller width) and \(k = 20\) in the right (with a larger width). The number of labeled instances in each class is represented as \(\ell /c\)

Table 4 Clustering performance comparison using ARI (averages on 30 runs and their standard deviation)

6.2.2 Clustering

We also demonstrate the effectiveness of AEW in an un-supervised learning scenario, especially in clustering. Table 4 shows adjusted Rand index (ARI) (Hubert and Arabie 1985) of each of nine compared method. ARI was extended from Rand index (Rand 1971), which evaluates the accuracy of clustering results based on pairwise comparison, in such a way that the expected value for random labeling of ARI should take 0 (and the maximum is 1). We employed spectral clustering (SC) as a graph-based clustering to which AEW applies, and for comparison, we used sparse manifold clustering and embedding (SMCE) (Elhamifar and Vidal 2011), k-means clustering, and kernel k-means clustering. SMCE generates a graph using a LLE based objective function. The difference from LLE is that SMCE prevents connections between different manifolds by penalizing edge weights with the weighted \(L_1\) norm, according to distances between node pairs. However, as often the case with LLE, there are no reliable ways to select the regularization parameter. For comparison, we further used SMMC again here because a graph created by this method is also applicable to spectral clustering, and the same parameter settings as the semi-supervised case were used.

In Table 4, we can see that either AEW-N-SC or AEW-NL-SC achieved the best ARI for all datasets except Vowel. These two methods (AEW-N-SC and AEW-NL-SC) significantly improved the performance of SC with k-NN graphs in four datasets, i.e. COIL, USPS, Yale, and UMIST. L-SC and AEW-L-SC, i.e. SC and SC with AEW, both using unnormalized graph Laplacian, were not comparable to those with normalized graph Laplacian. AEW-L-SC significantly improved the performance of L-SC in three datasets, while AEW-L-SC was not comparable to AEW-N-SC and AEW-NL-SC. These results suggest that the normalization of Laplacian is important for SC, being compared to label propagation. As a result, AEW-NL-SC showed the most stable performance among the three methods with AEW. In Vowel, k-means and kernel k-means achieved the best performance, suggesting that the lower-dimensional manifold model assumed in Theorem 1 is not suitable for this dataset. SMMC achieved comparable performance against the competing methods for MNIST only. The difficulty of this method would be the parameter tuning. Overall however we emphasize that spectral clustering with AEW achieved the best performance for all datasets except Vowel.

7 Conclusions

We have proposed the adaptive edge weighting (AEW) method for graph-based learning algorithms such as label propagation and spectral clustering. AEW is based on the minimization of the local reconstruction error under the constraint that each edge has the function form of similarity for each pair of nodes. Due to this constraint, AEW has numerous advantages against LLE based approaches, which have a similar objective function. For example, noise sensitivity of LLE can be alleviated by the parameterized form of the edge weights, and the similarity form for the edge weights is very reasonable for many graph-based methods. We also provide several interesting properties of AEW, by which our objective function can be motivated analytically. Experimental results have demonstrated that AEW can improve the performance of graph-based algorithms substantially, and we also saw that AEW outperformed LLE based approaches in almost all cases.