1 Introduction

Nowadays there is an increasing interest in the study of graph-based data, either because the information is directly available as a network or a graph, or because the data points are assumed to be sampled on a low dimensional manifold whose structure is estimated by constructing a weighted graph with the data points as vertices. Moreover, fitting a function of the nodes of a graph, as a regression or a classification problem, can be a useful tool for example to cluster the nodes by using some partial knowledge about the partition and the structure of the graph itself.

In this paper, given some labelled data points and several other unlabelled ones, we consider the problem of predicting the label class of the latter. Following the manifold learning framework, the data are supposed to be positioned on a manifold that is embedded in a high dimensional space, or to constitute a graph by themselves. In the first case, the usual assumption is that the classes are separated by low density regions, whereas in the second case is that the connectivity is weaker between classes than inside each of them (Chapelle et al. 2010). On the other side, the robustness of semi-supervised learning methods and their behaviour in the presence of noise, in this case just wrongly labelled data, has been recently discussed in Gleich and Mahoney (2015), where a robustification method was introduced. Notice that, out of the manifold learning framework, the literature regarding label noise is extensive, e.g. in Natarajan et al. (2013) the loss functions of classification models are modified to deal with label noise, whereas in Liu and Tao (2016) a reweighting scheme is proposed. More recently, in Vahdat (2017) the label noise is modelled as part of a graphical model in a semi-supervised context. There exists also a number of deep learning approaches for graph-based semi-supervised learning [see e.g. Yang et al. (2016)].

We propose here a different optimization problem, based on a concave error function, which is specially well-suited when the number of available labels is small and which can deal with flipped labels naturally. The major contributions of our work are as follows:

  1. (i)

    We propose a manifold learning method phrased as an optimization problem which is robust to label noise. While many other graph-based methods involve a regression-like loss function, our loss function intuitively corresponds to a classification loss akin to the well-known hinge loss used in Support Vector Machines.

  2. (ii)

    We prove that, although the loss function is concave, the optimization problem remains convex provided that the positive trade-off parameter is smaller than the second least eigenvalue of the normalized combinatorial Laplacian of the graph.

  3. (iii)

    Computationally, the solution of the classification problem is simply given by solving a linear system, whose conditioning is described.

  4. (iv)

    In the case of Euclidean data, we present an out-of-sample extension of this method, which allows to extend the prediction to unseen data points.

  5. (v)

    We present a heuristic method to automatically fix the unique hyper-parameter in order to get a parameter-free approach.

Let us also emphasize that the method proposed in this paper can be naturally phrased in the framework of kernel methods, as a function estimation in a Reproducing Kernel Hilbert Space. Indeed, the corresponding kernel is then given by the Moore-Penrose pseudo-inverse of the normalized Laplacian. In this sense, this work can be seen as a continuation of Alaíz et al. (2018).

The paper is structured as follows. Section 2 introduces the context of the classification task and it reviews two state-of-the-art methods for solving it. In Sect. 3 we introduce our proposed robust approach, which is numerically compared with the others in Sect. 4. The paper ends with some conclusions in Sect. 5.

2 Classification of graph-based data

2.1 Preliminaries

The datasets analysed in this paper constitute the nodes \(\mathcal {V}\) of a connected graph \(\mathcal {G}= \left( \mathcal {V},\mathcal {E} \right) \), where the undirected edges \(\mathcal {E}\) are given as a symmetric weight matrix W with non-negative entries. This graph can be obtained in different settings, e.g.:

  • Given a set of data points \(\left\{ x_i \right\} _{i = 1}^N\), with \(x_i \in \mathbb {R}^d\) and a positive kernel \(k\left( x, y \right) \ge 0\), the graph weights can be defined as \(w_{ij} = k\left( x_i,x_j \right) \).

  • Given a set of data points \(\left\{ x_i \right\} _{i = 1}^N\), with \(x_i \in \mathbb {R}^d\), the weights are constructed as follows: \(w_{ij} = 1\) if j is among the k nearest neighbours of i for the \(\ell _2\)-norm, and \(w_{ij} = 0\) otherwise. Then, the weight matrix W is symmetrized as \(\left( W + W^\intercal \right) / 2\).

  • The dataset is already given as a weighted undirected graph.

Some nodes are labelled by \(\pm 1\) and we denote by \(\mathcal {V}_\mathrm {L}\subset \mathcal {V}\) the set of labelled nodes. For simplicity, we identify \(\mathcal {V}_\mathrm {L}\) with \(\left\{ 1,\ldots , s \right\} \) and \(\mathcal {V}\) with \(\left\{ 1,\ldots , N \right\} \), with \(s < N\) the number of available labels. Any labelled node \(i \in \mathcal {V}_\mathrm {L}\) has a class label \(c_i = \pm 1\). We denote by y the label vector defined as follows

$$\begin{aligned} y_i = {\left\{ \begin{array}{ll} c_i &{} \text { if } i \in \mathcal {V}_\mathrm {L}, \\ 0 &{} \text { if } i \in \mathcal {V}{\setminus } \mathcal {V}_\mathrm {L}. \end{array}\right. } \end{aligned}$$

The methods discussed in this paper are formulated in the framework of manifold learning. Indeed, the classification of unlabelled data points relies on the definition of a Laplacian matrix, which can be seen as a discrete Laplace-Beltrami operator on a manifold (Coifman and Lafon 2006).

Let \(L = D - W\) be the matrix of the combinatorial Laplacian, where \(D = {{\mathrm{diag}}}\left( d \right) \), d is the degree vector \(d = W 1\), and 1 is the vector of ones, i.e., \(d_i = \sum _{j = 1}^N w_{ij}\). We will write \(i \sim j\) iff \(w_{ij} \ne 0\). The normalized Laplacian, defined by \(L^\mathrm {N}= D^{-1/2}L D^{-1/2}= \mathbb {I}- D^{-1/2}W D^{-1/2}\) (where \(\mathbb {I}\in \mathbb {R}^{N \times N}\) is the identity matrix), accounts for a non-trivial sampling distribution of the data points on the manifold. The normalized Laplacian has an orthonormal basis of eigenvectors \(\left\{ v_{\ell } \right\} _{\ell = 0}^{N - 1}\), with \(v_{k}^\intercal v_{\ell }= \delta _{k \ell }\) (the Kronecker delta), associated to non-negative eigenvalues \(0 = \lambda _{0} \le \lambda _{1} \le \cdots \le \lambda _{N-1} \le 2\). Noticeably, the zero eigenvector of \(L^\mathrm {N}\) is simply specified by the node degrees, i.e., we have \(v_{0,i} \propto \sqrt{d_i}\) for all \(i = 1, \ldots , N\). Notice that the Laplacian can be expressed in this basis according to the lemma below.

Lemma 1

The normalized Laplacian admits the following spectral decomposition, which also gives a resolution of the identity matrix \(\mathbb {I}\in \mathbb {R}^{N \times N}\):

$$\begin{aligned} L^\mathrm {N}= \sum _{\ell = 1}^{N-1} \lambda _{\ell }v_{\ell }v_{\ell }^\intercal , \quad \mathbb {I}= \sum _{\ell = 0}^{N-1} v_{\ell }v_{\ell }^\intercal . \end{aligned}$$

Proof

See Chung (1997). \(\square \)

For simplicity, we assume here that each eigenvalue is associated to a one-dimensional eigenspace. The general case can be phrased in a straightforward manner.

Following Belkin and Niyogi (2004), we introduce the smoothing functional associated to the normalized Laplacian:

$$\begin{aligned} S_\mathcal {G}\left( f \right) = \frac{1}{2} f^\intercal L^\mathrm {N}f = \frac{1}{2} \sum _{i ,j | i \sim j} w_{ij} \left( \frac{f_i}{\sqrt{d_i}} - \frac{f_j}{\sqrt{d_j}} \right) ^2 , \end{aligned}$$
(1)

where \(f_i\) denotes the i-th component of f.

Remark 1

The smoothest vector according to the smoothness functional 1 is the eigenvector \(v_{0}\), which corresponds to a 0 value, \(S_\mathcal {G}\left( v_{0} \right) = 0\).

The following sections contain a brief review of the state of the art for semi-supervised graph-based classification methods.

2.2 Belkin–Niyogi approach

In Belkin and Niyogi (2004), a semi-supervised classification problem is phrased as the estimation of a (discrete) function written as a sum of the first p smoothest functions, that is, the first p eigenvectors of the combinatorial Laplacian. The classification problem is defined by

$$\begin{aligned} \min _{a \in \mathbb {R}^p}~{\sum _{i=1}^s \left( c_i - \sum _{\ell = 0}^{p - 1} a_\ell v_{\ell ,i} \right) ^2} , \end{aligned}$$
(2)

where \(a_0, \ldots , a_{p - 1}\) are real coefficients. The solution of Problem 2, \(a^{\star }\), is obtained by solving a linear system. The predicted vector is then

$$\begin{aligned} f^{\star }= \sum _{\ell = 1}^{p} a^{\star }_\ell v_{\ell } . \end{aligned}$$

Finally, the classification of an unlabelled node \(i \in \mathcal {V}{\setminus } \mathcal {V}_\mathrm {L}\) is given by \({{\mathrm{sign}}}\left( f^{\star }_i \right) \). Indeed, Problem 2 is minimizing a sum of errors of a regression-like problem involving only the labelled data points. The information known about the position of the unlabelled data points is included in the eigenvectors \(v_{\ell }\) of the Laplacian (Fourier modes), which is the Laplacian of the full graph, including the unlabelled nodes. Only a small number p of eigenvectors is used in order to approximate the label function. This number p is a tuning parameter of the model.

We will denote this model as Belkin–Niyogi Graph Classification (BelkGC).

2.3 Zhou et al. approach

In Zhou et al. (2004), the following regularized semi-supervised classification problem is proposed:

$$\begin{aligned} \min _{f \in \mathbb {R}^N}~{\frac{1}{2} f^\intercal L^\mathrm {N}f + \frac{\gamma }{2} \left\| f - y \right\| _2^2} , \end{aligned}$$
(3)

where \(\gamma > 0\) is a regularization parameter which has to be selected. We notice that the second term in the objective function of Problem 3, involving the \(\ell _2\)-norm of the label vector, can be interpreted as the error term of a least-squares regression problem. Intuitively, Problem 3 will have a solution \(f^{\star }\in \mathbb {R}^N\) such that \(f^{\star }_i \approx 0\) if \(i \in \mathcal {V}{\setminus } \mathcal {V}_\mathrm {L}\) (unlabelled nodes), that is, it will try to fit zeroes. Naturally, we will have \(f^{\star }_i \approx c_i\) for all the labelled nodes \(i \in \mathcal {V}_\mathrm {L}\). Finally, the prediction of the unlabelled node class is given by calculating \({{\mathrm{sign}}}\left( f^{\star }_i \right) \) for \(i \in \mathcal {V}{\setminus } \mathcal {V}_\mathrm {L}\). The key ingredient is the regularization term (based on the Laplacian) which will make the solution smoother by increasing the bias.

Notice that the original algorithm solves Problem 3 once per each class, using as target the indicator vector of the nodes labelled as that class, and then classifying the unlabelled nodes according to the maximum prediction between all the classes. Although both formulations (using two binary target vectors and predicting with the maximum, or using a single target vector with \(\pm 1\) and zero values and predicting with the sign) are slightly different, we will use Problem 3 since in this work we consider only binary problems. We will denote this model as Zhou et al. Graph Classification (ZhouGC).

In the recent work Gleich and Mahoney (2015), it is emphasized that this method is implicitly robust in the presence of graph noise, since the prediction decays towards zero preventing the errors in far regions of the network from propagating to other areas. Moreover, a modification of this algorithm is proposed to add an additional \(\ell _1\) penalization, so that the prediction decays faster according to an additional regularization parameter. However, the resultant method is still qualitatively similar to ZhouGC since the loss term is still the one of a regression problem, with the additional disadvantage of having an extra tuning parameter.

2.4 Related methods

There are other semi-supervised learning methods that impose the label values as constraints (Zhu et al. 2003; Joachims 2003). The main drawback is that, as discussed in Gleich and Mahoney (2015), the rigid way of including the labelled information makes them more sensible to noise, specially in the case of mislabelled nodes.

On the other side, there are techniques with completely different approaches as Laplacian SVM (Belkin et al. 2006), a manifold learning model for semi-supervised learning based on an ordinary Support Vector Machine (SVM) classifier supplemented with an additional manifold regularization term. This method was originally designed for Euclidian data, hence its scope is different from the previous models. The straightforward approach to apply this method to graph data, is by embedding the graph, what in principle requires the computation of the inverse of a dense Gram matrix entering in the definition of an SVM problem. Hence, the training involves both a matrix inversion of the size of the labelled and unlabelled training data set and a quadratic problem of the same size. In order to reduce the computational cost, a training procedure in the primal was proposed in Melacci and Belkin (2011) where the use of a preconditioned conjugate gradient algorithm with an early stopping criterion is suggested. However, these methods still require the choice of two regularization parameters besides the kernel bandwidth. This selection requires a cross-validation procedure which is especially difficult if the number of known labels is small.

3 Robust method

The two methods presented in Sects. 2.2 and 2.3 can be interpreted as regression problems, which intuitively estimate a smooth function \(f^{\star }\) such that its value is approximately the class label, i.e., \(f^{\star }_i \approx c_i\) for all the labelled nodes \(i\in \mathcal {V}_\mathrm {L}\). We will propose in this section a new method based on a concave loss function and a convex regularization term, which is best suited for classification tasks. Moreover, with the proper constraints, the resulting problem is convex and can be solved using a dual formulation.

We keep as a main ingredient the first term of Problem 3, \(\frac{1}{2} f^\intercal L^\mathrm {N}f\), which is a well-known regularization term requiring a maximal smoothness of the solution on the (sampled) manifold. However, if the smooth solution is \(f^{\star }\), we emphasize that we have to favour \({{\mathrm{sign}}}\left( f^{\star }_i \right) = c_i\) instead of imposing \(f^{\star }_i \approx c_i\) for all \(i \in \mathcal {V}_\mathrm {L}\). Hence, for \(\gamma > 0\), we propose the minimization problem

$$\begin{aligned} \left\{ \begin{array}{l}\displaystyle \min _{f \in \mathbb {R}^N}~{\frac{1}{2} f^\intercal L^\mathrm {N}f - \frac{\gamma }{2} \sum _{i=1}^{N}{\left( y_i + f_i \right) ^2}}\\ \displaystyle \text {s.t. }f^\intercal v_{0} = 0 ,\end{array} \right. \end{aligned}$$
(4)

where \(\gamma \) has to be bounded from above as stated in Theorem 1. The constraint means that we do not want the solution to have a component directed along the vector \(v_{0}\), since its components all have the same sign (an additional justification is given in Remark 2). We will denote our model as Robust Graph Classification (RobustGC).

Notice that Problem 4, corresponding to RobustGC, can be written as Problem 3, corresponding to ZhouGC, by doing the following changes: \(\gamma \rightarrow -\gamma \), \(y \rightarrow -y\), and by supplementing the problem with the constraint \(f^T v_0 = 0\). Both problems can be compared by analysing the error term in both formulations. In ZhouGC this term simply corresponds to the Squared Error (SE), namely \(\left( f_i - y_i \right) ^2\). In RobustGC, a Concave Error (CE) is used instead, \(- \left( f_i + y_i \right) ^2\). As illustrated in Fig. 1, this means that ZhouGC tries to fit the target, both if it is a known label \(\pm 1\), or if it is zero. On the other side, RobustGC tries to have predictions far from 0 (somehow minimizing the entropy of the prediction), biased towards the direction marked by the label for labelled points. Nevertheless, as shown in Fig. 1a, the model is also able to minimize the CE in the opposite direction to the one indicated by the label, what provides robustness with respect to label noise. Finally, if the label is unknown, the CE only favours large predictions in absolute value. As an additional remark, let us stress that the interplay of the Laplacian-based regularization and the error term, which are both quadratic functions, is yet of fundamental importance. As a matter of fact, in the absence of the regularization term, the minimization of the unbounded error term is meaningless.

Fig. 1
figure 1

Comparison of the squared error (SE) and the proposed concave error (CE), both for a labelled node with \(c_i = 1\) (the case \(c_i = -1\) is just a reflection of this one) and for an unlabelled point.  SE;  CE. a Positive label and b unknown label

RobustGC can be further studied by splitting the error term to get the following equivalent problem:

$$\begin{aligned} \left\{ \begin{array}{l}\displaystyle \min _{f \in \mathbb {R}^N}~{\frac{1}{2} f^\intercal L^\mathrm {N}f + \gamma \sum _{i=1}^{N}{\left( -y_i f_i \right) } + \gamma \sum _{i=1}^{N}{\left( - \frac{f_i^2}{2} \right) }}\\ \displaystyle \text {s.t. }f^\intercal v_{0} = 0 ,\end{array} \right. \end{aligned}$$

where the two error terms have the following meaning:

  • The first error term is a penalization term involving a sum of loss functions \({\text {L}}\left( f_i \right) = - y_i f_i\). This unbounded loss function term is reminiscent of the hinge loss in Support Vector Machines: \(\max \left( 0, 1 - y_i f_i \right) \). Indeed, for each labelled node \(i \in \mathcal {V}_\mathrm {L}\), this term favours values of \(f_i\) which have the sign of \(y_i\). However, for each unlabelled node \(i \in \mathcal {V}{\setminus }\mathcal {V}_\mathrm {L}\), the corresponding term \({\text {L}}\left( f_i \right) = 0\) vanishes. This motivates the presence of the other error term.

  • The second error term is a penalization term forcing the value \(f_i\) to take a non-zero value in order to minimize \(- f_i^2 / 2\). In particular, if i is unlabelled, this term favours \(f_i\) to take a non-zero value which will be dictated by the neighbours of i in the graph.

The connection between our method and kernel methods based on a function estimation problem in a Reproducing Kernel Hilbert Space (RKHS) is explained in the following remark.

Remark 2

The additional condition \(f^\intercal v_{0} = 0\) in Problem 4 can also be justified as follows. The Hilbert space \(H_K = \left\{ f \in \mathbb {R}^N \text { s.t. } f^\intercal v_{0} = 0 \right\} \) is an RKHS endowed with the inner product \(\langle f | f' \rangle _K = f^\intercal L^\mathrm {N}f'\) and with the reproducing kernel given by the Moore–Penrose pseudo-inverse \(K = \left( L^\mathrm {N} \right) ^\dagger \). More explicitly, we can define \(K_i = \left( L^\mathrm {N} \right) ^\dagger e_i \in \mathbb {R}^N\), where \(e_i\) is the canonical basis element given by a vector of zeros with a 1 at the i-th component. Furthermore, the kernel evaluated at any nodes i and j is given by \(K\left( i,j \right) = e_i^\intercal \left( L^\mathrm {N} \right) ^\dagger e_j\). As a consequence, the reproducing property is merely (Zhou et al. 2011)

$$\begin{aligned} \langle K_i | f \rangle _K = \left( \left( L^\mathrm {N} \right) ^\dagger e_i \right) ^\intercal L^\mathrm {N}f = f_i , \end{aligned}$$

for any \(f\in H_K\). As a result, the first term of Problem  4 is equal to \(\left\| f \right\| _{K}^2 / 2\) and the problem becomes a function estimation problem in an RKHS.

Notice that the objective function involves the difference of two convex functions and, therefore, it is not always bounded from below. The following theorem states the values of the regularization parameter such that the objective is bounded from below on the feasible set and so that the optimization problem is convex.

Theorem 1

Let \(\gamma > 0\) be a regularization parameter. The optimization problem

$$\begin{aligned} \min _{f \in \mathbb {R}^N}~{\frac{1}{2} f^\intercal L^\mathrm {N}f - \frac{\gamma }{2} \left\| f+y \right\| _2^2}\quad \text {s.t. }f^\intercal v_{0} = 0 \end{aligned}$$

has a strongly convex objective function on the feasible space if and only if \(\gamma < \lambda _{1}\), where \(\lambda _{1}\) is the second smallest eigenvalue of \(L^\mathrm {N}\). In that case, the unique solution is given by the vector:

$$\begin{aligned} f^{\star }= \left( \frac{L^\mathrm {N}}{\gamma } - \mathbb {I} \right) ^{-1} P_0 y , \end{aligned}$$

with \(P_0 = \mathbb {I}- v_{0} v_{0}^\intercal \).

Proof

Using Lemma 1, any vector \(f \in v_{0}^\perp \), i.e., satisfying the constraint \(f^\intercal v_{0} = 0\), can be written as \(f = \sum _{\ell = 1}^{N-1} \tilde{f}_\ell v_{\ell }\), where \(\tilde{f}_\ell = v_{\ell }^\intercal f \in \mathbb {R}\) is the projection of f over \(v_{\ell }\). Furthermore, we also expand the label vector in the basis of eigenvectors \(y = \sum _{\ell = 0}^{N-1} \tilde{y}_\ell v_{\ell }\), with \(\tilde{y}_\ell = v_{\ell }^\intercal y\). Then, the objective function is the finite sum

$$\begin{aligned} F\left( \tilde{f}_1, \ldots , \tilde{f}_{N-1} \right) = \sum _{\ell = 1}^{N-1}\left( \frac{\lambda _{\ell }- \gamma }{2} \tilde{f}^2_\ell - \gamma \tilde{y}_\ell \tilde{f}_\ell \right) - \frac{\gamma }{2} \left\| y \right\| _2^2 , \end{aligned}$$

where we emphasize that the term \(\ell = 0\) is missing. As a result, F is clearly a strongly convex function of \(\left( \tilde{f}_1, \ldots , \tilde{f}_{N-1} \right) \) if and only if \(\gamma < \lambda _{\ell }\) for all \(\ell = 1, \ldots , N-1\), that is, iff \(\gamma < \lambda _{1}\). Since the objective F is quadratic, its minimum is merely given by \(\left( \tilde{f}^{\star }_1, \ldots , \tilde{f}^{\star }_{N-1} \right) \), with

$$\begin{aligned} \tilde{f}^{\star }_\ell = \frac{\tilde{y}_\ell }{\frac{\lambda _{\ell }}{\gamma } - 1} , \end{aligned}$$
(5)

for \(\ell = 1, \ldots , N-1\). Then, the solution of the minimization problem is given by

$$\begin{aligned} f^{\star }= \sum _{\ell = 1}^{N-1} \tilde{f}^{\star }_\ell v_{\ell }&= \sum _{\ell = 1}^{N-1} \frac{\tilde{y}_\ell }{\frac{\lambda _{\ell }}{\gamma } - 1} v_{\ell }\\&= \left( \frac{L^\mathrm {N}}{\gamma } - \mathbb {I} \right) ^{-1} \left( y - v_{0} \left( v_{0}^\intercal y \right) \right) , \end{aligned}$$

which is obtained by using the identity \(y - v_{0} \left( v_{0}^\intercal y \right) = \sum _{\ell = 1}^{N-1} \tilde{y}_\ell v_{\ell }\). This completes the proof. \(\square \)

By examining the form of the solution of Problem 4 given in 5 as a function of the regularization constant \(0< \gamma < \lambda _{1}\), we see that taking \(\gamma \) close to the second eigenvalue \(\lambda _{1}\) will give more weight to the second eigenvector, while the importance of the next eigenvectors decreases as \(1 / \lambda _{\ell }\). Regarding the selection of \(\gamma \) in practice, as shown experimentally just fixing a value of \(\gamma ={0.9}\lambda _{1}\) leads to a parameter-free version of RobustGC (denoted PF-RobustGC) that keeps a considerable accuracy.

The complete procedure to apply this robust approach is summarized in Algorithm 1, where \(\gamma \) is set as a percentage \(\eta \) of \(\lambda _{1}\) to make it problem independent. Notice that, apart from building the needed matrices and vectors, the algorithm only requires to compute the largest eigenvalue of a matrix and to solve a well-posed linear system.

figure c

3.1 Illustrative example

A comparison of ZhouGC, BelkGC and RobustGC is shown in Fig. 2, where the three methods are applied over a very simple graph: a chain with strong links between the first ten nodes, strong links between the last ten nodes, and a weak link connecting the tenth and the eleventh nodes (with a weight ten times smaller). This structure clearly suggests to split the graph in two halves.

Fig. 2
figure 2

Comparison of the different methods over a chain with two clearly separable clusters, where the link between the two middle nodes is ten times smaller than the other links.  ZhouGC;  BelkGC;  RobustGC. a Example with two correct labels and b example with four correct labels and a flipped one

In Fig. 2a one node of each cluster receives a label, whereas in Fig. 2b one node of the positive class and four of the negative are labelled, with a flipped label in the negative class. The predicted values of \(f^{\star }\) show that ZhouGC (with \(\gamma = {1}\)) is truly a regression model, fitting the known labels (even the flipped one) and pushing towards zero the unknown ones. BelkGC (with two eigenvectors, \(p = {2}\)) fits much better the unknown labels for nodes far from the labelled ones, although the flipped label push the prediction towards zero in the second example for the negative class. Finally, RobustGC (with \(\eta ={0.5}\)) clearly splits the graph in two for the first example, where the prediction is almost a step function, and it is only slightly affected by the flipped label of the second example. Of course, this experiment is only illustrative, since tuning the parameters of the different models could affect significantly the results.

3.2 Conditioning of the linear system

As shown in Theorem 1, the RobustGC model is trained by solving the following linear system:

$$\begin{aligned} \left( \frac{L^\mathrm {N}}{\gamma } - \mathbb {I} \right) f^{\star }= P_0 y . \end{aligned}$$

It is therefore interesting to describe the condition number of this system in order to estimate the stability of its numerical solution. In particular, we will use the following lemma characterizing the maximum eigenvalue of \(L^\mathrm {N}\).

Lemma 2

If the weight matrix is positive semi-definite, \(W \succeq 0\), then \(\lambda _{\max }\left( L^\mathrm {N} \right) \le 1\). If W is indefinite, then \(\lambda _{\max }\left( L^\mathrm {N} \right) \le 2\).

Proof

The argument is classic. Let us write \(L^\mathrm {N}= \mathbb {I}- S\), with \(S = D^{-1/2}W D^{-1/2}\). Clearly, S is related by the conjugation to a stochastic matrix \(\Sigma = D^{-1} W = D^{-1/2}S D^{1/2}\). Hence, \(\Sigma \) and S have the same spectrum \(\left\{ \lambda _{\ell } \right\} _{\ell = 0}^{N-1}\). Therefore, since \(\Sigma \) is stochastic, it holds that \(\left| \lambda _{\ell } \right| \le 1\) for all \(\ell = 0, \ldots , N-1\). Then, in general, \(\lambda _{\max }\left( L^\mathrm {N} \right) = 1 - \lambda _{\min }\left( S \right) \le 2\), which proves the second part of the Lemma. Furthermore, if \(W \succeq 0\), then \(S \succeq 0\), which means that \(\lambda _{\min }\left( S \right) \ge 0\) and we have \(\lambda _{\max }\left( L^\mathrm {N} \right) = 1 - \lambda _{\min }\left( S \right) \le 1\), which shows the first part of the statement. \(\square \)

Furthermore, in the feasible space (i.e., for all \(f \in \mathbb {R}^N\) such that \(f^\intercal v_{0} = 0\)), we have \(\lambda _{\min }\left( L^\mathrm {N} \right) = \lambda _{1}\). Then, we can deduce the condition number of the system:

$$\begin{aligned} \kappa = \frac{\left| \lambda _{\max }\left( L^\mathrm {N}/ \gamma - \mathbb {I} \right) \right| }{\left| \lambda _{\min }\left( L^\mathrm {N}/ \gamma - \mathbb {I} \right) \right| } \le \frac{c - \gamma }{\lambda _{1} - \gamma } , \end{aligned}$$

where \(c = 1\) if the weight matrix is positive semi-definite and \(c = 2\) if the weight matrix is indefinite.

The upshot is that the problem is better conditioned a priori if the weight matrix is positive semi-definite. Furthermore, in order to have a reasonable condition number, \(\gamma \) should not be too close to \(\lambda _{1}\).

3.3 Out-of-sample extension

In the particular case of a graph obtained using a Mercer kernel over a set of data points \(\left\{ x_i \right\} _{i = 1}^N\), with \(x_i \in \mathbb {R}^d\), an out-of-sample extension allows to make predictions over unseen points.

In order to pose an out-of-sample problem, let \(f^{\star }= \left( L^\mathrm {N}/ \gamma - \mathbb {I} \right) ^{-1} P_0 y\) be the solution of the RobustGC problem. Then, if we are given an additional point \(x \in \mathbb {R}^d\), we want to obtain the value of the classifier \(f_x\) such that \({{\mathrm{sign}}}\left( f_x \right) \) predicts the label of x. In particular, recall that the normalized Laplacian \(L^\mathrm {N}= \mathbb {I}- S\) is built from the kernel matrix

$$\begin{aligned} S_{ij} = \frac{k\left( x_i, x_j \right) }{\sqrt{d_i d_j}} , \text { with } d_i = \sum _{j=1}^N k\left( x_i, x_j \right) . \end{aligned}$$

This kernel can be extended to the new point x as follows:

$$\begin{aligned} S_{xj} = \frac{k\left( x, x_j \right) }{\sqrt{d_x d_j}} , \text { with } d_x = \sum _{j=1}^N k\left( x, x_j \right) , \end{aligned}$$

and \(S_{xx} = k\left( x, x \right) / d_x\) (notice that in many of the most common kernels, such as the Gaussian kernel, \(k\left( x, x \right) = 1\)). We consider

$$\begin{aligned} \tilde{f} = \begin{pmatrix} f^{\star }\\ f_x \end{pmatrix} \text { and } \tilde{y} = \begin{pmatrix} y \\ 0 \end{pmatrix} , \end{aligned}$$

with \(\tilde{f}, \tilde{y} \in \mathbb {R}^{N+1}\) and \(f, y \in \mathbb {R}^N\). The extension of the Laplacian is defined as follows:

$$\begin{aligned} \tilde{L}^\mathrm {N}= \begin{pmatrix} L^\mathrm {N}&{} l \\ l^\intercal &{} \tilde{L}^\mathrm {N}_{xx} \end{pmatrix}, \end{aligned}$$

with \(l_i = - S_{xi}\) and \(\tilde{L}^\mathrm {N}_{xx} = 1 - k\left( x, x \right) / d_x\). Notice that \(\tilde{L}^\mathrm {N}\) is not necessarily positive semi-definite.

In order to obtain \(f_x\), we propose the minimization problem

$$\begin{aligned} \min _{\tilde{f} \in \mathbb {R}^{N+1}}~{\frac{1}{2} \tilde{f}^\intercal \tilde{L}^\mathrm {N}\tilde{f} -\frac{\gamma }{2} \sum _{i=1}^{N}{\left( \tilde{y}_i + \tilde{f}_i \right) ^2}}\quad \text {s.t. }\tilde{f} = \begin{pmatrix} f^{\star }\\ f_x \end{pmatrix} , \end{aligned}$$

which is equivalent to solving

$$\begin{aligned} \min _{f_x \in \mathbb {R}}~{\frac{\tilde{L}^\mathrm {N}_{xx} - \gamma }{2} f_x^2 + \left( l^\intercal f^{\star } \right) f_x} , \end{aligned}$$

where \(l^\intercal f^{\star }= - \sum _{i=1}^N S_{xi} f^{\star }_i\). This quadratic problem has a solution provided that \(\tilde{L}^\mathrm {N}_{xx} - \gamma > 0\), that is, only if the degree of this new point x is large enough:

$$\begin{aligned} d_x > \frac{k\left( x, x \right) }{1 - \gamma } . \end{aligned}$$

This means that x has to be close enough from the initial set \(\left\{ x_i \right\} _{i = 1}^N\) in order to be able to extend the classifier given by \(f^{\star }\) (notice that, in this case, \(\gamma< \lambda _{1} < 1\), and hence the inequality involving \(d_x\) is well defined). Under this assumption, the solution reads

$$\begin{aligned} f_x = \frac{1}{1 - \gamma - k\left( x, x \right) d_x^{-1}} \sum _{i=1}^N S_{xi} f^{\star }_i , \end{aligned}$$
(6)

namely, it is a Nyström-like extension of the solution \(f^{\star }\) with respect to the Mercer kernel \(S\left( x, y \right) = k\left( x, y \right) / \sqrt{d_x d_y}\).

3.3.1 Example of the out-of-sample extension

Figure 3 includes an example of the out-of-sample extension over the moons dataset, with 50 patterns (5 of them labelled) for each class. The model built is then extended over a \({100} \times {100}\) regular grid using 6. As the bandwidth of the kernel is extended, the prediction can be extended to a broader area, but the classification becomes less precise at the border between the two classes.

Fig. 3
figure 3

Out-of-sample extension over the moons dataset using different bandwidths. unlabelled point of class \(-\,1\)/\(+\,1\);  labelled point of class \(-\,1\)/\(+\,1\); area predicted as class \(-\,1\)/\(+\,1\);  area out of prediction range. a Bandwidth \(\sigma = \) 0.15, b bandwidth \( \sigma = 0\).3, c bandwidth \( \sigma = 0\).6 and d bandwidth \(\sigma = 1\).2

4 Experiments

In this section we will illustrate the robustness of the proposed method RobustGC with respect to labelling noise, we will show empirically how it can be successfully applied to the problem of classifying nodes over different graphs, and we will also include an example of its out-of-sample extension.

For the first two set of experiments, the following four models will be compared:

ZhouGC :

It corresponds to Problem 3, where the parameter \(\gamma \) is selected from a grid of 51 points in logarithmic scale in the interval \(\left[ {10^{-5}}, 10^{{5}} \right] \).

BelkGC :

It corresponds to Problem 2. The number p of eigenvectors used is chosen between 1 and 51.

RobustGC :

It corresponds to Problem 4, where the parameter \(\gamma \) is selected from a grid of 51 points in linear scale between 0 and \(\lambda _{1}\).

PF-RobustGC :

It corresponds to Problem 4, where \(\gamma \) is fixed as \(\gamma = {0.9} \lambda _{1}\), so it is a parameter-free method. As shown in Fig. 6, the stability of the prediction with respect to \(\gamma \) suggests to use such a fixed value, where the solution is mainly dominated by the second eigenvector \(v_{1}\) but without ignoring the next eigenvectors. Moreover, a value of \(\gamma \) closer to \(\lambda _{1}\) could affect the conditioning of the linear system, as explained in Sect. 3.2.

Regarding the selection of the tuning parameters, these models are divided in two groups:

  • For ZhouGC, BelkGC and RobustGC, a perfect validation criterion is assumed, so that the best parameter is selected according to the test error. Although this approach prevents from estimating the true generalization error, it is applied to the three models so that the comparison between them should still be fair, and this way we avoid the crucial selection of the parameter, which can be particularly difficult for the small sizes of labelled set considered here. Obviously, any validation procedure will give results at best as good as these ones.

  • PF-RobustGC does not require to set any tuning parameter, hence its results are more realistic than those of the previous group, and it is in disadvantage with respect to them. This means that, if this model outperforms the others in the experiments, it is expected to do it in a real context, where the parameters of the previous methods have to be set without using test information.

4.1 Robustness of the classification with respect to label noise

The first set of experiments aims to test the robustness of the classification of the different models with respect to label noise. In particular, we propose to generate a Stochastic Block Model as follows: a very simple graph of 200 nodes with two clusters is generated with an intra-cluster connectivity of \({70}\%\), whereas the connectivity between clusters is either \({30}\%\) (a well-separated problem) or \({50}\%\) (a more difficult problem); an example of the resulting weight matrices is shown in Fig. 4. For each of these two datasets, the performance of the models is compared for different numbers of labels and different levels of noise, which correspond to the percentage of flipped labels. Each configuration is repeated 50 times by varying the labelled nodes to average the accuracies.

Fig. 4
figure 4

Binary weight matrices for the Stochastic Block Model with low and high inter-cluster connectivity (the connections are marked in yellow). a Low inter-cluster connectivity and b high inter-cluster connectivity

The results are included in Fig. 5, where the solid lines represent the average accuracy, and the striped regions the areas between the minimum and maximum accuracies. In the case of the low inter-cluster connectivity dataset (left column of Fig. 5), RobustGC is able to almost perfectly classify all the points independently of the noise level (hence the striped region only appears when the number of labels is small and the noise is maximum). Moreover, PF-RobustGC is almost as good as RobustGC, and only slightly worse when the noise is the highest and the number of labels is small. These two models outperform BelkGC, and also ZhouGC, which is clearly the worse of the four approaches. Regarding the high inter-cluster connectivity dataset (right column of Fig. 5), for this more difficult problem RobustGC still gets a perfect classification except when the noise level is very high, where the accuracy drops a little when the number of labels is small. BelkGC is again worse than RobustGC, and the difference is more noticeable when the noise increases. On the other side, the heuristic PF-RobustGC is in this case worse than BelkGC (the selection of \(\gamma \) is clearly not optimal) but it still outperforms ZhouGC.

Fig. 5
figure 5

Robust comparison for the low inter-cluster connectivity graph (left column) and the high inter-cluster connectivity graph (right column).  ZhouGC;  BelkGC;  RobustGC;  PF-RobustGC

4.2 Accuracy of the classification

The second set of experiments consists in predicting the label of the nodes over the following six supervised datasets:

digits49-s and digits49-w:

The task is to distinguish between the handwritten digits 4 and 9 from the USPS dataset (Hull 1994); the suffix -s denotes that the weight matrix is binary and sparse corresponding to the symmetrized 20-Nearest Neighbours graph, whereas the suffix -w corresponds to a non-sparse weight matrix built upon a Gaussian kernel with \(\sigma = {1.25}\). The total number of nodes is 250 (125 of each class).

karate :

This dataset corresponds to a social network of 34 people of a karate club, with two communities of sizes 18 and 16 (Zachary 1977).

polblogs :

A symmetrized network of hyperlinks between weblogs on US politics from 2005 (Adamic and Glance 2005); there are 1222 nodes, with two clusters of 636 and 586 elements.

polbooks :

A network of books about US politics around 2004 presidential election, with 92 nodes and two classes of 49 and 43 elements.

synth :

This dataset is generated by a Stochastic Block Model composed by three clusters of 100 points with a connectivity of \(30\%\) inside each cluster and \(5\%\) between clusters; the positive class is composed by one cluster and the negative by the other two.

For each dataset, 6 different training sizes (or number of labelled nodes) are considered, corresponding to 1%, 2%, 5%, 10%, 20% and 50% the total number of nodes, provided that this number is larger than two, since at least one sample of each class is randomly selected. Moreover, each experiment is repeated 20 times by varying the labelled nodes in order to average the result and check if the differences between them are significant. In order to compare the models we use the accuracy over the unlabelled samples.

Table 1 Accuracy of the classification

The results are included in Table 1, where the first column includes the percentage of labelled data and the corresponding number of labels, and the other four columns show the accuracy (mean and standard deviation) of the four models and the corresponding ranking (given also by the the colour; the darker, the better). The same rank value is repeated if two models are not significantly differentFootnote 1. We can see that the proposed RobustGC method outperforms both ZhouGC and BelkGC at least for the smallest training sizes, and for all the sizes in the cases of karate, polblogs (the largest one) and polbooks. In the case of digits49-s and digits49-w, RobustGC beats the other methods for the three first sizes, being then beaten by BelkGC in the former and ZhouGC in the latter. Finally, for synth the robust RobustGC is the best model for the smallest training size, but it is then outperformed by BelkGC until the largest training size, where both of them solve the problem perfectly. Notice that this dataset is fairly simple, and a spectral clustering approach over the graph (without any labels) could be near a correct partition; BelkGC can benefit for this partition just regressing over the first eigenvectors to get a perfect classification with a very small number of labels. Turning our attention to the parameter-free heuristic approach PF-RobustGC, it is comparable to the approach with perfect parameter selection RobustGC in 3 out of the 6 datasets. In digits49-s, digits49-w and synth, PF-RobustGC is comparable to RobustGC for the experiments with a small number of labels, although it works slightly worse when the number of labels is increased. Nevertheless, the results show that the proposed heuristic performs quite well in practice.

4.2.1 Dependence on the tuning parameter

As mentioned before, for the smallest training sets used here, some of them composed by only two labelled nodes, it is impossible to perform a validation procedure. To analyse the dependence of ZhouGC, BelkGC and RobustGC on their tuning parameters, Fig. 6 shows the evolution of the average test accuracy, both for the smallest and largest training sizes. The proposed RobustGC has the most stable behaviour, although as expected it sometimes drops near the critical value \(\gamma = \lambda _{1}\). Nevertheless, this should be the easiest model to tune. ZhouGC shows also a quite smooth dependence, but with a sigmoid shape, where the maximum tends to be located in a narrow region at the middle. Finally, BelkGC (the model comparable to RobustGC in termsof accuracy) presents the sharpest plot with large changes in the first steps, and hence it is expected to be more difficult to tune.

Fig. 6
figure 6

Comparison of the accuracy with respect to the different tuning parameters, for the smallest and largest training sets, and for the six datasets.  ZhouGC;  BelkGC;  RobustGC

Fig. 7
figure 7

Comparison of the accuracy with respect to the size of the subgraph of an out-of-sample extended model and a model built using the complete graph.  PF-RobustGC model (complete);  PF-RobustGC model (out-of-sample)

4.3 Out-of-sample extension

This experiment illustrates the out-of-sample extension, by comparing the accuracy of a model built using all the available graph and a model which is built with a smaller subgraph and then extended to the remaining nodes.

In particular, the dataset used is based on digits49-w, that is, a weighted graph representing the handwritten digits 4 and 9, but in this case the Gaussian kernel has a broader bandwidth of \(\sigma = {5}\), so that the resulting model can be extended to all the patterns. Moreover, the total number of nodes is increased to 1000 (500 of each class). The number of labelled nodes is fixed to 10 (5 of each class), whereas the size of the subgraph used to build the PF-RobustGC model is varied from 20 (10 labelled and 10 unlabelled nodes) to 1000 (all the graph, 10 labelled and 990 unlabelled nodes). Once the PF-RobustGC model is built, the prediction is extended to the unlabelled nodes (both those used to build the model and those out of the initial subgraph, thanks to the out-of-sample extension), and the accuracy is measured. The experiment is repeated 20 times to average the results, where the patterns are shuffled so that both the labelled nodes and the subgraph change.

The results are depicted in Fig. 7, where the accuracy of the PF-RobustGC is shown as a function of the size of the subgraph used to built the model. As a baseline, an PF-RobustGC model using all the graph is also plotted. The solid lines represent the average accuracy, and the striped regions the areas between the minimum and maximum. We can see that with a relatively small subgraph (of around 300 nodes) the subgraph model is comparable to the complete model (indeed, there is no statistically significant differenceFootnote 2 from iteration 179 on), so we can conclude that, in this particular case, the out-of-sample extension is working quite well.

5 Conclusions

Starting from basic spectral graph theory, a novel classification method applicable to both semi-supervised classification and graph data classification has been derived in the framework of manifold learning, namely Robust Graph Classification (RobustGC). The method has a clear interpretation in terms of loss functions and regularization. Noticeably, even though the loss function is concave, we have stated the conditions so that the optimization problem is convex. A simple algorithm to solve this problem has been proposed, which only requires to solve a linear system. The results of the method on artificial and real data show that RobustGC is indeed more robust to the presence of wrongly labelled data points, and it is also particularly well-suited when the number of available labels is small. Moreover, an out-of-sample extension for this model is proposed, which allows to extend the initial model to points out of the graph.

As further work, we intend to study with more detail the possibilities of the concave loss functions in supervised problems, bounding the solutions using either regularization terms or other alternative mechanisms. Regarding the selection of \(\gamma \), according to our results the predictions of RobustGC are quite stable with respect to changes in \(\gamma \) in an interval containing the best parameter value. Hence, it seems that a stability criterion could be useful to tune \(\gamma \). On the other side, RobustGC could be applied to large-scale graph data, where the out-of-sample extension could play a crucial role in order to make the optimization problem affordable. Moreover, the out-of-sample extension potentially allows the proposed RobustGC method to be used in completely supervised problems, and compared with other classification methods.