1 Introduction

Partitioning data into sensible groups is a fundamental problem in machine learning and science in general. One of the most popular approaches is to find the best (balanced) cut of a graph representing data, such as the normalized cut of Shi and Malik [24] or the Cheeger ratio cut [9]. However, solving balanced/ratio cut problems is NP-hard, which has lead people to compute approximate solutions. The most well-known approach to approximate the solution of a ratio cut is the spectral clustering method, which is based on a 2 relaxation of the original ratio cut. This 2 relaxation reduces to solving a generalized system of eigenvectors for the graph Laplacian, then selects the 2nd smallest eigenvector and finally partitions into two groups by thresholding (this requires testing multiple thresholds). Different normalizations of the graph Laplacian lead to different spectral clustering methods. These methods often provide good solutions but can fail on somewhat benign problems; for example see the two-moons example in Fig. 1. In this case, the relaxation leading to the spectral clustering methods is too weak. A stronger relaxation was introduced by Bühler and Hein in [7]. They described the p-spectral clustering method, which considers the p relaxation of the Cheeger ratio cut, instead of the 2 relaxation. They showed that the relaxed solution of the p-spectral clustering problem tends asymptotically to the solution of the Cheeger cut problem when p→1. In [10, 26] (also see [25]), it was proved that the relaxation for p=1 is actually exact, i.e. the solution of the 1 relaxation problem provides an exact solution of the Cheeger cut problem. Unfortunately, there is no algorithm that guarantees to find global minimizers of the 1 relaxation problem (we recall that the problem is NP-hard). However, the experiments in [7, 26] showed that good results can be obtained with these stronger relaxations; the works [3, 15, 16] have further strengthened the case for 1 relaxation methods and related ideas, and have charted a new and promising research direction for improving spectral clustering methods.

In this work, we propose to extend [26]. In particular, we are interested in extending to the challenging multi-class ratio cut problem, and adding label information to obtain a transductive problem. Standard approaches for the unsupervised learning problem usually proceed by recursive two-class clustering. In this paper, we will use results recently introduced in imaging science to solve the multi-class learning problem. The papers [1, 6, 8, 19, 20, 28] have proposed tight approximations of the solution of the multi-phase image segmentation problem based on 1 relaxation techniques. The main contribution of this paper is to develop efficient multi-class algorithms for the transductive learning problem. We will introduce two multi-class algorithms based on the 1 relaxation of the Cheeger cut and the piecewise constant Mumford/Shah or Potts models [22, 23]. Experiments show that these new multi-class transductive learning algorithms improve the classification results compared to spectral clustering algorithms, particularly in the case of a very few numbers of labels.

2 Unsupervised Data Classification with 1 Relaxation of the Cheeger Cut

2.1 The Model

In this section, we recall the main result of [26] and proposed a modified and improved version of the algorithm introduced there. Let G=(V,E) be a graph where V is the set of nodes and E is the set of edges weighted by a function W ij , ∀(ij)∈E. A classical method for clustering is to consider the Cheeger minimization problem [9]:

$$\begin{aligned} \min_{\varOmega\subset V} \quad \frac{\operatorname{Cut}(\varOmega,\varOmega^c)}{\min(|\varOmega |,|\varOmega^c|)} \end{aligned}$$
(1)

which partitions the set V of points into two sets Ω and Ω c (the complementary set of Ω in V). The cut is defined as \(\operatorname{Cut}(\varOmega,\varOmega^{c}):=\sum_{i\in\varOmega,j\in\varOmega ^{c}}w_{ij}\) and |.| provides the number of points in a given set. The Cheeger problem is NP-hard. However, it was shown in [10], and by the authors of this paper using a different argument in [26], that there exists an exact continuous relaxation of (1) as follows. Let us consider the minimization problem w.r.t. a function u:V→[0,1]:

$$\begin{aligned} \min_{u\in[0,1]}\quad \frac{\Vert Du\Vert _1}{\Vert u-m(u)\Vert _1} \end{aligned}$$
(2)

where ∥Du1:=∑ ij w ij |u i u j | is the graph-based total variation of the function u, m(u) is the median of u, and ∥um(u)∥1=∑ i |u i m(u)|. If a global minimizer u of (2) can be computed, then it can be shown that this minimizer would be the indicator of a set Ω (i.e. \(u^{\star}=1_{\varOmega^{\star}}\)) corresponding to a solution of the NP-hard problem (1). But there is no algorithm that guarantees to compute global minimizers of (2) as the problem is non-convex. However, experiments show that the proposed minimization algorithm in [26], which we will review below, produces good approximations of the solution.

Recent advances in 1 optimization offer powerful tools to design a fast and accurate algorithm to solve the minimization problem (2). First, observe that minimizing (2) is equivalent to:

$$\begin{aligned} \min_{u\in[0,1]} \quad \frac{\Vert Du\Vert _1}{\Vert u\Vert _1} \quad \textrm{s.t.}\quad m(u)=0 \end{aligned}$$
(3)

Indeed, the energy is not changed if a constant is added to u. So it is possible to restrict the minimization problem to functions u with zero median. Then, the ratio minimization problem (3) can be solved using the method of Dinkelbach [11] (also used in imaging problems s.a. [17, 18]) which introduces the minimax problem:

$$\begin{aligned} \min_{u\in[0,1]}\max_{\lambda\in\mathbb{R}} \quad \Vert Du\Vert _1 - \lambda \Vert u\Vert _1\quad \textrm{s.t.}\quad m(u)=0 \end{aligned}$$
(4)

Then, we consider a standard two-step iterative algorithm:

  1. (i)

    Fix λ, compute the solution of the constrained minimization problem:

    $$\begin{aligned} u^{n+1}=\operatorname*{argmin}_{u\in[0,1]}\ \Vert Du\Vert _1 - \lambda^n \Vert u\Vert _1\quad \textrm{s.t.}\quad m(u)=0 \end{aligned}$$
    (5)
  2. (ii)

    Fix u, compute the solution of the maximization problem:

    $$\begin{aligned} \lambda^{n+1}=\operatorname*{argmax}_{\lambda\in\mathbb{R}} \ \Vert Du^{n+1}\Vert _1 - \lambda \Vert u^{n+1}\Vert _1 \end{aligned}$$
    (6)

For the minimization problem (5), observe that the constraint zero median is not linear, but it can be replaced by the approximate linear constraint ∑ i u i ≤|V|/2. Indeed, suppose that u i ∈{0,1} then the median of u is zero if ∑ i u i ≤∑ i (1−u i ) which yields to ∑ i u i ≤|V|/2. We will consider the notation 1.u:=∑ i u i in the rest of the paper.

In order to deal efficiently with the non-differentiability of the 1 norm in (6), a splitting approach associated with an augmented Lagrangian method and the Alternating Direction Method of Multipliers [13] can be used along the same lines as [4, 14]. Hence, we consider the constrained minimization problem:

$$ \begin{aligned} &\min_{u,v\in[0,1],d} \quad \Vert d\Vert _1 - \lambda\Vert v\Vert _1 \\ &\textrm{s.t.} \quad d=Du,\qquad v=u, \qquad \mathbf{1}.v\leq|V|/2 \end{aligned} $$
(7)

whose linear constraints can be enforced with an augmented Lagrangian method as:

$$\begin{aligned} \left \{ \begin{array}{l} (u^{n+1},v^{n+1},d^{n+1})\\ \quad {}= \operatorname*{argmin}_{u,v\in[0,1],d} \Vert d\Vert _1 - \lambda \Vert v\Vert _1 \\ \qquad {}+ \alpha_d.(d-Du)+\frac{r_d}{2}|d-Du|^2\\ \qquad {}+ \alpha_v.(v-u)+\frac{r_v}{2}(v-u)^2 + \alpha_m.(\mathbf{1}.v-|V|/2)\\ \alpha_d^{n+1}=\alpha_d^n+r_d.(d^{n+1}-Du^{n+1})\\ \alpha_v^{n+1}=\alpha_v^n+r_v.(v^{n+1}-u^{n+1})\\ \alpha_m^{n+1}=\max(0,\alpha_m^n+r_m.(\mathbf{1}.v^{n+1}-|V|/2)) \end{array} \right . \end{aligned}$$
(8)

Three sub-minimizations need to be solved. The minimization problem w.r.t. u:

$$\begin{aligned} \min_{u} \frac{r_d}{2} \biggl|Du-\biggl(d+\frac{\alpha_d}{r_d} \biggr) \biggr|^2 + \frac {r_v}{2} \biggl(u-\biggl(v+\frac{\alpha_v}{r_v} \biggr) \biggr)^2 \end{aligned}$$

whose solution u is given by a Poisson problem:

$$\begin{aligned} \bigl(r_v+r_dD^TD\bigr)u=r_dD^T \biggl(d+\frac{\alpha_d}{r_d} \biggr) + r_v \biggl(v+\frac {\alpha_v}{r_v} \biggr) \end{aligned}$$
(9)

The solution of (9) can be estimated by a few steps of conjugate gradient descent as D is extremely sparse. The minimization problem w.r.t. v:

$$\begin{aligned} \min_{v\in[0,1]} - \lambda\Vert v\Vert _1 +\frac{r_v}{2} \biggl(v-\biggl(u-\frac{\alpha _v}{r_v}\biggr) \biggr)^2 + \alpha_m.(\mathbf{1}.v-|V|/2) \end{aligned}$$

has an analytical solution given by unshrinkage [26] and truncated into [0,1]:

$$\begin{aligned} &v^\star=\varPi_{[0,1]} \biggl(f_v+ \frac{\lambda}{r_v} \frac{f_v}{|f_v|} \biggr), \\ &\quad \textrm{with}\ f_v:=u-\frac{\alpha_v}{r_v}- \frac{\alpha _m}{r_v} \end{aligned}$$
(10)

To avoid the constant trivial solution, we also apply the “renormalization” step: \(v^{\star}\leftarrow\frac{v^{\star}-\min(v^{\star})}{\max(v^{\star})-\min(v^{\star})}\). The minimization problem w.r.t. d:

$$\begin{aligned} \min_{d} \Vert d\Vert _1 +\frac{r_d}{2} \biggl|d- \biggl(Du-\frac{\alpha_d}{r_d}\biggr) \biggr|^2 \end{aligned}$$

has also an analytical solution given by shrinkage [12]:

$$\begin{aligned} &d^\star=\max \biggl(|f_d|-\frac{1}{r_d},0 \biggr) \frac{f_d}{|f_d|}, \\ &\quad \textrm{with}\ f_d:=Du-\frac{\alpha_d}{r_d} \end{aligned}$$
(11)

For the maximization problem (6), the solution is as follows:

$$\begin{aligned} \lambda^{n+1}=\frac{\Vert Du^{n+1}\Vert _1}{\Vert u^{n+1}\Vert _1} \end{aligned}$$
(12)

We will consider a steepest gradient descent method instead of (12) to get a smoother evolution of λ n+1:

$$\begin{aligned} \lambda^{n+1}=\lambda^n-\delta_\lambda. \biggl( \lambda^n-\frac {\Vert Du^{n+1}\Vert _1}{\Vert u^{n+1}\Vert _1} \biggr) \end{aligned}$$
(13)

To summarize the algorithm introduced in this section, we write down the pseudo-code Algorithm 1.

Algorithm 1
figure 1

Unsupervised learning with 1 relaxation of the Cheeger cut

2.2 Experiments

In this section, we demonstrate results using the unsupervised classification Algorithm 1. For each experience, we build the weight matrix using the self-tuning construction of [29]. We use ten nearest neighbors, and the tenth neighbor determines the local scale. The universal scaling parameter is set to 1. For Algorithm 1, we set r d =10, r v =100, r m =6K/N, where N is the number of data points and K is the number of classes, and δ λ =0.4. Figure 1 presents the well-known two-moon dataset [7]. Each moon has 1,000 data points in \(\mathbb{R}^{100}\). This example shows that the solution of the 1 relaxation is tighter than the solution of the 2 relaxation (see caption for more details). In Table 1, we compare quantitatively our algorithm with the spectral clustering method of Shi and Malik [24] and the related method of Hein and Bühler in [15], which is available at http://www.ml.uni-saarland.de/code/ one SpectralClustering/oneSpectralClustering.html ([16] is not yet available for comparison). Our method and [15] outperform the spectral clustering method.

Fig. 1
figure 2

Unsupervised classification of the two-moon dataset. Each moon has 1,000 data points in \(\mathbb{R}^{100}\). Figure (b) is the result given by the spectral clustering method of Shi and Malik [24]. It fails to produce the correct result as the 2 relaxation is too weak. Figure (d) is the result of the 1 relaxation algorithm and figure (c) is the random initialization. The proposed algorithm succeeds to compute the correct result. This also shows that the solution of the 1 relaxation is tighter than the solution of the 2 relaxation (Note: it is a color figure)

Table 1 Unsupervised learning result for the two-moon dataset. Column (a) reports the minimum energy value (1) considered among 100 random initializations and the error is the percentage of misclassified data of the minimum energy. Column (b) reports the average energy value (1) and the misclassification error for 100 random initializations. Note that the same random initialization was used for Algorithms 1 and [15]

In Fig. 2, we apply the standard recursive two-class partitioning approach to deal with more than two classes. Figure 2(b) shows the result by spectral clustering and Fig. 2(c) presents the result with our algorithm (see caption for more details).

Fig. 2
figure 3

Unsupervised classification for the four-moon dataset. The standard recursive two-class partitioning approach is applied. Figure (b) shows the result by spectral clustering [24] and figure (c) presents the result with Algorithm 1. Although our algorithm produces a better result than spectral clustering, it still fails to compute the solution. When more than two classes are considered then the quality of the results given by the recursive algorithm actually strongly depends on the choice of the initialization. In fact, for most initializations, the standard recursive two-class partitioning approach will not be able to give the solution (Note: it is a color figure)

On the right hand side of Fig. 3, we display a projection of the MNIST benchmark dataset, available at http://yann.lecun.com/exdb/mnist/, to 3 dimensions via PCA. This data set consists of 70,000 28 × 28 images of handwritten digits, 0 through 9, usually broken into a 60000 point training set and a 10000 point test set; thus the data is presented as 70000 points in \(\mathbb{R}^{784}\)). The data was preprocessed by projecting onto 50 principal components. Table 2 compares quantitatively our algorithm with the spectral clustering method of Shi and Malik [24] and the related method of Hein and Bühler in [15]. Our method and [15] outperform the spectral clustering method.

Fig. 3
figure 4

Projection into a 3D space (via PCA) of the MNIST benchmark dataset. This data set consists of 60,000 28×28 images and 10,000 training images (each image is a data point in \(\mathbb{R}^{784}\)) of handwritten digits, 0 through 9 (Note: it is a color figure)

Table 2 Unsupervised learning result for the MNIST dataset. Column (a) reports the minimum energy value (1) considered among 10 random initializations and the error is the percentage of misclassified data of the minimum energy. Column (b) reports the average energy value (1) and the misclassification error for 10 random initializations. Note that the same random initialization was used for Algorithms 1 and [15]

3 Transductive Data Classification with 1 Relaxation of the Multi-class Cheeger Cut

In this section, we extend the unsupervised two-phase Cheeger learning algorithm of Sect. 2 to a transductive multi-class Cheeger learning algorithm. The most natural extension of (1) to K classes is as follows:

$$\begin{aligned} &\min_{\varOmega_1,\ldots,\varOmega_K} \quad \sum _{k=1}^K \frac{\operatorname{Cut}(\varOmega_k,\varOmega _k^c)}{\min(|\varOmega_k|,|\varOmega_k^c|)} \\ &\textrm{s.t.}\quad \bigcup_{k=1}^K \varOmega_k=V \quad \textrm{and}\quad \varOmega_i \cap \varOmega_j=\emptyset\quad \forall i\not=j \end{aligned}$$

The previous minimization problem is equivalent to the following problem:

$$ \begin{aligned} &\min_{\{u_k\}_{k=1}^K\in\{0,1\}} \quad \sum_{k=1}^K \frac {\Vert Du_k\Vert _1}{\Vert u_k-m(u_k)\Vert _1} \\ &\textrm{s.t.}\quad \sum_{k=1}^K u_k(i)=1\quad \forall i\in V\end{aligned} $$
(14)

The set of minimization used in the above minimization problem is not convex because binary functions do not make a convex set. Thus we consider the following relaxation:

$$ \begin{aligned} &\min_{\{u_k\}_{k=1}^K\in[0,1]} \quad \sum_{k=1}^K \frac {\Vert Du_k\Vert _1}{\Vert u_k-m(u_k)\Vert _1} \\ &\textrm{s.t.}\quad \sum_{k=1}^K u_k(i)=1\quad \forall i\in V \end{aligned} $$
(15)

In Sect. 2, we recall that the continuous 1 relaxation of the two-phase Cheeger minimization problem is exact, meaning that the (continuous) solution of (2) provides a (discrete) solution of the original Cheeger problem (1). We do not know if the 1 relaxation is still exact when multiple classes are considered, i.e. if the (continuous) solution of (15) provides a (discrete) solution of the original multi-class Cheeger problem (14). For the multi-class Cheeger-based learning problem considered in this paper, experiments show that the solutions \(\{u_{k}\}_{k=1}^{K}\) are close to binary functions, but there is no theoretical guarantee of this observation.

As the transductive learning problem is considered here then a (small) set l k of labels is given for each class Ω k (i.e. l k Ω k , see Fig. 4) and the following minimization problem is thus considered:

$$\begin{aligned} \begin{aligned} & \min_{\varOmega_1,\ldots,\varOmega_K} \quad \sum_{k=1}^K \frac{\operatorname{Cut}(\varOmega_k,\varOmega _k^c)}{\min(|\varOmega_k|,|\varOmega_k^c|)}\\ & \textrm{s.t.}\quad \bigcup_{k=1}^K \varOmega_k=V \quad \textrm{and}\quad \varOmega_i \cap\varOmega_j=\emptyset \quad \forall i\not=j \\ &\quad \textrm{and given}\ \{l_k\}_{k=1}^K \end{aligned} \end{aligned}$$
(16)

which is equivalent to:

$$\begin{aligned} &\min_{\{u_k\}_{k=1}^K\in\{0,1\}} \quad \sum _{k=1}^K \frac {\Vert Du_k\Vert _1}{\Vert u_k-m(u_k)\Vert _1}\\ & \textrm{s.t.}\quad \sum_{k=1}^K u_k(i)=1\quad \forall i\in V \quad \textrm{and}\\ & u_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p \ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p\ \textrm{and}\ k\not=p \end{array} \right . \end{aligned}$$

and which is relaxed to:

$$\begin{aligned} &\min_{\{u_k\}_{k=1}^K\in[0,1]}\quad \sum _{k=1}^K \frac {\Vert Du_k\Vert _1}{\Vert u_k-m(u_k)\Vert _1}\\ & \textrm{s.t.}\quad \sum_{k=1}^K u_k(i)=1\quad \forall i\in V \quad \textrm{and}\\ & u_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p\ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p \ \textrm{and}\ k\not=p \end{array} \right . \end{aligned}$$
Fig. 4
figure 5

Illustration of the multi-class transductive problem. Given a set of labeled data points \(\{l_{k}\}_{k=1}^{K}\) (the colored points), the objective is to find the data classes \(\{\varOmega_{k}\}_{k=1}^{K}\) that minimize the Cheeger energy (16)

We now extend the two-phase algorithm designed in Sect. 2 to the multi-phase case:

$$\begin{aligned} &\min_{\{u_k\}_{k=1}^K\in[0,1]}\max _{\{\lambda_k\}_{k=1}^K\in\mathbb {R}}\quad \sum_{k=1}^K \Vert Du_k\Vert _1 - \lambda_k \Vert u_k\Vert _1\\ & \textrm{s.t.}\quad m(u_k)=0,\qquad \sum_{k=1}^K u_k(i)=1 \quad \forall i\in V,\quad \textrm{and}\\ & u_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p \ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p \ \textrm{and}\ k\not=p \end{array} \right . \end{aligned}$$

The median constraint is relaxed to 1.u k ≤|V|/K. We again consider a standard two-step iterative algorithm:

  1. (i)

    Fix λ k , compute the solution for the K minimization problems:

    $$\begin{aligned} &u_k^{n+1}= \operatorname*{argmin}_{u_k\in[0,1]}\ \Vert Du_k\Vert _1 - \lambda_k^n \Vert u_k\Vert _1\\ &\textrm{s.t.}\quad m(u_k)=0,\qquad \sum_{k=1}^K u_k(i)=1 \quad \forall i\in V,\quad \textrm{and}\\ & u_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p \ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p \ \textrm{and}\ k\not=p \end{array} \right . \end{aligned}$$
  2. (ii)

    Fix u k , compute the solution of the K maximization problems:

    $$\begin{aligned} \lambda_k^{n+1}=\operatorname*{argmax}_{\lambda\in\mathbb{R}} \bigl\Vert Du_k^{n+1}\bigr\Vert _1 - \lambda \bigl\Vert u_k^{n+1}\bigr\Vert _1 \end{aligned}$$
    (17)

The minimization problems (17) are solved as follows:

$$\begin{aligned} \left\{ \begin{array}{l} (u_k^{n+1},v_k^{n+1},d_k^{n+1})\\ \quad {}= \operatorname*{argmin}_{u_k,v_k\in[0,1],d_k} \Vert d_k\Vert _1 \\ \qquad {}- \lambda\Vert v_k\Vert _1 + {\alpha_d}_k.(d_k-Du_k)+\frac {r_d}{2}|d_k-Du_k|^2\\ \qquad {}+ {\alpha_v}_k.(v_k-u_k)+\frac{r_v}{2}(v_k-u_k)^2\\ \qquad {} + {\alpha_m}_k.(\mathbf{1}.v_k-|V|/K)\\ \textrm{s.t.}\quad \sum_{k=1}^K v_k=1\ \textrm{and}\\ v_k(i)= \left\{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p\ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p \ \textrm{and}\ k\not=p \end{array} \right. \\ {\alpha_d}_k^{n+1}=\alpha_d^n+r_d. \bigl(d_k^{n+1}-Du_k^{n+1}\bigr) \\ {\alpha_v}_k^{n+1}=\alpha_v^n+r_v. \bigl(v_k^{n+1}-u_k^{n+1}\bigr) \\ {\alpha_m}_k^{n+1}=\max\bigl(0,{ \alpha_m}_k^n+r_m.\bigl({ \bf1}.v_k^{n+1}-|V|/K\bigr)\bigr) \end{array} \right. \end{aligned}$$
(18)

The solution of the minimization problems w.r.t. u k ,v k ,d k is the same as the solution given in the previous section. Finally, the projection onto the convex simplex set \(\sum_{k=1}^{K} v_{k}= 1\) is given by [21, 28]. Observe that the final solution \(\{u^{\star}_{k}\}_{k=1}^{K}\) of (17) is not guaranteed to be binary. Hence, a conversion step is required to make \(\{u^{\star}_{k}\}_{k=1}^{K}\) binary. The most natural conversion is as follows:

$$\begin{aligned} \hat{u}^\star_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ k=\arg\max_{p\in\{1,\ldots,K\}} u^\star_p(i) \\ 0& \textrm{otherwise} \end{array} \right .\quad \forall i\in V \end{aligned}$$
(19)

where \(\{\hat{u}^{\star}_{k}\}_{k=1}^{K}\) are binary functions satisfying \(\sum_{k=1}^{K} \hat{u}^{\star}_{k}=1\).

To summarize the algorithm introduced in this section, we write down the pseudo-code Algorithm 2. Eventually, Fig. 5 presents a simple illustration of the proposed multi-class Cheeger transductive model.

Algorithm 2
figure 6

Transductive learning with 1 relaxation of the multi-class Cheeger cut

Fig. 5
figure 7

Transductive classification of the four-moon dataset. The objective is to classify the four moons using 3 labels for each moon. Figure (b) presents the result with the spectral method ( 2 relaxation) and figure (c) shows the result with the 1 relaxation of the multi-class Cheeger cut (Algorithm 2). The 1 relaxation produces a better classification result than the 2 relaxation (Note: it is a color figure)

4 Transductive Data Classification with 1 Relaxation of the Multi-class Mumford-Shah-Potts Model

In this section, we develop an alternative to the multi-class Cheeger transductive classification algorithm introduced in the previous section. A successful multi-phase segmentation algorithm in imaging is the multiphase piecewise constant Mumford-Shah method [22] (continuous setting) or the Potts method [23] (discrete setting). These methods are well suited to solve the image segmentation problem and the idea in this section is to extend them to the transductive learning problem. Note that the piecewise constant Mumford-Shah/Potts models have been first implemented with the level set method [27, 30] and the graph cut method [5]. However, these methods are either too slow, not optimal, not accurate enough or the memory allocation can be important. Recent advances in 1 optimization algorithms provide efficient tool to solve the piecewise constant Mumford-Shah/Potts models [1, 6, 8, 19, 20, 28]. These recent improvements will be used to develop an efficient algorithm for the transductive Potts model:

$$ \begin{aligned} &\min_{\varOmega_1,\ldots,\varOmega_K} \quad \sum_{k=1}^K \underbrace{\operatorname{Cut}\bigl(\varOmega _k,\varOmega_k^c \bigr)}_{\simeq \operatorname{Per}(\varOmega_k)}\\ & \textrm{s.t.}\quad \bigcup_{k=1}^K \varOmega_k=V\quad \textrm{and}\quad \varOmega_i \cap\varOmega_j=\emptyset \quad \forall i\not=j \\ &\quad \textrm{and given}\ \{l_k\}_{k=1}^K \end{aligned} $$
(20)

where \(\operatorname{Per}\) stands for perimeter. The relationship between cut and perimeter comes from the coarea formula [25]. In the continuous setting, we have \(\operatorname{Per}(\varOmega)=\int_{\mathcal{M}\subset\mathbb{R}^{d}}|\nabla f|\) when f=1 Ω (x) is the indicator function of a geometric set Ω defined as f(x)=1 ∀xΩ and 0 otherwise. Then, discretizing the total variation energy leads to \(\int_{\mathcal {M}\subset\mathbb{R}^{d}}|\nabla f|\simeq\sum_{i,j\in V}w_{i,j}|f_{i}-f_{j}|\) and plugging the indicator function f i =1 Ω (i) finally gives \(\operatorname{Per}(\varOmega)\simeq\sum_{i,j\in V}w_{i,j}|f_{i}-f_{j}|=\sum_{i\in \varOmega, j\in\varOmega^{c}}w_{i,j}=\operatorname{Cut}(\varOmega,\varOmega^{c})\). The minimization problem (20) is equivalent to the following problem:

$$\begin{aligned} &\min_{\{u_k\}_{k=1}^K\in\{0,1\}} \quad \sum _{k=1}^K \Vert Du_k\Vert _1 \\ &\textrm{s.t.}\quad \sum_{k=1}^K u_k(i)=1 \quad \forall i\in V,\quad \textrm{and}\\ & u_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p\ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p \ \textrm{and}\ k\not=p \end{array} \right . \end{aligned}$$

The set of minimization used in the above minimization problem is not convex because binary functions do not make a convex set. Thus we consider the following relaxation:

$$\begin{aligned} &\min_{\{u_k\}_{k=1}^K\in[0,1]} \quad \sum _{k=1}^K \Vert Du_k\Vert _1 \\ &\textrm{s.t.}\quad \sum_{k=1}^K u_k(i)=1 \quad \forall i\in V,\quad \textrm{and}\\ & u_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p \ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p \textrm{ and } k\not=p \end{array} \right . \end{aligned}$$

The previous minimization problem is solved as:

$$\begin{aligned} \left\{ \begin{array}{l} (u_k^{n+1},v_k^{n+1},d_k^{n+1})\\ \quad {}=\operatorname*{argmin}_{u_k,v_k\in[0,1],d_k} \Vert d_k\Vert _1 \\ \qquad {}+ {\alpha_d}_k.(d_k-Du_k)+\frac{r_d}{2}|d_k-Du_k|^2\\ \qquad {}+ {\alpha_v}_k.(v_k-u_k)+\frac{r_v}{2}(v_k-u_k)^2 \\ \textrm{s.t.}\quad \sum_{k=1}^K v_k=1\quad \textrm{and}\\ v_k(i)= \left\{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p \ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p \ \textrm{and}\ k\not=p \end{array} \right. \\ {\alpha_d}_k^{n+1}=\alpha_d^n+r_d. \bigl(d_k^{n+1}-Du_k^{n+1}\bigr) \\ {\alpha_v}_k^{n+1}=\alpha_v^n+r_v. \bigl(v_k^{n+1}-u_k^{n+1}\bigr) \end{array} \right. \end{aligned}$$

The solution of the minimization problems w.r.t. u k ,d k is the same as the solution given in Sect. 2. The minimization w.r.t. v k is simply given by:

$$\begin{aligned} v_k^\star=\varPi_{[0,1]}({f_v}_k) \quad \textrm{with}\ {f_v}_k:=u_k- \frac {{\alpha_v}_k}{r_v} \end{aligned}$$
(21)

and project onto the convex simplex set \(\sum_{k=1}^{K} v_{k}=1\) using [21, 28]. Observe that the final solution \(\{u^{\star}_{k}\}_{k=1}^{K}\) of (17) is not guaranteed to be binary. Hence, a conversion step is required to make \(\{u^{\star}_{k}\}_{k=1}^{K}\) binary. Like in the previous section, the binary conversion is as follows:

$$\begin{aligned} \hat{u}^\star_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ k=\arg\max_{p\in\{1,\ldots,K\}} u^\star_p(i)\\ 0& \textrm{otherwise} \end{array} \right .\quad \forall i\in V \end{aligned}$$
(22)

where \(\{\hat{u}^{\star}_{k}\}_{k=1}^{K}\) satisfy \(\sum_{k=1}^{K} \hat{u}^{\star}_{k}=1\).

To summarize the algorithm introduced in this section, we write down the pseudo-code Algorithm 3.

Algorithm 3
figure 8

Transductive learning with 1 relaxation of multi-class Mumford-Shah-Potts model

5 Experiments

In this section, we show classification results using the transductive algorithms developed in Sects. 3 and 4. We will work on the four moons and MNIST datasets described above. For both data sets, we build the weights matrix using the self-tuning construction of [29]. We use ten nearest neighbors, and the tenth neighbor determines the local scale. The universal scaling parameter is set to 1. For Algorithm 1, we set r d =10, r v =100, r m =6K/N, where N is the number of data points and K is the number of classes, and δ λ =0.4. For Algorithm 3, we set r d =10 and r v =100. We choose the labeled points randomly, and fix a number of labeled points to draw from each class.

We compare Algorithm 2 and Algorithm 3 with a spectral transductive learning method from [2], which uses linear least squares on the eigenvectors of the normalized Laplacian to estimate the classes. That is, given the weight matrix W as before, we set \(\mathcal{L}=I-S^{-1/2}WS^{-1/2}\), where S is the diagonal matrix with the row sums on the diagonal, that is, S ii =∑ j W ij . We compute the l+1 lowest eigenvalue eigenvectors ϕ 0,…,ϕ l of \(\mathcal{L}\), and form the N×l matrix Φ=[ϕ 1ϕ l ]; note that as usual we have omitted the density vector ϕ 0. Each row of Φ corresponds to a data point. Next we form the matrix Φ lab by extracting the rows of Φ corresponding to the labeled data points. Let L denote the number of classes, and p be the number of labeled data points. Given the p×L binary label matrix Y, we compute

$$A= \bigl(\varPhi_{\text{lab}}^T\varPhi_{\text{lab}} \bigr)^{-1}\varPhi_{\text {lab}}^T Y $$

which minimizes the least square energy:

$$ \|Y-\varPhi_{\text{lab}}A\|^2_2 $$
(23)

To compute the class labels of the unlabeled points, we set R=ΦA, and let

$$y_j=\operatorname*{argmax}_i R_{ji} $$

Tables 3 and 4 compare the proposed 1 relaxations of the multi-class Cheeger cut (Algorithm 2) and the Mumford-Shah-Potts (Algorithm 3) with the competitive spectral method of [2] (by selecting the number l of eigenvectors which minimizes the error). We have tested different numbers of labels (n l is the number of labeled data for each class) that are selected randomly. We repeat each experiment 10 times. For each experiment, the labeled points were chosen randomly and the same labeled points were used for the multi-class Cheeger cut model, the Mumford-Shah-Potts model and the spectral method. The 1 relaxations of the multi-class Cheeger cut and the Mumford-Shah-Potts outperform the spectral method in all cases, significantly so when a very small number of points are labeled.

Table 3 Transductive learning for the four-moon dataset. Column (I) reports the minimum energy value considered among 10 random tests and the associated error of misclassified data and computational time. Column (II) reports the average energy value, the misclassification error and the computational time for the 10 tests. The energy considered for the Cheeger model is (16), for the Mumford-Shah-Potts is (20) and for the spectral method is (23). Finally, n l is the number of (randomly selected) labeled data for each class
Table 4 Transductive learning for the MNIST dataset. Column (I) reports the minimum energy value considered among 10 random tests and the associated error of misclassified data and computational time. Column (II) reports the average energy value, the misclassification error and the computational time for the 10 tests. The energy considered for the Cheeger model is (16), for the Mumford-Shah-Potts is (20) and for the spectral method is (23). Finally, 60,000 unlabeled data points are considered and n l ×10 labeled data points where n l is the number of (randomly selected) labeled data for each class. Therefore, the total number of data points for each experiment is 60,000+n l ×10 data points

6 Conclusion

The paper introduces new 1 relaxation methods for the multi-class transductive learning problem. These relaxation methods are inspired from recent advances in imaging science which offer fast, accurate and robust 1 optimization tools which allow to go beyond standard 2 relaxation methods, i.e. spectral clustering methods. Experiments demonstrate that the 1 relaxations of the multi-class Cheeger cut and the Mumford-Shah-Potts outperform the spectral clustering method, and even more significantly when a very small number of labels is considered.

Reproducible Research

The code is available at http://www.cs.cityu.edu.hk/~xbresson/codes.html#learningmulti.