Multi-class Transductive Learning Based on ℓ 1 Relaxations of Cheeger Cut and Mumford-Shah-Potts Model

Bresson, Xavier; Tai, Xue-Cheng; Chan, Tony F.; Szlam, Arthur

doi:10.1007/s10851-013-0452-5

Multi-class Transductive Learning Based on ℓ ¹ Relaxations of Cheeger Cut and Mumford-Shah-Potts Model

Published: 27 August 2013

Volume 49, pages 191–201, (2014)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Mathematical Imaging and Vision Aims and scope Submit manuscript

Multi-class Transductive Learning Based on ℓ ¹ Relaxations of Cheeger Cut and Mumford-Shah-Potts Model

Download PDF

Xavier Bresson¹,
Xue-Cheng Tai²,
Tony F. Chan³ &
…
Arthur Szlam⁴

629 Accesses
31 Citations
Explore all metrics

Abstract

Recent advances in ℓ ¹ optimization for imaging problems provide promising tools to solve the fundamental high-dimensional data classification in machine learning. In this paper, we extend the main result of Szlam and Bresson (Proceedings of the 27th International Conference on Machine Learning, pp. 1039–1046, 2010), which introduced an exact ℓ ¹ relaxation of the Cheeger ratio cut problem for unsupervised data classification. The proposed extension deals with the multi-class transductive learning problem, which consists in learning several classes with a set of labels for each class. Learning several classes (i.e. more than two classes) simultaneously is generally a challenging problem, but the proposed method builds on strong results introduced in imaging to overcome the multi-class issue. Besides, the proposed multi-class transductive learning algorithms also benefit from recent fast ℓ ¹ solvers, specifically designed for the total variation norm, which plays a central role in our approach. Finally, experiments demonstrate that the proposed ℓ ¹ relaxation algorithms are more accurate and robust than standard ℓ ² relaxation methods s.a. spectral clustering, particularly when considering a very small number of labels for each class to be classified. For instance, the mean error of classification for the benchmark MNIST dataset of 60,000 data in $\mathbb{R}^{784}$ using the proposed ℓ ¹ relaxation of the multi-class Cheeger cut is 2.4 % when only one label is considered for each class, while the error of classification for the ℓ ² relaxation method of spectral clustering is 24.7 %.

Regularized label relaxation with negative technique for image classification

Article 18 May 2022

One-Class Transfer Learning with Uncertain Data

Retargeted Regression Methods for Multi-label Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Partitioning data into sensible groups is a fundamental problem in machine learning and science in general. One of the most popular approaches is to find the best (balanced) cut of a graph representing data, such as the normalized cut of Shi and Malik [24] or the Cheeger ratio cut [9]. However, solving balanced/ratio cut problems is NP-hard, which has lead people to compute approximate solutions. The most well-known approach to approximate the solution of a ratio cut is the spectral clustering method, which is based on a ℓ ² relaxation of the original ratio cut. This ℓ ² relaxation reduces to solving a generalized system of eigenvectors for the graph Laplacian, then selects the 2nd smallest eigenvector and finally partitions into two groups by thresholding (this requires testing multiple thresholds). Different normalizations of the graph Laplacian lead to different spectral clustering methods. These methods often provide good solutions but can fail on somewhat benign problems; for example see the two-moons example in Fig. 1. In this case, the relaxation leading to the spectral clustering methods is too weak. A stronger relaxation was introduced by Bühler and Hein in [7]. They described the p-spectral clustering method, which considers the ℓ ^p relaxation of the Cheeger ratio cut, instead of the ℓ ² relaxation. They showed that the relaxed solution of the p-spectral clustering problem tends asymptotically to the solution of the Cheeger cut problem when p→1. In [10, 26] (also see [25]), it was proved that the relaxation for p=1 is actually exact, i.e. the solution of the ℓ ¹ relaxation problem provides an exact solution of the Cheeger cut problem. Unfortunately, there is no algorithm that guarantees to find global minimizers of the ℓ ¹ relaxation problem (we recall that the problem is NP-hard). However, the experiments in [7, 26] showed that good results can be obtained with these stronger relaxations; the works [3, 15, 16] have further strengthened the case for ℓ ¹ relaxation methods and related ideas, and have charted a new and promising research direction for improving spectral clustering methods.

In this work, we propose to extend [26]. In particular, we are interested in extending to the challenging multi-class ratio cut problem, and adding label information to obtain a transductive problem. Standard approaches for the unsupervised learning problem usually proceed by recursive two-class clustering. In this paper, we will use results recently introduced in imaging science to solve the multi-class learning problem. The papers [1, 6, 8, 19, 20, 28] have proposed tight approximations of the solution of the multi-phase image segmentation problem based on ℓ ¹ relaxation techniques. The main contribution of this paper is to develop efficient multi-class algorithms for the transductive learning problem. We will introduce two multi-class algorithms based on the ℓ ¹ relaxation of the Cheeger cut and the piecewise constant Mumford/Shah or Potts models [22, 23]. Experiments show that these new multi-class transductive learning algorithms improve the classification results compared to spectral clustering algorithms, particularly in the case of a very few numbers of labels.

2 Unsupervised Data Classification with ℓ ¹ Relaxation of the Cheeger Cut

2.1 The Model

In this section, we recall the main result of [26] and proposed a modified and improved version of the algorithm introduced there. Let G=(V,E) be a graph where V is the set of nodes and E is the set of edges weighted by a function W _ij, ∀(ij)∈E. A classical method for clustering is to consider the Cheeger minimization problem [9]:

$$\begin{aligned} \min_{\varOmega\subset V} \quad \frac{\operatorname{Cut}(\varOmega,\varOmega^c)}{\min(|\varOmega |,|\varOmega^c|)} \end{aligned}$$

(1)

which partitions the set V of points into two sets Ω and Ω ^c (the complementary set of Ω in V). The cut is defined as $\operatorname{Cut}(\varOmega,\varOmega^{c}):=\sum_{i\in\varOmega,j\in\varOmega ^{c}}w_{ij}$ and |.| provides the number of points in a given set. The Cheeger problem is NP-hard. However, it was shown in [10], and by the authors of this paper using a different argument in [26], that there exists an exact continuous relaxation of (1) as follows. Let us consider the minimization problem w.r.t. a function u:V→[0,1]:

$$\begin{aligned} \min_{u\in[0,1]}\quad \frac{\Vert Du\Vert _1}{\Vert u-m(u)\Vert _1} \end{aligned}$$

(2)

where ∥Du∥₁:=∑_ij w _ij|u _i−u _j| is the graph-based total variation of the function u, m(u) is the median of u, and ∥u−m(u)∥₁=∑_i|u _i−m(u)|. If a global minimizer u ^⋆ of (2) can be computed, then it can be shown that this minimizer would be the indicator of a set Ω ^⋆ (i.e. $u^{\star}=1_{\varOmega^{\star}}$) corresponding to a solution of the NP-hard problem (1). But there is no algorithm that guarantees to compute global minimizers of (2) as the problem is non-convex. However, experiments show that the proposed minimization algorithm in [26], which we will review below, produces good approximations of the solution.

Recent advances in ℓ ¹ optimization offer powerful tools to design a fast and accurate algorithm to solve the minimization problem (2). First, observe that minimizing (2) is equivalent to:

$$\begin{aligned} \min_{u\in[0,1]} \quad \frac{\Vert Du\Vert _1}{\Vert u\Vert _1} \quad \textrm{s.t.}\quad m(u)=0 \end{aligned}$$

(3)

Indeed, the energy is not changed if a constant is added to u. So it is possible to restrict the minimization problem to functions u with zero median. Then, the ratio minimization problem (3) can be solved using the method of Dinkelbach [11] (also used in imaging problems s.a. [17, 18]) which introduces the minimax problem:

$$\begin{aligned} \min_{u\in[0,1]}\max_{\lambda\in\mathbb{R}} \quad \Vert Du\Vert _1 - \lambda \Vert u\Vert _1\quad \textrm{s.t.}\quad m(u)=0 \end{aligned}$$

(4)

Then, we consider a standard two-step iterative algorithm:

(i)
Fix λ, compute the solution of the constrained minimization problem:
$$\begin{aligned} u^{n+1}=\operatorname*{argmin}_{u\in[0,1]}\ \Vert Du\Vert _1 - \lambda^n \Vert u\Vert _1\quad \textrm{s.t.}\quad m(u)=0 \end{aligned}$$
(5)
(ii)
Fix u, compute the solution of the maximization problem:
$$\begin{aligned} \lambda^{n+1}=\operatorname*{argmax}_{\lambda\in\mathbb{R}} \ \Vert Du^{n+1}\Vert _1 - \lambda \Vert u^{n+1}\Vert _1 \end{aligned}$$
(6)

For the minimization problem (5), observe that the constraint zero median is not linear, but it can be replaced by the approximate linear constraint ∑_i u _i≤|V|/2. Indeed, suppose that u _i∈{0,1} then the median of u is zero if ∑_i u _i≤∑_i(1−u _i) which yields to ∑_i u _i≤|V|/2. We will consider the notation 1.u:=∑_i u _i in the rest of the paper.

In order to deal efficiently with the non-differentiability of the ℓ ¹ norm in (6), a splitting approach associated with an augmented Lagrangian method and the Alternating Direction Method of Multipliers [13] can be used along the same lines as [4, 14]. Hence, we consider the constrained minimization problem:

$$ \begin{aligned} &\min_{u,v\in[0,1],d} \quad \Vert d\Vert _1 - \lambda\Vert v\Vert _1 \\ &\textrm{s.t.} \quad d=Du,\qquad v=u, \qquad \mathbf{1}.v\leq|V|/2 \end{aligned} $$

(7)

whose linear constraints can be enforced with an augmented Lagrangian method as:

$$\begin{aligned} \left \{ \begin{array}{l} (u^{n+1},v^{n+1},d^{n+1})\\ \quad {}= \operatorname*{argmin}_{u,v\in[0,1],d} \Vert d\Vert _1 - \lambda \Vert v\Vert _1 \\ \qquad {}+ \alpha_d.(d-Du)+\frac{r_d}{2}|d-Du|^2\\ \qquad {}+ \alpha_v.(v-u)+\frac{r_v}{2}(v-u)^2 + \alpha_m.(\mathbf{1}.v-|V|/2)\\ \alpha_d^{n+1}=\alpha_d^n+r_d.(d^{n+1}-Du^{n+1})\\ \alpha_v^{n+1}=\alpha_v^n+r_v.(v^{n+1}-u^{n+1})\\ \alpha_m^{n+1}=\max(0,\alpha_m^n+r_m.(\mathbf{1}.v^{n+1}-|V|/2)) \end{array} \right . \end{aligned}$$

(8)

Three sub-minimizations need to be solved. The minimization problem w.r.t. u:

$$\begin{aligned} \min_{u} \frac{r_d}{2} \biggl|Du-\biggl(d+\frac{\alpha_d}{r_d} \biggr) \biggr|^2 + \frac {r_v}{2} \biggl(u-\biggl(v+\frac{\alpha_v}{r_v} \biggr) \biggr)^2 \end{aligned}$$

whose solution u ^⋆ is given by a Poisson problem:

$$\begin{aligned} \bigl(r_v+r_dD^TD\bigr)u=r_dD^T \biggl(d+\frac{\alpha_d}{r_d} \biggr) + r_v \biggl(v+\frac {\alpha_v}{r_v} \biggr) \end{aligned}$$

(9)

The solution of (9) can be estimated by a few steps of conjugate gradient descent as D is extremely sparse. The minimization problem w.r.t. v:

$$\begin{aligned} \min_{v\in[0,1]} - \lambda\Vert v\Vert _1 +\frac{r_v}{2} \biggl(v-\biggl(u-\frac{\alpha _v}{r_v}\biggr) \biggr)^2 + \alpha_m.(\mathbf{1}.v-|V|/2) \end{aligned}$$

has an analytical solution given by unshrinkage [26] and truncated into [0,1]:

$$\begin{aligned} &v^\star=\varPi_{[0,1]} \biggl(f_v+ \frac{\lambda}{r_v} \frac{f_v}{|f_v|} \biggr), \\ &\quad \textrm{with}\ f_v:=u-\frac{\alpha_v}{r_v}- \frac{\alpha _m}{r_v} \end{aligned}$$

(10)

To avoid the constant trivial solution, we also apply the “renormalization” step: $v^{\star}\leftarrow\frac{v^{\star}-\min(v^{\star})}{\max(v^{\star})-\min(v^{\star})}$. The minimization problem w.r.t. d:

$$\begin{aligned} \min_{d} \Vert d\Vert _1 +\frac{r_d}{2} \biggl|d- \biggl(Du-\frac{\alpha_d}{r_d}\biggr) \biggr|^2 \end{aligned}$$

has also an analytical solution given by shrinkage [12]:

$$\begin{aligned} &d^\star=\max \biggl(|f_d|-\frac{1}{r_d},0 \biggr) \frac{f_d}{|f_d|}, \\ &\quad \textrm{with}\ f_d:=Du-\frac{\alpha_d}{r_d} \end{aligned}$$

(11)

For the maximization problem (6), the solution is as follows:

$$\begin{aligned} \lambda^{n+1}=\frac{\Vert Du^{n+1}\Vert _1}{\Vert u^{n+1}\Vert _1} \end{aligned}$$

(12)

We will consider a steepest gradient descent method instead of (12) to get a smoother evolution of λ ⁿ⁺¹:

$$\begin{aligned} \lambda^{n+1}=\lambda^n-\delta_\lambda. \biggl( \lambda^n-\frac {\Vert Du^{n+1}\Vert _1}{\Vert u^{n+1}\Vert _1} \biggr) \end{aligned}$$

(13)

To summarize the algorithm introduced in this section, we write down the pseudo-code Algorithm 1.

2.2 Experiments

In this section, we demonstrate results using the unsupervised classification Algorithm 1. For each experience, we build the weight matrix using the self-tuning construction of [29]. We use ten nearest neighbors, and the tenth neighbor determines the local scale. The universal scaling parameter is set to 1. For Algorithm 1, we set r _d=10, r _v=100, r _m=6K/N, where N is the number of data points and K is the number of classes, and δ _λ=0.4. Figure 1 presents the well-known two-moon dataset [7]. Each moon has 1,000 data points in $\mathbb{R}^{100}$. This example shows that the solution of the ℓ ¹ relaxation is tighter than the solution of the ℓ ² relaxation (see caption for more details). In Table 1, we compare quantitatively our algorithm with the spectral clustering method of Shi and Malik [24] and the related method of Hein and Bühler in [15], which is available at http://www.ml.uni-saarland.de/code/ one SpectralClustering/oneSpectralClustering.html ([16] is not yet available for comparison). Our method and [15] outperform the spectral clustering method.

Table 1 Unsupervised learning result for the two-moon dataset. Column (a) reports the minimum energy value (1) considered among 100 random initializations and the error is the percentage of misclassified data of the minimum energy. Column (b) reports the average energy value (1) and the misclassification error for 100 random initializations. Note that the same random initialization was used for Algorithms 1 and [15]

Full size table

In Fig. 2, we apply the standard recursive two-class partitioning approach to deal with more than two classes. Figure 2(b) shows the result by spectral clustering and Fig. 2(c) presents the result with our algorithm (see caption for more details).

On the right hand side of Fig. 3, we display a projection of the MNIST benchmark dataset, available at http://yann.lecun.com/exdb/mnist/, to 3 dimensions via PCA. This data set consists of 70,000 28 × 28 images of handwritten digits, 0 through 9, usually broken into a 60000 point training set and a 10000 point test set; thus the data is presented as 70000 points in $\mathbb{R}^{784}$). The data was preprocessed by projecting onto 50 principal components. Table 2 compares quantitatively our algorithm with the spectral clustering method of Shi and Malik [24] and the related method of Hein and Bühler in [15]. Our method and [15] outperform the spectral clustering method.

Table 2 Unsupervised learning result for the MNIST dataset. Column (a) reports the minimum energy value (1) considered among 10 random initializations and the error is the percentage of misclassified data of the minimum energy. Column (b) reports the average energy value (1) and the misclassification error for 10 random initializations. Note that the same random initialization was used for Algorithms 1 and [15]

Full size table

3 Transductive Data Classification with ℓ ¹ Relaxation of the Multi-class Cheeger Cut

In this section, we extend the unsupervised two-phase Cheeger learning algorithm of Sect. 2 to a transductive multi-class Cheeger learning algorithm. The most natural extension of (1) to K classes is as follows:

$$\begin{aligned} &\min_{\varOmega_1,\ldots,\varOmega_K} \quad \sum _{k=1}^K \frac{\operatorname{Cut}(\varOmega_k,\varOmega _k^c)}{\min(|\varOmega_k|,|\varOmega_k^c|)} \\ &\textrm{s.t.}\quad \bigcup_{k=1}^K \varOmega_k=V \quad \textrm{and}\quad \varOmega_i \cap \varOmega_j=\emptyset\quad \forall i\not=j \end{aligned}$$

The previous minimization problem is equivalent to the following problem:

$$ \begin{aligned} &\min_{\{u_k\}_{k=1}^K\in\{0,1\}} \quad \sum_{k=1}^K \frac {\Vert Du_k\Vert _1}{\Vert u_k-m(u_k)\Vert _1} \\ &\textrm{s.t.}\quad \sum_{k=1}^K u_k(i)=1\quad \forall i\in V\end{aligned} $$

(14)

The set of minimization used in the above minimization problem is not convex because binary functions do not make a convex set. Thus we consider the following relaxation:

$$ \begin{aligned} &\min_{\{u_k\}_{k=1}^K\in[0,1]} \quad \sum_{k=1}^K \frac {\Vert Du_k\Vert _1}{\Vert u_k-m(u_k)\Vert _1} \\ &\textrm{s.t.}\quad \sum_{k=1}^K u_k(i)=1\quad \forall i\in V \end{aligned} $$

(15)

In Sect. 2, we recall that the continuous ℓ ¹ relaxation of the two-phase Cheeger minimization problem is exact, meaning that the (continuous) solution of (2) provides a (discrete) solution of the original Cheeger problem (1). We do not know if the ℓ ¹ relaxation is still exact when multiple classes are considered, i.e. if the (continuous) solution of (15) provides a (discrete) solution of the original multi-class Cheeger problem (14). For the multi-class Cheeger-based learning problem considered in this paper, experiments show that the solutions $\{u_{k}\}_{k=1}^{K}$ are close to binary functions, but there is no theoretical guarantee of this observation.

As the transductive learning problem is considered here then a (small) set l _k of labels is given for each class Ω _k (i.e. l _k⊂Ω _k, see Fig. 4) and the following minimization problem is thus considered:

$$\begin{aligned} \begin{aligned} & \min_{\varOmega_1,\ldots,\varOmega_K} \quad \sum_{k=1}^K \frac{\operatorname{Cut}(\varOmega_k,\varOmega _k^c)}{\min(|\varOmega_k|,|\varOmega_k^c|)}\\ & \textrm{s.t.}\quad \bigcup_{k=1}^K \varOmega_k=V \quad \textrm{and}\quad \varOmega_i \cap\varOmega_j=\emptyset \quad \forall i\not=j \\ &\quad \textrm{and given}\ \{l_k\}_{k=1}^K \end{aligned} \end{aligned}$$

(16)

which is equivalent to:

$$\begin{aligned} &\min_{\{u_k\}_{k=1}^K\in\{0,1\}} \quad \sum _{k=1}^K \frac {\Vert Du_k\Vert _1}{\Vert u_k-m(u_k)\Vert _1}\\ & \textrm{s.t.}\quad \sum_{k=1}^K u_k(i)=1\quad \forall i\in V \quad \textrm{and}\\ & u_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p \ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p\ \textrm{and}\ k\not=p \end{array} \right . \end{aligned}$$

and which is relaxed to:

$$\begin{aligned} &\min_{\{u_k\}_{k=1}^K\in[0,1]}\quad \sum _{k=1}^K \frac {\Vert Du_k\Vert _1}{\Vert u_k-m(u_k)\Vert _1}\\ & \textrm{s.t.}\quad \sum_{k=1}^K u_k(i)=1\quad \forall i\in V \quad \textrm{and}\\ & u_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p\ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p \ \textrm{and}\ k\not=p \end{array} \right . \end{aligned}$$

We now extend the two-phase algorithm designed in Sect. 2 to the multi-phase case:

$$\begin{aligned} &\min_{\{u_k\}_{k=1}^K\in[0,1]}\max _{\{\lambda_k\}_{k=1}^K\in\mathbb {R}}\quad \sum_{k=1}^K \Vert Du_k\Vert _1 - \lambda_k \Vert u_k\Vert _1\\ & \textrm{s.t.}\quad m(u_k)=0,\qquad \sum_{k=1}^K u_k(i)=1 \quad \forall i\in V,\quad \textrm{and}\\ & u_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p \ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p \ \textrm{and}\ k\not=p \end{array} \right . \end{aligned}$$

The median constraint is relaxed to 1.u _k≤|V|/K. We again consider a standard two-step iterative algorithm:

(i)
Fix λ _k, compute the solution for the K minimization problems:
$$\begin{aligned} &u_k^{n+1}= \operatorname*{argmin}_{u_k\in[0,1]}\ \Vert Du_k\Vert _1 - \lambda_k^n \Vert u_k\Vert _1\\ &\textrm{s.t.}\quad m(u_k)=0,\qquad \sum_{k=1}^K u_k(i)=1 \quad \forall i\in V,\quad \textrm{and}\\ & u_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p \ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p \ \textrm{and}\ k\not=p \end{array} \right . \end{aligned}$$
(ii)
Fix u _k, compute the solution of the K maximization problems:
$$\begin{aligned} \lambda_k^{n+1}=\operatorname*{argmax}_{\lambda\in\mathbb{R}} \bigl\Vert Du_k^{n+1}\bigr\Vert _1 - \lambda \bigl\Vert u_k^{n+1}\bigr\Vert _1 \end{aligned}$$
(17)

The minimization problems (17) are solved as follows:

$$\begin{aligned} \left\{ \begin{array}{l} (u_k^{n+1},v_k^{n+1},d_k^{n+1})\\ \quad {}= \operatorname*{argmin}_{u_k,v_k\in[0,1],d_k} \Vert d_k\Vert _1 \\ \qquad {}- \lambda\Vert v_k\Vert _1 + {\alpha_d}_k.(d_k-Du_k)+\frac {r_d}{2}|d_k-Du_k|^2\\ \qquad {}+ {\alpha_v}_k.(v_k-u_k)+\frac{r_v}{2}(v_k-u_k)^2\\ \qquad {} + {\alpha_m}_k.(\mathbf{1}.v_k-|V|/K)\\ \textrm{s.t.}\quad \sum_{k=1}^K v_k=1\ \textrm{and}\\ v_k(i)= \left\{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p\ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p \ \textrm{and}\ k\not=p \end{array} \right. \\ {\alpha_d}_k^{n+1}=\alpha_d^n+r_d. \bigl(d_k^{n+1}-Du_k^{n+1}\bigr) \\ {\alpha_v}_k^{n+1}=\alpha_v^n+r_v. \bigl(v_k^{n+1}-u_k^{n+1}\bigr) \\ {\alpha_m}_k^{n+1}=\max\bigl(0,{ \alpha_m}_k^n+r_m.\bigl({ \bf1}.v_k^{n+1}-|V|/K\bigr)\bigr) \end{array} \right. \end{aligned}$$

(18)

The solution of the minimization problems w.r.t. u _k,v _k,d _k is the same as the solution given in the previous section. Finally, the projection onto the convex simplex set $\sum_{k=1}^{K} v_{k}= 1$ is given by [21, 28]. Observe that the final solution $\{u^{\star}_{k}\}_{k=1}^{K}$ of (17) is not guaranteed to be binary. Hence, a conversion step is required to make $\{u^{\star}_{k}\}_{k=1}^{K}$ binary. The most natural conversion is as follows:

$$\begin{aligned} \hat{u}^\star_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ k=\arg\max_{p\in\{1,\ldots,K\}} u^\star_p(i) \\ 0& \textrm{otherwise} \end{array} \right .\quad \forall i\in V \end{aligned}$$

(19)

where $\{\hat{u}^{\star}_{k}\}_{k=1}^{K}$ are binary functions satisfying $\sum_{k=1}^{K} \hat{u}^{\star}_{k}=1$.

To summarize the algorithm introduced in this section, we write down the pseudo-code Algorithm 2. Eventually, Fig. 5 presents a simple illustration of the proposed multi-class Cheeger transductive model.

4 Transductive Data Classification with ℓ ¹ Relaxation of the Multi-class Mumford-Shah-Potts Model

In this section, we develop an alternative to the multi-class Cheeger transductive classification algorithm introduced in the previous section. A successful multi-phase segmentation algorithm in imaging is the multiphase piecewise constant Mumford-Shah method [22] (continuous setting) or the Potts method [23] (discrete setting). These methods are well suited to solve the image segmentation problem and the idea in this section is to extend them to the transductive learning problem. Note that the piecewise constant Mumford-Shah/Potts models have been first implemented with the level set method [27, 30] and the graph cut method [5]. However, these methods are either too slow, not optimal, not accurate enough or the memory allocation can be important. Recent advances in ℓ ¹ optimization algorithms provide efficient tool to solve the piecewise constant Mumford-Shah/Potts models [1, 6, 8, 19, 20, 28]. These recent improvements will be used to develop an efficient algorithm for the transductive Potts model:

$$ \begin{aligned} &\min_{\varOmega_1,\ldots,\varOmega_K} \quad \sum_{k=1}^K \underbrace{\operatorname{Cut}\bigl(\varOmega _k,\varOmega_k^c \bigr)}_{\simeq \operatorname{Per}(\varOmega_k)}\\ & \textrm{s.t.}\quad \bigcup_{k=1}^K \varOmega_k=V\quad \textrm{and}\quad \varOmega_i \cap\varOmega_j=\emptyset \quad \forall i\not=j \\ &\quad \textrm{and given}\ \{l_k\}_{k=1}^K \end{aligned} $$

(20)

where $\operatorname{Per}$ stands for perimeter. The relationship between cut and perimeter comes from the coarea formula [25]. In the continuous setting, we have $\operatorname{Per}(\varOmega)=\int_{\mathcal{M}\subset\mathbb{R}^{d}}|\nabla f|$ when f=1_Ω(x) is the indicator function of a geometric set Ω defined as f(x)=1 ∀x∈Ω and 0 otherwise. Then, discretizing the total variation energy leads to $\int_{\mathcal {M}\subset\mathbb{R}^{d}}|\nabla f|\simeq\sum_{i,j\in V}w_{i,j}|f_{i}-f_{j}|$ and plugging the indicator function f _i=1_Ω(i) finally gives $\operatorname{Per}(\varOmega)\simeq\sum_{i,j\in V}w_{i,j}|f_{i}-f_{j}|=\sum_{i\in \varOmega, j\in\varOmega^{c}}w_{i,j}=\operatorname{Cut}(\varOmega,\varOmega^{c})$. The minimization problem (20) is equivalent to the following problem:

$$\begin{aligned} &\min_{\{u_k\}_{k=1}^K\in\{0,1\}} \quad \sum _{k=1}^K \Vert Du_k\Vert _1 \\ &\textrm{s.t.}\quad \sum_{k=1}^K u_k(i)=1 \quad \forall i\in V,\quad \textrm{and}\\ & u_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p\ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p \ \textrm{and}\ k\not=p \end{array} \right . \end{aligned}$$

The set of minimization used in the above minimization problem is not convex because binary functions do not make a convex set. Thus we consider the following relaxation:

$$\begin{aligned} &\min_{\{u_k\}_{k=1}^K\in[0,1]} \quad \sum _{k=1}^K \Vert Du_k\Vert _1 \\ &\textrm{s.t.}\quad \sum_{k=1}^K u_k(i)=1 \quad \forall i\in V,\quad \textrm{and}\\ & u_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p \ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p \textrm{ and } k\not=p \end{array} \right . \end{aligned}$$

The previous minimization problem is solved as:

$$\begin{aligned} \left\{ \begin{array}{l} (u_k^{n+1},v_k^{n+1},d_k^{n+1})\\ \quad {}=\operatorname*{argmin}_{u_k,v_k\in[0,1],d_k} \Vert d_k\Vert _1 \\ \qquad {}+ {\alpha_d}_k.(d_k-Du_k)+\frac{r_d}{2}|d_k-Du_k|^2\\ \qquad {}+ {\alpha_v}_k.(v_k-u_k)+\frac{r_v}{2}(v_k-u_k)^2 \\ \textrm{s.t.}\quad \sum_{k=1}^K v_k=1\quad \textrm{and}\\ v_k(i)= \left\{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p \ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p \ \textrm{and}\ k\not=p \end{array} \right. \\ {\alpha_d}_k^{n+1}=\alpha_d^n+r_d. \bigl(d_k^{n+1}-Du_k^{n+1}\bigr) \\ {\alpha_v}_k^{n+1}=\alpha_v^n+r_v. \bigl(v_k^{n+1}-u_k^{n+1}\bigr) \end{array} \right. \end{aligned}$$

The solution of the minimization problems w.r.t. u _k,d _k is the same as the solution given in Sect. 2. The minimization w.r.t. v _k is simply given by:

$$\begin{aligned} v_k^\star=\varPi_{[0,1]}({f_v}_k) \quad \textrm{with}\ {f_v}_k:=u_k- \frac {{\alpha_v}_k}{r_v} \end{aligned}$$

(21)

and project onto the convex simplex set $\sum_{k=1}^{K} v_{k}=1$ using [21, 28]. Observe that the final solution $\{u^{\star}_{k}\}_{k=1}^{K}$ of (17) is not guaranteed to be binary. Hence, a conversion step is required to make $\{u^{\star}_{k}\}_{k=1}^{K}$ binary. Like in the previous section, the binary conversion is as follows:

$$\begin{aligned} \hat{u}^\star_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ k=\arg\max_{p\in\{1,\ldots,K\}} u^\star_p(i)\\ 0& \textrm{otherwise} \end{array} \right .\quad \forall i\in V \end{aligned}$$

(22)

where $\{\hat{u}^{\star}_{k}\}_{k=1}^{K}$ satisfy $\sum_{k=1}^{K} \hat{u}^{\star}_{k}=1$.

To summarize the algorithm introduced in this section, we write down the pseudo-code Algorithm 3.

5 Experiments

In this section, we show classification results using the transductive algorithms developed in Sects. 3 and 4. We will work on the four moons and MNIST datasets described above. For both data sets, we build the weights matrix using the self-tuning construction of [29]. We use ten nearest neighbors, and the tenth neighbor determines the local scale. The universal scaling parameter is set to 1. For Algorithm 1, we set r _d=10, r _v=100, r _m=6K/N, where N is the number of data points and K is the number of classes, and δ _λ=0.4. For Algorithm 3, we set r _d=10 and r _v=100. We choose the labeled points randomly, and fix a number of labeled points to draw from each class.

We compare Algorithm 2 and Algorithm 3 with a spectral transductive learning method from [2], which uses linear least squares on the eigenvectors of the normalized Laplacian to estimate the classes. That is, given the weight matrix W as before, we set $\mathcal{L}=I-S^{-1/2}WS^{-1/2}$, where S is the diagonal matrix with the row sums on the diagonal, that is, S _ii=∑_j W _ij. We compute the l+1 lowest eigenvalue eigenvectors ϕ ₀,…,ϕ _l of $\mathcal{L}$, and form the N×l matrix Φ=[ϕ ₁…ϕ _l]; note that as usual we have omitted the density vector ϕ ₀. Each row of Φ corresponds to a data point. Next we form the matrix Φ _lab by extracting the rows of Φ corresponding to the labeled data points. Let L denote the number of classes, and p be the number of labeled data points. Given the p×L binary label matrix Y, we compute

$$A= \bigl(\varPhi_{\text{lab}}^T\varPhi_{\text{lab}} \bigr)^{-1}\varPhi_{\text {lab}}^T Y $$

which minimizes the least square energy:

$$ \|Y-\varPhi_{\text{lab}}A\|^2_2 $$

(23)

To compute the class labels of the unlabeled points, we set R=ΦA, and let

$$y_j=\operatorname*{argmax}_i R_{ji} $$

Tables 3 and 4 compare the proposed ℓ ¹ relaxations of the multi-class Cheeger cut (Algorithm 2) and the Mumford-Shah-Potts (Algorithm 3) with the competitive spectral method of [2] (by selecting the number l of eigenvectors which minimizes the error). We have tested different numbers of labels (n _l is the number of labeled data for each class) that are selected randomly. We repeat each experiment 10 times. For each experiment, the labeled points were chosen randomly and the same labeled points were used for the multi-class Cheeger cut model, the Mumford-Shah-Potts model and the spectral method. The ℓ ¹ relaxations of the multi-class Cheeger cut and the Mumford-Shah-Potts outperform the spectral method in all cases, significantly so when a very small number of points are labeled.

Table 3 Transductive learning for the four-moon dataset. Column (I) reports the minimum energy value considered among 10 random tests and the associated error of misclassified data and computational time. Column (II) reports the average energy value, the misclassification error and the computational time for the 10 tests. The energy considered for the Cheeger model is (16), for the Mumford-Shah-Potts is (20) and for the spectral method is (23). Finally, n _l is the number of (randomly selected) labeled data for each class

Full size table

Table 4 Transductive learning for the MNIST dataset. Column (I) reports the minimum energy value considered among 10 random tests and the associated error of misclassified data and computational time. Column (II) reports the average energy value, the misclassification error and the computational time for the 10 tests. The energy considered for the Cheeger model is (16), for the Mumford-Shah-Potts is (20) and for the spectral method is (23). Finally, 60,000 unlabeled data points are considered and n _l×10 labeled data points where n _l is the number of (randomly selected) labeled data for each class. Therefore, the total number of data points for each experiment is 60,000+n _l×10 data points

Full size table

6 Conclusion

The paper introduces new ℓ ¹ relaxation methods for the multi-class transductive learning problem. These relaxation methods are inspired from recent advances in imaging science which offer fast, accurate and robust ℓ ¹ optimization tools which allow to go beyond standard ℓ ² relaxation methods, i.e. spectral clustering methods. Experiments demonstrate that the ℓ ¹ relaxations of the multi-class Cheeger cut and the Mumford-Shah-Potts outperform the spectral clustering method, and even more significantly when a very small number of labels is considered.

Reproducible Research

The code is available at http://www.cs.cityu.edu.hk/~xbresson/codes.html#learningmulti.

References

Bae, E., Yuan, J., Tai, X.-C.: Global minimization for continuous multiphase partitioning problems using a dual approach. Int. J. Comput. Vis. 92(1), 112–129 (2009)
Article MathSciNet Google Scholar
Belkin, M.: Problems of learning on manifolds. PhD thesis, University of Chicago (2003)
Bertozzi, A., Flenner, A.: Diffuse interface models on graphs for classification of high dimensional data. UCLA CAM Report 11-27 (2011)
Bioucas-Dias, J.M., Figueiredo, M.A.: A new TwIST: two-step iterative shrinkage/thresholding algorithms for image restoration. IEEE Trans. Image Process. 16(12), 2992–3004 (2007)
Article MathSciNet Google Scholar
Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1124–1137 (2004)
Article Google Scholar
Brown, E.S., Chan, T.F., Bresson, X.: A convex relaxation method for a class of vector-valued minimization problems with applications to Mumford-Shah segmentation. UCLA CAM Report 10-43 (2010)
Bühler, T., Hein, M.: Spectral clustering based on the graph p-Laplacian. In: International Conference on Machine Learning, pp. 81–88 (2009)
Google Scholar
Chambolle, A., Cremers, D., Pock, T.: A convex approach for computing minimal partitions. Technical Report TR-2008-05, Dept. of Computer Science, University of Bonn, Bonn (2008)
Cheeger, J.: A lower bound for the smallest eigenvalue of the Laplacian. Problems in Analysis, 195–199 (1970)
Chung, F.R.K.: Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, vol. 92 (1997). Published for the Conference Board of the Mathematical Sciences, Washington, DC
MATH Google Scholar
Dinkelbach, W.: On nonlinear fractional programming. Manag. Sci. 13, 492–498 (1967)
Article MATH MathSciNet Google Scholar
Donoho, D.: De-noising by soft-thresholding. IEEE Trans. Inf. Theory 41(33), 613–627 (1995)
Article MATH MathSciNet Google Scholar
Glowinski, R., Le Tallec, P.: Augmented Lagrangian and Operator-Splitting Methods in Nonlinear Mechanics. SIAM, Philadelphia (1989)
Book MATH Google Scholar
Goldstein, T., Osher, S.: The split Bregman method for L1-regularized problems. SIAM J. Imaging Sci. 2(2), 323–343 (2009)
Article MATH MathSciNet Google Scholar
Hein, M., Bühler, T.: An inverse power method for nonlinear eigenproblems with applications in 1-spectral clustering and sparse PCA. In: Advances in Neural Information Processing Systems (NIPS), pp. 847–855 (2010)
Google Scholar
Hein, M., Setzer, S.: Beyond spectral clustering—tight relaxations of balanced graph cuts. In: Advances in Neural Information Processing Systems (NIPS) (2011)
Google Scholar
Kolev, K., Cremers, D.: Continuous ratio optimization via convex relaxation with applications to multiview 3D reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Kolmogorov, V., Boykov, Y., Rother, C.: Applications of parametric maxflow in computer vision. In: International Conference on Computer Vision, pp. 1–8 (2007)
Google Scholar
Lellmann, J., Kappes, J., Yuan, J., Becker, F., Schnörr, C.: Convex multi-class image labeling by simplex-constrained total variation. In: International Conference on Scale Space and Variational Methods in Computer Vision, pp. 150–162 (2009)
Chapter Google Scholar
Lellmann, J., Schnörr, C.: Continuous multiclass labeling approaches and algorithms. Univ. of Heidelberg, Tech. Rep. (2010)
Michelot, C.: A finite algorithm for finding the projection of a point onto the canonical simplex of rn. J. Optim. Theory Appl. 50(1), 195–200 (1986)
Article MATH MathSciNet Google Scholar
Mumford, D., Shah, J.: Optimal approximations of piecewise smooth functions and associated variational problems. Commun. Pure Appl. Math. 42, 577–685 (1989)
Article MATH MathSciNet Google Scholar
Potts, R.B., Domb, C.: Some generalized order-disorder transformations. Math. Proc. Camb. Philos. Soc. 48, 106–109 (1952)
Article MATH Google Scholar
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Article Google Scholar
Strang, G.: Maximal flow through a domain. Math. Program. 26, 123–143 (1983)
Article MATH MathSciNet Google Scholar
Szlam, A., Bresson, X.: Total variation and cheeger cuts. In: Proceedings of the 27th International Conference on Machine Learning, pp. 1039–1046 (2010)
Google Scholar
Vese, L.A., Chan, T.F.: A multiphase level set framework for image segmentation using the Mumford and Shah model. Int. J. Comput. Vis. 50(3), 271–293 (2002)
Article MATH Google Scholar
Zach, C., Gallup, D., Frahm, J.M., Niethammer, M.: Fast global labeling for real-time stereo using multiple plane sweeps. In: Vision, Modeling, and Visualization, pp. 243–252 (2008)
Google Scholar
Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. In: Advances in Neural Information Processing Systems 17 (NIPS 2004) (2004)
Google Scholar
Zhao, H.K., Chan, T.F., Merriman, B., Osher, S.: A variational level set approach to multiphase motion. J. Comput. Phys. 127, 179–195 (1996)
Article MATH MathSciNet Google Scholar

Download references

Acknowledgement

Xavier Bresson is supported by the Hong Kong RGC under Grant GRF110311.

Author information

Authors and Affiliations

Department of Computer Science, City University of Hong Kong, Kowloon Tang, Hong Kong
Xavier Bresson
Department of Mathematics, University of Bergen, Bergen, Norway
Xue-Cheng Tai
Department of Mathematics and Computer Science, Hong Kong University of Science and Technology, Kowloon, Hong Kong
Tony F. Chan
Department of Mathematics, The City College of New York, New York, NY, USA
Arthur Szlam

Authors

Xavier Bresson
View author publications
You can also search for this author in PubMed Google Scholar
Xue-Cheng Tai
View author publications
You can also search for this author in PubMed Google Scholar
Tony F. Chan
View author publications
You can also search for this author in PubMed Google Scholar
Arthur Szlam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xavier Bresson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bresson, X., Tai, XC., Chan, T.F. et al. Multi-class Transductive Learning Based on ℓ ¹ Relaxations of Cheeger Cut and Mumford-Shah-Potts Model. J Math Imaging Vis 49, 191–201 (2014). https://doi.org/10.1007/s10851-013-0452-5

Download citation

Published: 27 August 2013
Issue Date: May 2014
DOI: https://doi.org/10.1007/s10851-013-0452-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-class Transductive Learning Based on ℓ ¹ Relaxations of Cheeger Cut and Mumford-Shah-Potts Model

Abstract

Similar content being viewed by others

Regularized label relaxation with negative technique for image classification

One-Class Transfer Learning with Uncertain Data

Retargeted Regression Methods for Multi-label Learning

1 Introduction