Abstract
Recent advances in ℓ 1 optimization for imaging problems provide promising tools to solve the fundamental high-dimensional data classification in machine learning. In this paper, we extend the main result of Szlam and Bresson (Proceedings of the 27th International Conference on Machine Learning, pp. 1039–1046, 2010), which introduced an exact ℓ 1 relaxation of the Cheeger ratio cut problem for unsupervised data classification. The proposed extension deals with the multi-class transductive learning problem, which consists in learning several classes with a set of labels for each class. Learning several classes (i.e. more than two classes) simultaneously is generally a challenging problem, but the proposed method builds on strong results introduced in imaging to overcome the multi-class issue. Besides, the proposed multi-class transductive learning algorithms also benefit from recent fast ℓ 1 solvers, specifically designed for the total variation norm, which plays a central role in our approach. Finally, experiments demonstrate that the proposed ℓ 1 relaxation algorithms are more accurate and robust than standard ℓ 2 relaxation methods s.a. spectral clustering, particularly when considering a very small number of labels for each class to be classified. For instance, the mean error of classification for the benchmark MNIST dataset of 60,000 data in \(\mathbb{R}^{784}\) using the proposed ℓ 1 relaxation of the multi-class Cheeger cut is 2.4 % when only one label is considered for each class, while the error of classification for the ℓ 2 relaxation method of spectral clustering is 24.7 %.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Partitioning data into sensible groups is a fundamental problem in machine learning and science in general. One of the most popular approaches is to find the best (balanced) cut of a graph representing data, such as the normalized cut of Shi and Malik [24] or the Cheeger ratio cut [9]. However, solving balanced/ratio cut problems is NP-hard, which has lead people to compute approximate solutions. The most well-known approach to approximate the solution of a ratio cut is the spectral clustering method, which is based on a ℓ 2 relaxation of the original ratio cut. This ℓ 2 relaxation reduces to solving a generalized system of eigenvectors for the graph Laplacian, then selects the 2nd smallest eigenvector and finally partitions into two groups by thresholding (this requires testing multiple thresholds). Different normalizations of the graph Laplacian lead to different spectral clustering methods. These methods often provide good solutions but can fail on somewhat benign problems; for example see the two-moons example in Fig. 1. In this case, the relaxation leading to the spectral clustering methods is too weak. A stronger relaxation was introduced by Bühler and Hein in [7]. They described the p-spectral clustering method, which considers the ℓ p relaxation of the Cheeger ratio cut, instead of the ℓ 2 relaxation. They showed that the relaxed solution of the p-spectral clustering problem tends asymptotically to the solution of the Cheeger cut problem when p→1. In [10, 26] (also see [25]), it was proved that the relaxation for p=1 is actually exact, i.e. the solution of the ℓ 1 relaxation problem provides an exact solution of the Cheeger cut problem. Unfortunately, there is no algorithm that guarantees to find global minimizers of the ℓ 1 relaxation problem (we recall that the problem is NP-hard). However, the experiments in [7, 26] showed that good results can be obtained with these stronger relaxations; the works [3, 15, 16] have further strengthened the case for ℓ 1 relaxation methods and related ideas, and have charted a new and promising research direction for improving spectral clustering methods.
In this work, we propose to extend [26]. In particular, we are interested in extending to the challenging multi-class ratio cut problem, and adding label information to obtain a transductive problem. Standard approaches for the unsupervised learning problem usually proceed by recursive two-class clustering. In this paper, we will use results recently introduced in imaging science to solve the multi-class learning problem. The papers [1, 6, 8, 19, 20, 28] have proposed tight approximations of the solution of the multi-phase image segmentation problem based on ℓ 1 relaxation techniques. The main contribution of this paper is to develop efficient multi-class algorithms for the transductive learning problem. We will introduce two multi-class algorithms based on the ℓ 1 relaxation of the Cheeger cut and the piecewise constant Mumford/Shah or Potts models [22, 23]. Experiments show that these new multi-class transductive learning algorithms improve the classification results compared to spectral clustering algorithms, particularly in the case of a very few numbers of labels.
2 Unsupervised Data Classification with ℓ 1 Relaxation of the Cheeger Cut
2.1 The Model
In this section, we recall the main result of [26] and proposed a modified and improved version of the algorithm introduced there. Let G=(V,E) be a graph where V is the set of nodes and E is the set of edges weighted by a function W ij , ∀(ij)∈E. A classical method for clustering is to consider the Cheeger minimization problem [9]:
which partitions the set V of points into two sets Ω and Ω c (the complementary set of Ω in V). The cut is defined as \(\operatorname{Cut}(\varOmega,\varOmega^{c}):=\sum_{i\in\varOmega,j\in\varOmega ^{c}}w_{ij}\) and |.| provides the number of points in a given set. The Cheeger problem is NP-hard. However, it was shown in [10], and by the authors of this paper using a different argument in [26], that there exists an exact continuous relaxation of (1) as follows. Let us consider the minimization problem w.r.t. a function u:V→[0,1]:
where ∥Du∥1:=∑ ij w ij |u i −u j | is the graph-based total variation of the function u, m(u) is the median of u, and ∥u−m(u)∥1=∑ i |u i −m(u)|. If a global minimizer u ⋆ of (2) can be computed, then it can be shown that this minimizer would be the indicator of a set Ω ⋆ (i.e. \(u^{\star}=1_{\varOmega^{\star}}\)) corresponding to a solution of the NP-hard problem (1). But there is no algorithm that guarantees to compute global minimizers of (2) as the problem is non-convex. However, experiments show that the proposed minimization algorithm in [26], which we will review below, produces good approximations of the solution.
Recent advances in ℓ 1 optimization offer powerful tools to design a fast and accurate algorithm to solve the minimization problem (2). First, observe that minimizing (2) is equivalent to:
Indeed, the energy is not changed if a constant is added to u. So it is possible to restrict the minimization problem to functions u with zero median. Then, the ratio minimization problem (3) can be solved using the method of Dinkelbach [11] (also used in imaging problems s.a. [17, 18]) which introduces the minimax problem:
Then, we consider a standard two-step iterative algorithm:
-
(i)
Fix λ, compute the solution of the constrained minimization problem:
$$\begin{aligned} u^{n+1}=\operatorname*{argmin}_{u\in[0,1]}\ \Vert Du\Vert _1 - \lambda^n \Vert u\Vert _1\quad \textrm{s.t.}\quad m(u)=0 \end{aligned}$$(5) -
(ii)
Fix u, compute the solution of the maximization problem:
$$\begin{aligned} \lambda^{n+1}=\operatorname*{argmax}_{\lambda\in\mathbb{R}} \ \Vert Du^{n+1}\Vert _1 - \lambda \Vert u^{n+1}\Vert _1 \end{aligned}$$(6)
For the minimization problem (5), observe that the constraint zero median is not linear, but it can be replaced by the approximate linear constraint ∑ i u i ≤|V|/2. Indeed, suppose that u i ∈{0,1} then the median of u is zero if ∑ i u i ≤∑ i (1−u i ) which yields to ∑ i u i ≤|V|/2. We will consider the notation 1.u:=∑ i u i in the rest of the paper.
In order to deal efficiently with the non-differentiability of the ℓ 1 norm in (6), a splitting approach associated with an augmented Lagrangian method and the Alternating Direction Method of Multipliers [13] can be used along the same lines as [4, 14]. Hence, we consider the constrained minimization problem:
whose linear constraints can be enforced with an augmented Lagrangian method as:
Three sub-minimizations need to be solved. The minimization problem w.r.t. u:
whose solution u ⋆ is given by a Poisson problem:
The solution of (9) can be estimated by a few steps of conjugate gradient descent as D is extremely sparse. The minimization problem w.r.t. v:
has an analytical solution given by unshrinkage [26] and truncated into [0,1]:
To avoid the constant trivial solution, we also apply the “renormalization” step: \(v^{\star}\leftarrow\frac{v^{\star}-\min(v^{\star})}{\max(v^{\star})-\min(v^{\star})}\). The minimization problem w.r.t. d:
has also an analytical solution given by shrinkage [12]:
For the maximization problem (6), the solution is as follows:
We will consider a steepest gradient descent method instead of (12) to get a smoother evolution of λ n+1:
To summarize the algorithm introduced in this section, we write down the pseudo-code Algorithm 1.
2.2 Experiments
In this section, we demonstrate results using the unsupervised classification Algorithm 1. For each experience, we build the weight matrix using the self-tuning construction of [29]. We use ten nearest neighbors, and the tenth neighbor determines the local scale. The universal scaling parameter is set to 1. For Algorithm 1, we set r d =10, r v =100, r m =6K/N, where N is the number of data points and K is the number of classes, and δ λ =0.4. Figure 1 presents the well-known two-moon dataset [7]. Each moon has 1,000 data points in \(\mathbb{R}^{100}\). This example shows that the solution of the ℓ 1 relaxation is tighter than the solution of the ℓ 2 relaxation (see caption for more details). In Table 1, we compare quantitatively our algorithm with the spectral clustering method of Shi and Malik [24] and the related method of Hein and Bühler in [15], which is available at http://www.ml.uni-saarland.de/code/ one SpectralClustering/oneSpectralClustering.html ([16] is not yet available for comparison). Our method and [15] outperform the spectral clustering method.
In Fig. 2, we apply the standard recursive two-class partitioning approach to deal with more than two classes. Figure 2(b) shows the result by spectral clustering and Fig. 2(c) presents the result with our algorithm (see caption for more details).
On the right hand side of Fig. 3, we display a projection of the MNIST benchmark dataset, available at http://yann.lecun.com/exdb/mnist/, to 3 dimensions via PCA. This data set consists of 70,000 28 × 28 images of handwritten digits, 0 through 9, usually broken into a 60000 point training set and a 10000 point test set; thus the data is presented as 70000 points in \(\mathbb{R}^{784}\)). The data was preprocessed by projecting onto 50 principal components. Table 2 compares quantitatively our algorithm with the spectral clustering method of Shi and Malik [24] and the related method of Hein and Bühler in [15]. Our method and [15] outperform the spectral clustering method.
3 Transductive Data Classification with ℓ 1 Relaxation of the Multi-class Cheeger Cut
In this section, we extend the unsupervised two-phase Cheeger learning algorithm of Sect. 2 to a transductive multi-class Cheeger learning algorithm. The most natural extension of (1) to K classes is as follows:
The previous minimization problem is equivalent to the following problem:
The set of minimization used in the above minimization problem is not convex because binary functions do not make a convex set. Thus we consider the following relaxation:
In Sect. 2, we recall that the continuous ℓ 1 relaxation of the two-phase Cheeger minimization problem is exact, meaning that the (continuous) solution of (2) provides a (discrete) solution of the original Cheeger problem (1). We do not know if the ℓ 1 relaxation is still exact when multiple classes are considered, i.e. if the (continuous) solution of (15) provides a (discrete) solution of the original multi-class Cheeger problem (14). For the multi-class Cheeger-based learning problem considered in this paper, experiments show that the solutions \(\{u_{k}\}_{k=1}^{K}\) are close to binary functions, but there is no theoretical guarantee of this observation.
As the transductive learning problem is considered here then a (small) set l k of labels is given for each class Ω k (i.e. l k ⊂Ω k , see Fig. 4) and the following minimization problem is thus considered:
which is equivalent to:
and which is relaxed to:
We now extend the two-phase algorithm designed in Sect. 2 to the multi-phase case:
The median constraint is relaxed to 1.u k ≤|V|/K. We again consider a standard two-step iterative algorithm:
-
(i)
Fix λ k , compute the solution for the K minimization problems:
$$\begin{aligned} &u_k^{n+1}= \operatorname*{argmin}_{u_k\in[0,1]}\ \Vert Du_k\Vert _1 - \lambda_k^n \Vert u_k\Vert _1\\ &\textrm{s.t.}\quad m(u_k)=0,\qquad \sum_{k=1}^K u_k(i)=1 \quad \forall i\in V,\quad \textrm{and}\\ & u_k(i)= \left \{ \begin{array}{l@{\quad }l} 1& \textrm{if}\ i\in l_p \ \textrm{and}\ k=p\\ 0& \textrm{if}\ i\in l_p \ \textrm{and}\ k\not=p \end{array} \right . \end{aligned}$$ -
(ii)
Fix u k , compute the solution of the K maximization problems:
$$\begin{aligned} \lambda_k^{n+1}=\operatorname*{argmax}_{\lambda\in\mathbb{R}} \bigl\Vert Du_k^{n+1}\bigr\Vert _1 - \lambda \bigl\Vert u_k^{n+1}\bigr\Vert _1 \end{aligned}$$(17)
The minimization problems (17) are solved as follows:
The solution of the minimization problems w.r.t. u k ,v k ,d k is the same as the solution given in the previous section. Finally, the projection onto the convex simplex set \(\sum_{k=1}^{K} v_{k}= 1\) is given by [21, 28]. Observe that the final solution \(\{u^{\star}_{k}\}_{k=1}^{K}\) of (17) is not guaranteed to be binary. Hence, a conversion step is required to make \(\{u^{\star}_{k}\}_{k=1}^{K}\) binary. The most natural conversion is as follows:
where \(\{\hat{u}^{\star}_{k}\}_{k=1}^{K}\) are binary functions satisfying \(\sum_{k=1}^{K} \hat{u}^{\star}_{k}=1\).
To summarize the algorithm introduced in this section, we write down the pseudo-code Algorithm 2. Eventually, Fig. 5 presents a simple illustration of the proposed multi-class Cheeger transductive model.
4 Transductive Data Classification with ℓ 1 Relaxation of the Multi-class Mumford-Shah-Potts Model
In this section, we develop an alternative to the multi-class Cheeger transductive classification algorithm introduced in the previous section. A successful multi-phase segmentation algorithm in imaging is the multiphase piecewise constant Mumford-Shah method [22] (continuous setting) or the Potts method [23] (discrete setting). These methods are well suited to solve the image segmentation problem and the idea in this section is to extend them to the transductive learning problem. Note that the piecewise constant Mumford-Shah/Potts models have been first implemented with the level set method [27, 30] and the graph cut method [5]. However, these methods are either too slow, not optimal, not accurate enough or the memory allocation can be important. Recent advances in ℓ 1 optimization algorithms provide efficient tool to solve the piecewise constant Mumford-Shah/Potts models [1, 6, 8, 19, 20, 28]. These recent improvements will be used to develop an efficient algorithm for the transductive Potts model:
where \(\operatorname{Per}\) stands for perimeter. The relationship between cut and perimeter comes from the coarea formula [25]. In the continuous setting, we have \(\operatorname{Per}(\varOmega)=\int_{\mathcal{M}\subset\mathbb{R}^{d}}|\nabla f|\) when f=1 Ω (x) is the indicator function of a geometric set Ω defined as f(x)=1 ∀x∈Ω and 0 otherwise. Then, discretizing the total variation energy leads to \(\int_{\mathcal {M}\subset\mathbb{R}^{d}}|\nabla f|\simeq\sum_{i,j\in V}w_{i,j}|f_{i}-f_{j}|\) and plugging the indicator function f i =1 Ω (i) finally gives \(\operatorname{Per}(\varOmega)\simeq\sum_{i,j\in V}w_{i,j}|f_{i}-f_{j}|=\sum_{i\in \varOmega, j\in\varOmega^{c}}w_{i,j}=\operatorname{Cut}(\varOmega,\varOmega^{c})\). The minimization problem (20) is equivalent to the following problem:
The set of minimization used in the above minimization problem is not convex because binary functions do not make a convex set. Thus we consider the following relaxation:
The previous minimization problem is solved as:
The solution of the minimization problems w.r.t. u k ,d k is the same as the solution given in Sect. 2. The minimization w.r.t. v k is simply given by:
and project onto the convex simplex set \(\sum_{k=1}^{K} v_{k}=1\) using [21, 28]. Observe that the final solution \(\{u^{\star}_{k}\}_{k=1}^{K}\) of (17) is not guaranteed to be binary. Hence, a conversion step is required to make \(\{u^{\star}_{k}\}_{k=1}^{K}\) binary. Like in the previous section, the binary conversion is as follows:
where \(\{\hat{u}^{\star}_{k}\}_{k=1}^{K}\) satisfy \(\sum_{k=1}^{K} \hat{u}^{\star}_{k}=1\).
To summarize the algorithm introduced in this section, we write down the pseudo-code Algorithm 3.
5 Experiments
In this section, we show classification results using the transductive algorithms developed in Sects. 3 and 4. We will work on the four moons and MNIST datasets described above. For both data sets, we build the weights matrix using the self-tuning construction of [29]. We use ten nearest neighbors, and the tenth neighbor determines the local scale. The universal scaling parameter is set to 1. For Algorithm 1, we set r d =10, r v =100, r m =6K/N, where N is the number of data points and K is the number of classes, and δ λ =0.4. For Algorithm 3, we set r d =10 and r v =100. We choose the labeled points randomly, and fix a number of labeled points to draw from each class.
We compare Algorithm 2 and Algorithm 3 with a spectral transductive learning method from [2], which uses linear least squares on the eigenvectors of the normalized Laplacian to estimate the classes. That is, given the weight matrix W as before, we set \(\mathcal{L}=I-S^{-1/2}WS^{-1/2}\), where S is the diagonal matrix with the row sums on the diagonal, that is, S ii =∑ j W ij . We compute the l+1 lowest eigenvalue eigenvectors ϕ 0,…,ϕ l of \(\mathcal{L}\), and form the N×l matrix Φ=[ϕ 1…ϕ l ]; note that as usual we have omitted the density vector ϕ 0. Each row of Φ corresponds to a data point. Next we form the matrix Φ lab by extracting the rows of Φ corresponding to the labeled data points. Let L denote the number of classes, and p be the number of labeled data points. Given the p×L binary label matrix Y, we compute
which minimizes the least square energy:
To compute the class labels of the unlabeled points, we set R=ΦA, and let
Tables 3 and 4 compare the proposed ℓ 1 relaxations of the multi-class Cheeger cut (Algorithm 2) and the Mumford-Shah-Potts (Algorithm 3) with the competitive spectral method of [2] (by selecting the number l of eigenvectors which minimizes the error). We have tested different numbers of labels (n l is the number of labeled data for each class) that are selected randomly. We repeat each experiment 10 times. For each experiment, the labeled points were chosen randomly and the same labeled points were used for the multi-class Cheeger cut model, the Mumford-Shah-Potts model and the spectral method. The ℓ 1 relaxations of the multi-class Cheeger cut and the Mumford-Shah-Potts outperform the spectral method in all cases, significantly so when a very small number of points are labeled.
6 Conclusion
The paper introduces new ℓ 1 relaxation methods for the multi-class transductive learning problem. These relaxation methods are inspired from recent advances in imaging science which offer fast, accurate and robust ℓ 1 optimization tools which allow to go beyond standard ℓ 2 relaxation methods, i.e. spectral clustering methods. Experiments demonstrate that the ℓ 1 relaxations of the multi-class Cheeger cut and the Mumford-Shah-Potts outperform the spectral clustering method, and even more significantly when a very small number of labels is considered.
Reproducible Research
The code is available at http://www.cs.cityu.edu.hk/~xbresson/codes.html#learningmulti.
References
Bae, E., Yuan, J., Tai, X.-C.: Global minimization for continuous multiphase partitioning problems using a dual approach. Int. J. Comput. Vis. 92(1), 112–129 (2009)
Belkin, M.: Problems of learning on manifolds. PhD thesis, University of Chicago (2003)
Bertozzi, A., Flenner, A.: Diffuse interface models on graphs for classification of high dimensional data. UCLA CAM Report 11-27 (2011)
Bioucas-Dias, J.M., Figueiredo, M.A.: A new TwIST: two-step iterative shrinkage/thresholding algorithms for image restoration. IEEE Trans. Image Process. 16(12), 2992–3004 (2007)
Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1124–1137 (2004)
Brown, E.S., Chan, T.F., Bresson, X.: A convex relaxation method for a class of vector-valued minimization problems with applications to Mumford-Shah segmentation. UCLA CAM Report 10-43 (2010)
Bühler, T., Hein, M.: Spectral clustering based on the graph p-Laplacian. In: International Conference on Machine Learning, pp. 81–88 (2009)
Chambolle, A., Cremers, D., Pock, T.: A convex approach for computing minimal partitions. Technical Report TR-2008-05, Dept. of Computer Science, University of Bonn, Bonn (2008)
Cheeger, J.: A lower bound for the smallest eigenvalue of the Laplacian. Problems in Analysis, 195–199 (1970)
Chung, F.R.K.: Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, vol. 92 (1997). Published for the Conference Board of the Mathematical Sciences, Washington, DC
Dinkelbach, W.: On nonlinear fractional programming. Manag. Sci. 13, 492–498 (1967)
Donoho, D.: De-noising by soft-thresholding. IEEE Trans. Inf. Theory 41(33), 613–627 (1995)
Glowinski, R., Le Tallec, P.: Augmented Lagrangian and Operator-Splitting Methods in Nonlinear Mechanics. SIAM, Philadelphia (1989)
Goldstein, T., Osher, S.: The split Bregman method for L1-regularized problems. SIAM J. Imaging Sci. 2(2), 323–343 (2009)
Hein, M., Bühler, T.: An inverse power method for nonlinear eigenproblems with applications in 1-spectral clustering and sparse PCA. In: Advances in Neural Information Processing Systems (NIPS), pp. 847–855 (2010)
Hein, M., Setzer, S.: Beyond spectral clustering—tight relaxations of balanced graph cuts. In: Advances in Neural Information Processing Systems (NIPS) (2011)
Kolev, K., Cremers, D.: Continuous ratio optimization via convex relaxation with applications to multiview 3D reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Kolmogorov, V., Boykov, Y., Rother, C.: Applications of parametric maxflow in computer vision. In: International Conference on Computer Vision, pp. 1–8 (2007)
Lellmann, J., Kappes, J., Yuan, J., Becker, F., Schnörr, C.: Convex multi-class image labeling by simplex-constrained total variation. In: International Conference on Scale Space and Variational Methods in Computer Vision, pp. 150–162 (2009)
Lellmann, J., Schnörr, C.: Continuous multiclass labeling approaches and algorithms. Univ. of Heidelberg, Tech. Rep. (2010)
Michelot, C.: A finite algorithm for finding the projection of a point onto the canonical simplex of rn. J. Optim. Theory Appl. 50(1), 195–200 (1986)
Mumford, D., Shah, J.: Optimal approximations of piecewise smooth functions and associated variational problems. Commun. Pure Appl. Math. 42, 577–685 (1989)
Potts, R.B., Domb, C.: Some generalized order-disorder transformations. Math. Proc. Camb. Philos. Soc. 48, 106–109 (1952)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Strang, G.: Maximal flow through a domain. Math. Program. 26, 123–143 (1983)
Szlam, A., Bresson, X.: Total variation and cheeger cuts. In: Proceedings of the 27th International Conference on Machine Learning, pp. 1039–1046 (2010)
Vese, L.A., Chan, T.F.: A multiphase level set framework for image segmentation using the Mumford and Shah model. Int. J. Comput. Vis. 50(3), 271–293 (2002)
Zach, C., Gallup, D., Frahm, J.M., Niethammer, M.: Fast global labeling for real-time stereo using multiple plane sweeps. In: Vision, Modeling, and Visualization, pp. 243–252 (2008)
Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. In: Advances in Neural Information Processing Systems 17 (NIPS 2004) (2004)
Zhao, H.K., Chan, T.F., Merriman, B., Osher, S.: A variational level set approach to multiphase motion. J. Comput. Phys. 127, 179–195 (1996)
Acknowledgement
Xavier Bresson is supported by the Hong Kong RGC under Grant GRF110311.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bresson, X., Tai, XC., Chan, T.F. et al. Multi-class Transductive Learning Based on ℓ 1 Relaxations of Cheeger Cut and Mumford-Shah-Potts Model. J Math Imaging Vis 49, 191–201 (2014). https://doi.org/10.1007/s10851-013-0452-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10851-013-0452-5