1 Introduction

Many data mining and machine learning applications involve high-dimensional data, such as images, videos and documents (Wang et al. 2019; Zhang et al. 2019; Tang and Wei 2019; Ding et al. 2018; Fan et al. 2016; Du et al. 2017, 2018). In practice, such high-dimensional datasets can often be well approximated by multiple low-dimensional subspaces corresponding to multiple classes or categories (Fahmi et al. 2018; Wang et al. 2018; Amin et al. 2019; Tang et al. 2019). In the past decades, the subspace clustering as one of the most important machine learning technologies has been widely used in computer vision (Wang et al. 2016; Zhou et al. 2013) and machine learning (Bian et al. 2017; Du et al. 2017, 2016; Jia et al. 2017; Cheng et al. 2013). For instance, low-rank representation (LRR) (Liu et al. 2013) and sparse subspace clustering (SSC) (Elhamifar and Vidal 2013) aim to obtain a low-rank structure of multiple subspaces to fit high-dimensional data. However, both LRR and SSC require \({\mathcal {O}}(n^3)\) time complexity, so they are not able to scale to problems with large n (number of samples). Moreover, since LRR involves a nuclear norm regularization penalty of an n-by- n matrix, it requires computing singular value decomposition (SVD) (Wall et al. 2003) of an n-by-n matrix in every iteration.

In order to scale up to larger datasets, some subspace clustering methods have been proposed to mitigate the computation and memory cost. Non-convex low-rank representation (NLRR) (Shen and Li 2016) was proposed to alleviate the computational cost of LRR. The main idea is to take a non-convex reformulation of nuclear norm into account to avoid the repeated SVD computations in nuclear norm optimization solvers. However, it is not efficient to solve NLRR directly due to the use of an auxiliary variable, which leads to solving an \(n \times n\) linear system at each step. To further address this computation issue, online low-rank subspace clustering (OLRSC) (Shen et al. 2016) and online robust PCA (ORPCA) (Feng et al. 2013) were proposed. Despite that the time and memory cost per iteration are independent of n, OLRSC and ORPCA suffer from much slower convergence compared with batch NLRR algorithms.

Instead of using online updates, we propose a scalable algorithm NLRR in this paper, called NLRR++, which reduces the time complexity from \({\mathcal {O}}(n^3)\) to \({\mathcal {O}}(mnd)\) (usually \(d\ll m\ll n\)). NLRR++ is based on the main finding that by formulating \({\mathbf {U}}\mathbf {V}^{\mathrm{T}}\) in Eq. (3) (NLRR problem) as sum of rank-one matrices and updating the variables in a block coordinate descent manner, the NLRR problem can be solved without adding any auxiliary variable. However, the update for \({\mathbf {u}}_i\) (a column of \({\mathbf {U}}\)) involves inverting an m-by-m matrix which leads to \({\mathcal {O}}(m^3)\) computational cost. By exploiting the structure of the linear system and pre-computing the SVD of the inverse part, NLRR++ is able to solve this linear system in \({\mathcal {O}} (mn)\) time. We show that NLRR++ converges to stationary points. Moreover, to address the issue of the high demand in memory and computational time for the final spectral clustering phase, we also propose an efficient clustering scheme that can further boost the performance of subspace clustering. Our experiments results on synthetic and real-world datasets show that the proposed algorithm is much faster than state-of-the-art space clustering approaches such as LRR (Liu et al. 2013), SSC (Elhamifar and Vidal 2013), NLRR (Shen and Li 2016), OLRSC (Shen et al. 2016) and ORPCA (Feng et al. 2013).

The spectral clustering has been used for subspace clustering (Von Luxburg 2007), which usually follows two steps. In the first step, a symmetric affinity matrix \(\mathbf {W}\) is constructed. And in the second step, a weighted undirected graph is constructed, and the segmentation of the data is then found by clustering the eigenvectors of the graph Laplacian using some traditional clustering techniques such as K-means. One challenge is to build an affinity matrix to capture the relationship among data points. Most recent methods to construct affinity matrix target the sparse and low-rank representation of data points. The low-rank representation of the data matrix is then converted into symmetric and nonnegative affinity matrix, from which the segmentation is done by using spectral clustering. However, the spectral clustering-based methods cannot scale to large datasets due to the high cost of memory and computation time, as the affinity matrix \(\mathbf {W} \in {\mathbb {R}}^{n \times n}\) is usually dense, demanding huge memory when n is large.

In this paper, we focus on LRR, which is guaranteed to have robust multiple subspace segmentation performance. LRR is to solve the following optimization problem:

$$\begin{aligned} \begin{aligned}&\mathop {\min }\limits _{{\mathbf {X}},{\mathbf {E}}} \frac{1}{2} \Vert \mathbf {Z} - \mathbf {AX} - {\mathbf {E}}\Vert _F^2 +\beta \Vert {\mathbf {X}}\Vert _* +\lambda \Vert {\mathbf {E}}\Vert _{2,1}. \end{aligned} \end{aligned}$$
(1)

Here, \( \mathbf {Z} =[\mathbf {z}_1, \mathbf {z}_2, \ldots , \mathbf {z}_n] \in {\mathbb {R}}^{m \times n}\) is the observation matrix with n samples and m features (\( m \ll n\)), and \({\mathbf {A}} \in {\mathbb {R}}^{m \times n}\) is a given dictionary matrix. Without any prior knowledge of dictionary, it is often assumed that \({\mathbf {A}}=\mathbf {Z}\). LRR aims to estimate two matrices: \({\mathbf {E}}\) is a sparse matrix capturing the corruptions of observations, and \({\mathbf {X}} \in {\mathbb {R}}^{n \times n}\) captures the low-rank structure of the observed data. Finally, the regularization parameters \(\beta , \lambda \) can be tuned to control the sparsity and low rankness of the solution.

If all the samples lie exactly in a small number of subspaces, under certain conditions the solution \({\mathbf {X}}\) of Eq. (1) will be a block-diagonal matrix up to permutation. The results can thus be used for subspace clustering. In practice, the representation matrix \({\mathbf {X}} = {\mathbf {U}}_x\varvec{\varSigma }_x\mathbf {V}_x^{\mathrm{T}}\) can be used to construct an affinity matrix \(\mathbf {W}= \mathbf {V}_x\mathbf {V}_x^{\mathrm{T}}\), which is then fed into spectral clustering to get the final clustering result.

Despite the desirable theoretical properties, LRR methods cannot scale to large datasets due to high memory and computational cost. To solve Eq. (1), we need to store \({\mathbf {X}}\) matrix, which needs \(({\mathcal {O}}(n^2))\) memory. Obviously, it is unfeasible to store for large-scale datasets with millions of samples. Furthermore, solving an optimization problem with nuclear norm penalty \(\Vert {\mathbf {X}} \Vert _*\) involves singular values decomposition in every iteration, which is also computationally expensive (Recht et al. 2010).

Recently, Shen and Li (2016) mitigated the computation issue by introducing a non-convex reformulation of Eq. (1). Assuming the rank of \({\mathbf {X}}\) is at most d, Fazel et al. (2001) showed that:

$$\begin{aligned} \mathop {\min } \limits _{{\mathbf {U}},\mathbf {V},{\mathbf {X}}=\mathbf {UV}} \Vert {\mathbf {X}} \Vert _* = \frac{1}{2}(\Vert {\mathbf {U}}\Vert _F^2 + \Vert \mathbf {V}\Vert _F^2), \end{aligned}$$
(2)

where \({\mathbf {U}} \in {\mathbb {R}}^{m\times d}, \mathbf {V} \in {\mathbb {R}}^{n \times d}\). Therefore, LRR in Eq. (1) has an equivalent but non-convex formulation (NLRR) as follows:

$$\begin{aligned}&\mathop {\min }\limits _{{\mathbf {U}},\mathbf {V} ,{\mathbf {E}}} \frac{1}{2}\Vert \mathbf {Z} {-} {\mathbf {A}}{\mathbf {U}}\mathbf {V}^{\mathrm{T}}-{\mathbf {E}}\Vert _F^2 {+} \frac{\beta }{2} (\Vert {\mathbf {U}}\Vert _F^2 + \Vert \mathbf {V}\Vert _F^2)+ \lambda \Vert {\mathbf {E}} \Vert _{2,1}. \end{aligned}$$
(3)

In order to solve this NLRR problem, a straightforward way is to update the three variables \({\mathbf {U}},\mathbf {V},{\mathbf {E}}\) alternately until convergence. However, Shen and Li (2016) pointed out that this alternating minimization scheme may be intractable. Assume \(\mathbf {V,E}\) are fixed, solving the subproblem with respect to \({\mathbf {U}}\) can be computationally intensive as the optimal \({\mathbf {U}}\) is:

$$\begin{aligned} \begin{aligned} \text {vec}({\mathbf {U}})=(\mathbf {V}^{\mathrm{T}}\mathbf {V}\otimes ({\mathbf {A}}^{\mathrm{T}}{\mathbf {A}})+\beta ^{-1}\mathbf {I}_{mn})\text {vec}({\mathbf {A}}^{\mathrm{T}}(\mathbf {Z}-{\mathbf {E}})\mathbf {V}), \end{aligned} \end{aligned}$$
(4)

where \(\otimes \) is a Kronecker product. In Eq. (4), \(\mathbf {V}^{\mathrm{T}}\mathbf {V}\otimes ({\mathbf {A}}^{\mathrm{T}}{\mathbf {A}})\) is an mn-by-mn matrix, so the time complexity of solving (4) is \({\mathcal {O}}(m^3n^3)\) so that memory cost is \({\mathcal {O}}(m^2n^2)\) which is not practical even in small-scale applications. To solve the variable \({\mathbf {U}}\) efficiently, Shen and Li (2016) proposed to add an auxiliary variable \(\mathbf {D}={\mathbf {A}}{\mathbf {U}}\) to Eq. (3), and then, updating \({\mathbf {U}}\) becomes an ordinary least square problem by differentiating it with respect to \({\mathbf {U}}\) as follows (details can be found in Ref. Shen and Li 2016):

$$\begin{aligned} \begin{aligned} \text {(NLRR-1)}\;\;{\mathbf {U}}=(\mu {\mathbf {A}}{\mathbf {A}}^{\mathrm{T}}+\mathbf {I}_n)^{-1} {\mathbf {A}}^{\mathrm{T}}(\mathbf {W}+\mu \mathbf {D}). \end{aligned} \end{aligned}$$
(5)

As we can see from the above equation, introducing auxiliary variables can reduce the computational and memory cost of computing \({\mathbf {U}}\) to \({\mathcal {O}}(n^3)\) and \({\mathcal {O}}(n^2)\), respectively.

2 Reformulated non-convex low-rank representation model

In this section, we introduce our proposed algorithm, NLRR++. By reformulating NLRR problem in Eq. (3), we show that a block coordinate descent algorithm can be used to solve NLRR efficiently without adding any auxiliary variable. In Sect. 2.1, we describe the one-variable update rule in detail. And in Sect. 2.2, we discuss how to group variables into blocks so that the block coordinate descent can be run efficiently, followed by describing several further improvements and theoretical guarantee in the rest of this section.

2.1 Solving strategy via block coordinate descent method

We reformulate the NLRR problem in Eq. (3) in order to mitigate the computational cost and memory requirement. Note that the solution \({\mathbf {U}}\mathbf {V}^{\mathrm{T}}\) can be represented as a sum of d outer products:

$$\begin{aligned} \begin{aligned} {\mathbf {X}} = {\mathbf {U}}\mathbf {V}^{\mathrm{T}} = \sum \limits _{t=1}^d {\mathbf {u}}_t{\mathbf {v}}^{\mathrm{T}}_t \end{aligned} \end{aligned}$$
(6)

where \( {\mathbf {u}}_t,{\mathbf {v}}_t, (t=1,\ldots , d)\) are the column vectors of \({\mathbf {U}},\mathbf {V}\) respectively. From the perspective of latent feature space, \({\mathbf {u}}_t\) and \({\mathbf {v}}_t\) correspond to the tth latent feature. Therefore, the objective function of Eq. (3) can be written as follows:

$$\begin{aligned} \begin{aligned}&g(\{{\mathbf {u}}_t\}_{t=1}^d, \{{\mathbf {v}}_t\}_{t=1}^d,{\mathbf {E}}) =\frac{1}{2}\Vert \mathbf {Z} - {\mathbf {A}}\sum \limits _{t=1}^d {\mathbf {u}}_t{\mathbf {v}}_t^{\mathrm{T}}-{\mathbf {E}}\Vert _F^2 \\&\quad +\, \frac{\beta }{2}\left( \sum \limits _{t=1}^d \Vert {\mathbf {u}}_t\Vert _2^2 + \sum \limits _{t=1}^d \Vert {\mathbf {v}}_t\Vert _2^2\right) +\lambda \Vert {\mathbf {E}}\Vert _{2,1}. \end{aligned} \end{aligned}$$
(7)

In large-scale applications and related optimization problems, it is difficult to solve all variables at one time in each iteration because of the large data and variables. The block coordinate descent (BCD) approach is effective to solve this problem. It divides variables into smaller blocks and updates only one of them in each iteration, thus simplifying the solution of variables. The main idea of BCD is to update only one block coordinate variable at a time when the other variables are fixed. Hence, the key of applying BCD is to design the subproblem of updating single variable and the order of updating variables.

In the following, we propose a BCD algorithm for solving Eq. (7). In the outer loop, we alternately update \({\mathbf {U}},\mathbf {V}\) when fixing \({\mathbf {E}}\) and update \({\mathbf {E}}\) when fixing \({\mathbf {U}},\mathbf {V}\). In the inner loop, update variables \({\mathbf {U}},\mathbf {V}\) in turn by BCD. First, we discuss how to update \({\mathbf {U}}, \mathbf {V}\) when fixing \({\mathbf {E}}\). In fact, Shen and Li (2016) discussed about this approach and found out it is impossible to update the whole \({\mathbf {U}}\) or \(\mathbf {V}\) efficiently. But here, we show that by splitting these two matrices into columns, each column actually has a closed form solution that can be computed efficiently. Minimizing Eq. (7) with respect to \({\mathbf {u}}_t,{\mathbf {v}}_t (t=1, \ldots , d)\), it can then be written as:

$$\begin{aligned} \begin{aligned} g(\{{\mathbf {u}}_t\}_{t=1}^d, \{{\mathbf {v}}_t\}_{t=1}^d)&= \beta \sum \limits _{t=1}^d \Vert {\mathbf {u}}_t\Vert _2^2 + \beta \sum \limits _{t=1}^d \Vert {\mathbf {v}}_t\Vert _2^2 \\&\quad +\,\Vert \mathbf {Z} - {\mathbf {A}}\sum \limits _{t=1}^d {\mathbf {u}}_t{\mathbf {v}}_t^{\mathrm{T}}-{\mathbf {E}}\Vert _F^2. \end{aligned} \end{aligned}$$
(8)

In the following, we show that the subproblem in Eq. (8) can be solved efficiently by a column-wise coordinate descent, where at each step, we update the whole column of \({\mathbf {u}}_t\) or \({\mathbf {v}}_t\) together.

  • Update \({\mathbf {u}}_t\)(\(t=1, \ldots , d\)): When \(\mathbf {E,V}, {\mathbf {u}}_j(j \ne t)\) are fixed, differentiating the objective function in Eq. (8) with respect to \({\mathbf {u}}_t\) and arranging the other terms yield

    $$\begin{aligned} \begin{aligned} {\mathbf {u}}_t = (c_t{\mathbf {A}}^{\mathrm{T}}{\mathbf {A}}+ \beta \mathbf {I}_n)^{-1}{\mathbf {A}}^{\mathrm{T}}\mathbf {R}_t{\mathbf {v}}_t \end{aligned} \end{aligned}$$
    (9)

    where \( \mathbf {R}_t= \mathbf {Z}-{\mathbf {A}}\sum \nolimits _{j \ne t} {\mathbf {u}}_j{\mathbf {v}}_j^{\mathrm{T}}-{\mathbf {E}}\) is the tth residual matrix and \(c_t =\Vert {\mathbf {v}}_t\Vert _2^2\). However, computing \( \mathbf {Q} {\mathbf {v}} = (c_t{\mathbf {A}}^{\mathrm{T}}{\mathbf {A}}+ \beta \mathbf {I}_n)^{-1}{\mathbf {v}}\) (where \({\mathbf {v}}={\mathbf {A}}^{\mathrm{T}}\mathbf {R}_t{\mathbf {v}}_t\) is a vector and \(\mathbf {Q} \in {\mathbb {R}}^{n\times n}\)) requires inverting a n-by-n matrix or solving an n-dimensional linear systems, which takes \(O(n^3)\) time.

To resolve this problem, we need to exploit the structure of the \(\mathbf {Q}\) matrix. Note that \(\mathbf {Q}=c_t({\mathbf {A}}^{\mathrm{T}} {\mathbf {A}} + \frac{\beta }{c_t}\mathbf {I}_n)^{-1}\), so the first part remains unchanged throughout the whole optimization procedure, and only the second term changed by a constant \(c_t\). Therefore, by caching some information about \({\mathbf {A}}^{\mathrm{T}}{\mathbf {A}} \), we can compute the inverse efficiently. In order to do this, we first use the Woodbury matrix identityFootnote 1 to transform the dimensions of the inverse operation from n-by-n to m-by-m (m \(\ll \)n), i.e.,

$$\begin{aligned} \mathbf {Q}=\frac{1}{\beta }\mathbf {I}_n-\frac{1}{\beta ^2}{\mathbf {A}}^{\mathrm{T}} \left( \frac{c_t}{\beta }{\mathbf {A}}{\mathbf {A}}^{\mathrm{T}}+\mathbf {I}_m\right) ^{-1}{\mathbf {A}}. \end{aligned}$$

Then, Eq. (9) can be written as follows:

$$\begin{aligned} \begin{aligned} {\mathbf {u}}_t= \frac{1}{\beta }{\mathbf {A}}^{\mathrm{T}}\mathbf {R}_t{\mathbf {v}}_t-\frac{1}{\beta ^2}{\mathbf {A}}^{\mathrm{T}}\mathbf {Q}_m{\mathbf {A}}{\mathbf {A}}^{\mathrm{T}}\mathbf {R}_t{\mathbf {v}}_t \end{aligned} \end{aligned}$$
(10)

where \(\mathbf {Q}_m = (\frac{c_t}{\beta }{\mathbf {A}}{\mathbf {A}}^{\mathrm{T}}+\mathbf {I}_m)^{-1}\). To update \({\mathbf {u}}_t\), we have to compute the matrix inverse \((\frac{c_t}{\beta }{\mathbf {A}}{\mathbf {A}}^{\mathrm{T}}+\mathbf {I}_m)^{-1}\) in every iteration which requires the computational complexity of \({\mathcal {O}}(m^3)\).

Now, we take the structure of \(\mathbf {Q}_m\) into account and make adjustments to \(\mathbf {Q}_m\). Outside the outer loop of the algorithm, we pre-compute the SVD of \( {\mathbf {A}}{\mathbf {A}}^{\mathrm{T}} = {\mathbf {U}}_*\varvec{\varSigma }_*\mathbf {V}_*^{\mathrm{T}}\) in the beginning of the algorithm, and then, \(\mathbf {Q}_m\) can be written as

$$\begin{aligned} \begin{aligned} \mathbf {Q}_m = {\mathbf {U}}_*\text {diag} \left( \frac{1}{1+\frac{c_t}{\beta }\text {diag}({{\varvec{\varSigma }}_*})}\right) \mathbf {V}_*^{\mathrm{T}}. \end{aligned} \end{aligned}$$
(11)

It can be seen from the above that in each iteration, only \(c_t\) is changeable, but \(U_*\),\(\varSigma _*\), \(V_*^{\mathrm{T}}\) are unchanged. Therefore, calculating \({\mathbf {u}}_t\) only includes the matrix vector product whose computational complexity is \({\mathcal {O}}(mn)\).

  • Update \({\mathbf {v}}_t\)(\(t=1, \ldots , d\)): When \(\mathbf {E,V}, {\mathbf {v}}_j(j \ne t)\) are fixed, taking the gradient of Eq. (8) with respect to \({\mathbf {v}}_t\) and setting it to zero, we can get the update rule for \({\mathbf {v}}_t\):

    $$\begin{aligned}&{\mathbf {v}}_t^{\mathrm{T}} = \frac{1}{d_t+\beta } ({\mathbf {A}}{\mathbf {u}}_t)^{\mathrm{T}}\mathbf {R}_t, (d_t={\mathbf {u}}_t^{\mathrm{T}}{\mathbf {A}}^{\mathrm{T}}{\mathbf {A}}{\mathbf {u}}_t). \end{aligned}$$
    (12)
  • We follow Ref. Hale et al. (2008) to find the local minimizer of \({\mathbf {E}}\) by the following soft-thresholding operator when \({\mathbf {u}}_t,{\mathbf {v}}_t\) are fixed:

    $$\begin{aligned} \begin{aligned} \mathbf {[E]_{:,i}} = S_{\lambda }(\mathbf {[R]_{:,i}}), \end{aligned} \end{aligned}$$
    (13)

    where \([E]_{:,i}\) is ith column vector of matrix E, \(\mathbf {R}=\mathbf {R}_t - {\mathbf {A}}{\mathbf {u}}_t{\mathbf {v}}_t^{\mathrm{T}}\), and the soft-thresholding operator is defined by

    $$\begin{aligned} S_{\lambda } (\mathbf {[R]_{:,i}}) = \text {max} \;\{{\Vert [\mathbf {R}]_{:,\mathbf {i}}\Vert _\mathbf {2}}-\lambda ,0\}\frac{[R]_{:,i}}{\Vert [R]_{:,i}\Vert _2} \ (i=1, \ldots , n). \end{aligned}$$

2.2 Strategy for updating \({\mathbf {u}}_t,{\mathbf {v}}_t\)

To update the columns of \({\mathbf {U}}, \mathbf {V}\), there are multiple choices of the update sequence. In this subsection, we investigate two main inner updating sequences.

The first choice is to update all the columns of \({\mathbf {U}}\) and then all the columns of \(\mathbf {V}\):

$$\begin{aligned} \begin{aligned} (\mathrm{I})\;&\underbrace{\mathbf {R}_1,{\mathbf {u}}_1, \mathbf {R},\mathbf {R}_2,{\mathbf {u}}_2, \mathbf {R}, \ldots , \mathbf {R}_d, {\mathbf {u}}_d,\mathbf {R}}_{{\mathbf {U}}},\\&\underbrace{\mathbf {R}_1, {\mathbf {v}}_1, \mathbf {R}, \mathbf {R}_2, {\mathbf {v}}_2, \mathbf {R}, \ldots , \mathbf {R}_d, {\mathbf {v}}_d,\mathbf {R}}_{\mathbf {V}} \end{aligned} \end{aligned}$$
(14)

Note that before updating \({\mathbf {u}}_t\) by Eq. (9) or \({\mathbf {v}}_t\) by Eq. (12), the tth residual matrix \( \mathbf {R}_t= \mathbf {Z}-{\mathbf {A}}\sum \nolimits _{j \ne t} {\mathbf {u}}_j{\mathbf {v}}_j^{\mathrm{T}}-{\mathbf {E}}\) needs to be updated at first, and once \({\mathbf {u}}_t\) or \({\mathbf {v}}_t\) is updated, the real residual matrix \(\mathbf {R}= \mathbf {Z}-{\mathbf {A}}\sum \nolimits _{i=1}^d {\mathbf {u}}_i{\mathbf {v}}_i^{\mathrm{T}}-{\mathbf {E}}\) needs to be calculated too by adding the updated \({\mathbf {u}}_t {\mathbf {v}}_t^{\mathrm{T}}\). The updates of \(\mathbf {R}, \mathbf {R}_t\) are also included in the update sequence of Eq. (14). In this approach, each update of \({\mathbf {u}}_t\) and \({\mathbf {v}}_t\) will trigger the update of \(\mathbf {R}_t\) and \(\mathbf {R}\).

An alternative “feature-wise approach” is to update \({\mathbf {u}}_t\) in company with \({\mathbf {v}}_t\); thus, the update frequency of \(\mathbf {R}\) and \(\mathbf {R}_t\) is reduced in half. The detailed updating sequences are as follows:

$$\begin{aligned} \begin{aligned} ({\mathrm{II}})\;\;\underbrace{\mathbf {R}_1, {\mathbf {u}}_1, {\mathbf {v}}_1, \mathbf {R}}_{({\mathbf {U}}^1,\mathbf {V}^1)}, \underbrace{\mathbf {R}_2, {\mathbf {u}}_2, {\mathbf {v}}_2, \mathbf {R}}_{({\mathbf {U}}^2,\mathbf {V}^2)}, \ldots , \underbrace{\mathbf {R}_d, {\mathbf {u}}_d, {\mathbf {v}}_d, \mathbf {R}}_{({\mathbf {U}}^d,\mathbf {V}^d)}. \end{aligned} \end{aligned}$$
(15)

Moreover, between each update of \(\mathbf {R}_t\) and \(\mathbf {R}\), we can update \({\mathbf {u}}_t, {\mathbf {v}}_t\) by T times, and within these T inner updates, we do not need to update residual matrix, as explained in detail in Sect. 4.

Our experiments show that approach \(({\mathrm{II}})\) demonstrates better clustering performance and higher efficiency than the approach \(({\mathrm{I}})\) does. The algorithm outline of the approach \(({\mathrm{II}})\) is presented in Algorithm 1.

figure a
Fig. 1
figure 1

The objective function value versus computation time with the comparison among NLRR++(\(T=1\)), NLRR++(\(T=4\)), NLRR++b (via the update approach \(({\mathrm{I}})\) in Eq. (14)), NLRR-1 using ImageNet features (\(N=62{,}700\)). Results show that our proposed update sequence has faster and more stable convergence

2.3 Accelerate NLRR++

We accelerate the NLRR++ based on the approach \(({\mathrm{II}})\) (in Eq. (15)) by controlling the number of inner iterations T for updating \({\mathbf {u}}_t, {\mathbf {v}}_t\) (Yu et al. 2012). NLRR++ could be slightly more efficient when \(T > 1\) due to the benefit brought by the “delayed residual update.” Note that \(\mathbf {R}\) and \(\mathbf {R}_t\) are fixed during NLRR++ iterations for each rank-one approximation problem in Eq. (8). Therefore, \(\mathbf {R}\) and \(\mathbf {R}_t\) are required to be updated only when we switch to the next subproblem corresponding to another feature. Moreover, the more the inner iterations we use, the better the approximation to subproblem in Eq. (8). Hence, a direct approach to accelerate NLRR++ is to increase T. But a large and fixed T might result in overhead on a single subproblem.

We close this section by a comparison between NLRR and NLRR++ in Fig. 1. Here, we compare four settings with the ImageNet dataset: Two settings are NLRR-1, and three others are NLRR++ with fixed \(T = 1\) (NLRR++T1) (using the \(({\mathrm{II}})\) approach), fixed \(T = 4\) (NLRR++T4) (using the \(({\mathrm{II}})\) approach), NLRR++b (using the \(({\mathrm{I}})\) approach), respectively. As shown in Fig. 1, the “feature-wise” update strategy by NLRR++ using approach \(({\mathrm{II}})\), even when \(T = 1\), converges faster than NLRR. We also observe that a larger T improves convergence of NLRR++.

2.4 Time complexity and memory cost

Our NLRR++ via column-wise BCD is summarized in Algorithm 1. To analyze the computational and memory complexity, we assume \(d\ll m < n\), since the sample dimension m is fixed, and the estimation d of subspace rank is generally smaller than the sample dimension. Only the number of samples n will increase with the increase in the dataset, and the rank d is usually less than 100.

In the inner iteration \((t=1,\ldots ,d)\), both the updates of \(\mathbf {R}_t\) and \(\mathbf {R}\) in the step 7 and 12 cost \({\mathcal {O}}(mn)\). The updates of \({\mathbf {u}}_t\) and \({\mathbf {v}}_t\) in the step 9 and 10 cost \({\mathcal {O}}{(3Tmn+2Tm^2)}\) and \({\mathcal {O}}{(2Tmn)}\), respectively, where T is a small fixed constant. Therefore, the time complexity of the inner iteration in Algorithm 1 can be summarized as \({\mathcal {O}}(dmn)\), and thus, the total time complexity of NLRR++ is \({\mathcal {O}}(T_1mnd)\), where \(T_1\) is the number of outer iterations scaling as \({\mathcal {O}}(\frac{1}{\varepsilon }) \) (Richtrik and Tak 2014). In all the experiments shown in Sect. 4, \(T_1\) is no more than 5. In terms of the memory cost, each of the steps in Algorithm 1 needs \({\mathcal {O}}(mn)\) memory.

We also compare our time complexity with other methods in Table 1 to clearly demonstrate the difference of time complexity and memory cost of NLRR++, NLRR (Shen and Li 2016), OLRSC (Shen et al. 2016) and ORPCA (Feng et al. 2013). Note that OLRSC and ORPCA are two online methods with the lowest memory cost which is in the sacrifice of the computational efficiency.

Table 1 Comparison of time complexity memory cost of NLRR, NLRR++, OPRCA and OLRSC

2.5 Theoretical convergence analysis

For NLRR++, \(g(\{{\mathbf {u}}_t\}_{t=1}^d, \{{\mathbf {v}}_t\}_{t=1}^d, {\mathbf {E}})\) is convex in \({\mathbf {u}}_t\), \({\mathbf {v}}_t\) or \({\mathbf {E}}\) separately. Since we follow BCD update rule with \(2t+1\) blocks (\(\{{\mathbf {u}}_t\}_{t=1}^d\), \(\{{\mathbf {v}}_t\}_{t=1}^d\)), and \({\mathbf {E}}\), their corresponding subproblems can be written as

$$\begin{aligned}&\mathop {\min }\limits _{\{{\mathbf {u}}_t\}_{t=1}^d, } g(\{{\mathbf {u}}_t\}_{t=1}^d, \{{\mathbf {v}}_t\}_{t=1}^d ,{\mathbf {E}}), \end{aligned}$$
(16)
$$\begin{aligned}&\mathop {\min }\limits _{ \{{\mathbf {v}}_t\}_{t=1}^d } g(\{{\mathbf {u}}_t\}_{t=1}^d, \{{\mathbf {v}}_t\}_{t=1}^d ,{\mathbf {E}}), \end{aligned}$$
(17)
$$\begin{aligned}&\mathop {\min }\limits _{{\mathbf {E}}} \ g(\{{\mathbf {u}}_t\}_{t=1}^d, \{{\mathbf {v}}_t\}_{t=1}^d,{\mathbf {E}}). \end{aligned}$$
(18)

According to the proposition 2.7.1 in Bertsekas (1999), if the solution of Eqs. (16), (17) and (18) is uniquely attained, then every limit point \({\mathbf {u}}_t^*\) of the sequences generated BCD is a stationary point. For Eqs. (16) and (17), due to the L2-regularization, the subproblems are strongly convex, so they clearly satisfy this condition. For Eq. (18), the solution given by the soft-thresholding is also uniquely attained. Therefore, our Algorithm 1 will converge to stationary points of problem in Eq. (7).

However, it is difficult to guarantee that the stationary point of non-convex optimization is globally optimal (Bertsekas 1999). Burer et al.’s work Burer and Monteiro (2005) shows that the local minimum solution obtained is also the global maximum solution when the data do not contain noise and the given rank is large enough. Usually, the algorithm stops at some stationary points instead of local minimum solution, but subsequent experiments show that the solution obtained by Algorithm 1 performs well in subspace clustering.

3 Clustering pipeline

The original pipeline of clustering is firstly using the representation matrix \({\mathbf {X}}\approx {\mathbf {U}}\mathbf {V}^{\mathrm{T}}\) to construct the n-by-n affinity matrix \(\mathbf {W}= |\mathbf {V}||\mathbf {V}^{\mathrm{T}}|\), where \({\mathbf {U}}, \mathbf {V}\) are produced by the nonlinear LRR approaches, and then feeding the affinity matrix to a spectral clustering algorithm (Ng et al. 2001) to get the final clustering result. However, the spectral clustering method is not practical when facing a n-by-n dense affinity matrix, because the memory cost and time complexity requirements are \(\mathcal {O}{(n^3)}\) and \(\mathcal {O}{(n^2)}\), respectively. Therefore, the spectral clustering algorithm is not suitable for large datasets.

In this section, the K-means clustering method is implemented directly utilizing the n-by-d (d is the estimation rank of the dataset) matrix \(\mathbf {U,V}\) obtained from the non-convex low-rank representation model. Note that \({\mathbf {X}} \approx {\mathbf {U}}\mathbf {V}^{\mathrm{T}}\) is obtained from Eq. (1) or Eq. (3), and columns in \({\mathbf {U}} \in {\mathbb {R}}^{n \times d}\) denote the basis of \({\mathbf {X}}\) which spans the subspace of \({\mathbf {X}}\), and \(\mathbf {V}\) denotes the corresponding coefficient matrix under the basis \({\mathbf {U}}\). Rather than directly feeding \(\mathbf {V}\) to K-means algorithm as Shen and Li (2016) do, we introduce a dimension reduction step to avoid the curse of dimension especially when \(d>100\). We select k columns in \(\mathbf {V}\) (assuming the cluster number k is known and usually \(k \ll d\)) which corresponds to the k principal components of \({\mathbf {U}}\) (measured by the column vector length) to construct a new dimension reduced representation matrix \(\mathbf {V}_k\). Then, we normalize the rows of \(\mathbf {V}_k\) and feed it to K-means clustering to obtain the final clustering result.

The comparison of the time-consuming and the clustering accuracy between the spectral clustering, K-means and our proposed methods is shown in Tables 2 and 3 where we feed \(\mathbf {V}\) obtained by NLRR++ to the clustering methods. The spectral clustering computes SVD for an n-by-n similarity matrix which is quite slow and always dominates the running time of the whole task. The performance of K-means is not stable and not of high efficiency especially for large datasets.

Table 2 Time cost (seconds) of various post-processing schemes for subspace clustering
Table 3 Clustering accuracy (%) of various post-processing schemes for subspace clustering

4 Experiments

Parameter settings and Environment We set \(\beta = 0.8\) for both NLRR++ and NLRR (Shen and Li 2016), and number of inner iterations \(T=4\) for NLRR++. For the competing methods, i.e., OLRSC (Shen et al. 2016), ORPCA (Feng et al. 2013), LRR (Liu et al. 2013), SSC (Elhamifar and Vidal 2013) and CASS (Lu et al. 2014), we follow the default parameter settings specified in their papers.

Fig. 2
figure 2

Clustering accuracy and time complexity (in seconds). The first two subfigures show synthetic data with \(m=1000, k=4, n_k =1000\) and \(\rho =0.02\). The x-axis represents the Rank/Dimension. Note that NLRR takes less time while staying at bad stationary points with much lower clustering accuracy, especially when Rank/Dimension is larger than 0.2. The last two figures show synthetic data with \(m=1000, k=4\), \(d_k =40\) and \(\rho =0.02\). The x-axis represents number of samples in each subspace. NLRR++ has a nice clustering performance and meanwhile has the lowest computational time

All the experiments were executed on Intel Xeon E5-2640 2.4 GHz CPU with 64G RAM and Linux OS. Note that our algorithm can be easily parallelized since all the operations are matrix multiplications, and in fact, this can be automatically done in MATLAB if we assign more than one thread. Since all the other algorithms can also be parallelized by MATLAB,Footnote 2 we compare the algorithms under both single-thread and multi-thread settings.

4.1 Subspace clustering on simulation data

We use four disjoint subspaces\(\{\mathbf {S}_k\}_{k=1}^4 \in {\mathbb {R}}^m\) , whose bases are denoted by \(\{\mathbf {L}_k\}_{k=1}^4 \in {\mathbb {R}}^{m\times d_k}\) . The data matrix \(\mathbf {Z}_k \in S_k\) is then produced by \(\mathbf {Z}_k = \mathbf {L}_k\mathbf {R}_k^{\mathrm{T}}\), where \(\mathbf {R}_k \in {\mathbb {R}}^{n_k \times d_k}\). The entries of \(\mathbf {L}_k\)’s and \(\mathbf {R}_k\)’s are sampled from the normal distribution.

We generate the data with fixed dimension \(m=1000\), corruption rate \(\rho =0.02\) and set \(\lambda = \frac{1}{\sqrt{m}}\) for both NLRR and NLRR++. In the first two subfigures of Fig. 2, we fix the number of samples in each subspace and show how the change of Rank/Dimension, which represents the ratio between the number of basis and the ambient dimension, affects the ACC (clustering accuracy) and the computational time (in seconds). As seen from the first two subfigures, the accuracy of NLRR++, OLRPCA and OLRSC is not much different, but it is far better than that of NLRR and LRR algorithms. In terms of the convergence speed, NLRR and NLRR++ algorithms are faster than other methods. Although NLRR is sometimes faster than NLRR++, its accuracy is poor due to the premature convergence of NLRR. In the last two subfigures of Fig. 2, we show the effect of the increase in sample number on the clustering accuracy and the running time of the algorithm. Fixed remaining parameters are \(m = 1000\), \(k = 4\), \(\hbox {DK} = 40\) and \(P = 0.02\). NLRR++ embodies higher clustering accuracy and the fastest convergence speed. The increase in sample size has a relatively small impact on the efficiency of NLRR++, OLRSC and OLRPCA algorithms. The convergence speed of LRR and NLRR methods decreases obviously with the increase in samples.

In Fig. 3, we plot the training curve of NLRR and NLRR++ in terms of objective function since they are solving the same optimization problem in Eq. (3). We set the Rank/Dimension as \(16 \%\) and the number of samples in each subspace as 2800. The results clearly show that NLRR++ is much faster than NLRR.

Fig. 3
figure 3

Comparison of NLRR and NLRR++ on synthetic data (with \(n_k = 2800, \text {Rank/Dimension}=16 \%, \rho =0.02\)). The x-axis is training time in seconds. We compare the objective function value Eq. (3) in the left figure and clustering accuracy in the right figure.

4.2 Subspace clustering on real datasets

We examine the performance for subspace clustering on five real databases. The first four datasets can be downloaded from the LibSVM website.Footnote 3 For MNIST and SVHN, we randomly select 22,700 samples and 9001 samples, to form MNIST-20K and SVHN-9K. We also evaluate our method on a real-world dataset, ImageNet.Footnote 4 ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each class contains hundreds or thousands of images. In this experiment, we use the features trained by convolutional neural network (Krizhevsky et al. 2012), to construct the sample set ImageNet-30K, ImageNet-60K and ImageNet-120K with 30K, 60K and 120K respectively.

Table 4 Clustering F-measure with rank \(d = \{50,95,90,150,250,550\}\) for USPS, Protein, MNIST-20K, SVHN-9K, ImageNet-30K, ImageNet-60K
Table 5 Clustering ACC (%) with rank \(d = \{50,95,90,150,250,550\}\) for USPS, Protein, MNIST-20K, SVHN-9K, ImageNet-30K, ImageNet-60K
Table 6 Computational time (seconds) for each method by automatically parallelized with ten threads versus only one thread in MATLAB

In order to evaluate the clustering efficiency of the proposed method, we utilize two evaluation metrics accuracy (ACC) (Elhamifar and Vidal 2013) and F-measure (Larsen and Aone 1999). For NLRR and NLRR++, we fix the converge condition parameter \(\varepsilon = 0.005\), \(\beta =0.8\) and the regularization parameter \(\lambda = 40\) for all the datasets. We set the rank to be \(d = \{50, 95, 90, 150, 250, 550, 700\} \) for USPS, ImageNet-120K, MNIST-20K, SVHN-9K, ImageNet-30K, ImageNet-60K and Protein respectively. This rank is used for all the methods. Also note that following (Shen and Li 2016; Shen et al. 2016; Feng et al. 2013), we use the dataset itself as the dictionary \({\mathbf {A}}\). We use the proposed clustering pipeline mentioned in Sect. 3, which suffices to guarantee an appealing clustering result on the large-scale dataset, for all the methods except SSC. This is because the representation matrix obtained by SSC is too sparse to perform singular value decomposition, so the clustering strategy introduced in the previous section cannot be used directly. As a result, only the spectral clustering method can be applied.

We record the F-measure and the accuracy of NLRR, NLRR++, OLRSC, ORPCA, LRR and SSC in Tables 4 and 5. We also show the computational time via multiple threads and single thread in Table 6. Our algorithm NLRR++ significantly outperforms other state-of-the-art methods both in accuracy and efficiency on most of datasets. Although SSC is slightly better than OLRSC on Protein dataset, it takes almost 2 h, while NLRR++ only needs 16.25 s to obtain a good solution. As mentioned earlier, since SSC requires the spectral clustering for the final result, it is not applicable to large-scale clustering. Therefore, we do not report the accuracy of SSC on the dataset ImageNet-30K, ImageNet-60K because the affinity matrix is out of memory when using the spectral clustering algorithm.

Compared with NLRR, OLRSC and ORPCA, NLRR++ always achieves higher clustering accuracy and takes less running time. For example, on the USPS dataset, NLRR++ achieves the accuracy of 66.37%, while the second best algorithm can only achieve 51.76% accuracy. In terms of the computational time, NLRR++ uses only 6 s, which is much faster than all the other methods. For the ImageNet, MNIST and USPS datasets, NLRR++ achieves the best performance both in efficiency and accuracy. However, it is no better than NLRR in the SVHN dataset due to relatively smaller amount of data. Clearly, NLRR++ outperforms other methods in datasets with large amount of samples.

Finally, for the ImageNet-120K, NLRR, SSC and LRR run out of memory in their first iteration. The online methods OLRSC and ORPCA take more than 10 h to compute while NLRR++ only needs about 1.2 h. Apparently, NLRR++ is the only method that can scale to 120,000 samples.

5 Conclusions

In this paper, we propose a novel algorithm NLRR++ for scalable subspace clustering. It dramatically reduces the memory cost of NLRR from \(\mathcal {O}(n^2)\) to \(\mathcal {O}(mn)\) and the time complexity per iteration from \(\mathcal {O}(n^3)\) to \(\mathcal {O}(dmn)\). The two key techniques used in NLRR++ are the rank-one reformulation of the non-convex subspace recovery problem and the column-wise block coordinate descent that enables faster variable updates. we have also analyzed the time complexity and empirically demonstrated that our algorithm is computationally much more efficient compared with competing baselines. Our extensive experimental study on synthetic and realistic datasets also illustrates the robustness of NLRR++. The results show that NLRR++ is the only method that can solve an ImageNet problem with 120K samples in about 2 h with one single thread, while other methods either run out of memory or demand more than 10 h training time.