1 Introduction

In various machine learning and data analysis areas, such as collaborative filtering [1, 2], dimensionality reduction [3], subspace clustering [4], multiple labels learning [5, 6], and image processing [7, 8], one needs to consider the following matrix completion problem:

$$\begin{aligned} \min _{X \in {\mathbb {R}}^{m \times n}} rank(X) \quad {\hbox {s.t.}}\ {\mathscr {P}}_{\varOmega }(X) = {\mathscr {P}}_{\varOmega }(M), \end{aligned}$$
(1)

where \(M \in {\mathbb {R}}^{m \times n}\) is the incomplete low-rank matrix to be reconstructed, X is the considered low-rank matrix in \({\mathbb {R}}^{m \times n}\), rank(X) is the rank of X, \(\varOmega \) is the location of the observed entries, and \({\mathscr {P}}_{\varOmega }\) is the orthogonal projection onto the span of matrices vanishing outside of \(\varOmega \). The goal of problem (1) is to find the lowest-rank solution \(X^{*}\) of \({\mathscr {P}}_{\varOmega }(X) = {\mathscr {P}}_{\varOmega }(M)\), which is minimal to rank(X). More often, we consider the following regularization problem, which can be cast as:

$$\begin{aligned} \min _{X \in {\mathbb {R}}^{m \times n}} \frac{1}{2}\| {\mathscr {P}}_{\varOmega }(X - M)\| _{F}^{2} + \lambda rank(X), \end{aligned}$$
(2)

where \(\| \cdot \| _{F}\) is the Frobenius norm, \(\lambda \) is a positive regularization parameter.

However, since the nonconvexity and discontinuous nature of rank(X), problems (1) and (2) are NP-hard [9] and cannot be solved in polynomial time. To alleviate this difficulty, a widely used strategy is to adopt the nuclear norm as a convex surrogate of \({ {rank}}(\cdot )\). Theoretical studies show that the nuclear norm is the tightest convex lower bound of the rank [9]. Candès and Recht [10] have been proven that the missing values can be perfectly recovered by solving nuclear norm regularization minimization problem if incomplete matrix M satisfies certain assumptions, e.g., \(|\varOmega | \ge {\mathscr {O}}(N^{1.2}r{\text {log}}(N))\)\((N = { {max}}(m, n), r = { {rank}}(M))\). Generally speaking, the nuclear norm regularization minimization problem can be treated as a Semidefinite Program (SDP) problem [11]. However, SDP solvers are only suitable for \(m \times n\) matrices with \(m, n \le 100\). In order to overcome this drawback, various methods have been proposed to tackle nuclear norm regularization minimization problem. Examples of first-order methods include singular value thresholding (SVT) algorithm [12] and accelerated proximal gradient with linesearch (APGL) algorithm [13]. Although these two sophisticated algorithms can attain promising results with a strong theoretical guarantee, they all involve expensive SVD operations in each iteration and cannot suitable for large-scale matrices. With the aim of alleviating this shortcoming, fixed point continuation with approximate (FPCA) [14] SVD addresses the same problem as APGL while utilizing a fast Monte Carlo algorithm for SVD calculations. Another state-of-the-art algorithm is Soft-impute algorithm [15] which utilizes a special “sparse plus low-rank” structure associated with the SVT to allow efficient SVD in each iteration. Very recently, Soft-impute algorithm has been accelerated by using Nesterov’s rule [16].

Although the nuclear norm regularization minimization problem can be efficiently solved and has been accepted as a powerful tool for the matrix completion problems. Fan [17] pointed out that the nuclear norm penalty shrinks all singular values equally, which leads to over-penalize large singular values. In other words, the nuclear norm may make the solution deviate from the original solution. With the aim of making the larger singular values get less penalized, there has been a significant interest in the use of the nonconvex surrogates of \(rank(\cdot )\), such as capped-\(l_{1}\) penalty, log-sum penalty (LSP), truncated nuclear norm (TNN), smoothly clipped absolute deviation (SCAD), minimax concave penalty (MCP). In [18], the learning formulations with capped-\(l_{1}\) objective functions are solved by means of multi-stage convex relaxation scheme. In [19], the sparse signal recovery problems are solved by a sequence of weighted \(l_{1}\)-minimization problems. In [20], the matrix completion algorithm based on the TNN is employed in achieving a better approximation to the rank of matrix. In [17], the penalized likelihood methods are proposed to deal with nonparametric regression by select variables and estimate coeffcients simultaneously. In [21], a fast, continuous, nearly unbiased and accurate method is proposed for solving high-dimensional linear regression. Empirically, these nonconvex regularization methods achieve better recovery performance than the convex nuclear norm regularization methods.

Another line of nonconvex surrogates of \({ {rank}}(\cdot )\) is \(l_{q} \; (0< q < 1 )\) or Schatten-q quasi-norm, which has received significant interest in the area of matrix completion. The key idea is that it allows a less biased and/or lower rank solution to be found than using the nuclear norm. However, the resultant nonconvex \(l_{q}\) regularization problems are much more difficult to solve. A popular used method for optimization the nonconvex \(l_{q}\) regularization problems is \(l_{q}\)-Proximal-Gradient (\(l_{q}PG\)) algorithm [22]. Since the noncontinuous of objective function (\(q = 0\)), most existing convergence theory that work with convex cases cannot be applied. By employing the Zangwill’s global convergence theory, a non-trivial convergence result of \(l_{q}PG\) is obtained. However, each \(l_{q}PG\) iteration involves computing the approximate solutions [23,24,25,26] and SVD steps. Thus, this method is not very accurate and may becomes slow when addressing large-scale matrices. To improve the accuracies and speed of \(l_{q}PG\) algorithm, some improved methods based on Schatten 1/2 quasi-norm [27] and Schatten 2/3 quasi-norm [28] have recently been proposed. Peng et al. [29] proposed a fixed point iterative scheme with the singular value half thresholding operator. The convergence analysis of this method reveals that it is faster and more efficient than \(l_{q}PG\) algorithm. Most recently, by observing that the Schatten 2/3 quasi-norm regularization model has shown better performance than Schatten 1/2 quasi-norm minimization as it requires fewer measurements, Wang et al. [30] build a \(L_{2/3}\)-PA algorithm for matrix completion. Although, the closed-form thresholding formulas of Schatten 1/2 quasi-norm and Schatten 2/3 quasi-norm regularization problems are obtained, a computationally expensive SVD still be required in each iteration.

With the aim of improving the drawbacks of above, this paper proposes a novel faster and more accurate algorithm for matrix completion. The proposed method extends HFPA and \(L_{2/3}\)-PA algorithms by using inexact proximal operator and Nesterov’s accelerated gradient method. More precisely, our algorithm is based on HFPA and \(L_{2/3}\)-PA algorithms, but achieves faster convergence rate and better recovery performance than HFPA and \(L_{2/3}\)-PA algorithms do. Besides, our proposed algorithm is simple and easy to use for large-scale matrix completion problems. The main contributions of our paper include the following:

  1. 1.

    We first propose an efficient and fast algorithm to solve matrix completion problem, which is based on an inexact proximal operator and the Nesterov’s accelerated gradient method;

  2. 2.

    In addition, we study the convergence of our proposed method, and show that our proposed algorithm is guaranteed to converge to a critical point of the nonconvex objective. Furthermore, our proposed algorithm is simple and easy to use;

  3. 3.

    Finally, we apply the proposed algorithm to synthetic data, image recovery and large-scale recommendation problems, and achieve faster convergence rate and better recovery performance than most of state-of-the-art algorithms, which demonstrate that our proposed algorithm has the great potentials in matrix completion applications.

The remainder of this paper can be organized as follows. Section 2 describes the related works; Sect. 3 introduces the proposed algorithm; Sect. 4 reports and analyses the experimental results in both speed and quality; Finally, Sect. 5 gives the conclusion of this paper.

Notation We summarize the notations that will be used in this paper. For \(X \in {\mathbb {R}}^{m \times n}, \sigma (X) = (\sigma _{1}(X), \sigma _{2}(X), \ldots , \sigma _{r}(X))^{T}\) denotes the vector of singular value of X arranged in nonincreasing order; \(Diag(\sigma (X))\) denotes a diagonal matrix whose diagonal vector is \(\sigma (X)\). The Frobenius norm and Schatten q quasi-norm of X are defined as \(\| X \| _{F} = (\sum _{i,j}X_{ij}^{2})^{1/2} = (\sum _{i = 1}^{r}\sigma _{i}(X)^{2})^{1/2}\) and \(\| X \| _{q} = (\sum _{i = 1}^{r}\sigma _{i}(X)^{q})^{1/q}\), respectively. For \(X, Y \in {\mathbb {R}}^{m \times n}\), \(<X, Y>= tr(Y^{T}X)\) denotes their inner product.

2 Related work

2.1 Proximal algorithms

The APGL algorithm pioneered uses proximal operator [31] to solve matrix completion problems. It considers the following low-rank matrix completion problem:

$$\begin{aligned} \min _{X \in {\mathbb {R}}^{m \times n}} F(X) \equiv f(X) + g(X), \end{aligned}$$
(3)

where fg are convex, and f is smooth but g is possibly nonsmooth. Besides, f is differentiable with Lipschitz continuous gradient L, i.e., \(\| \bigtriangledown f(X_{1}) - \bigtriangledown f(X_{2}) \le L\| X_{1} - X_{2}\| \). At the tth iteration, the APGL algorithm generates \(X_{t + 1}\) as

$$\begin{aligned} X_{t + 1} = \mathrm{prox}_{\mu g}(Y) = \mathop {\mathrm{arg}\min }_{X \in {\mathbb {R}}^{m \times n}}\frac{1}{2}\| X - Y\| _{F}^{2} + \mu g(X), \end{aligned}$$
(4)

where \(\mu < 1/L\), and \(\hbox {prox}_{\mu g}(\cdot )\) denotes the proximal operator. By using the Nesterov’s accelerated gradient method, Y can be obtained as follows:

$$\begin{aligned} G_{t} = (1 + \theta _{t})X_{t} - \theta _{t}X_{t - 1}, \end{aligned}$$
(5)
$$\begin{aligned} Y = G_{t} - \mu \bigtriangledown f(G_{t}), \end{aligned}$$
(6)
$$\begin{aligned} \theta _{t + 1} = \frac{1 + \sqrt{1 + 4(\theta _{t})^{2}}}{2}. \end{aligned}$$
(7)

The convergence analysis of APGL algorithm reveals that this algorithm converges at a rate of \({\mathscr {O}}(1/T^{2})\), where T is the iteration number. This is also known to be the best possible rate for the problem (3) [32].

Since the considerable attention of nonconvex and nonsmooth problems in machine learning, it is natural to ask that if the accelerated proximal algorithm can be extended to problems where f and/or g may be nonconvex. The answer is positive. In [33], Li and Lin have extended accelerated proximal algorithm to solve nonconvex and nonsmooth problems. The key idea is adopting a monitor that satisfies the sufficient descent property. Subsequently, they proposed a nonmonotone accelerated proximal gradient algorithm (nmAPG). Although the nmAPG algorithm is much faster than most of state-of-the-art, its convergence rate is still unknown.

2.2 Existing completion methods with \(l_{q}\) regularization

Using the Schatten-q quasi-norm minimization instead of the \(rank(\cdot )\) minimization is one of the most successful approaches in the area of matrix completion. The state-of-the-art matrix completion algorithms based on \(l_{q}\) regularization are \(l_{q}PG\) [22], HFPA [29], and \(L_{2/3}\)-PA [30].

(1) \(l_{q}PG\): Considering the \(l_{q}\) penalized model:

$$\begin{aligned} \min _{X \in {\mathbb {R}}^{m \times n}} F(X) = \frac{1}{2}\| {\mathscr {P}}_{\varOmega }(X - M)\| _{F}^{2} + \lambda \| X\| _{q}^{q}, \end{aligned}$$
(8)

where \(q \in [0, 1]\). It should be note that \(q = 0\) is equivalent to the problem (2) and \(q = 1\) is equivalent to the nuclear norm penalized problem. It requires computing the following non-trivial optimization problem:

$$\begin{aligned} \min _{x}\phi (x) = \frac{1}{2}(z - x)^{2} + \lambda |x|^{q}, \end{aligned}$$
(9)

where z is a constant and \(q \in (0, 1)\). This has the following non-trivial solutions.

Theorem 1

[22] Let \(q \in (0, 1)\), \(b = [2\lambda (1 - q)]^{1/(2 - q)}\), and \(c = b + \lambda q b^{q - 1}\). Then the solutions \(x^{*} = \tau _{\lambda }(z)\)to the problem (9) are:

$$\begin{aligned} \tau _{\lambda }(z) = \left\{ \begin{array}{ll} 0, &{} {\mathrm{if}}~|z| < c;\\ \{0, { {sgn}}(z)b\}, &{} \mathrm{if}~|z| = c;\\ { {sgn}}(z){\hat{x}}, &{} \mathrm{if}~|z| > c. \end{array} \right. \end{aligned}$$
(10)

where for \(|z| > c\), \({\hat{x}} \in (b, |z|)\)solves:

$$\begin{aligned} x + \lambda q x^{q - 1} = |z| ~~where~~ x > 0. \end{aligned}$$
(11)

When \(|z| > c\)there are two solution to (11) and \({\hat{x}}\)is the larger one which can be computed from the iteration:

$$\begin{aligned} x_{t + 1} = \rho (x_{t}) ~~{\hbox {where}} ~~ \rho (x) = |z| - \lambda q x^{q - 1} \end{aligned}$$
(12)

with the initial condition \(x_{(0)} \in [b, |z|]\).

Using the above theorem, Marjanovic et al. have proposed a MM-based algorithm namely \(l_{q}PG\) for iteratively reducing objective function F. Moreover, experiments are performed on matrix completion problems show that \(l_{q}PG\) algorithm is superior to the state-of-the-art. Although the \(l_{q}PG\) algorithm further accelerated using warm-starting and Nesterov’s method, it is still suffer from heavy SVD which takes \({\mathscr {O}}(mn^{2})\) time.

(2) HFPA:  Recently, Peng et al. proposed HFPA to improve FPCA and \(l_{q}PG\) algorithms by using the Schatten 1/2 quasi-norm. The key idea is that when \(q = 1/2\) the solution of problem (9) has an analytical expression [27], that is:

$$\begin{aligned} \tau _{\lambda }(z) = \left\{ \begin{array}{ll} 0, &{} {\mathrm{if}}~|z| < \frac{\root 3 \of {54}}{4}\lambda ^{2/3};\\ \{0, \frac{2}{3}z(\cos (\frac{2\pi }{3} - \frac{2}{3}\phi _{\lambda }(z)))\}, &{} \mathrm{if}~|z| = \frac{\root 3 \of {54}}{4}\lambda ^{2/3};\\ \frac{2}{3}z\left(\cos \left(\frac{2\pi }{3} - \frac{2}{3}\phi _{\lambda }(z)\right)\right), &{} \mathrm{if}~|z| > \frac{\root 3 \of {54}}{4}\lambda ^{2/3}. \end{array} \right. \end{aligned}$$
(13)

with \(\phi _{\lambda }(z) = {\text {arccos}}(\frac{\lambda }{8}(\frac{z}{3})^{-3/2})\). The HFPA algorithm consists of two loops. For inner iterations, a special gradient descent method and the matrix half thersholding operator are employed to obtain the approximation solution. In the outer loops, the continuation technique is used to accelerate the convergence speed of HFPA. Besides, the authors further provide the global necessary optimality condition for the \(L_{1/2}\) regularization problem and the exact mathematical convergence analysis of the HFPA. Actually, HFPA is a fixed point method, but used for nonconvex optimization problem. Although, a fast Monte Carlo algorithm is adopted in HFPA to approximate SVD procedure, it is still time-consuming on large matrices.

figure a

(3) \(L_{2/3}\)-PA: Another state-of-the-art algorithm, \(L_{2/3}\)-PA, combines a gradient descent method in inner iterations that approximation the solutions by 2/3-thresholding operator with continuation technique that accelerates the convergence speed. The main idea in \(L_{2/3}\)-PA is to employ the following analytical expression of solutions of problem (9) when \(q = 2/3\):

$$\begin{aligned} \tau _{\lambda }(z) = \left\{ \begin{array}{ll} {{sgn}}(z)\left( \frac{(|\varphi _{\lambda }(z)| + \sqrt{\frac{2|z|}{|\varphi _{\lambda }(z)|} - |\varphi _{\lambda }(z)|^{2}})}{2}\right) ^{3}, &{} \mathrm{if}\,|z| > \frac{\root 4 \of {48}}{3}\lambda ^{3/4};\\ 0, &{} \mathrm{otherwise}; \end{array} \right. \end{aligned}$$
(14)

where \(\varphi _{\lambda }(z) = (2/\sqrt{3})\lambda ^{1/4}({\text {cosh}}(\phi _{\lambda }(z)/3))^{1/2}\), with \(\phi _{\lambda }(z) = {\hbox {arccosh}}(27z^{2}\lambda ^{-3/2}/16)\). Based on the empirical studies, we found that \(L_{2/3}\)-PA is not very accurate though it improves the recovery performance of HFPA. Moreover, it is not efficient enough as still involve time-consuming SVD steps and cannot suitable for large-scale matrices completion.

2.3 Existing completion methods with inexact proximal operator

In machine learning research, the proximal gradient methods are popular for solving various optimization problems with non-smooth regularization. However, it requires the full SVD to solve the proximal operator, which may be time-consuming. Thus, inexact proximal gradient methods are extremely important.

Based on this line, Yao et al. [34] use the power method to approximate the SVD scheme, and propose Accelerated and Inexact Soft-Impute (AIS-Impute) algorithm for matrix completion. The convergence analysis of the AIS-Impute algorithm illustrates that it still converges at a rate of \({\mathscr {O}}(1/T^{2})\). For the nonconvex problems, a Fast NonConvex Low-rank (FaNCL) [35] algorithm is proposed for matrix completion and robust principal component analysis (RPCA). To improve the convergence speed, the inexact proximal gradient method is also employed in the procedure of FaNCL. Besides, Gu et al. [36] have proposed nonmontone accelerated inexact proximal gradient method (nmAIPG) which extends the nmAPG from exact case to inexact case. They also point out that the inexact version of nmAPG shares the same convergence rate as the nmAPG. Observing that the nmAIPG may requires two proximal steps in each iteration, and can be inefficient for solving large-scale matrices. With the aim of alleviating this shortcoming, the noconvex inexact APG (niAPG) algorithm has been proposed in [37] which requires only one inexact proximal step in each iteration. Thus, the niAPG algorithm is much faster, while achieves the comparable performance as the state-of-the-art.

3 Proposed method

In this section, we introduce our proposed algorithm and discuss some of its basic properties.

3.1 Motivation

In this section, we illustrate that how the proximal gradient algorithm can be solving the following problems:

$$\begin{aligned} \min _{X \in {\mathbb {R}}^{m \times n}} F(X) = f(X) + \lambda \| X\| _{q}^{q}, \end{aligned}$$
(15)

where \(f(X) = \frac{1}{2}\| {\mathscr {P}}_{\varOmega }(X - M)\| _{F}^{2}\) and \(q \in \{\frac{1}{2}, \frac{2}{3}\}\).

First, we give the following definition.

Definition 1

(q-thresholding operator) Suppose \(x = (x_{1},x_{2},\ldots ,x_{n})^{T}\), for any \(\lambda > 0\), the q-thresholding operator \({\mathscr {T}}_{\lambda }(\cdot )\) is defined as

$$\begin{aligned} {\mathscr {T}}_{\lambda }(x) = (\tau _{\lambda }(x_{1}),\tau _{\lambda }(x_{2}),\ldots ,\tau _{\lambda }(x_{n}))^{T}. \end{aligned}$$
(16)

Now, we consider the following quadratic approximation model of the objective function (15) at Y:

$$\begin{aligned} Q_{\lambda , \mu }(X, Y)& {} = \frac{1}{2}\| {\mathscr {P}}_{\varOmega }(Y - M)\| _{F}^{2} + <X - Y, {\mathscr {P}}_{\varOmega }(Y - M)> \nonumber \\&\quad + \frac{1}{2\mu }\| X - Y \| _{F}^{2} + \lambda \| X \| _{q}^{q}, \end{aligned}$$
(17)

where \(\mu \in (0, 1/L)\). It should be note that \(Q_{\lambda , \mu }(X, X) = F(X)\) and \(Q_{\lambda , \mu }(X, Y) \ge F(X)\) for any \(X, Y \in {\mathbb {R}}^{m \times n}\). By using simple algebra, Eq. (17) can be recasted as:

$$\begin{aligned} Q_{\lambda , \mu }(X, Y)& {} = \frac{1}{2\mu }\| X - (Y - \mu {\mathscr {P}}_{\varOmega }(Y - M))\| _{F}^{2} + \lambda \| X \| _{q}^{q}\nonumber \\&\quad+ \frac{1}{2}\| {\mathscr {P}}_{\varOmega }(Y - M)\| _{F}^{2} - \frac{\mu }{2}\| {\mathscr {P}}_{\varOmega }(Y - M) \| _{F}^{2}. \end{aligned}$$
(18)

Ignoring the constant terms in (18) and let \(Y = X_{t - 1}\), the minimizer \(X_{t}\) of \(Q_{\lambda , \mu }(X, Y)\) can be obtained by

$$\begin{aligned} X_{t} = \mathop {arg \min }_{X \in {\mathbb {R}}^{m \times n}}\left\{ \frac{1}{2\mu }\| X - (X_{t - 1} - \mu {\mathscr {P}}_{\varOmega }(X_{t - 1} - M))\| _{F}^{2} + \lambda \| X \| _{q}^{q}\right\} . \end{aligned}$$
(19)

Thus, the above \(l_{q}\) regularization problem can be solved by the proximal operator as shown in the following lemma.

Lemma 2

[29] Let \(G = X_{t - 1} - \mu {\mathscr {P}}_{\varOmega }(X_{t - 1} - M)\)and the SVD of G is \(U\varSigma V^{T}\). Then

$$\begin{aligned} X_{t} = { {prox}}_{\lambda \mu , q}(G) = \mathop {{\text {argmin}}}_{X \in {\mathbb {R}}^{m \times n}}\frac{1}{2}\| X - G \| _{F}^{2} + \lambda \| X\| _{q}^{q}, \end{aligned}$$
(20)

where \(q \in \{\frac{1}{2}, \frac{2}{3}\}\). Specifically, \({ {prox}}_{\lambda \mu , q}(G) = U { {Diag}}({\mathscr {T}}_{\lambda \mu }(\sigma (G)))V^{T}\).

3.2 Method for computing inexact proximal operator

As shown in Lemma 2, solving the proximal operator \({ {prox}}_{\lambda \mu , q}(\cdot )\) only needs the singular values/vectors which are greater than \( \frac{\root 3 \of {54}}{4}(\lambda \mu )^{2/3}\) or \(\frac{\root 4 \of {48}}{3}(\lambda \mu )^{3/4}\). It means that the full SVD in Lemma 2 is time-consuming and unnecessary. In order to overcome this drawback, it is natural to ask that if there is a faster and more effective way to solve this problem. Fortunately, there are some researcher focus on this problem [35, 38]. They suggest to apply SVD on small matrix instead of the original large matrix, thus the complexity could be dramatically down. The main idea of their methods is to extract a small core matrix by finding orthonormal based with the unitary invariant property. For problem (20), we can obtain similar result.

Proposition 1

Suppose G has \({\hat{k}}\)singular values larger than \( \frac{\root 3 \of {54}}{4}(\lambda \mu )^{2/3}\)or \(\frac{\root 4 \of {48}}{3}(\lambda \mu )^{3/4}\), and \(G = QQ^{T}G\), where \(Q \in {\mathbb {R}}^{m \times k}, k > {\hat{k}}\), is orthogonal. Then

  1. (i)

    range(G) \(\subseteq \)range(Q);

  2. (ii)

    \({ {prox}}_{\lambda \mu , q}(G) = Q{ {prox}}_{\lambda \mu , q}(Q^{T}G)\).

Proof

  1. (i)

    Since \(G = Q(Q^{T}G)\), we can obtained the first part of Proposition 1 obviously.

  2. (ii)

    Consider an arbitrary matrix \(X \in {\mathbb {R}}^{m \times n}\) admits the SVD as \(X = U\varSigma V^{T}\), then \(QX = (QU)\varSigma V^{T}\). Since Q is orthogonal and using the unitary invariant norm property, we can get \(\| X \| _{q}^{q} = \| QX \| _{q}^{q}\). Therefore,

    $$\begin{aligned} Q{ {prox}}_{\lambda \mu , q}(Q^{T}G)& {} = Q\mathop {{\text {argmin}}}_{Z \in {\mathbb {R}}^{m \times n}}\frac{1}{2}\| Z - Q^{T}G \| _{F}^{2} + \lambda \| Z\| _{q}^{q}\nonumber \\& {} = \mathop {{\text {argmin}}}_{X \in {\mathbb {R}}^{m \times n}}\frac{1}{2}\| QZ - QQ^{T}G \| _{F}^{2} + \lambda \| Q^{T}X\| _{q}^{q}, \end{aligned}$$
    (21)

    the second equality follows from \(X =QZ\).

Since \(\| Q^{T}X \| _{q}^{q} = \| Z \| _{q}^{q} = \| QZ \| _{q}^{q} = \| X \| _{q}^{q}\) and \(QQ^{T} = I\), we get

$$\begin{aligned} Q{ {prox}}_{\lambda \mu , q}(Q^{T}G)& {} = \mathop {{\text {argmin}}}_{X \in {\mathbb {R}}^{m \times n}}\frac{1}{2}\| X - G \| _{F}^{2} + \lambda \| X\| _{q}^{q}\nonumber \\ & {}= { {prox}}_{\lambda \mu , q}(G), \end{aligned}$$
(22)

which is the second part of Proposition 1. We complete the proof. \(\square \)

In the traditional way, we need full SVD on the original large matrix and its complexity is \({\mathscr {O}}(mn^{2})\). From Proposition 1, we can easily find that the complexity can be dramatically down as performing SVD on a smaller matrix. Specifically, we perform SVD on matrix \(Q^{T}G\) and its complexity becomes \({\mathscr {O}}(mk^{2})\). Thus, when \(k \ll n\), the computation speed can be significantly improved.

It should be note that the proof process of Proposition 1 partially follows but is different from [38]. Actually, we deal with the \(l_{q}\) regularization problem instead of nuclear norm regularization problem.

Next, we will address that how to determine the orthogonal matrix Q. Indeed, orthogonal matrix Q can be approximated by using power method [39], which is wildly used to approximate the SVD in nuclear norm and nonconvex regularization minimization problem [35, 40]. More precisely, we can use Algorithm 2 to obtain the orthogonal matrix Q.

figure b

Similar to [35, 40], we fix the number of iterations to 3 for PowerMethod. The matrix R is used for warm-start which is particularly useful for proximal algorithm. As pointed out by [35], the PROPACK algorithm [41] can also be used to obtain Q and the complexity is the same as PowerMethod. Empirically, PROPACK is much less efficient than PowerMethod. Besides, a fast Monte Carlo algorithm [14, 29, 30] is used for computing an approximate SVD. Although this method greatly reduces the computational effort, it has some tunable parameters that need to be set, which makes it not easy to use.

3.3 Inexact proximal operator

Based on the PowerMethod algorithm, we can compute \({ {prox}}_{\lambda \mu , q}(\cdot )\) more efficiently. Specifically, as in Algorithm 3, we first adopt the PowerMethod algorithm to obtain orthogonal matrix Q. Then, we perform SVD on \(Q^{T}G\) which is much smaller than the original matrix G as \(k \ll n\), and the complexity is reduced from \({\mathscr {O}}(mn^{2})\) to \({\mathscr {O}}(mk^{2})\). The next steps are obtain the approximate \({ {prox}}_{\lambda \mu , q}(\cdot )\) by solving problem (9), where q is fixed to 1/2 or 2/3.

According to Algorithm 3, the \({ {prox}}_{\lambda \mu , q}(\cdot )\) step will be inexact. Thus, we should monitor the progress of F to make sure the proposed algorithm converges to a critical point. Motivated by [42], Yao et al. [35, 37] suggest employing the following condition to control the inexactness, that is,

$$\begin{aligned} F(X_{t + 1}) \le F(X_{t}) - \frac{\delta }{2}\| X_{t + 1} - X_{t} \| _{F}^{2}, \end{aligned}$$
(23)

where \(\delta > 0\) is a constant. Obviously, this condition makes the objective function F always decreased. However, in real application, this monotonically decreasing may makes the algorithm fall into narrow curved valley. A nonmontone condition is also used. Specifically, we accept \(X_{t + 1}\), if \(X_{t + 1}\) makes the value of F smaller than the maximum over previous m \((m > 1)\) iterations, that is,

$$\begin{aligned} F(X_{t + 1}) \le \max _{t = \max(1, k - m), \ldots , k}F(X_{t}) - \frac{\delta }{2}\| X_{t + 1} - X_{t} \| _{F}^{2}, \end{aligned}$$
(24)

Based on the analysis of above, we propose Algorithm 4 to obtain the approximation solution at iteration t.

figure c
figure d

3.4 The proposed algorithm

In this section, we will introduce our proposed \(l_{q}\) inexact APG (\(l_{q}\)iAPG) algorithm and discuss three techniques to accelerate the convergence of our proposed algorithm.

First, we use Nesterov’s accelerated gradient method. In the area of convex optimization, the Nesterov’s accelerated gradient method is a widely used technique for speed up the convergence of most machine learning algorithms, e.g., proximal algorithm. Recently, this method has also been employed to accelerate the convergence of nonconvex optimization. nmAPG [33], FaNCL-acc [35] and niAPG [37] are the state-of-the-art algorithms. Thus, similar to nmAPG, we will use the following steps in \(l_{q}\)iAPG algorithm,

$$ \theta _{t + 1}= \frac{1 + \sqrt{1 + 4\theta _{t}^{2}}}{2},$$
(25)
$$\begin{aligned} Y_{t + 1}& {} = X_{t} + \left( \frac{\theta _{t}}{\theta _{t + 1}}\right) (Z_{t} - X_{t}) \nonumber \\&\quad + \left( \frac{\theta _{t} - 1}{\theta _{t + 1}}\right) (X_{t} - X_{t - 1}). \end{aligned}$$
(26)

Due to the (25) and (26) strategies, extensive experiments have shown that the convergence rate of \(l_{q}\)iAPG algorithm is significantly improved.

Since the regularization parameter \(\lambda \) plays an important role in our proposed algorithm, the tuning of \(\lambda \) becomes a subtle issue. Fortunately, this problem has been already considered in [29, 30], and a most reliable choice of the optimal regularization parameters of (15) are

$$\begin{aligned} \lambda ^{*} = \left\{ \begin{array}{ll} \frac{\sqrt{96}}{9\mu }(\sigma _{r_{t}}(X_{t + 1}))^{3/2}, &{} {\mathrm{if}}~q = \frac{1}{2};\\ \frac{\root 3 \of {108}}{4\mu }(\sigma _{r_{t}}(X_{t + 1}))^{4/3}, &{} {\mathrm{if}}~q = \frac{2}{3}. \end{array} \right. \end{aligned}$$
(27)

Besides, the continuation technique is used in our proposed algorithm. As shown in [14, 30, 35], continuation technique is a commonly used method to improve the convergence speed of machine learning algorithms. The key idea of continuation technique is to choose a decreasing sequence \(\lambda _{t}: \lambda _{1}> \lambda _{2}> \ldots> {\bar{\lambda }} > 0\), then at tth iteration, use \(\lambda = \lambda _{t}\). Therefore, based on continuation technique, we suggest use the following regularization parameter at the \((t + 1)\)th iteration, that is,

$$\begin{aligned} \lambda _{t + 1} = \left\{ \begin{array}{ll} \max \left\{{\bar{\lambda }}, \min \left\{\eta \lambda _{t}, \frac{\sqrt{96}}{9\mu }(\sigma _{r_{t + 1}}(X_{t + 1}))^{3/2}\right\}\right\}, &{} {\mathrm{if}}~q = \frac{1}{2};\\ \max \{{\bar{\lambda }}, \min \{\eta \lambda _{t}, \frac{\root 3 \of {108}}{4\mu }(\sigma _{r_{t + 1}}(X_{t + 1}))^{4/3}\}\}, &{} {\mathrm{if}}~q = \frac{2}{3}. \end{array} \right. \end{aligned}$$
(28)

where \(\eta \in (0, 1)\) is a constant, \(r_{t + 1}\) is the rank of \(X_{t + 1}\), and \({\bar{\lambda }}\) is a sufficiently small but positive real number, e.g., \(10^{-4}\).

Now, we outline our proposed algorithm as follows,

figure e

The third technique to accelerate the convergence rate of our proposed algorithm is to use the “sparse plus low-rank” structure [15]. Step 4 in \(l_{q}\)iAPG algorithm performs inexact proximal step. Obviously, for any \(t > 1\), there are two cases will be happen. When \(X_{t}\) is equal to \(Z_{t}\), \(Y_{t + 1} = X_{t} + \left( \frac{\theta _{t} - 1}{\theta _{t + 1}}\right) (X_{t} - X_{t - 1})\) . Therefore, step 1 of Algorithm 4 has

$$\begin{aligned} G& {} = Y_{t + 1} - \mu {\mathscr {P}}_{\varOmega }(Y_{t + 1} - M) = \left( 1 + \frac{\theta _{t} - 1}{\theta _{t + 1}}\right) X_{t} \nonumber \\& \quad - \left( \frac{\theta _{t} - 1}{\theta _{t + 1}}\right) X_{t - 1} - \mu {\mathscr {P}}_{\varOmega }(Y_{t + 1} - M). \end{aligned}$$
(29)

The first two terms are low-rank matrices, while the last term is a sparse matrix. This kind of “sparse plus low-rank” structure can speed up matrix multiplications. More precisely, for any \(V \in {\mathbb {R}}^{m \times k}\), GV can be obtained as

$$\begin{aligned} GV& {} = (1 + \alpha _{t})U_{t}\varSigma _{t}(V_{t}^{T}V) - \alpha _{t}U_{t - 1}\varSigma _{t - 1}(V_{t - 1}^{T}V) \nonumber \\&\quad + \mu {\mathscr {P}}_{\varOmega }(M - Y_{t})V, \end{aligned}$$
(30)

where \(\alpha _{t} = \frac{\theta _{t} - 1}{\theta _{t + 1}}\). Similarly, for any \(U \in {\mathbb {R}}^{m \times k}\), \(U^{T}G\) can be obtained as

$$\begin{aligned} U^{T}G & {} = (1 + \alpha _{t})(U^{T}U_{t})\varSigma _{t}V_{t}^{T} - \alpha _{t}(U^{T}U_{t - 1})\varSigma _{t - 1}V_{t - 1}^{T}\nonumber \\& \quad + \mu U^{T}{\mathscr {P}}_{\varOmega }(M - Y_{t}). \end{aligned}$$
(31)

When \(X_{t} = X_{t - 1}\), \(Y_{t + 1}\) becomes \(X_{t} + \left( \frac{\theta _{t}}{\theta _{t + 1}}\right) (Z_{t} - X_{t})\). Thus, step 1 of Algorithm 4 has

$$\begin{aligned} G & {} = Y_{t + 1} - \mu {\mathscr {P}}_{\varOmega }(Y_{t + 1} - M) = (1 + \alpha _{t})X_{t} \nonumber \\& \quad - \alpha _{t}Z_{t} - \mu {\mathscr {P}}_{\varOmega }(Y_{t + 1} - M), \end{aligned}$$
(32)

where \(\alpha _{t} = \frac{\theta _{t}}{\theta _{t + 1}}\). The same as above, we have

$$\begin{aligned} GV & {} = (1 + \alpha _{t})U_{t}\varSigma _{t}(V_{t}^{T}V) - \alpha _{t}U_{Z_{t}}\varSigma _{Z_{t}}(V_{Z_{t}}^{T}V)\nonumber \\& \quad + \mu {\mathscr {P}}_{\varOmega }(M - Y_{t})V, \end{aligned}$$
(33)

and

$$\begin{aligned} U^{T}G& {} = (1 + \alpha _{t})(U^{T}U_{t})\varSigma _{t}V_{t}^{T} - \alpha _{t}(U^{T}U_{Z_{t}})\varSigma _{Z_{t}}V_{Z_{t}}^{T} \nonumber \\&\quad + \mu U^{T}{\mathscr {P}}_{\varOmega }(M - Y_{t}). \end{aligned}$$
(34)

It should be note that the \(l_{q}\)iAPG algorithm is different from \(l_{q}\)PG algorithm. First, the \(l_{q}\)iAPG algorithm employs closed-form thresholding formulas, while \(l_{q}\)PG algorithm uses q-thresholding function. The q-thresholding function is often solved by numerical methods and only obtained its approximate solutions. Besides, the \(l_{q}\)iAPG algorithm allows inexact proximal step (Step 4), which makes the algorithm more efficient and faster. Moreover, the \(l_{q}\)iAPG algorithm uses more robust acceleration scheme, in which involves \(X_{t}\), \(X_{t - 1}\) and \(Z_{t}\). Finally, the \(l_{q}\)iAPG algorithm exploits the “sparse plus low-rank” structure to improve the speed of matrix multiplications.

The convergence of F(X) is shown as follows:

Theorem 3

Let \(\{X_{t}\}\) be the sequence generated by \(l_{q}\) iAGP algorithm, then

  1. (i)

    \(\{X_{t}\}\) is a minimization sequence and bounded, and has at least one accumulation point;

  2. (ii)

    \(F(X_{t})\)converges to \(F(X_{*})\), where \(X_{*}\)is any accumulation point of \(\{X_{t}\}\).

Proof

  1. (i)

    According to the control condition (23) in \(l_{q}\)iAGP algorithm, for any \(t = 1, 2, \ldots ,\) we have

    $$\begin{aligned} F(X_{t}) \le F(X_{t - 1}) \le \cdots \le F(X_{0}). \end{aligned}$$
    (35)

    Thus, \(\{X_{t}\}\) is a minimization sequence of F(X), and \(\{F(X_{t})\}\) is bounded. Since \(\{X_{t}\} \subset \{X: F(X) \le F(X_{0})\}\), \(\{X_{t}\}\) is bounded and has at least one accumulation point.

  2. (ii)

    From above description, we also have \(F(X_{t})\) converges to \(F_{*}\), where \(F_{*}\) is a constant. Suppose an accumulation point of \(\{X_{t}\}\) is \(X_{*}\). Using the continuity of F(X), we obtain \(F(X_{t}) \rightarrow F_{*} = F(X_{*})\) as \(t \rightarrow \infty \). We complete the proof.□

4 Numerical experiments

In this section, we validate our proposed \(l_{q}\)iAPG algorithm for matrix completion problems by conducting a series of experiments. We compare our proposed method with the following state-of-the-art matrix completion algorithms.

  1. (1)

    Three nuclear norm minimization algorithms: APGL [13], AIS-Impute [34], and Active [40];

  2. (2)

    Two low-rank matrix decomposition-based methods: low-rank matrix fitting (LMaFit) [43], and alternating steepest descent algorithm (ASD) [45];

  3. (3)

    Two methods for solving models with schatten-q regularizers: HFPA [29], and multi-schatten-q norm surrogate (MSS) [46];

  4. (4)

    Three methods for solving models with nonconvex low-rank regularizers: iterative reweighted nuclear norm (IRNN) [47] algorithm, FaNCL [35], and niAPG [37].

We also tested singular value projection (SVP) [48], rank-one matrix pursuit method (R1MP ) [44], iterative reweighted least square(IRucLp) [49], and Soft-AIS [50]. In the following tests, however, these methods are slow or require large memory, so their results are not reported here.

In the following experiments, we follow the recommended settings of the parameters for these algorithms. For our proposed algorithm, we set \({\bar{\lambda }} = 10^{-4}\), \(\mu = 1.99\), and \(\eta = 0.75\). To prove the effectiveness of our proposed algorithm, we consider three cases: synthetic data, image recovery and recommendation problems. Besides, all the algorithms are stopped when the difference in objective values between consecutive iterations becomes smaller than \(10^{-5}\). All the algorithms are implemented in MATLAB R2014a on a Windows server 2008 system with Intel Xeon E5-2680-v4 CPU(3 cores, 2.4 GHz) and 256 GB memory.

4.1 Synthetic data

The test matrix \(M \in {\mathbb {R}}^{m \times n}\) with rank r is generated as \(M = M_{L}M_{R} + N\), where the entries of random matrices \(M_{L} \in {\mathbb {R}}^{m \times r}\) and \(M_{R} \in {\mathbb {R}}^{r \times n}\) are sampled i.i.d. from the standard normal distribution \({\mathscr {N}}(0, 1)\), and entries of N sampled from \({\mathscr {N}}(0, 0.1)\). Without loss of generality, we set \(m = n\) and \(r = 5\). We then sampled a subset \(\varOmega \) of p entries uniformly at random as the observations, where \(p = 2mrlog(m)\).

Similar to [35], we evaluate the recovery performance of the algorithms based on the i) normalized mean squared error NMSE = \(\| {\mathscr {P}}_{\varOmega ^{\perp }}(X - UV)\| _{F}/\| {\mathscr {P}}_{\varOmega ^{\perp }}(UV)\| _{F}\), where X is the recovered matrix and \(\varOmega ^{\perp }\) stands for the unobserved positions; ii) rank of X; and iii) running time. We vary m in the range \(\{1000, 2000, 3000, 5000\}\). For each algorithm, we present its average NMSE, rank and running time with 10 runs.

Table 1 Comparison of different algorithms on synthetic data, NMSE is scaled by \(10^{-2}\). CPU time is in seconds, and “sr” denotes the sample ratio

The average NMSE, rank, and running time are reported in Table 1. The results in Table 1 demonstrate that our proposed \(l_{q}\)iAPG is a competitive algorithm. More precisely, \(l_{q}\)iAPG algorithm runs fastest among these algorithms. In terms of accuracy, \(l_{q}\)iAPG attained satisfying performance. In Table 1, we can find that \(l_{q}\)iAPG algorithm achieves most accurate solutions in nearly all problems. We also observe that as the size of the matrix increases, the \(l_{q}\)iAPG algorithm will be more faster than other algorithms. Moreover, \(l_{q}\)iAPG algorithm is able to solve large-scale random matrix completions. Specifically, the running time of \(l_{q}\)iAPG algorithm for solving problem with \(m = 10^{5}\), \(sr = 0.12\% \) is within 1159.2 s (NMSE is smaller than 0.0141), while most other algorithms cannot get satisfactory results within this time. Therefore, taking both accuracy and converge speed into consideration, our proposed \(l_{q}\)iAPG algorithm has the best recovery performance among these algorithms.

4.2 Image recovery

In this section, we will apply \(l_{q}\)iAPG algorithm to image inpainting problems. In image inpainting problems, the values of some of the pixels of the image are missing, and our mission is to find out the missing values. If the image is low rank or numerical low rank, we can solve the image inpainting problem as a matrix completion problem. In this tests, we use the following benchmark images: Barbara, Bridge, Clown, Couple, Crowd, Fingerprint, Girlface, Houses, Kiel, Lighthouse, Tank, Truck, Trucks, and Zelda. The size of each image is \(512 \times 512\). We directly deal with the original images. As the image matrix is not guaranteed to be low rank, we use rank 50 for the estimated matrix for test. We randomly exclude \(50\%\) of the pixels in the images, and the remaining ones are used as the observations. We use peak signal-to-noise ratio (PSNR) [51] and running time to evaluate the recovery performance of the algorithms. We represent their average results with 5 runs.

Table 2 Comparison of different algorithms on benchmark images. CPU time is in seconds
Fig. 1
figure 1

Results of image recovery by using different algorithms

We list the results in terms of the PSNR in Table 2. We also exhibit the results of image recovery by using different algorithms for Zelda in Fig. 1. The results in Fig. 1 demonstrate that our proposed \(l_{q}\)iAPG algorithm performs best among 12 algorithms. We also can easily find from Table 2 that our \(l_{q}\)iAPG, niAPG, HFPA, and ASD algorithms achieve the best results. Although, the results obtained by HFPA and ASD algorithms are slightly better than \(l_{q}\)iAPG. Our algorithm is much faster than HFPA and ASD algorithms. In addition, although niAPG algorithm is slightly faster than our algorithm, the solutions obtained by our algorithm are more accurate. More precisely, \(l_{q}\)iAPG algorithm achieves most accurate solutions 7 times on all images, while niAPG achieves most accurate solutions only 4 times. Again, taking both accuracy and converge speed into consideration, our proposed \(l_{q}\)iAPG algorithm is a competitive algorithm in the field of image recovery.

4.3 Recommendation

Table 3 Characteristics of the recommendation datasets
Table 4 Comparison of different algorithms on Jester and MovieLens datasets. CPU time is in seconds

To further demonstrate the effectiveness of our proposed method, in this section we apply \(l_{q}\)iAPG algorithm on Jester and MovieLens datasets. We consider six datasets: Jester1, Jester2, Jester3, Jester-all, MovieLens-100K, and MovieLens-1M. The characteristics of these datasets are shown in Table 3. The Jester datasets are collected from a joke recommendation system. The whole data is stored in three excel files with the following characteristics [13].

  1. (1)

    jester-1: 24,983 users who have rated 36 or more jokes;

  2. (2)

    jester-2: 23,500 users who have rated 36 or more jokes;

  3. (3)

    jester-3: 24,938 users who have rated between 15 and 35 jokes.

The MovieLens datasets are collected from the MovieLens website. The characteristics of these data sets are list as follows [13]:

  1. (1)

    movie-100K: 100,000 ratings for 1682 movies by 943 users;

  2. (2)

    movie-1M: 1 million ratings for 3900 movies by 6040 users.

The Jester-all is obtained by combining Jester1, Jester2, and Jester3 datasets. In the following test, we follow the setup in [35], and randomly pick up \(50\%\) of the observed for training and use the remaining \(50\%\) for testing. We use the root mean squared error (RMSE) and running time to evaluate the recovery performance of the algorithms. The RMSE is defined as \({\hbox {RMSE}} = \sqrt{\| {\mathscr {P}}_{{\bar{\varOmega }}}(X - M)\| _{F}^{2}/|{\bar{\varOmega }}|_{1}}\), where \({\bar{\varOmega }}\) is the test set, X is the recovered matrix. The test of each algorithm is repeated 5 times.

Fig. 2
figure 2

Matrix completion results on Jester and MovieLens datasets by using different algorithms. In the first and second rows, we apply different algorithms on Jester datasets and depict the RMSE along the CPU times. In the last row, we apply different algorithms on MovieLens datasets and depict the RMSE along the CPU times

The reconstruction results in terms of RMSE and running time are listed in Table 4. We depict the RMSE along the CPU times and show these results in Fig. 2. As can be seen form Table 4, our proposed algorithm, APGL, and niAPG achieve the lowest RMSE in nearly all problems. We also observe that the results obtained by APGL are slightly better than ours, but our method is much faster. Besides, our proposed algorithm can be runs on all six data sets, while many algorithms only run partial data sets. Furthermore, Fig. 2 demonstrate that our proposed algorithm decreases the RMSE much faster than others. This is the third time to show that our proposed algorithm is a competitive algorithm in the field of matrix completion.

5 Conclusion

In this paper, we focus on the large-scale low-rank matrix completion problems with \(l_{q}\) regularizers and proposed an efficient and fast inexact thresholding algorithm called \(l_{q}\)iAPG algorithm to handle such problems. The key idea is to employ the closed-form q-thresholding operator to approximate the rank of a matrix and power method to approximate the SVD procedure. At the same time, our proposed algorithm inherits the great efficiency advantages of first-order gradient-based methods and is simple and easy to use, which is more suitable for large-scale matrix completion problems. In addition, we adopted three techniques to accelerate the convergence rate of our proposed algorithm, which are Nesterov’s accelerated gradient method, continuation technique, and the “sparse plus low-rank” structure. Furthermore, a convergence analysis of the \(l_{q}\)iAPG algorithm has shown that the sequence \(\{X_{t}\}\) generated by \(l_{q}\)iAPG algorithm is bounded and has at least one accumulation point. More important, we also shown that the objective function F(X) converges to \(F(X_{*})\), where \(X_{*}\) is any accumulation point of \(\{X_{t}\}\). Finally, extensive experiments on matrix completion problems validated that our proposed algorithm is more efficient and faster. Specifically, we compare our proposed algorithm with state-of-the-art algorithms on a series of scenarios, including synthetic data, image recovery and recommendation problems. All results demonstrated that our proposed algorithm is able to achieve comparable recovery performance, while being faster and more efficient than state-of-the-art methods.