1 Introduction

The problem of matrix completion generally refers to the problem of finding a low-rank matrix under a condition that only a small fraction of its elements is known.

The ‘textbook’ application of this problem lies in the field of recommender systems [1]; however, matrix completion algorithms recently found many other applications, including machine learning [2, 3], signal processing [4, 5] and genomic data integration [6].

Mathematically, such a problem can be formulated in different ways, resulting in problems and algorithms with different properties.

Consider an operator \(\mathcal{A}_{\Omega }: {\mathbb {R}}^{n \times n} \rightarrow {\mathbb {R}}^{n \times n}\) of the form

$$\begin{aligned} \mathcal{A}_{\Omega }(X)_{ij} = {\left\{ \begin{array}{ll} \frac{X_{ij}}{\sqrt{\rho }}, \, (i,j) \in \Omega \\ 0, \, (i,j) \notin \Omega \\ \end{array}\right. } \end{aligned}$$

where \(\Omega \in {\mathbb {N}}_n \times {\mathbb {N}}_n\) is a subset of indices that define the mask of matrix elements given as input, and \(\rho \) is the fraction of known elements \(\rho = \frac{\# \Omega }{n^2}\).

Using this definition, one of the possible mathematical formulation of the matrix completion problem as an optimization problem looks as follows:

$$\begin{aligned} rank(X)&\rightarrow \min , \\ \mathcal{A}(X)&= \mathcal{A}(X_*). \end{aligned}$$

This problem is NP-hard in general [7], and it is a common technique to replace the functional to be minimized by its convex envelope, which is the nuclear norm. The functional then can be optimized in polynomial time, and furthermore, this approach allows to bound the number of matrix elements sufficient for completion [8]. This number depends on the matrix size n as \(O(n \log ^2 n)\), and certain assumptions on the rank and the non-sparsity of the unknown matrix are required for the bound.

However, the computational complexity of such an approach remains high, as convex optimization is applied with the whole matrix unknown, resulting in \(O(n^2)\) variables to be optimized. In order to cope with that problem, alternative ‘fixed-rank’ approaches have also been studied, which consider the following optimization problem:

$$\begin{aligned} \Vert \mathcal{A}(X) - \mathcal{A}(X_*)\Vert _F^2&\rightarrow \min , \nonumber \\ rank(X)&\le r. \end{aligned}$$
(1)

The fixed-rank approach allows to either search for the unknown matrix in a factorized form using gradient-based optimization [9, 10], or to use techniques based on optimization on algebraic manifolds [11,12,13,14]. The latter approach allows to use optimization methods with faster than linear (e.g. second order) convergence, but has initial point restrictions and commonly does not have a direct estimate on the required matrix element count; the former approach allows to build geometrically-convergent methods that do not have strong initial point requirements, and keeps the same estimate of \(O(n \log ^2 n)\) matrix elements being sufficient for successful completion. Furthermore, recent results [15, 16] have shown that it is possible to substantially reduce the computational complexity of such methods while maintaining provable geometric convergence.

In this paper, one of the factorized-form iterative gradient-based optimization algorithms called ‘Singular Value Projection’ (SVP) [9] will be considered under a special setting with erroneous input data: it will be assumed that a small fraction of given element values may be subject to errors. The theoretical bounds developed for the original SVP method have a certain requirement of the non-sparsity of the unknown matrix and the residual on each iteration. Thus, the original SVP method, if applied directly to the mask with erroneous elements, will generally not converge to the unaffected matrix.

The main result of this paper is an algorithm that is able to solve the completion problem with sparse errors. An alternative procedure, that introduces two independent completion ‘masks’ and searches for the most ‘aligned’ parts of the resulting matrices, is proposed and analyzed theoretically. The proposed procedure can also be formulated as a ‘low-rank plus sparse’ approximation algorithm with provable theoretical bounds under the assumptions commonly used in the analysis of the matrix completion algorithms.

The organization of the paper is as follows. The second section contains brief discussion of the concepts of ‘Restricted Isometry Property (RIP)’, canonical angles between supspaces, and SVD perturbations, that are widely used throughout the paper. In the third section, the matrix completion SVP method is discussed. Theoretical convergence results, which are close to those from [9, 15], but provide a direct dependence on the problem condition number, are provided in order to form a base for the new algorithm. These results guarantee geometric convergence of the method in the case when \(\mathcal{A}\) is a general operator that satisfies the so-called ‘Restricted Isometry Property’ (RIP). In the fourth section, a novel ‘Twin Completion’ approach, that is able to solve the matrix completion problem in the setting of sparse measurement errors, is proposed and analyzed. In the fifth section, numerical results are provided that show the effectiveness of the proposed method.

2 Required concepts

In this section, some linear algebra and matrix completion concepts will be discussed, that will be used throughout the rest of the paper.

2.1 Restricted isometry property

As the matrix completion problems, if posed directly, are commonly NP-hard [7], it is common to impose additional constraints on the functional (1) and/or its solutions. One way to do so is to suppose that the linear operator \(\mathcal{A}\) fulfills the so-called RIP property, which means that it approximately preserves the norm of the input.

Definition 2.1

(‘Restricted isometry property’, RIP)

$$\begin{aligned} (1 - \delta _r) \Vert X \Vert _F^2 \le \Vert \mathcal{A}(X) \Vert _F^2 \le (1 + \delta _r) \Vert X \Vert _F^2, \; rank(X) \le r \end{aligned}$$
(2)

The approximate relative norm preservation necessarily means absolute scalar product preservation, which is shown in the following Lemma using the common scalar product expression in terms of corresponding norm:

Lemma 2.1

If a linear operator \(\mathcal{A} \in {\mathbb {R}}^{n \times n} \rightarrow {\mathbb {R}}^{n \times n}\) satisfies restricted isometry property (2) for all matrices \(Z: rank(Z) \le 2r\), then

$$\begin{aligned}&(X, Y)_F - \delta _{2r} \Vert X\Vert _F \Vert Y\Vert _F \le (A(X), A(Y))_F \le (X, Y)_F + \delta _{2r} \Vert X\Vert _F \Vert Y\Vert _F, \\&X, Y \in {\mathbb {R}}^{n \times n}: rank(X) \le r, rank(Y) \le r. \end{aligned}$$

Proof

As the Frobenius scalar product is induced by the Frobenius norm, by the parallelogram identity

$$\begin{aligned} (X, Y)_F&= \frac{1}{4} (\Vert X + Y \Vert _F^2 - \Vert X - Y \Vert _F^2) \\ (A(X), A(Y))_F&= \frac{1}{4} (\Vert A(X + Y) \Vert _F^2 - \Vert A(X - Y) \Vert _F^2) \end{aligned}$$

Now, if \(\Vert X\Vert _F = \Vert Y\Vert _F = 1\), then

$$\begin{aligned} |(A(X), A(Y))_F - (X, Y)_F |&\le \frac{1}{4} |\Vert A(X + Y)\Vert _F^2 - \Vert X + Y\Vert _F^2 |+ \nonumber \\&\quad + \frac{1}{4} |\Vert A(X - Y)\Vert _F^2 - \Vert X - Y\Vert _F^2 |\nonumber \\&\le \frac{\delta _{2r}}{4} \Vert X + Y \Vert _F^2 + \frac{\delta _{2r}}{4} \Vert X - Y \Vert _F^2 \nonumber \\&\le \delta _{2r} \frac{4 (X, Y)}{4} \le \delta _{2r}. \end{aligned}$$
(3)

In general case with arbitrary values of \(\Vert X\Vert _F, \Vert Y\Vert _F\), the proof follows from applying the considerations above to \(\frac{X}{\Vert X \Vert _F}, \frac{Y}{\Vert Y \Vert _F}\), and scaling the result (3) by \(\Vert X \Vert _F \Vert Y \Vert _F\). \(\square \)

While it is possible to provide an example of such an operator \(\mathcal{A} \in {\mathbb {R}}^{n \times n} \rightarrow {\mathbb {R}}^{n \times n}\) with an image dimensionality of \(O(n^2)\), it is unclear if such operators with smaller image dimensionalities \(m \ll n^2\) even do exist. However, certain additional limitations on the set of low-rank matrices on which the RIP constraint must hold allow one to build some examples of such operators \(\mathcal{A}\) with low image dimensionality [9, 15]. An important example is the so-called ‘matrix completion’ operator.

Consider \(\mathcal{A}_{\Omega } \in L({\mathbb {R}}^{n \times n} \rightarrow {\mathbb {R}}^{n \times n})\) to be called a random ‘matrix completion’ operator, if it has the form

$$\begin{aligned} \mathcal{A}_{\Omega }(X)_{ij} = {\left\{ \begin{array}{ll} X_{ij} / \sqrt{\rho }, (i, j) \in \Omega \\ 0, (i, j) \notin \Omega \\ \end{array}\right. } \end{aligned}$$

where \(\Omega \) is a set of indices corresponding to the known elements of a matrix, and \(\rho = \frac{|\Omega |}{n^2}\) is the sparsity parameter. Such an operator, with a minimal sparsity requirement is known to satisfy RIP property on a subset of matrices of low rank.

This subset is defined using the so-called matrix \(\mu \)-incoherence condition, or essentially ‘non-sparsity’ condition of the following form:

Definition 2.2

(\(\mu \)-incoherent matrix) A matrix \(X, \; rank(X) = r\) is called \(\mu \)-incoherent, if the SVD factors of this matrix satisfy

$$\begin{aligned}&|U_{ij} |\le \frac{\sqrt{\mu }}{\sqrt{n}}, |V_{ij} |\le \frac{\sqrt{\mu }}{\sqrt{n}}, \nonumber \\&X = U \Sigma V^*, U,V \in {\mathbb {R}}^{n \times r}. \end{aligned}$$
(4)

Theorem 2.1

(Theorem 4.2, [9]) There exists a constant \(C_{RIP} \ge 0\) such that for any \(0< \delta _r < 1,\) any \(\mu \ge 1, n \ge 3,\) and a sparsity parameter

$$\begin{aligned} \rho \ge C_{RIP} \frac{\mu ^2 r^2}{\delta _r^2} \frac{\log (n)}{n}, \end{aligned}$$

a matrix completion operator \(\mathcal{A}_{\Omega }\) with \(\Omega \) selected randomly using uniform index distribution

$$\begin{aligned} {\mathbb {P}} \{ (i, j) \in \Omega \} = \rho , \forall i,j, \end{aligned}$$

with a probability not less than \(1 - \exp (-n \log (n))\) satisfies RIP with parameter \(\delta _r\) on all \(\mu \)-incoherent matrices of rank not larger than r.

Here and throughout the paper the logarithm is assumed to be natural (with base e), but it is mostly insignificant for asymptotic analysis. It is notable that the incoherence property can be translated to an elementwise bound of a matrix using a following Lemma.

Lemma 2.2

If X is a rank-r matrix, that satisfies incoherence properties (4) with constant \(\mu \), then

$$\begin{aligned} \Vert X \Vert _C \le \frac{\mu \sqrt{r} \Vert X \Vert _F}{n}. \end{aligned}$$

Proof

The proof is given in [9], Lemma 4.7. \(\square \)

2.2 Angles between subspaces

In this paper, we are going to analyze the convergence of matrix completion methods using the idea of singular vector subspace stability to almost orthogonal perturbations. The analysis will require the following common linear algebra concept of canonical angles and vectors between linear subspaces. Consider two linear subspaces \(\mathcal{L} \in {\mathbb {R}}^n\) and \(\mathcal{M}\in {\mathbb {R}}^n\) such that \(\dim (\mathcal{L}) = \dim (\mathcal{M}) = r\). Let \(P_\mathcal{L}, P_\mathcal{M}\) denote the corresponding orthogonal projectors. Then [17, 18],

  • The singular values \(\sigma _k\) of the product \(P_\mathcal{L} P_\mathcal{M}\) belong to the interval [0, 1]. The left and right sorted singular vector pairs of \(P_\mathcal{L} P_\mathcal{M}\) are called the ‘canonical vectors’ between subspaces \(\mathcal{L}, \mathcal{M}\). The sorted values \(\phi _k \in [0, \frac{\pi }{2}]\), such that \(\cos \phi _k = \sigma _k(P_\mathcal{L} P_\mathcal{M})\), are known as the canonical angles between subspaces \(\mathcal{L}, \mathcal{M}\).

  • The first canonical vector pair between subspaces has the meaning of the pair of most correlated unit vectors of these subspaces:

    $$\begin{aligned} \{\vec {x}_1, \vec {y}_1\} = \arg \max _{\vec {x} \in \mathcal{L}, \vec {y} \in \mathcal{M}} (\vec {x}, \vec {y})_2. \end{aligned}$$

    All the other canonical vector pairs can be expressed iteratively with

    $$\begin{aligned} \{\vec {x}_k, \vec {y}_k\} = \arg \max _{\vec {x} \in \mathcal{L}, x \perp x_1 \dots x_{k-1}; \vec {y} \in \mathcal{M}, y \perp y_1 \dots y_{k - 1}} (\vec {x}, \vec {y})_2, k \le r. \end{aligned}$$

    .

  • The eigenvalues of a symmetric matrix \(P_\mathcal{L} - P_\mathcal{M}\) are equal to \(\pm \sin \phi _k\) (maximum total of 2r nonzero values), where \(\phi _k\) denotes the canonical angles between \(P_\mathcal{L}, P_\mathcal{M}\).

2.3 SVD perturbations

Now, let us recall a singular base additive perturbation bound proved by Wedin, which is close to the well-known Davis– bound. The theorem essentially guarantees that if the perturbation is almost orthogonal to both left and right singular subspaces of the original matrix, these subspaces are not changed much after the perturbation.

Throughout the paper, it will be commonly assumed that \(P_U := UU^*\) denotes an orthogonal projection onto a subspace spanned by orthogonal columns of U, and \(P_r(X) \in {\mathbb {R}}^{n \times n}\) denotes the optimal SVD-projection of a matrix \(X \in {\mathbb {R}}^{n \times n}\) onto the set of matrices with a rank not larger than r.

Lemma 2.3

(Wedin, [19, 20]) Let \(X \in {\mathbb {R}}^{n \times n}\) be a rank-r matrix, and \(X = U \Sigma _r V^*\) be its short SVD with \(U, V \in {\mathbb {R}}^{n \times r}\). Let \(E \in {\mathbb {R}}^{n \times n}\) be an additive error matrix that fulfills \(\max \{\Vert P_U E \Vert _F, \Vert E P_V \Vert _F \} \le \eta \). Let \(\gamma \in {\mathbb {R}}\) be a positive constant.

  • If \(\gamma \le \sigma _r(X) - \Vert E \Vert _2\), and \(P_r(X + E) = {\hat{U}}_r \hat{\Sigma }_r {\hat{V}}_r^*\), then

    $$\begin{aligned} \max \{\Vert P_U - P_{{\hat{U}}_r} \Vert _F, \Vert P_V - P_{{\hat{V}}_r} \Vert _F \} \le 2 \frac{\eta }{\gamma }. \end{aligned}$$
  • More generally, if there exist size-r singular vector bases \({\hat{U}}, {\hat{V}}\) of \(X + E\) (which do not necessarily correspond to the top-r singular values), and \(\gamma \le \min _{i \le r,j} |\sigma _i(X) - \sigma _j((I - P_{{\hat{U}}}) (X + E) (I - P_{{\hat{V}}})) |\), then

    $$\begin{aligned} \max \{\Vert P_U - P_{{\hat{U}}_r} \Vert _F, \Vert P_V - P_{{\hat{V}}_r} \Vert _F \} \le 2 \frac{\eta }{\gamma }. \end{aligned}$$

Proof

The statements are proved in [20] and also discussed in [19]. The claims are obtained by plugging into Theorem 4, [19] the notations \(\Sigma _r \leftrightarrow \tilde{\Sigma }_1\), \(\sigma ((I - P_{{\hat{U}}}) (X + E) (I - P_{{\hat{V}}})) \leftrightarrow \sigma (\Sigma _2)\), \(U_r^* E \leftrightarrow S, E V_r \leftrightarrow R\). \(\square \)

Now, let us prove an additional Lemma that uses the above theorem for the particular case of a low-rank matrix perturbation.

Lemma 2.4

Let \(X \in {\mathbb {R}}^{n \times n}\) be a rank-r matrix, and \(U \Sigma V^*\) be its short SVD such that \(U, V \in {\mathbb {R}}^{n \times r}\). Let \(E \in {\mathbb {R}}^{n \times n}\) be an additive error matrix that fulfills \(\max \{\Vert P_U E \Vert _F, \Vert E P_V \Vert _F \} \le \eta \), \(\Vert E \Vert _2 \le \frac{\sigma _r(X)}{2}\).

Let \(\kappa := \frac{\sigma _1(X)}{\sigma _r(X)}\) be the ‘problem condition number’. Then,

$$\begin{aligned} \Vert P_r(X + E) - X \Vert _F \le \eta (8 \kappa + 3). \end{aligned}$$

Proof

Defining \(P_r(X + E) := {\tilde{U}} \tilde{\Sigma }{\tilde{V}}^*\), \(\Delta P_U := P_{{\tilde{U}}} - P_{U}\), \(\Delta P_{V} := P_{{\tilde{V}}} - P_{V}\), where \({\tilde{U}}, {\tilde{V}} \in {\mathbb {R}}^{n \times r}\), we can use the previous Lemma to open the brackets as

$$\begin{aligned} \Vert P_r(X + E) - X \Vert _F&= \Vert P_{{\tilde{U}}} (X + E) P_{{\tilde{V}}} - P_U X P_V \Vert _F \nonumber \\&= \Vert (P_U + \Delta P_U) (X + E) (P_V + \Delta P_V) - P_U X P_V \Vert _F \nonumber \\&\le \Vert (P_U + \Delta P_U) X (P_V + \Delta P_V) - P_U X P_V \Vert _F \end{aligned}$$
(5)
$$\begin{aligned}&\quad + \Vert (P_U + \Delta P_U) E (P_V + \Delta P_V) \Vert _F. \end{aligned}$$
(6)

Let us define \(\gamma := \sigma _r(X) - \Vert E \Vert _2\); as \(\Vert E \Vert _2 < \frac{\sigma _r(X)}{2}\), \(\gamma > \frac{\sigma _r(X)}{2}\). Then, Lemma 2.3 gives the bound \(\Vert \Delta P_U \Vert _F \le \frac{2 \eta }{\gamma }, \Vert \Delta P_V \Vert _F \le \frac{2 \eta }{\gamma }\), and we can bound the two additive terms of (5), (6) by

$$\begin{aligned} \Vert (P_U + \Delta P_U) (X + E) (P_V&+ \Delta P_V) - P_U X P_V \Vert _F\\&\le \Vert P_U X \Delta P_V \Vert _F + \Vert \Delta P_U X P_{{\tilde{V}}} \Vert _F \\&\le (\Vert \Delta P_V \Vert _F + \Vert \Delta P_U \Vert _F) \Vert X\Vert _2 \\&\le \frac{4 \eta }{\gamma } \Vert X\Vert _2 \le 8 \eta \frac{\sigma _1(X)}{\sigma _r(X)} = 8 \eta \kappa \end{aligned}$$

and

$$\begin{aligned} \Vert (P_U + \Delta P_U) E (P_V + \Delta P_V)\Vert _F&\le \Vert P_U E P_{{\tilde{V}}} \Vert _F + \Vert \Delta P_U E P_{{\tilde{V}}} \Vert _F \\&\le \Vert P_U E \Vert _F + \Vert \Delta P_U \Vert _F \Vert E\Vert _2 \\&\le \eta + \frac{2 \eta }{\gamma } \Vert E\Vert _2 \le 3 \eta , \end{aligned}$$

where we used \(\Vert AB \Vert _F \le \Vert A \Vert _F \Vert B \Vert _2\) multiple times, and the last inequality follows from \(\gamma > \frac{\sigma _r(X)}{2} \ge \Vert E \Vert _2\). Summing the two additive terms finishes the proof. \(\square \)

3 Exact restricted isometry SVP

The so-called ‘Singular Value Projection’ [9] method is essentially an iterative projected gradient method applied to the functional of the form

$$\begin{aligned} \Psi (Y)&= \Vert \mathcal{A}(Y) - \mathcal{A}(X)\Vert _F^2 \rightarrow \min , \nonumber \\ rank(Y)&\le r. \end{aligned}$$
(7)

One iteration of the method can be described with equations

$$\begin{aligned} W_k&:= X - X_k, \\ X_{k + 1}&:= P_r(X_k - \alpha \nabla \Psi (X_k)) \\&= P_r(X_k + \alpha \mathcal{A}^* \mathcal{A} (W_k)) \\&= P_r(X - W_k + \alpha \mathcal{A}^* \mathcal{A} (W_k)) \\&= P_r(X + (\alpha \mathcal{A}^* \mathcal{A} - \mathcal{I}) (W_k)); \\ X_{k + 1}&= P_r(X + E_k), \\ E_k&:= (\alpha \mathcal{A}^* \mathcal{A} - \mathcal{I}) (W_k). \end{aligned}$$

where k is the iteration number, \(X_k\) denotes the current approximation of the desired unknown matrix X, \(W_k\) denotes current error, \(\alpha > 0\) is the step parameter, and \(\mathcal{I} \in {\mathbb {R}}^{n \times n} \rightarrow {\mathbb {R}}^{n \times n}\) denotes the identity operator defined in the space of matrices.

In [9], it is proven that if the operator \(\mathcal{A}: {\mathbb {R}}^{n \times n} \rightarrow {\mathbb {R}}^{n \times n}\) satisfies the RIP property (2) with a constant \(0< \delta _{2r} < 1\) for all matrices up to rank 2r, then the SVP method with \(\alpha = \frac{1}{1 + \delta _{2r}} \approx 1\) attains geometric convergence in terms of the optimization functional (7), regardless of the initial point \(X_0\). Here, we will provide alternative convergence theorems and proofs that are based on the analysis of a low-rank SVD perturbation. In our paper, we will consider a fixed step size of \(\alpha = 1\). Our arguments are based on the following Lemma.

Lemma 3.1

(Projected \(\mathcal{A}^* \mathcal{A} - \mathcal{I}\) residual) Let \(U, V \in {\mathbb {R}}^{n \times r}\) be arbitrary bases with orthogonal columns, and \(Y \in {\mathbb {R}}^{n \times n}\), \(rank(Y) \le r\). Let \(\mathcal{A}\) be an operator that satisfies RIP with a constant \(\delta _{2r}\) on all matrices of rank up to 2r. Then

$$\begin{aligned} \max \{ \Vert P_U ((\mathcal{A}^* \mathcal{A} - \mathcal{I})Y)\Vert _F, \Vert ((\mathcal{A}^* \mathcal{A} - \mathcal{I})Y) P_V\Vert _F \} \le \delta _{2r} \Vert Y\Vert _F, \end{aligned}$$

where \(P_U, P_V\) denote orthogonal projectors \(UU^*, VV^*\) respectively.

Proof

Using that \(rank(P_U (\mathcal{A}^* \mathcal{A} - \mathcal{I})Y) \le r\), and rewriting the Frobenius norm with a scalar product maximization, we have

$$\begin{aligned} \Vert P_U (\mathcal{A}^* \mathcal{A} - \mathcal{I})Y \Vert _F&= \max _{L, R \in {\mathbb {R}}^{n \times r}, \Vert LR^* \Vert _F \le 1} (P_U (\mathcal{A}^* \mathcal{A} - \mathcal{I})Y, LR^*)_F \\&= \max _{\Vert LR^* \Vert _F \le 1} ((\mathcal{A}(Y), \mathcal{A}(P_U LR^*))_F - (Y, P_U LR^*)_F) \\&\le \delta _{2r} \Vert Y\Vert _F \max _{\Vert LR^*\Vert _F \le 1}\Vert P_U LR^*\Vert _F \\&= \delta _{2r} \Vert Y\Vert _F. \end{aligned}$$

where the inequality follows from a scalar product isometry Lemma 2.1. \(\square \)

Lemma 3.1 brings the following idea: although the SVP iteration additive term \(E_k = (\mathcal{A}^* \mathcal{A} - \mathcal{I}) W_k\) itself may have a relatively ‘large’ norm compared to current error \(W_k\), it is almost orthogonal to the bases of the solution X. Thus, by Lemma 2.3, addition of \(E_k\) should have a limited impact on the top singular bases of X, and \(\Vert P_r(X + E_k) - X \Vert \ll \Vert W_k \Vert \).

Recent studies [15, 16] have shown that the most numerically-complex operation of the SVP algorithm, which is the SVD decomposition (that costs \(O(n^3)\) operations in general), can be reduced by using approximate partial SVP decomposition. Such an approximate decomposition can be obtained using any algorithm \({\hat{P}}_r\) that satisfies the approximation condition: for any matrix Y, \({\hat{P}}_r(Y)\) should be a rank-r matrix that in some sense approximates the actual projection \(P_r(Y)\). In [15], the following definition of approximation is introduced:

Definition 3.1

(Approximate projector, [15]) The operator \({\hat{P}}_r\) is called an \(\epsilon \)-approximate SVD projection operator if

$$\begin{aligned} \Vert {\hat{P}}_r(Y) - Y \Vert _F^2 \le (1 + \epsilon ) \Vert Y - P_r(Y) \Vert _F^2, \; Y \in {\mathbb {R}}^{n \times n}. \end{aligned}$$
(8)

Same condition in expectation form is used in [16]. In this paper, we are going to use a stronger version of such a condition compared to [15, 16]:

Definition 3.2

(Approximate projector) The operator \({\hat{P}}_r\) is called an \(\epsilon \)-approximate SVD projection operator if

$$\begin{aligned} \Vert {\hat{P}}_r(Y) - P_r(Y) \Vert _F \le \epsilon \Vert Y - P_r(Y) \Vert _F, \; Y \in {\mathbb {R}}^{n \times n}. \end{aligned}$$
(9)

The Definitions 3.1, 3.2 are non-equivalent: if (9) holds, then

$$\begin{aligned} \Vert {\hat{P}}_r(Y) - Y \Vert _F&\le \Vert {\hat{P}}_r(Y) - P_r(Y) \Vert _F + \Vert P_r(Y) - Y \Vert _F \\&\le (1 + \epsilon ) \Vert P_r(Y) - Y \Vert _F, \end{aligned}$$

and (8) is fulfilled with \(1 + \tilde{\epsilon }= \sqrt{1 + \epsilon }\). If (8) holds instead, the Definition 3.2 may not be fulfilled in general, which can be seen on the following example. Let

$$\begin{aligned} Y = \begin{bmatrix} 1.001 &{} 0 \\ 0 &{} 1 \\ \end{bmatrix}, P_r(Y) = \begin{bmatrix} 1.001 &{} 0 \\ 0 &{} 0 \\ \end{bmatrix}, {\hat{P}}_r(Y) := \begin{bmatrix} 0 &{} 0 \\ 0 &{} 1 \\ \end{bmatrix}; \end{aligned}$$

Then, taking \(\epsilon = 0.001\) we have

$$\begin{aligned} \Vert Y - {\hat{P}}_r(Y) \Vert _F = (1 + \epsilon ) \Vert Y - P_r(Y) \Vert _F, \; \Vert P_r(Y) - {\hat{P}}_r(Y) \Vert _F = \Vert Y \Vert _F. \end{aligned}$$

In the context of this paper, however, the matrices Y of interest are bounded by some limitations. Lemma 2.4, for example, requires a gap in the singular values of \(Y = X + E\) of the form \(\sigma _r(Y) - \sigma _{r + 1}(Y) > const\). We will now show that imposing additional constraints on the singular value decay of the considered matrix Y make the Definitions 3.1, 3.2 close-to-equivalent.

First, consider that if the rank-r matrix has an orthogonal column basis \({\hat{U}} \in {\mathbb {R}}^{n \times r}, {\hat{P}}_r(Y) = {\hat{U}} Z\), then \(\Vert {\hat{P}}_r(Y) - Y \Vert _F \ge \Vert P_{{\hat{U}}} Y - Y \Vert _F\), where \(P_{{\hat{U}}} = {{\hat{U}}} {{\hat{U}}}^*\) is the orthogonal projector. Since computing \({{\hat{U}}}^* Y\) is a (relatively to the common SVD) low-complexity operation, it can be assumed that the approximate projector \({\hat{P}}_r(Y)\) always has the form \(P_{{\hat{U}}(Y)} Y\), where \({\hat{U}}(Y)\) is some approximation to the top-r singular columns of Y.

Lemma 3.2

(Approximate projector equivalence) Let \(Y \in {\mathbb {R}}^{n \times n}\), and let \(P_r(Y) = P_U Y = U \Sigma V^*, U, V \in {\mathbb {R}}^{n \times r}\) - be its optimal SVD-projection onto the set of rank-r matrices. Let \(P_{{\hat{U}}} Y\) be an approximate projection, which satisfies

$$\begin{aligned} \Vert Y - P_{{\hat{U}}} Y\Vert _F^2 \le (1 + \epsilon ) \Vert Y - P_U Y\Vert _F^2. \end{aligned}$$
(10)

Let the singular values of Y satisfy

$$\begin{aligned} \sum \limits _{i = 1}^{r} \sigma _i^2(Y)&\le 2 \epsilon \sum \limits _{i = r+1}^{n} \sigma _i^2(Y), \end{aligned}$$
(11)
$$\begin{aligned} \sum \limits _{i = r + 1}^{2r} \sigma _i^2(Y)&\le \epsilon \sum \limits _{i = r+1}^{n} \sigma _i^2(Y). \end{aligned}$$
(12)

Then,

$$\begin{aligned} \Vert P_{{\hat{U}}} Y - P_U Y \Vert _F \le (\sqrt{2} + \sqrt{6}) \sqrt{\epsilon } \Vert Y - P_U Y \Vert _F. \end{aligned}$$

Proof

By the approximation property 10, it can be seen that

$$\begin{aligned} (1 + \epsilon ) \sum \limits _{i = r+1}^{n} \sigma _i^2(Y) \ge \Vert (I&- P_{{\hat{U}}}) Y \Vert _F^2 = \Vert P_U (I - P_{{\hat{U}}}) Y \Vert _F^2 + \Vert (I - P_U) (I - P_{{\hat{U}}}) Y \Vert _F^2, \\ \Vert (I - P_U) (I - P_{{\hat{U}}}) Y \Vert _F^2&= \Vert (I - P_U) Y - (I - P_U) P_{{\hat{U}}} Y \Vert _F^2 \ge \sum \limits _{i = 2r + 1}^{n} \sigma _i^2(Y), \end{aligned}$$

where the last inequality follows from the bound on the optimal Frobenius-norm residual between the matrix \((I - P_U) Y\) and a rank-r matrix. By subtraction, we have

$$\begin{aligned} \Vert P_U (I - P_{{\hat{U}}}) Y \Vert _F^2&\le \sum \limits _{i = r + 1}^{2r} \sigma _i^2(Y) + \epsilon \sum \limits _{i = r + 1}^{n} \sigma _i^2(Y) \\&= \beta ^2(Y) \Vert P_U Y\Vert _F^2, \\ \beta ^2(Y)&:= \frac{\sum \nolimits _{i = r + 1}^{2r} \sigma _i^2(Y) + \epsilon \sum \nolimits _{i = r + 1}^{n} \sigma _i^2(Y)}{\sum \nolimits _{i = 1}^{r} \sigma _i^2(Y)}. \end{aligned}$$

On the other hand, by the orthogonal projection properties and optimality of singular subspace U, we have

$$\begin{aligned} \Vert (I - P_{{\hat{U}}}) Y \Vert _F^2 = \Vert Y \Vert _F^2 - \Vert P_{{\hat{U}}} Y \Vert _F^2&= \Vert Y \Vert _F^2 - \Vert (I - P_U) P_{{\hat{U}}} Y \Vert _F^2 - \Vert P_U P_{{\hat{U}}} Y \Vert _F^2, \\ \Vert (I - P_{{\hat{U}}}) Y \Vert _F^2&\ge \Vert (I - P_{U}) Y \Vert _F^2. \end{aligned}$$

By rearranging the terms, we can bound

$$\begin{aligned} \Vert (I - P_U) P_{{\hat{U}}} Y \Vert _F^2 \le \Vert Y \Vert _F^2 - \Vert (I - P_{U}) Y \Vert _F^2 - \Vert P_U P_{{\hat{U}}} Y \Vert _F^2 = \Vert P_{U} Y \Vert _F^2 - \Vert P_U P_{{\hat{U}}} Y \Vert _F^2. \end{aligned}$$

Now, we need to bound the value \(\Vert P_{U} Y \Vert _F^2 - \Vert P_U P_{{\hat{U}}} Y \Vert _F^2\) using the previously obtained \(\Vert P_U Y - P_U P_{{\hat{U}}} Y \Vert _F^2 \le \beta ^2 (Y) \Vert P_U Y \Vert _F^2\). The bound would be straightforward in the absence of squares; in order to handle the squares, let us make an arithmetic substitution \(a \leftrightarrow \Vert P_U P_{{\hat{U}}} Y \Vert _F, b \leftrightarrow \Vert P_U Y \Vert _F, b \ge a \ge 0, (b-a)^2 \le \beta ^2 b^2\). Then,

$$\begin{aligned} b^2 - a^2 = b^2 - 2ab + a^2 + 2ab - 2a^2&= (b - a)^2 + 2a(b - a) \le (\beta ^2 + 2 \beta ) b^2. \\ \Vert (I - P_U) P_{{\hat{U}}} Y \Vert _F^2&\le (\beta ^2 + 2 \beta ) \Vert P_U Y \Vert _F^2. \end{aligned}$$

Now we can use the obtained bounds to obtain

$$\begin{aligned} \Vert P_{{\hat{U}}} Y - P_U Y \Vert _F&\le \Vert P_U Y - P_U P_{{\hat{U}}} Y \Vert _F + \Vert P_{{\hat{U}}} Y - P_U P_{{\hat{U}}} Y \Vert _F \\&= \Vert P_U (I - P_{{\hat{U}}}) Y \Vert _F + \Vert P_{{\hat{U}}} (I - P_U) Y \Vert _F \\&\le (\beta + \sqrt{\beta ^2 + 2 \beta }) \Vert P_U Y \Vert _F. \end{aligned}$$

In order to finalize the proof, we need to use conditions (11), (12) to bound \(\beta (Y)\). Inequality (12) gives

$$\begin{aligned} \sum \limits _{i = r + 1}^{2r} \sigma _i^2(Y) + \epsilon \sum \limits _{i = r + 1}^{n} \sigma _i^2(Y) \le 2 \epsilon \sum \limits _{i = r + 1}^{n} \sigma _i^2(Y), \end{aligned}$$

thus by (11) \(\beta (Y) > 1\). Then,

$$\begin{aligned} \Vert P_{{\hat{U}}} Y - P_U Y \Vert _F&\le (1 + \sqrt{3}) \beta \Vert P_U Y \Vert _F = (1 + \sqrt{3}) \sqrt{\sum \nolimits _{i = r + 1}^{2r} \sigma _i^2(Y) + \epsilon \sum \nolimits _{i = r + 1}^{n} \sigma _i^2(Y)} \\&\le (1 + \sqrt{3}) \sqrt{2\epsilon \sum \nolimits _{i = r + 1}^{n} \sigma _i^2(Y)} = (\sqrt{2} + \sqrt{6}) \sqrt{\epsilon } \Vert Y - P_U Y \Vert _F. \end{aligned}$$

\(\square \)

Now, we will sum up the SVP convergence theory results, similar to those in [9, 15], with a theorem. The theorem uses the Definition 3.2 and requires the RIP property (2) to hold for all matrices of rank up to 4r, compared to 2r in [9, 15]. With these more strict conditions it is possible to establish a short and understandable proof of convergence using the concept of SVD perturbation analysis.

Theorem 3.1

(SVP convergence) Let X be a rank-r matrix and \(\kappa = \frac{\sigma _1(X)}{\sigma _r(X)}\). Let \(\mathcal{A}\) be an operator that satisfies RIP with a constant \(\delta _{4r}\) on all matrices of rank up to 4r. Then the SVP algorithm attains local linear(geometric) convergence with constant \(\delta _{4r} (8 \kappa + 3)\).

Let \(\Vert \mathcal{A}^*\Vert _F = \max _{X \ne 0}\frac{\Vert \mathcal{A}^*(X)\Vert _F}{\Vert X\Vert _F}\). Additionally, let \({\hat{P}}_r\) be an \(\epsilon \)-approximate SVD projection operator in the sense of Definition 3.2, and let \(\epsilon \Vert \mathcal{A}^* \Vert _{F} (1 + \delta _{4r}) < 1 - \delta _{4r} (8 \kappa + 3)\). Then, approximate SVP with \(P_r\) replaced by \({\hat{P}}_r\) attains local linear(geometric) convergence.

Proof

In the case of exact SVP projection, taking into account that \(rank(W_k) \le 2r\), and assuming \(\Vert E_k \Vert _2 \le \frac{\sigma _r(X)}{2}\), Lemmas 2.4, 3.1 give estimates

$$\begin{aligned} \Vert P_U E_k \Vert _F&\le \delta _{4r} \Vert W_k \Vert _F, \Vert E_k P_V \Vert _F \le \delta _{4r} \Vert W_k \Vert _F \\ \Vert W_{k+1} \Vert _F&= \Vert P_r(X + E_k) - X \Vert _F \le \delta _{4r} (8 \kappa + 3) \Vert W_k \Vert _F. \end{aligned}$$

The assumption \(\Vert E_k \Vert _2 \le \frac{\sigma _r(X)}{2}\) holds true locally when the error value of the current iterate is small enough: as

$$\begin{aligned} \Vert E_k\Vert _2 = \Vert (\mathcal{I} - \mathcal{A}^* \mathcal{A}) (W_k) \Vert _2&\le \Vert \mathcal{I} \Vert _F \Vert W_k\Vert _F + \Vert \mathcal{A^*}\Vert _F \Vert \mathcal{A} (W_k) \Vert _F \\&\le (1 + (1 + \delta _{4r}) \Vert \mathcal{A}^*\Vert _F) \Vert W_k\Vert _F, \end{aligned}$$

it is sufficient to assume

$$\begin{aligned} \Vert W_k \Vert _2 \le \frac{1}{2 \Vert \mathcal{A}^*\Vert _F} \sigma _r(X). \end{aligned}$$

In the case of inexact SVP projection, it can be seen that

$$\begin{aligned} W_{k+1}&= X - {\hat{P}}_r(X + E_k) \\&= X - P_r(X + E_k) + P_r(X + E_k) - {\hat{P}}_r(X + E_k), \end{aligned}$$

and it suffices to bound \(\Vert P_r(X + E_k) - {\hat{P}}_r(X + E_k) \Vert _F\) with

$$\begin{aligned} \Vert P_r(X + E_k) - {\hat{P}}_r(X + E_k) \Vert _F&\le \epsilon \Vert P_r(X + E_k) - (X + E_k) \Vert _F \\&= \epsilon \Vert P_r(X_k + \mathcal{A}^*\mathcal{A}(W_k)) - (X_k + \mathcal{A}^*\mathcal{A}(W_k)) \Vert _F \\&\le \epsilon \Vert X_k - (X_k + \mathcal{A}^* \mathcal{A}(W_k)) \Vert _F \\&= \epsilon \Vert \mathcal{A}^* \mathcal{A}(W_k) \Vert _F \le (1 + \delta _{4r}) \epsilon \Vert \mathcal{A}^* \Vert _F \Vert W_k \Vert _F. \end{aligned}$$

where we replaced an optimal approximation with a suboptimal approximation \(X_k\) of rank r in order to obtain an upper bound. \(\square \)

Compared to the results of [9], Theorem 3.1 gives a convergence bound that depends on the optimization condition number \(\kappa \), which is seen in numerical experiments. For the purposes of this paper, let us generalize this theorem for the case of nonzero additive error occurring on each gradient step. This would result in the following Lemma.

Lemma 3.3

(SVP with errors) Consider the same conditions as in the previous theorem, but assume each SVP iteration is done with an additive error matrices \({\hat{S}}_k\) present:

$$\begin{aligned} X_{k + 1}&:= P_r(X_k + \mathcal{A}^* \mathcal{A} (W_k + {\hat{S}}_k)) \\&= P_r(X + E_k + S_k), \\ S_k&:= \mathcal{A}^* \mathcal{A} ({\hat{S}}_k). \end{aligned}$$

Let the additive error matrices suffice the following bounds:

$$\begin{aligned} \Vert S_k\Vert _F&\le \Vert \mathcal{A}^*\Vert _F \Vert W_k\Vert _F, \forall k \end{aligned}$$
(13)
$$\begin{aligned} \Vert P_U S_k\Vert _F&\le \delta _{4r} \Vert W_k\Vert _F, \nonumber \\ \Vert S_k P_V\Vert _F&\le \delta _{4r} \Vert W_k\Vert _F, \forall k. \end{aligned}$$
(14)

Then,

  • If \(2 \delta _{4r} (8 \kappa + 3) < 1\), the SVP algorithm with additive errors and exact SVD projection attains local geometric convergence with constant \(2 \delta _{4r} (8 \kappa + 3)\).

  • If \(\epsilon \Vert \mathcal{A}^* \Vert _F (2 + \delta _{4r}) < 1 - 2 \delta _{4r} (8 \kappa + 3)\), then the SVP algorithm with additive errors and \(\epsilon \)-approximate SVD projection attains local geometric convergence.

Proof

Again, in the case of exact SVP projection, Lemma 2.4 gives an estimate

$$\begin{aligned}{} & {} \Vert W_{k+1} \Vert _F = \Vert P_r(X + E_k + S_k) - X \Vert _F \\{} & {} \le \max \{ \Vert P_U (E_k + S_k) \Vert _F, \Vert (E_K + S_k) P_V \Vert _F\} (8 \kappa + 3). \end{aligned}$$

By Lemma 3.1 and the theorem assumptions we can then bound

$$\begin{aligned} \Vert P_U (E_k + S_k) \Vert _F&\le \Vert P_U E_k \Vert _F + \Vert P_U S_k \Vert _F \le 2 \delta _{4r} \Vert W_k \Vert _F, \\ \Vert (E_k + S_k) P_V \Vert _F&\le \Vert E_k P_V \Vert _F + \Vert S_k P_V \Vert _F \le 2 \delta _{4r} \Vert W_k \Vert _F. \end{aligned}$$

These bounds are true under locality assumption \(\Vert E_k + S_k \Vert _2 \le \sigma _r(X)\), for which, taking into account the condition (13), it is sufficient to assume that the current error is bounded by

$$\begin{aligned} \Vert W_k \Vert _2 \le \frac{1}{4 \Vert \mathcal{A}^* \Vert _{F}} \sigma _r(X). \end{aligned}$$

In the case of approximate SVD projection and gradient additive error,

$$\begin{aligned} W_{k+1}&= X - {\hat{P}}_r(X + E_k + S_k) \\&= X - P_r(X + E_k + S_k) + P_r(X + E_k + S_k) - {\hat{P}}_r(X + E_k + S_k), \end{aligned}$$

and it suffices to bound \(\Vert P_r(X + E_k + S_k) - {\hat{P}}_r(X + E_k + S_k)\Vert _F\) using the \(\epsilon \)-approximation property (9)

$$\begin{aligned} \Vert P_r(X + E_k + S_k)&- {\hat{P}}_r(X + E_k + S_k) \Vert _F \le \\&\le \epsilon \Vert P_r(X + E_k + S_k) - (X + E_k + S_k) \Vert _F \\&= \epsilon \Vert P_r(X_k + \mathcal{A}^*\mathcal{A}(W_k) + S_k) - (X_k + \mathcal{A}^*\mathcal{A}(W_k) + S_k) \Vert _F \\&\le \epsilon \Vert X_k - (X_k + \mathcal{A}^* \mathcal{A}(W_k) + S_k) \Vert _F = \epsilon (\Vert \mathcal{A}^* \mathcal{A}(W_k) \Vert _F + \Vert S_k \Vert _F) \\&\le \epsilon (2 + \delta _{4r}) \Vert \mathcal{A}^* \Vert _F \Vert W_k \Vert _F. \end{aligned}$$

\(\square \)

4 Twin completion

4.1 Low-rank plus sparse problem setting

Assume a matrix \(Y \in {\mathbb {R}}^{n \times n}\) can be represented as a sum

$$\begin{aligned} Y = X + {\hat{S}}, \end{aligned}$$
(15)

where the first term is a low-rank matrix, \(rank(X) = r \ll n\), and the second term \({\hat{S}}\) is sparse. The problem that will be considered in this section is finding both X and \({\hat{S}}\) by knowing Y. In other terms, a two-part representation is computed for the input matrix, where the two parts are known to have different structure types (‘low-rank’ and ‘sparse’). If one of these parts is known and the model decomposition (15) is exact, the second part can be obtained by subtraction.

An algorithm that is based on the matrix completion concept will be proposed as a solution. For the convergence analysis of the algorithm, it will be assumed that nonzero elements of \({\hat{S}}\) are distributed uniformly among the set of all possible indices, with a constant probability of order \(\frac{\beta }{n}, \beta = const\), which results in an average of \(\beta n\) nonzero elements in \({\hat{S}}\).

Referring to the Sect. 2 and Theorem 2.1, assume that the low-rank part X is \(\mu \)-incoherent. Algorithms for finding both the exact low-rank part X and the sparse part \({\hat{S}}\) for a given Y based on convex relaxations are available under certain tangent-space conditions for X [21], but involve numerically complex procedures because X is handled as a vector of \(O(n^2)\) unknowns. In our work, we will provide an algorithm that maintains the low-parametric structure of approximations for both X (factorized low-rank) and \({\hat{S}}\) iteratively. In order to achieve that, we will consider a problem in the following form:

$$\begin{aligned} \Vert (\mathcal{I} - \mathcal{P}_{\Lambda })(Y - X) \Vert _F \rightarrow _{Y, \Lambda } \inf , Y \in {\mathbb {R}}^{n \times n}, |\Lambda |\le C \beta n, C > 1. \end{aligned}$$
(16)

Here, \(\Lambda \subset \{1 \dots n\} \times \{1 \dots n\}\), and \(\mathcal{P}_{\Lambda } \in {\mathbb {R}}^{n \times n} \rightarrow {\mathbb {R}}^{n \times n}\) denotes an operator that sets to zero all matrix elements with indices that do not belong to \(\Lambda \). The functional (16) suggests that an approximation for \(Y = X + {\hat{S}}\) is searched in a form of a low-rank and sparse matrix again, but the sparse part may have a constant times more elements than the original \({\hat{S}}\), thus we are looking for a possibly suboptimal approximation in the original low-rank plus sparse format.

Then, consider the following iterative algorithm, starting with \(k = 0, X_0 = 0, W_0 = X\), empty set \(\Lambda \) and a random mask \(\Omega _0\) selected uniformly with sparsity \(\rho \ge {C_{RIP}} \frac{\mu ^2 r^2}{\delta _{4r}^2} \frac{\log (n)}{n}\), as in Theorem 2.1. Algorithm iterations are then is described by:

figure a

The value \(\varepsilon \) is a predefined threshold which the low-rank part of the approximation should meet along matrix entries that do not belong to the sparse part of the approximation. The logic behind this algorithm is essentially that by construction, additive error matrices \(S_k := \mathcal{A}_{\Omega _k}^* \mathcal{A}_{\Omega _k} ({\hat{S}})\), that arise in the iterative process, should fall under the assumptions of Lemma 3.3, making the algorithm convergent. In order to prove that, we need the following Lemmas characterizing \(S_k\). Firstly, \(S_k\) lies in an intersection of two random sparse subsets, defined by \({\hat{S}}\) and by \(\Omega _k\); that means, that \(S_k\) must be a very sparse matrix with high probability:

Lemma 4.1

(Sparse mask intersections) If \(C \in [e^2, \frac{n}{\beta })\) be a constant, and let each index (ij) have the same probability \(\frac{\beta }{n}\) to belong to the set \(supp({\hat{S}})\) of nonzero elements of \({\hat{S}}\), then:

  • With probability not less than \(1 - e^{-(C + 1)\beta n}\), \(|supp({\hat{S}}) |\le C \beta n\).

  • With probability not less than \(1 - e^{-(C + 1)\beta \rho n}\), \(|supp(S_k) |\le C \rho \beta n, \rho \in (0,1)\).

Proof

Note that \(|supp(S_k) |\le |supp(S_0) |\), and both \(|supp({\hat{S}}) |\) and \(|supp(S_0) |\) can be considered random Binomial-distributed variables that correspond to \(n^2\) experiments with ‘success’ chances \(\frac{\beta }{n}\) and \(\frac{\rho \beta }{n}\), respectively. The Lemma then requires a bound on two Binomial distributions

$$\begin{aligned} {\mathbb {P}}(|supp({\hat{S}}) |> C \beta n)&= F(n^2 - C \beta n; n^2, 1 - \frac{\beta }{n}), \\ {\mathbb {P}}(|supp(S_0) |> C \beta \rho n)&= F(n^2 - C \beta \rho n; n^2, 1 - \frac{\beta \rho }{n}), \end{aligned}$$

where \(F(s; m, p) = {\mathbb {P}}(X_{m,p} \le s)\) denotes a cumulative distribution function of a Binomial random variable \(X_{m,p}\) with m experiments and ‘success’ chance p. Thus, we need a bound for the CDF of a Binomial distribution: such bound can be taken, for example, from [22], where it is established that

$$\begin{aligned} F(s; m, p) \le e^{- m D(\frac{s}{m} \Vert p)}; \\ D(q \Vert p) := q \log \frac{q}{p} + (1 - q) \log \frac{1 - q}{1 - p}; q,p \in (0, 1), \end{aligned}$$

where the notation \(D(q \Vert p)\) denotes the so-called Kullback–Leibler divergence, expressed directly for the particular case of two binary coins with probabilities qp. Then, the bound we need follows from

$$\begin{aligned} F(n^2 - C \beta n; n^2, 1 - \frac{\beta }{n})&\le e^{-n^2( \frac{n^2 - C \beta n}{n^2} \log \frac{\frac{n^2 - C \beta n}{n^2}}{\frac{n^2 - \beta n}{n^2}} + \frac{C \beta n}{n^2} \log \frac{\frac{C \beta n}{n^2}}{\frac{\beta n}{n^2}}) } \\&= e^{-(n^2 - C \beta n) \log \frac{n^2 - C \beta n}{n^2 - \beta n}} e^{-C \log C \beta n} \\&= e^{(n^2 - C \beta n) \log \frac{n^2 - \beta n}{n^2 - C \beta n}} e^{-C \log C \beta n} \\&= e^{(n^2 - C \beta n) \log (1 + \frac{(C - 1) \beta n}{n^2 - C \beta n})} e^{-C \log C \beta n} \\&\le e^{(C - 1) \beta n} e^{-C \log C \beta n} \le e^{- (C + 1) \beta n}. \end{aligned}$$

where we used \(\log (1 + x) \le x\) and \(\log C \ge 2\). The second bound is obtained in the same way by replacing \(\beta \leftrightarrow \beta \rho \) in the derivation. The condition \(\log C \ge 2\) here can be relaxed; it is used in order to obtain simpler probability bound formulas. \(\square \)

4.2 Sparse error bounds

Now, we are going to establish a bound on \(\Vert S_k\Vert _F, \Vert P_U S_k\Vert _F, \Vert S_k P_V\Vert _F\), where \(U, V \in {\mathbb {R}}^{n \times r}\) are the singular bases of the \(\mu \)-incoherent bases of the low-rank part of the solution X. The bound is based on the idea that the \(\Upsilon _k\) exclusion step should remove all ‘sharp’ sparse part elements except those that are not above the current residual in absolute values. This is done in the following Lemma:

Lemma 4.2

(Sparse error exclusion) Let \(C \in [e^2, \frac{n}{\beta })\). Assume \(W_{k - 1}\) is \(\mu \)-incoherent. Then, with probability no less than \(1 - e^{-C \beta \rho n}\),

$$\begin{aligned} \Vert S_k\Vert _F&\le 2\mu \sqrt{\frac{C\beta r}{\rho n}} \Vert W_{k-1}\Vert _F, \\ \max \{ \Vert P_U S_k\Vert _F, \Vert S_k P_V\Vert _F \}&\le 2 C \beta \mu r \sqrt{\frac{\mu }{n}} \Vert W_{k - 1}\Vert _F. \end{aligned}$$

Proof

Assume \(|supp({\hat{S}}) |\le C \beta n\) and \(|supp(S_k) |\le C \beta \rho n\); by previous Lemma, the probability that both events hold is no less than \(1 - e^{-C \beta \rho n}\). Then, recalling \(S_k = \mathcal{A}_{\Omega _k}^* \mathcal{A}_{\Omega _k} ({\hat{S}})\), we have

$$\begin{aligned} \Vert S_k\Vert _C = \Vert \mathcal{A}_{\Omega _k}^* \mathcal{A}_{\Omega _k} ({\hat{S}})\Vert _C&= \Vert \mathcal{A}_{\Omega _k}^* \mathcal{A}_{\Omega _k} (W_{k-1} + {\hat{S}}) - \mathcal{A}_{\Omega _k}^* \mathcal{A}_{\Omega _k} (W_{k-1})\Vert _C \le \\&\le \Vert \mathcal{A}_{\Omega _k}^* \mathcal{A}_{\Omega _k} (W_{k-1} + {\hat{S}})\Vert _C + \Vert \mathcal{A}_{\Omega _k}^* \mathcal{A}_{\Omega _k} (W_{k-1})\Vert _C\\&= \max _{(i,j) \in \Omega _k} \frac{|W_{k - 1} + {\hat{S}} |_{i,j}}{\rho } + \max _{(i,j) \in \Omega _k} \frac{|W_{k - 1} |_{i,j}}{\rho } \\&\le \min _{(i,j) \in \Upsilon _k} \frac{|W_{k - 1} + {\hat{S}} |_{i,j}}{\rho } + \max _{(i,j) \in \Omega _{k-1}} \frac{|W_{k - 1} |_{i,j}}{\rho } \\&\le 2 \max _{(i,j) \in \Omega _{k-1}} \frac{|W_{k - 1} |_{i,j}}{\rho }, \end{aligned}$$

where the last inequality follows from the set \(\Upsilon _k\) having more elements than the support of \({\hat{S}}\), thus at least for one \((i,j) \in \Upsilon _k\) it holds that \((W_{k-1} + {\hat{S}})_{i,j} = (W_{k-1})_{i,j}\). Using the incoherence assumption and Lemma 2.2, the bound then continues as

$$\begin{aligned} \Vert S_k\Vert _C \le 2 \frac{\Vert W_{k-1}\Vert _C}{\rho } \le \frac{2 \mu \sqrt{r} \Vert W_{k-1}\Vert _F}{\rho n}. \end{aligned}$$

Now, let us use \(|supp(S_k) |\le C \beta \rho n\) and the obtained elementwise bound on \(S_k\) in order to establish Frobenius norm bounds on \(S_k, P_U S_k, S_k P_V\):

$$\begin{aligned} \Vert S_k \Vert _F \le \sqrt{C \beta \rho n} \Vert S_k\Vert _C&\le 2\mu \sqrt{\frac{C\beta r}{\rho n}} \Vert W_{k-1}\Vert _F. \\ \Vert P_U S_k\Vert ^2_F = \Vert U^* S_k\Vert ^2_F&= \Vert \sum _{(i,j) \in supp(S_k)} s_{i,j} U^* \vec {e}_i \vec {e}_j^T\Vert ^2_F \\&\le |supp(S_k) |\sum _{(i,j) \in supp(S_k)} \Vert s_{i,j} U^* \vec {e}_i \vec {e}_j^T \Vert _F^2 \\&\le |supp(S_k) |^2 \Vert S_k\Vert _C^2 \max _i \Vert U^* \vec {e}_i\Vert ^2_2 \le \\&\le |supp(S_k) |^2 \Vert S_k\Vert _C^2 \frac{\mu r}{n} \\&\le \frac{(C^2 \beta ^2 \rho ^2 n^2) (4 r \mu ^2) (\mu r)}{(\rho ^2 n^2) (n)} \Vert W_{k - 1}\Vert _F^2 \\&= \frac{4 C^2 \beta ^2 \mu ^3 r^2}{n} \Vert W_{k - 1}\Vert _F^2; \\ \Vert P_U S_k\Vert _F&\le 2 C \beta \mu r \sqrt{\frac{\mu }{n}} \Vert W_{k - 1}\Vert _F. \end{aligned}$$

The derivation of the bound on \(\Vert S_k P_V \Vert _F\) is based on the same considerations as those for \(\Vert P_U S_k\Vert _F\). \(\square \)

The Lemma 4.2 combined with the proof of Lemma 3.3 now gives the following insight: the proposed Algorithm 1 converges geometrically in terms of \(\Vert (\mathcal{I} - \mathcal{P}_{\Lambda })(Y - X) \Vert _F\) if the following conditions keep true for each iteration k:

  • Both the current iterate \(X_k\) and residual \(W_k\) are \(\mu \)-incoherent.

  • The operator \(\mathcal{A}_{\Omega _k}\) satisfies the ‘Restricted Isometry Property’ (2) on all \(\mu \)-incoherent matrices with high probability.

The former condition is a complicated issue to prove: while \(X_k\) is guaranteed to be incoherent when \(\Vert W_k \Vert _2 \ll \sigma _r(X)\), the residual \(W_k\) is known to lose incoherence properties in practice even for the common SVP algorithm (without additive errors) applied to completion operator in some cases [15] (this can, however, be resolved in practice using heuristic rank-increasing techniques [15]). The latter statement though can be proved using the following idea: despite \(\Omega _k\) losing its randomness/uniformity (because the excluded sets \(\Upsilon _j\) are not random and depend on \(X, {\hat{S}}\)), \(\Omega _k\) is obtained from the initial set \(\Omega _0\) by excluding a relatively small number of elements from the mask, and the operator \(\mathcal{A}_{\Omega _0}\) does satisfy RIP w.h.p. based on Theorem 2.1. We fill finalize the analysis of the Algorithm 1 with the following Lemma.

Lemma 4.3

(Restricted isometry preservation) Let the initial set \(\Omega _0\) of Algorithm 1 be selected randomly using uniform index distribution

$$\begin{aligned} {\mathbb {P}} \{ (i, j) \in \Omega _0 \} = \rho , \forall i,j, \end{aligned}$$

and let \(\rho \ge {C_{RIP}} \frac{\mu ^2 r^2}{{\delta _r}^2} \frac{\log (n)}{n}\). Then, with probability not less than \(1 - exp(-n \log (n))\), the operator \(\mathcal{A}_{\Omega _k}\) (with the same scaling parameter \(\rho \)) that appears on iteration k of the Algorithm 1—satisfies the RIP-property with a constant \({\delta _{r,k}}\) on all \(\mu \)-incoherent matrices with rank not larger than r, and

$$\begin{aligned} \delta _{r,k} = \delta _r + \frac{C \beta k}{C_{RIP} r \log n}. \end{aligned}$$

Proof

By Theorem 2.1, the operator \(\mathcal{A}_{\Omega _0}\) with probability not less than \(1 - exp(-n \log (n))\) satisfies RIP-property on all \(\mu \)-incoherent matrices with rank at most k. Now consider an arbitrary \(\mu \)-incoherent matrix \(Y \in {\mathbb {R}}^{n \times n}, rank(Y) \le r\). Note that \(\Vert \mathcal{A}_{\Omega _0}(Y)\Vert _F^2 = \Vert \mathcal{A}_{\Omega _k}(Y)\Vert _F^2 + \frac{\Vert P_{\Omega _0 / \Omega _k}(Y) \Vert _F^2}{\rho } \), and \(|\Omega _0 / \Omega _k |\le C k \beta n\). Then,

$$\begin{aligned} \Vert \mathcal{A}_{\Omega _k}(Y)\Vert _F^2&\le \Vert \mathcal{A}_{\Omega _0}(Y)\Vert _F^2 \le (1 + \delta _r) \Vert Y\Vert _F^2 \le (1 + \delta _{r,k}) \Vert Y\Vert _F^2; \\ \Vert \mathcal{A}_{\Omega _k}(Y)\Vert _F^2&= \Vert \mathcal{A}_{\Omega _0}(Y)\Vert _F^2 - \frac{ \Vert P_{\Omega _0 / \Omega _k}(Y) \Vert _F^2}{\rho } \\&\ge (1 - \delta _r) \Vert Y\Vert _F^2 - \frac{C\beta k n \Vert Y\Vert _C^2}{\rho } \\&\ge (1 - \delta _r) \Vert Y\Vert _F^2 - \frac{C\beta k \mu ^2 r}{\rho n} \Vert Y\Vert _F^2 \\&\ge (1 - \delta _r) \Vert Y\Vert _F^2 - \frac{C \beta k}{C_{RIP} r \log n} \Vert Y\Vert _F^2 \\&= (1 - \delta _{r,k}) \Vert Y\Vert _F^2. \end{aligned}$$

\(\square \)

As we are assuming geometric convergence, \(k = O(\log \varepsilon )\), where \(\varepsilon \) is the desired solution residual threshold, and thus is not large. Better RIP constant \(\delta _{r,k}\) can be obtained via increasing \(\rho \) above theoretical minimum, as seen from the proof.

4.3 Twin completion

The proposed Algorithm 1 has the following notable drawbacks:

  1. 1.

    The number of elements in the sparse part of an approximation returned by the Algorithm 1 is not smaller than the true number of erroneous elements multiplied by the number of iterations. Both the convergence analysis (Lemma 3.3) and practical results (see next section) suggest that although the convergence is geometric, large values of the condition number \(\kappa := \frac{\sigma _1(X)}{\sigma _r(X)}\), where X is the unknown low-rank matrix, can greatly increase the required number of iterations, making the sparse part of approximation have more elements.

  2. 2.

    In practice, if \(\kappa \gg 1\), on an early (e.g. first) iteration k of the Algorithm 1 it is possible that \(\Vert S_k\Vert _2\) is below \(\sigma _1(X)\) but is above \(\sigma _r(X)\) (which contradicts locality assumption of convergence Lemma 3.3). In that case, the matrix \(X + E_k + S_k\) could have a set of singular vectors similar to those of X, but not all of them would occupy positions among the top-r singular values. The matrix \(W_{k + 1} = P_r(X + E_k + S_k) - X\) then commonly loses it’s incoherence properties, because one of the singular vectors \(P_r(X + E_k + S_k)\) is largely affected by \(S_k\), and the Algorithm 1 diverges. The same problem arises in practice for ill-conditioned matrices X even for the common SVP algorithm [15].

To deal with the discussed problems, a novel ‘Twin Completion’ approach is proposed. The approach is based on the following ideas:

  • If a random sparse mask \(\Omega \) projection \(P_{\Omega }\) is applied to an already sparse matrix \({\hat{S}}\), the result \(P_{\Omega } ({\hat{S}})\) can be considered a low-rank matrix, because it is very sparse.

  • If two random mask projections \(P_{\Omega _1}, P_{\Omega _2}\) are applied to an already sparse matrix, both results \(P_{\Omega _1} ({\hat{S}}), P_{\Omega _2} ({\hat{S}})\) will be very sparse low-rank matrices, and they are likely to have orthogonal column and row subspaces.

  • Sparse subspaces are almost orthogonal to incoherent subspaces. A sum of an incoherent matrix and a sparse matrix should have a set of (almost) incoherent singular vectors and a set of (almost) sparse singular vectors.

  • If two SVP steps, corresponding to two random masks, are carried out, the two results should have a pair of close rank-r subspaces (close to the solution X subspaces), and two sets of mutually orthogonal singular vectors.

Based on these considerations, it is proposed to generalize the Algorithm 1 so that it can use two masks. On each iteration, two gradient steps along two masks are carried out and SVP-projected onto the set of matrices with rank not larger than \(r + p\), where p is a parameter that should be an estimate to the number of elements in the two assumingly orthogonal very sparse matrices.

The algorithm is initialized with \(k = 0, X_0 = 0, W_0 = X\), an empty set \(\Lambda \) and two random masks \(\Omega _{a,0}, \Omega _{b,0}\), both selected randomly and uniformly with equal sparsity \(\rho \ge {C_{RIP}} \frac{\mu ^2 r^2}{\delta _r^2} \frac{\log (n)}{n}\), as in Theorem 2.1. The iterations are then described in Algorithm 2.

figure b

The proposed algorithm no longer relies on the construct \(P_r(X + E_k + S_k)\), which considers only the top-r singular vectors, directly, thus making it more stable in the case of ill-conditioned solutions. Furthermore, in practice, it is possible to relax the requirements on the norm of \(\Vert S_k\Vert _C\), which is controlled by the \(\Upsilon _k\) exclusion. That means, exclusions can be carried out more rarely (or less elements can be excluded on each iteration), resulting in smaller set \(\Lambda \) at the end of the algorithm, which corresponds to smaller number of elements in the sparse part of the final approximation. In the remainder of the paper, we will provide theoretical convergence background for the Algorithm 2 .

Firstly, let us prove the following technical Lemma: if two column bases \(span(U_a), span(U_b)\) have a pair of close and known rank-r subspaces \(span(U_{a,r}), span(U_{b,r})\), and the ‘remainders’ are almost orthogonal, then the canonical subspaces between \(span(U_a), span(U_b)\) cannot be much different from \(span(U_{a,r}), span(U_{b,r})\), respectively.

Lemma 4.4

(Canonical base distance bound) Let \(U_a, U_a \in {\mathbb {R}}^{r + p \times n}\) be matrices with orthogonal columns, that are decomposable into the following blocks:

$$\begin{aligned} U_a = \begin{bmatrix} U_{a,r}&U_{a,p-r} \end{bmatrix}, U_b = \begin{bmatrix} U_{b,r}&U_{b, p-r} \end{bmatrix}, \end{aligned}$$

where \(U_{a,r}, U_{b,r} \in {\mathbb {R}}^{n \times r}\). Let \(\Vert P_{U_{a,r}} - P_{U_{b,r}}\Vert _F \le \alpha , \Vert U_{a,p-r}^* U_{b,p-r} \Vert _F \le \beta \). Then, if the matrices \({\bar{U}}_{a}, \bar{U}_{b} \in {\mathbb {R}}^{n \times r}\) correspond to the top-r canonical vector pairs between the subspaces \(span(U_a), span(U_b)\),

$$\begin{aligned} \Vert P_{{\bar{U}}_{a}} - P_{U_{a,r}} \Vert _F&\le 12 \alpha + 4 \beta , \\ \Vert P_{{\bar{U}}_{b}} - P_{U_{b,r}} \Vert _F&\le 12 \alpha + 4 \beta . \end{aligned}$$

Proof

Consider the following block-product:

$$\begin{aligned} U_b^* U_a = \begin{bmatrix} U_{b,r}^* \\ U_{b, p-r}^* \\ \end{bmatrix} \begin{bmatrix} U_{a,r} &{} U_{a,p-r} \\ \end{bmatrix} = \begin{bmatrix} U_{b,r}^* U_{a,r} &{} U_{b,r}^* U_{a,p - r} \\ U_{b,p-r}^* U_{a,r} &{} U_{b,p-r}^* U_{a,p - r} \end{bmatrix}. \end{aligned}$$

Under the assumptions of the theorem, and considering the values of \(\alpha , \beta \) to be small, this block-product only has one large-norm block. The top-left element norm has a lower bound of

$$\begin{aligned} \Vert U_{b,r}^* U_{a,r}\Vert _F = \Vert P_{U_{b,r}} P_{U_{a,r}}\Vert _F&= \Vert P_{U_{a,r}} + (P_{U_{b,r}} - P_{U_{a,r}}) P_{U_{a,r}}\Vert _F \ge \\&\ge \Vert P_{U_{a,r}}\Vert _F - \Vert P_{U_{b,r}} - P_{U_{a,r}}\Vert _F \Vert P_{U_{a,r}}\Vert _2 \ge \sqrt{r} - \alpha , \end{aligned}$$

while the remaining block norms can be estimated with

$$\begin{aligned} \Vert U_{b,p - r}^* U_{a,r}\Vert _F&\le \Vert (I - P_{U_{b,r}}) P_{U_{a,r}}\Vert _F = \Vert P_{U_{a,r}} - P_{U_{b,r}} P_{U_{a,r}}\Vert _F = \\&= \Vert P_{U_{a,r}}^2 - P_{U_{b,r}} P_{U_{a,r}}\Vert _F \le \Vert P_{U_{a,r}}\Vert _2 \Vert P_{U_{a,r}} - P_{U_{b,r}}\Vert _F = \alpha ; \\ \Vert U_{a,p - r}^* U_{b,p - r}&\Vert _F \le \beta . \end{aligned}$$

The structure of the matrix \(U_b^* U_a\) then implies that its top left block defines a close-to-optimal rank-r approximation of the whole matrix. It can be seen that

$$\begin{aligned} \Vert P_r(U_b^* U_a) - \begin{bmatrix} U_{b,r}^* U_{a,r} &{} 0 \\ 0 &{} 0 \\ \end{bmatrix}\Vert _F&\le \Vert P_r(U_b^* U_a) - U_b^* U_a\Vert _F + \Vert U_b^* U_a - \begin{bmatrix} U_{b,r}^* U_{a,r} &{} 0 \\ 0 &{} 0 \\ \end{bmatrix}\Vert _F \\&\le 2 \Vert U_b^* U_a - \begin{bmatrix} U_{b,r}^* U_{a,r} &{} 0 \\ 0 &{} 0 \\ \end{bmatrix} \Vert _F \le 4 \alpha + 2 \beta . \end{aligned}$$

Now, assuming that the optimal SVD-projection \(P_r(U_b^* U_a)\) can be expressed as \(U_{col} \Sigma _{col} V^*_{col}, U_{col}, V_{col} \in {\mathbb {R}}^{p \times r}\), then canonical vector column matrices \({\bar{U}}_{a}, {\bar{U}}_{b}\) can be expressed as

$$\begin{aligned} {\bar{U}}_{a} = U_a U_{col}, {\bar{U}}_{b} = U_b V_{col}. \end{aligned}$$

Using that, we can bound

$$\begin{aligned} \Vert P_r(U_b^* U_a) - \begin{bmatrix} U_{a, r}^* U_{b, r} &{} 0 \\ 0 &{} 0 \\ \end{bmatrix}\Vert _F&= \Vert U_b(P_r(U_b^* U_a) - \begin{bmatrix} U_{b, r}^* U_{a,r} &{} 0 \\ 0 &{} 0 \\ \end{bmatrix})U_a^*\Vert _F \\&= \Vert U_b U_{col} \Sigma _{col} V_{col}^* U_a^* - U_{b,r} U_{b,r}^* U_{a,r} U_{a,r}^* \Vert _F\\&\ge \Vert {\bar{U}}_{b} {\bar{U}}_{a}^* - P_{U_{a,r}} P_{U_{b,r}}\Vert _F - \Vert \Sigma _{col} - I\Vert _F. \end{aligned}$$

Furthermore, \(\Sigma _{col}\) is a diagonal matrix with values corresponding to the cosines of the r smallest canonical angles between \(U_a\), \(U_b\). By the structure of \(U_a, U_b\), these canonical angles cannot be larger than those between \(U_{a,r}, U_{b,r}\). Considering the expression, which uses \(\cos \phi \in [0,1]\)

$$\begin{aligned} (1 - \cos \phi )^2 = 1 - 2 \cos \phi + \cos ^2 \phi \le 1 - 2 \cos ^2 \phi + \cos ^2 \phi = \sin ^2 \phi , \end{aligned}$$

and taking into account that

$$\begin{aligned} \Vert P_{U_{a,r}} - P_{U_{b,r}}\Vert _F^2 = \sum _k \sin ^2 \phi _k \end{aligned}$$

where \(\phi _k\) denote the canonical angles between \(U_{a}, U_{b}\), we have \(\Vert \Sigma _{col} - I\Vert _F \le \alpha \), and thus

$$\begin{aligned} \Vert {\bar{U}}_{b} {\bar{U}}_{a}^* - P_{U_{a,r}} P_{U_{b,r}}\Vert _F \le 5 \alpha + 2 \beta . \end{aligned}$$

Using that \(\Vert P_{U_{a,r}} - P_{U_{b,r}}\Vert _F \le \alpha \), again we have

$$\begin{aligned} \Vert {\bar{U}}_{b} {\bar{U}}_{a}^* - P_{U_{a,r}}\Vert _F&= \Vert {\bar{U}}_b {\bar{U}}_a^* - P_{U_{a,r}}\Vert _F \le 6 \alpha + 2 \beta , \end{aligned}$$
(17)
$$\begin{aligned} \Vert {\bar{U}}_{b} {\bar{U}}_{a}^* - P_{U_{b,r}}\Vert _F&= \Vert {\bar{U}}_b {\bar{U}}_a^* - P_{U_{b,r}}\Vert _F \le 6 \alpha + 2 \beta . \end{aligned}$$
(18)

Then, it suffices to bound

$$\begin{aligned} \Vert {\bar{U}}_{b} {\bar{U}}_{a}^* - P_{U_{b,r}}\Vert _F&\ge \Vert P_{{\bar{U}}_{b}^{\perp }} ({\bar{U}}_{b} {\bar{U}}_{a}^* - P_{U_{b,r}})\Vert _F = \Vert P_{{\bar{U}}_{b}^{\perp }} P_{U_{b,r}} \Vert _F; \\ \Vert P_{{\bar{U}}_{b}} P_{U_{b,r}}\Vert _F^2&= \Vert P_{U_{b,r}}\Vert _F^2 - \Vert P_{{\bar{U}}_{b}^{\perp }} P_{U_{b,r}} \Vert _F^2 \\&\ge r - (6 \alpha + 2 \beta )^2; \\ \Vert P_{{\bar{U}}_{b}} P_{U_{b,r}}\Vert _F&= \Vert (P_{{\bar{U}}_{b}} P_{U_{b,r}} - P_{U_{b,r}}^2) + P_{U_{b,r}}^2\Vert _F \\&\ge - \Vert P_{{\bar{U}}_{b}} - P_{U_{b,r}}\Vert _F + \Vert P_{U_{b,r}} \Vert _F; \\ \Vert P_{{\bar{U}}_{b}} - P_{U_{b,r}}\Vert _F&\le \sqrt{r} - \sqrt{r - (6 \alpha + 2 \beta )^2}. \end{aligned}$$

Bounding the square root difference, along with \(\Vert P_{{\bar{U}}_{a}} - P_{U_{a,r}}\Vert _F \le \sqrt{r} - \sqrt{r - (6 \alpha + 2 \beta )^2}\) obtained in a similar way, finalizes the proof. \(\square \)

Now we are going to introduce two assumptions, under which we are going to analyse the Algorithm 2 convergence.

Assumption 4.1

By Lemmas 3.1, 2.3, matrices \(X + E_{a,k}\), \(X + E_{b,k}\) should both have singular subspaces (not necessarily top-r) close to those of the rank-r solution matrix X. Let us define the corresponding column base pairs by \(\{ U_{a,r}, U_{b,r} \}, \{ V_{a,r}, V_{b,r} \}\). Then, with high probability, top \(p-r\) singular vectors of the remainder matrices

$$\begin{aligned} P_{U_{a,r}^{\perp }} (X + E_{a,k}) P_{V_{a,r}^{\perp }}, P_{U_{b,r}^{\perp }} (X + E_{b,k}) P_{V_{b,r}^{\perp }} \end{aligned}$$

are almost orthogonal to each other.

The assumption is motivated by the observation that

$$\begin{aligned} P_{U_{a(b),r}^{\perp }} (X + E_{a(b),k}) P_{V_{a(b),r}^{\perp }} \approx P_{U^{\perp }} (X + E_{a(b),k}) P_{V^{\perp }} = P_{U^{\perp }} E_{a(b),k} P_{V^{\perp }}, \end{aligned}$$

and, by definition \(E_{a(b),k} = (\mathcal{I} - \mathcal{A}_{\Omega _{a(b),k}}^* \mathcal{A}_{\Omega _{a(b),k}}) W_k\). Now assume that \(W_k\) is a residual of the common-SVP algorithm. Then, looking at Lemma 2.4 proof (opening the brackets), we can write

$$\begin{aligned}&P_{U^{\perp }} W_k P_{V^{\perp }} = P_{U^{\perp }} \Delta P_U (X + E_{a(b),k}) \Delta P_V P_{V^{\perp }} \approx O(\Vert \Delta P_{U(V)}\Vert ^2) = O(\Vert W_{k-1}\Vert ^2). \end{aligned}$$

The top \(p-r\) singular vectors of \(P_{U^{\perp }} E_{a(b),k} P_{V^{\perp }}\) can be expressed with an optimization functional

$$\begin{aligned}&{\arg \max }_{Z, Y \in {\mathbb {H}}} (P_{U^{\perp }} E_{a(b),k} P_{V^{\perp }}, ZY^*)_F\\&\quad = {\arg \max }_{Z, Y \in {\mathbb {H}}} (E_{a(b),k}, P_{U^{\perp }} Z Y^* P_{V^{\perp }})_F \\&\quad = {\arg \max }_{Z, Y \in {\mathbb {H}}, P_U Z = 0, Y P_V = 0} (E_{a(b),k}, ZY^*)_F, \end{aligned}$$

where \({\mathbb {H}} = \{ B \in {\mathbb {R}}^{n \times n}, \; rank(B) \le p - r, \; \Vert B\Vert _2 = 1 \}\). Taking into account \((E_{a(b),k}, ZY^*)_F = (\mathcal{A}_{\Omega _{a(b),k}} W_k, \mathcal{A}_{\Omega _{a(b),k}} ZY^*)_F - (W_k, ZY^*)_F\), singular vectors of the matrix \(P_{U^{\perp }} E_{a(b),k} P_{V^{\perp }}\) are then expressed as vectors that result in largest discrepancy between a \(\mathcal{A}\)-projected and non-projected versions of a scalar product of two matrices almost orthogonal to each other. The assumption is based on intuition that as the scalar product \((W_k, ZY^*)_F\) is relatively small, then the corresponding vectors are functions of the operator \(\mathcal{A}\), and are thus random among all vectors orthogonal to UV respectively.

Assumption 4.2

The addition of random sparse matrices \(S_{a,k}, S_{b,k}\) with not more than \(p-r\) nonzero elements, do not damage the properties of Assumption 4.1. It is supposed that, with high probability, top \(p-r\) singular vectors of the remainder matrices

$$\begin{aligned} P_{U_{a,r}^{\perp }} (X + E_{a,k} + S_{a,k}) P_{V_{a,r}^{\perp }}, P_{U_{b,r}^{\perp }} (X + E_{b,k} + S_{b,k}) P_{V_{b,r}^{\perp }} \end{aligned}$$

are also almost orthogonal to each other.

The matrices \(S_{a,k}, S_{b,k}\) are assumed to be very sparse: if \(\rho = O(\frac{\log n}{n})\), then by Lemma 4.2 they both have \(O(\log n)\) elements. Thus, \(S_{a,k}, S_{b,k}\) can be viewed as low-rank matrices singular vectors close to one-element unit vectors with random positions of the nonzero element. The probability that left (right) singular vectors of \(S_{a,k}, S_{b,k}\) then have a matching index can be roughly estimated with

$$\begin{aligned} O\left( n \left( \frac{p-r}{n}\right) ^2\right) = O \left( \frac{\log ^2(n)}{n}\right) \rightarrow 0, n \rightarrow \infty . \end{aligned}$$

As the singular vectors of \(S_{a(b),k}\) are sparse, and the bases UV are incoherent, then each singular vector of \(X + E_{a(b),k} + S_{a(b),k}\) should be either close to a singular vector of X, or to a singular vector of \(S_{a(b),k}\), or be random (if \(\Vert E_{a(b),k}\Vert \gg \Vert S_{a(b),k}\Vert \)), which explains the assumption.

Now let us finalize the analysis of the ‘Twin Completion’ Algorithm 2 with the following theorem.

Theorem 4.1

(Twin completion convergence) Let \(X_{k-1}\) be a current iterate of the proposed ‘twin completion’ algorithm, and X be the unknown matrix. Using the notation introduced in Algorithm 2, assume that both \(X_{k-1}, W_{k-1}\) are \(\mu \)-incoherent. Assume the Lemma 2.1 holds for all scalar products of \(\mu \)-incoherent matrices with operators \(\mathcal{A}_{\Omega _a}, \mathcal{A}_{\Omega _b}\), and assume Lemma 3.1 holds for \(\mu \)-incoherent UV. Let Assumptions 1,2 made above hold with an upper bound of \(\frac{\epsilon }{\Vert X\Vert _2}\) on the scalar products. Let the second assumption variant of Lemma 2.3 hold for matrix X and two perturbations \((E_{a,k} + S_{a,k}), (E_{b,k} + S_{b,k})\) with \(\gamma \ge C_{\gamma } \Vert X\Vert _2\). Then,

$$\begin{aligned} \Vert W_k\Vert _F \le {\hat{f}}(\mu , r) \delta _{4r} \Vert W_{k-1}\Vert + {\hat{g}}(\mu ,r) \epsilon , \end{aligned}$$

where \({\hat{f}}, {\hat{g}}\) are polynomial functions.

Proof

By Lemma 2.3, the two projected gradient steps of the proposed ‘twin completion’ algorithm can be expressed as

$$\begin{aligned} P_p(X + E_{a,k} + S_{a,k})&= P_{U_{a,r}} (X + E_{a,k} + S_{a,k}) P_{V_{a,r}} + \\&\quad + P_{U_{a,p - r}} (X + E_{a,k} + S_{a,k}) P_{V_{a,p - r}}, \\ P_p(X + E_{b,k} + S_{b,k})&= P_{U_{b,r}} (X + E_{b,k} + S_{b,k}) P_{V_{b,r}} + \\&\quad + P_{U_{b,p - r}} (X + E_{b,k} + S_{b,k}) P_{V_{b,p - r}}, \end{aligned}$$

where \(\Vert P_{U_{a(b),r}} - P_U\Vert _F \le \frac{2 \eta }{\gamma }\), and

$$\begin{aligned} \eta = \max _{a,b}\{\Vert U^* (E_{a(b),k} + S_{a(b),k}) \Vert _F, \Vert (E_{a(b),k} + S_{a(b),k}) V \Vert _F \}. \end{aligned}$$

Using Lemmas 2.2, 3.1, it is seen that

$$\begin{aligned} \eta \le \left( \delta _{4r} + 2C\beta \mu r \sqrt{\frac{\mu }{n}}\right) \Vert W_{k-1}\Vert _F. \end{aligned}$$

As by Theorem 2.1 it can be assumed that \(1 \ge \frac{C_{RIP} \mu ^2 r^2 \log n}{\delta _{4r}^2 n}\), thus

$$\begin{aligned} \eta \le D \mu \delta _{4r} \Vert W_{k-1}\Vert _F, D = const. \end{aligned}$$

Now, we can apply Lemma 4.4 with

$$\begin{aligned} \Vert P_{U_{a,r}} - P_{U_{b,r}}\Vert _F&\le 2 {\hat{D}} \mu r \delta _{4r} \frac{2 \Vert W_{k-1}\Vert _F}{\Vert X\Vert _2}, {\hat{D}} = D C_{\gamma }, \\ \Vert U_{b,p - r}^* U_{a,p - r}\Vert _F&\le \frac{\epsilon }{\Vert X\Vert _F}, \end{aligned}$$

and obtain

$$\begin{aligned} \Vert P_{{\bar{U}}_{a(b)}} - P_{U_{a(b),r}} \Vert _F&\le \frac{f(\mu , r) \delta _{4r} \Vert W_k\Vert _F + g(\mu , r) \epsilon }{\Vert X\Vert _2}, \\ \Vert P_{{\bar{V}}_{a(b)}} - P_{V_{a(b),r}} \Vert _F&\le \frac{f(\mu , r) \delta _{4r} \Vert W_k\Vert _F + g(\mu , r) \epsilon }{\Vert X\Vert _2}, \end{aligned}$$

where fg denote some polynomial functions. The theorem then follows by observing

$$\begin{aligned} \Vert P_{{\bar{U}}_{a(b)}} (X_{a(b),k}) P_{{\bar{V}}_{a(b)}} - X\Vert _F&= \Vert P_{{\bar{U}}_{a(b)}}(X + E_{a(b), k} + S_{a(b),k}) P_{{\bar{V}}_{a(b)}} - X\Vert _F \\&\le 3 \frac{f(\mu , r) \delta _{4r} \Vert W_k\Vert _F + g(\mu , r) \epsilon }{\Vert X\Vert _2} \Vert X + E_{a(b)} + S_{a(b)}\Vert _2 \\&\quad + \Vert P_U (E_{a(b),k} + S_{a(b),k}) P_V\Vert _F \\&\le {\hat{f}}(\mu , r) \delta _{4r} \Vert W_{k-1}\Vert + {\hat{g}}(\mu ,r) \epsilon ; \\ \Vert W_k\Vert&= \Vert P_r(\frac{P_{{\bar{U}}_{a}} (X_{a,k}) P_{{\bar{V}}_{a}} + P_{{\bar{U}}_{b}} (X_{b,k}) P_{{\bar{V}}_{b}}}{2}) - X\Vert _F \\&\le \Vert P_r(\frac{P_{{\bar{U}}_{a}} (X_{a,k}) P_{{\bar{V}}_{a}} + P_{\bar{U}_{b}} (X_{b,k}) P_{{\bar{V}}_{b}}}{2}) - \\&\quad - \frac{P_{{\bar{U}}_{a}} (X_{a,k}) P_{{\bar{V}}_{a}} + P_{{\bar{U}}_{b}} (X_{b,k}) P_{{\bar{V}}_{b}}}{2} \Vert _F \\&\quad + \Vert \frac{P_{{\bar{U}}_{a}} (X_{a,k}) P_{{\bar{V}}_{a}} + P_{{\bar{U}}_{b}} (X_{b,k}) P_{{\bar{V}}_{b}}}{2} - X\Vert _F \\&\le 2 \Vert \frac{P_{{\bar{U}}_{a}} (X_{a,k}) P_{{\bar{V}}_{a}} + P_{\bar{U}_{b}} (X_{b,k}) P_{{\bar{V}}_{b}}}{2} - X\Vert _F \\&\le \Vert P_{{\bar{U}}_{a}} (X_{a,k}) P_{{\bar{V}}_{a}} - X\Vert _F + \Vert P_{\bar{U}_{b}} (X_{b,k}) P_{{\bar{V}}_{b}} - X\Vert _F \\&\le 2({\hat{f}}(\mu , r) \delta _{4r} \Vert W_{k-1}\Vert + {\hat{g}}(\mu ,r) \epsilon ). \end{aligned}$$

\(\square \)

5 Numerical experiments

5.1 Artificial experiments

In order to numerically check the convergence theory and compare the proposed one-mask based and two-mask based procedures, both the proposed one-mask and two-mask Algorithms 1, 2 were tested on artificial data matrices that fit the required low-rank plus sparse structure exactly. The test matrices were constructed in the following way:

  • Two matrices with orthogonal columns \(U, V \in {\mathbb {R}}^{n \times r}\) were generated as the top-r singular bases of a random matrix filled with normal variables with zero mean. Such orthogonal columns empirically have logarithmic incoherence, thus are eligible to original SVP completion algorithm.

  • The solutions X were built as products of the form \(U \Sigma V^*\), where \(\Sigma \) is a diagonal real matrix with positive entries such that the singular values of X are controlled and decay with one of the following relations:

    • \(\sigma _k = \frac{1}{k}\), which models a well-conditioned problem;

    • \(\sigma _k = \frac{1}{3^k}\), which models an ill-conditioned problem.

  • The sparse part \({\hat{S}}\) was built randomly using uniform distribution with probability \(\frac{\beta }{n}\) for each element of the matrix to be nonzero; the corresponding nonzero values of the matrix were also selected randomly using independent normal Gaussian distribution scaled in such a way that \(\frac{\Vert {\hat{S}} \Vert _F}{\Vert X \Vert _F} \approx 0.3\).

For improved stability, heuristic sequential rank-increasing techniques as well as step size control techniques were implemented in both one-mask and two-mask algorithms, similar to those discussed in [15]; that means that the approximation starts with a rank-one iterate \(X_0\), and the iterate \(X_k\) rank is slowly increased up to r with iteration number k.

These heuristic techniques, for example, can include residual tracking [15]: if the residual relative difference \(\frac{\Vert W_k + S_k \Vert _F}{\Vert W_{k - 1} + S_{k - 1} \Vert _F}\) is smaller than a certain threshold \(\lambda = 0.99 < 1\), then the rank is unchanged and the step size slowly increases, else the rank is increased and the step size is reduced.

In our experiments, similar heuristic approaches are used for sparse part handling in order to control the size of the ‘excluded’ set \(\Lambda \); the set \(\Upsilon _k\) is set to an empty set on each iteration that fulfills the steady convergence condition \(\frac{\Vert W_k + S_k \Vert _F}{\Vert W_{k - 1} + S_{k - 1} \Vert _F} \le \lambda \). If the residual relative difference exceeds lambda, a decision should be made whether to increase the rank of the approximation or the size of the sparse exclusion set \(\Lambda \). The two-mask algorithm allows a handy decision-making procedure: if the first ‘tail’ canonical angle \(\phi _{r + 1}\) is smaller than a predefined constant (\(\frac{\pi }{4}\) was used in experiments), then the rank is increased; else the normally computed \(\Upsilon _k\) is added to the sparse exclusion set \(\Lambda \).

With this heuristic sparse part enlargement procedure, the algorithm is quite flexible in terms of the choice of the parameter C, which defines the size of the set \(\Upsilon _k\) of indices that are added to the mask of excluded elements \(\Lambda \) at once. While the theoretical analysis suggests \(C > 1\), it is possible to use smaller values of 0.2–0.5 in practice without any drawbacks (though, by the definition of C, the set \(\Lambda \) should be enlarged in at least \(O(\frac{1}{C})\) iterations, thus exceedingly small C should not be used).

As the one-mask algorithm cannot use canonical angles, the current approximation \(X_k\) incoherence thresholding was used as the criterion for similar decision-making in the one-mask algorithm.

The following paragraph describes the parameters of the carried out numerical experiments in detail.

  • Common scenario. \(n = 1024, r = 10, \sigma _k = \frac{1}{1 + k}, p = 3, \rho = 0.25, \beta = 10.0, C = 1\). The low-rank part is of rank 10, the sparse part consists of 10 nonzero elements per row, and the sparse part enlargement coefficient C is set to one, which means that the true number of elements in the unknown sparse matrix \({\hat{S}}\) corresponds directly to the size of the index set \(\Upsilon _k\) added to the mask of locked matrix elements \(\Lambda \) at a time. The result graphs are provided in Fig. 1. The graphs include the residual and the sines of the column canonical angles: it can be seen that the canonical angles numbered from 1 to r converge at the same rate as the residual, and the next angle \(r + 1\) stays high throughout all the iterations, as supposed by the algorithm idea. The ‘teeth-shaped’ sharp peaks in the \(\sin (\phi _r(U_{a,k}, U_{b, k}))\) graph at early iterations are caused by the heuristic rank increase procedures (the value of r is initialized by 1 and is increased step-by-step up to the true rank r). The convergence properties of the one-mask and two-mask Algorithms 1, 2 are indistinguishable in this scenario. The row canonical angles behave the same way as the column canonical angles do here and in all further experiments.

  • Smoothed scenario. \(n = 1024, r = 10, \sigma _k = \frac{1}{1 + k}, p = 3, \rho = 0.25, \beta = 10.0, C = 0.01\). The previous scenario experiments show that the residual graph is piecewise smooth, with smoothness intervals corresponding to the iteration intervals during which the set \(\Lambda \) is constant—each time a new sparse exclusion is made, the residual falls rapidly for a few iterations and then settles at a lower value again. Such a behavior can be controlled by the parameter C, which defines the number of indices added to the sparse approximation mask \(\Lambda \) of once. On Fig. 2, results are shown for \(C = 0.01\): the residual graph is more smooth, but the overall convergence is slower.

  • Ill-conditioned scenario. \(n = 1024, r = 10, \sigma _k = \frac{1}{3^k}, p = 3, \rho = 0.25, \beta = 10.0, C = 1\). In this scenario, the low-rank unknown matrix is designed to have rapidly decaying singular values, which causes differences in the one-mask and two-mask algorithm performance. In this case, the low-rank column and row factors obtained by the one-mask Algorithm 1 commonly lose incoherence properties, which causes the algorithm to stagnate: new elements are added to the mask of excluded elements until there are not enough elements left for completion. The performance graphs are provided in Fig. 3, the factor incoherence is measured in percents of the maximum value possible by definition. The ‘Matrix locked’ graph shows the size of the set \(\Lambda \) as compared to \(n^2\). The two-mask Algorithm 2 manages to keep the low-rank factor incoherence low and maintain convergence. The corresponding graphs are provided in Fig. 4.

  • High-rank scenario. \(n = 1024, r = 30, \sigma _k = \frac{1}{1 + k}, p = 7, \rho = 0.25, \beta = 10.0, C = 1\). In the case of well-conditioned matrix with high rank, the convergence (both for one-mask and two-mask based algorithms) is stable, yet the sequential rank increase procedure may take a significant number of iterations. The corresponding graph is provided on Fig. 5.

Fig. 1
figure 1

Common scenario algorithm performance

Fig. 2
figure 2

Smoothed scenario algorithm performance

Fig. 3
figure 3

Ill-conditioned scenario one-mask algorithm performance

Fig. 4
figure 4

Ill-conditioned scenario two-mask algorithm performance

Fig. 5
figure 5

High-rank scenario two-mask algorithm performance

5.2 Application to integral equations

The proposed algorithms were applied to the problem of compression of a structured large-scale dense linear system, that stems from a finite element discretization of integral equations arising in scattering problems on metasurfaces. A metasurface consists of conductor and dielectric parts, arranged in the form of blocks of the same shape: an example is shown in Fig. 6.

Fig. 6
figure 6

Metasurface with a finite-element grid

The considered linear system is based on a discretization of Maxwell’s equations using RWG-type basis functions; thus, two potential operators are involved, that correspond to the electric \(({\mathcal {E}}(p))(x)\) and the magnetic \(({\mathcal {M}}(p))(x)\) fields of the metasurface:

$$\begin{aligned} ({\mathcal {E}}(p))(x)&= ik \int \limits _{\Gamma } p(y) G_k(x, y) dy - \dfrac{1}{ik} \nabla _x \int \limits _{\Gamma } \nabla _{\Gamma } \cdot p(y) G_k(x, y) dy, \\ ({\mathcal {M}}(p))(x)&= \nabla _x \times \int \limits _{\Gamma } p(y) G_k(x, y) dy, \end{aligned}$$

where \(G_k(x, y)\) is the Green’s function for the Helmholtz equation with wavenumber k

$$\begin{aligned} G_k(x, y) = G_k(r) = \dfrac{-e^{-ikr}}{4\pi r}, ~~~ r = \Vert x - y\Vert _2. \end{aligned}$$

A notable property of the considered equations is the invariance to a parallel transfer, which, along with the metasurface structure, gives, up to a small number of ‘block edge’ related equations, a block Toeplitz–Toeplitz structure to the considered linear system, where each matrix block characterizes the physical relations between two corresponding metasurface parts. That means, that if the metasurface is based on a \(M_1 \times M_2\) two-dimensional grid of similar blocks, and n is a block size, it is sufficient to store \((2 M_1 - 1) (2 M_2 - 1)\) blocks of size \({\mathbb {C}}^{n \times n}\) that fully define the original matrix of size \({\mathbb {C}}^{M_1 M_2 n \times M_1 M_2 n}\).

During this work, it was attempted to further compress the linear system matrix by constructing the low-rank plus sparse approximations of the underlying \(n \times n\) blocks; notably, those that correspond to the relations between non-neighbor metasurface parts.

The experiments were carried out with a \(4 \times 4\) metasurface discretization, with a block size of \(n = 2542\), which corresponds in a complex dense linear system of size 40672. Table 1 summarizes the compression results as compared to storing the matrix as dense. The first line corresponds to the simple Toeplitz–Toeplitz structured matrix storage, with taking system symmetry into account. The next two lines correspond to using further compression of the underlying dense blocks with low-rank plus sparse structures, using the ‘two-mask’ algorithm proposed in the paper, with rank and sparsity parameters adapted to obtain a predefined Frobenius norm approximation of the full matrix.

Table 1 Large-scale dense matrix compression quality table

6 Discussion

As suggested by theory, the provided numerical results show that the proposed two-mask ‘twin completion’ Algorithm 2 offers a stable performance in ill-conditioned cases where the simpler ‘one-mask’ Algorithm 1 commonly loses the required incoherence properties and thus either diverges or greatly increases the sparse ‘excluded’ part of the output low-rank plus sparse approximation.

It is notable that in all numerical experiments the ‘Twin completion’ algorithm offered ‘exact’ convergence: if an iterate \(X_k\) was taken after a sufficient number of iterations has passed, and the residual \(X - X_k\) was computed, the locations of top \(|X - X_k |_{ij}\) residual elements would fully match the locations of nonzero elements of the true \({\hat{S}}\).

However, in practice, the elements that are nonzero in \({\hat{S}}\) are generally not the first to enter the set of ‘excluded’ elements \(\Lambda \) along with the iterations. This effect is most clearly seen when the norm of \({\hat{S}}\) is much smaller than the norm of X, since the first few iterations of both Algorithm 1, Algorithm 2 would then go as if no sparse errors were present, and set \(\Lambda \) would be initialized with index pairs that are independent of \({\hat{S}}\).