1 Introduction

Matrix completion and affine rank minimization are important research problems arising from numerous applications in various fields including compressive sensing, signal processing, machine learning, computer vision and control [6, 7, 18]. A simple and efficient first-order method for solving these problems is the singular value thresholding (SVT) algorithm introduced in [5].

Let \(\mathcal {A}\) be a linear transformation mapping \(n_1\times n_2\) matrices to \(\mathbb {R}^m\) and \(b\in \mathbb {R}^m\). SVT aims to find a low-rank solution to the linear system \(\mathcal {A}(X)=b\) by iteratively producing a sequence of matrix pairs \(\{(X^k,Y^k)\}_{k\in \mathbb {N}}\) as

$$\begin{aligned} {\left\{ \begin{array}{ll} Y^{k+1} = Y^k+\delta _k\mathcal {A}^*(b-\mathcal {A}(X^k)),&{} \\ X^{k+1} = \mathcal {D}_\tau (Y^{k+1}), &{} \end{array}\right. } \end{aligned}$$
(1)

where \(\mathcal {A}^*\) denotes the adjoint of \(\mathcal {A}\), \(X^1=Y^1\) is the zero matrix in \(\mathbb {R}^{n_1\times n_2}\) and \(\{\delta _k\}_{k\in \mathbb {N}} \) is a sequence of positive step sizes. Here \(\mathcal {D}_\tau (Y^{k+1})\) is a soft-thresholding operator at level \(\tau >0\) to be defined in (4) below, acting on the matrix \(Y^{k+1}\) to produce a low-rank approximation \(X^{k+1} = \mathcal {D}_\tau (Y^{k+1})\). Due to the ability of producing low-rank solutions with the soft-thresholding operator, SVT was shown to be extremely efficient at addressing problems with low-rank optimal solutions such as recommender systems [5]. It was shown in [5] that SVT is equivalent to the gradient descent algorithm applied to the dual problem of

$$\begin{aligned} \min _{X\in \mathbb {R}^{n_1\times n_2}}\ \Big [\Psi (X):=\tau \Vert X\Vert _*+\frac{1}{2}\Vert X\Vert _F^2\Big ]\quad \text {subject to } \ \mathcal {A}(X)=b, \end{aligned}$$
(2)

where \(\Vert X\Vert _*=\Vert \sigma (X)\Vert _1\) and \(\Vert X\Vert _F=\Vert \sigma (X)\Vert _2\) are the nuclear norm and Frobenius norm of X, respectively. Here \(\sigma (X)\) denotes the vector of all singular values of X in nonincreasing order and \(\Vert x\Vert _p=[\sum _{i=1}^d|x_i|^p]^{\frac{1}{p}}\) denotes the \(\ell _p\)-norm of \(x=(x_i)^d_{i=1}\in \mathbb {R}^d\). Based on this interpretation, it was further shown that the sequence \(\{X^k\}\) converges to the unique solution \(X^\star \) of the optimization problem (2) with the error satisfying \(\sum _{k=1}^{\infty }\Vert X^k-X^\star \Vert _F^2<\infty \), provided that the linear system \(\mathcal {A}(X)=b\) is consistent and that the step size sequence is bounded above and below from 0 satisfying \(0<\inf _k\delta _k\le \sup _k\delta _k<\frac{2}{\Vert \mathcal {A}\Vert ^2}\), where \(\Vert \mathcal {A}\Vert \) is the operator norm of \(\mathcal {A}\) defined by \(\Vert \mathcal {A}\Vert =\sup \limits _{X\in \mathbb {R}^{n_1\times n_2}:\Vert X\Vert _F\le 1}\Vert \mathcal {A}(X)\Vert _2.\)

In this paper, we refine the existing convergence analysis of SVT in terms of both convergence conditions and convergence rates. We shall show that \(\{X^k\}\) converges to the unique solution \(X^\star \) of the optimization problem

$$\begin{aligned} \min _{X\in \mathbb {R}^{n_1\times n_2}}\Psi (X)\quad \text {subject to }\ \mathcal {A}(X)=b_0, \end{aligned}$$
(3)

with respect to the Bregman distance if and only if the step size sequence \(\{\delta _k\}_{k\in \mathbb {N}}\) satisfies \(\sum _{k=1}^{\infty }\delta _k=\infty \), under the mild assumption that the orthogonal projection \(b_0\) of b onto the range of \(\mathcal {A}\) is nonzero. This gives a precise characterization on the convergence of SVT, while only sufficient conditions for the convergence of SVT were considered in the literature. Then we shall establish a convergence rate \(\Vert X^{T+1}-X^\star \Vert _F^2=O(\frac{1}{\sum _{k=1}^{T}\delta _k})\), which gives the order \(O(\frac{1}{T})\) in the general case \(0<\inf _k\delta _k\le \sup _k\delta _k<\frac{2}{\Vert \mathcal {A}\Vert ^2}\). This improves the previous convergence result \(\sum _{k=1}^{\infty }\Vert X^k-X^{\star }\Vert _F^2<\infty \) under the same condition with no explicit convergence rates [5]. Our convergence rate discussion is based on a key identity on the Bregman distance between \(X^T\) and \(X^\star \) and the excess objective function values of the dual problem of (3) in gradient descent at step T. Our discussion in getting the necessary condition \(\sum _{k=1}^{\infty }\delta _k=\infty \) is based on a novel error decomposition for the excess Bregman distance after interpreting SVT as a specific mirror descent algorithm with a non-differentiable mirror map. Our basic idea with this error decomposition is to control the Bregman distance between \(X_k\) and \(X^*\) from below by making full use of the smoothness of the objective function. The new interpretation of SVT also opens the door of studying SVT in the mirror descent framework [2, 12]. Notice the above definition of \(b_0\) also allows us to remove the assumption on the consistency of the linear system \(\mathcal {A}(X)=b\) considered in the literature.

2 Main Results

Before stating our main results, we define the operator \(\mathcal {D}_\tau \). Let \(Y=U\Sigma V^*\) be a singular value decomposition of a matrix \(Y\in \mathbb {R}^{n_1\times n_2}\) of rank r, where U and V are \(n_1\times r\) and \(n_2\times r\) matrices with orthonormal columns, respectively, and \(\Sigma =\text {diag}(\{\sigma _1,\ldots ,\sigma _r\})\) is the \(r\times r\) diagonal matrix with the main diagonal entries \(\sigma _1\ge \sigma _2\ge \cdots \ge \sigma _r>0\) being the positive singular values of Y. The singular value shrinkage operator \(\mathcal {D}_\tau \) at level \(\tau \) is defined [5] by

$$\begin{aligned} \mathcal {D}_\tau (Y)=U\mathcal {D}_\tau (\Sigma )V^*, \end{aligned}$$
(4)

where

$$\begin{aligned} D_\tau (\Sigma )=\text {diag}\big (\{(\sigma _1-\tau )_+,\ldots ,(\sigma _r-\tau )_+\}\big ) \end{aligned}$$

and \((t)_+=\max (0,t)\).

Observe from the definition (3) of \(X^\star \) that \(X^\star =0\) is equivalent to \(b_0=0\). Since \(b_0\) is the projection of b onto the range of \(\mathcal {A}\), we know that \(b-b_0\) is orthogonal to the range of \(\mathcal {A}\) and thereby \(\mathcal {A}^*(b -b_0)=0\). So from the definition (1) of SVT, we see that in this special case, for any choice of the step size sequence, \(X^k=0\) and \(Y^k =0\) for all \(k\in \mathbb {N}\), and the convergence holds obviously.

Our first main result provides a necessary and sufficient condition for the convergence of \(\{X^k\}\) to \(X^\star \) with respect to the Bregman distance when the trivial case \(b_0 =0\) is excluded. We denote \(\left\langle X,Y\right\rangle = \sum _{i=1}^{n_1}\sum _{j=1}^{n_2}X_{ij}Y_{ij}\) the standard inner product between the matrices \(X=(X_{ij}) \in \mathbb {R}^{n_1\times n_2}\) and \(Y=(Y_{ij}) \in \mathbb {R}^{n_1\times n_2}\), and the subdifferential of a function \(f:\mathbb {R}^{n_1\times n_2}\rightarrow \mathbb {R}\) at \(X \in \mathbb {R}^{n_1\times n_2}\) as

$$\begin{aligned} \partial f(X)=\{Y\in \mathbb {R}^{n_1\times n_2}: f(\widetilde{X})\ge f(X)+\left\langle \widetilde{X}-X,Y\right\rangle , \ \forall \widetilde{X}\in \mathbb {R}^{n_1\times n_2}\}. \end{aligned}$$

If f is convex, the Bregman distance between X and \(\widetilde{X}\) under f and \(\widetilde{Y}\in \partial f(\widetilde{X})\) is defined as

$$\begin{aligned} D_f^{\widetilde{Y}}(X,\widetilde{X})=f(X)-f(\widetilde{X})-\left\langle X-\widetilde{X},\widetilde{Y}\right\rangle . \end{aligned}$$

If f is differentiable, then \(\partial f(X)\) consists of \(\nabla f(X)\), the gradient of f at X.

Now we can state our first main result as follows.

Theorem 1

Let \(\{(X^k,Y^k)\}_{k\in \mathbb {N}}\) be produced by (1) and \(b_0 \not =0\). Then the following statements hold.

  1. (a)

    If \(\sup _k\delta _k<\frac{1}{2\Vert \mathcal {A}\Vert ^2}\), then

    $$\begin{aligned} \lim _{T\rightarrow \infty } D_\Psi ^{Y^{T}}(X^\star ,X^{T}) =0 \text { if and only if } \sum _{k=1}^{\infty }\delta _k =\infty . \end{aligned}$$
  2. (b)

    If \(\sup _k\delta _k<\frac{2}{\Vert \mathcal {A}\Vert ^2}\), then

    $$\begin{aligned} \left\| X^{T+1}- X^\star \right\| _F^2 \le \widetilde{C} \Big [\sum _{k=1}^{T}\delta _k\Big ]^{-1}, \quad \forall T\in \mathbb {N}, \end{aligned}$$

    where \(\widetilde{C}\) is a constant independent of T.

The necessity part of (a) of Theorem 1 will be proved by Proposition 5 in Sect. 3 while the sufficiency part of (a) and (b) follows from Proposition 9 in Sect. 4. We see from Theorem 1 that when \(0<\inf _k\delta _k\le \sup _k\delta _k<\frac{2}{\Vert \mathcal {A}\Vert ^2}\), there holds \(\left\| X^{T+1}- X^\star \right\| _F^2 =O(1/T)\). Theorem 1 also applies to the linearized Bregman iteration for compressive sensing [4, 22].

Our second main result, to be proved in Sect. 3, is a monotonic property of the sequence \(\{X^k\}\) in terms of the least squares error F(X) used often in learning theory and defined for \(X\in \mathbb {R}^{n_1\times n_2}\) by \(F(X)=\frac{1}{2}\Vert \mathcal {A}(X)-b\Vert _2^2\).

Theorem 2

Let \(\{(X^k,Y^k)\}_{k\in \mathbb {N}}\) be produced by (1) with the step-size sequence \(\{\delta _k\}_{k\in \mathbb {N}}\) satisfying \(0<\delta _k\le \frac{1}{\Vert \mathcal {A}\Vert ^{2}}\) for every \(k\in \mathbb {N}\). Then the following statements hold.

  1. (a)

    \(F(X^{k+1})\le F(X^k)\) for \(k\in \mathbb {N}\).

  2. (b)

    \(X^\star \) is a minimizer of F over \(\mathbb {R}^{n_1 \times n_2}\).

  3. (c)

    The following inequality holds for all \(T\in \mathbb {N}\)

    $$\begin{aligned} F(X^{T+1})-F(X^\star ) = \frac{1}{2} \Vert \mathcal {A}(X^{T+1}-X^\star )\Vert _2^2 \le \Psi (X^\star )\Big [\sum _{k=1}^{T}\delta _k\Big ]^{-1}. \end{aligned}$$
    (5)

Some of our ideas in the above results can be used to analyze some other thresholding algorithms such as those derived from spectral algorithms [1, 8, 9]. It would be interesting to establish learning theory analysis [14, 15, 20, 21] for SVT algorithms in a noisy setting.

3 Necessity of Convergence

Our proof of the necessity part of (a) of Theorem 1 is based on interpreting SVT as a specific instantiation of mirror descent algorithms, a class of algorithms performing gradient descent in the dual space mapped from the primal space by the subgradient of the mirror map [2, 16]. This interpretation enables us to use arguments for mirror descent algorithms to analyze the convergence of SVT. However, standard analysis for mirror descent algorithms requires the mirror map to be differentiable, which is not the case for SVT having the non-differentiable mirror map \(\Psi \). We use Bregman distances to overcome the difficulty. Our analysis can be extended to study SVT in the online setting [11, 13].

Our analysis needs some basic facts about convex functions. A function \(f:\mathbb {R}^{n_1\times n_2}\rightarrow \mathbb {R}\) is said to be \(\sigma \)-strongly convex with \(\sigma >0\) if \(D_f^{\widetilde{Y}}(X,\widetilde{X})\ge \frac{\sigma }{2}\Vert X-\widetilde{X}\Vert _F^2\) for any \(X,\widetilde{X} \in \mathbb {R}^{n_1\times n_2}\) and \(\widetilde{Y}\in \partial f(\widetilde{X})\). It is said to be L-strongly smooth if it is differentiable and \(D_f^{\nabla f(\widetilde{X})}(X,\widetilde{X})\le \frac{L}{2}\Vert X-\widetilde{X}\Vert _F^2\) for any \(X,\widetilde{X}\). We denote \(f^*(Y)=\sup \limits _{X\in \mathbb {R}^{n_1\times n_2}}\big [\left\langle X,Y\right\rangle -f(X)\big ]\) the Fenchel (convex) conjugate of f.

Lemma 3

For a convex function \(f:\mathbb {R}^{n_1\times n_2}\rightarrow \mathbb {R}\), the following statements hold.

  1. (a)

    \(f^{**}=f\) and

    $$\begin{aligned} \partial f^*(Y) = \{X:Y\in \partial f(X)\}, \quad \forall Y\in \mathbb {R}^{n_1\times n_2}. \end{aligned}$$
  2. (b)

    For \(\beta >0\), the function f is \(\beta \)-strongly convex if and only if \(f^*\) is \(\frac{1}{\beta }\)-strongly smooth.

  3. (c)

    If there exists a constant \(L>0\) such that

    $$\begin{aligned} \Vert \nabla f(X)-\nabla f(\widetilde{X})\Vert _F\le L\Vert X-\widetilde{X}\Vert _F \end{aligned}$$
    (6)

    for all \(X, \widetilde{X}\in \mathbb {R}^{n_1\times n_2}\), then we have

    $$\begin{aligned} f(X)\le f(\widetilde{X})+\left\langle X-\widetilde{X},\nabla f(\widetilde{X})\right\rangle +\frac{L}{2}\Vert X-\widetilde{X}\Vert _F^2. \end{aligned}$$

Part (a) of Lemma 3 on the duality between f and its Fenchel conjugate \(f^*\) can be found in [3]. Part (b) on the duality between strong convexity and strong smoothness can be found in [10]. Part (c) is a standard result in relating the Lipschitz continuity of \(\nabla F\) to the strong smoothness of F (see, e.g., [17, 23]).

The idea of applying Bregman distances to SVT has been introduced in the literature. For example, it can be found in [5] that

$$\begin{aligned} D_\Psi ^{\widetilde{Y}}(X,\widetilde{X})\ge \frac{1}{2}\Vert X-\widetilde{X}\Vert _F^2 \end{aligned}$$
(7)

for all \(X,\widetilde{X}\in \mathbb {R}^{n_1\times n_2},\widetilde{Y}\in \partial \Psi (\widetilde{X})\).

We observe the relation \(X^{k}=\nabla \Psi ^*(Y^{k})\) for SVT, which is a novelty of our analysis.

Lemma 4

The sequence \(\{(X^k,Y^k)\}_{k}\) produced by (1) satisfies \(Y^{k}\in \partial \Psi (X^{k})\) and \(X^{k}=\nabla \Psi ^*(Y^{k})\), and \(\Psi ^*\) is differentiable. Hence from \(\nabla F(X)=\mathcal {A}^*(\mathcal {A}(X)-b)\), we have

$$\begin{aligned} Y^{k+1}=Y^k-\delta _k\nabla F(X^k) =Y^k-\delta _k\mathcal {A}^*\big (\mathcal {A}(\nabla \Psi ^*(Y^k))-b\big ). \end{aligned}$$
(8)

Proof

The gradient of F reads directly as \(\nabla F(X)=\mathcal {A}^*(\mathcal {A}(X)-b)\). It was shown in [5] that for each \(\tau >0\) and \(Y\in \mathbb {R}^{n_1\times n_2}\), the singular value shrinkage operator obeys \(\mathcal {D}_\tau (Y)=\arg \min _X\frac{1}{2}\Vert X-Y\Vert _F^2+\tau \Vert X\Vert _*\). It follows that the second equation in (1) for \(Y^k\) is equivalent to

$$\begin{aligned} X^{k}=\arg \min _{X\in \mathbb {R}^{n_1\times n_2}}\frac{1}{2}\Vert X-Y^{k}\Vert _F^2+\tau \Vert X\Vert _*. \end{aligned}$$

Combining this with the optimality condition implies \(0\in X^{k}-Y^{k}+\tau \partial \Vert X^{k}\Vert _*\). That is, \(Y^{k}\in \partial \Psi (X^{k})\). By Part (a) of Lemma 3, this implies \(X^{k}\in \partial \Psi ^*(Y^{k})\). But (7) shows that \(\Psi \) is 1-strongly convex, which implies that \(\Psi ^*\) is differentiable according to Part (b) of Lemma 3. Therefore, \(X^{k}=\nabla \Psi ^*(Y^{k})\). This proves the desired statement. \(\square \)

Now we can carry out the novel analysis stated in the following proposition which proves the necessity part of Theorem 1.

Proposition 5

Let \(\{(X^k,Y^k)\}_{k\in \mathbb {N}}\) be produced by (1). If \(b_0 \not =0\) and for some \(\kappa >0\), the step-size sequence \(\{\delta _k\}_{k\in \mathbb {N}}\) satisfies \(0<\delta _k\le \frac{1}{(2+\kappa )\Vert \mathcal {A}\Vert ^2}\) for every \(k\in \mathbb {N}\), then \(D_\Psi ^{Y^T}(X^\star ,X^T)>0\) for \(T\in \mathbb {N}\) and

$$\begin{aligned} \sum _{k=1}^{T}\delta _k\ge \frac{\log \frac{\Psi (X^\star )}{D_\Psi ^{Y^{T+1}}(X^\star ,X^{T+1})}}{(2+\kappa )\Vert \mathcal {A}\Vert ^2\log \frac{2+\kappa }{\kappa }}. \end{aligned}$$
(9)

In particular, \(\lim _{T\rightarrow \infty } D_\Psi ^{Y^{T}}(X^\star ,X^{T}) =0\) implies \(\sum _{k=1}^{\infty }\delta _k =\infty \).

Proof

Let us first analyze how the Bregman distance is reduced in one step iteration of SVT.

Let \(k\in \mathbb {N}\). By Lemma 4 and the definition of the Bregman distance, for \(X\in \mathbb {R}^{n_1\times n_2}\), we have

$$\begin{aligned} \Delta _k (X)&:=D_\Psi ^{Y^k}(X,X^k) - D_\Psi ^{Y^{k+1}}(X,X^{k+1}) \nonumber \\&=\left[ \Psi (X)-\Psi (X^k)-\left\langle X-X^k,Y^k\right\rangle \right] -\left[ \Psi (X)-\Psi (X^{k+1})-\left\langle X-X^{k+1},Y^{k+1}\right\rangle \right] \nonumber \\&=\Psi (X^{k+1}) - \Psi (X^k) + \left\langle X-X^{k+1},Y^{k+1}\right\rangle - \left\langle X-X^k,Y^k\right\rangle . \end{aligned}$$
(10)

Notice that

$$\begin{aligned} D_\Psi ^{Y^{k+1}}(X^{k},X^{k+1}) = \Psi (X^k) - \Psi (X^{k+1}) - \left\langle X^{k}-X^{k+1},Y^{k+1}\right\rangle . \end{aligned}$$

Hence, by Lemma 4

$$\begin{aligned} \Delta _k (X)&=\left\langle X-X^{k},Y^{k+1}-Y^k\right\rangle - D_\Psi ^{Y^{k+1}}(X^{k},X^{k+1})\\&=-\delta _k\left\langle X-X^{k},\nabla F(X^k)\right\rangle - D_\Psi ^{Y^{k+1}}(X^{k},X^{k+1}). \end{aligned}$$

Setting \(X= X^\star \), we have

$$\begin{aligned} \Delta _k (X^\star ) = -\delta _k\left\langle X^\star -X^{k},\nabla F(X^k)\right\rangle - D_\Psi ^{Y^{k+1}}(X^{k},X^{k+1}). \end{aligned}$$
(11)

To estimate the inner product in (11), we apply Part (c) of Lemma 3 to the function F whose gradient satisfies the Lipschitz condition as

$$\begin{aligned} \Vert \nabla F(X)-\nabla F(\widetilde{X})\Vert _F&= \Vert \mathcal {A}^*(\mathcal {A}(X)-b)-\mathcal {A}^*(\mathcal {A}(\widetilde{X})-b)\Vert _F \nonumber \\&= \Vert \mathcal {A}^*(\mathcal {A}(X-\widetilde{X}))\Vert _F\le \Vert \mathcal {A}\Vert ^2\Vert X-\widetilde{X}\Vert _F. \end{aligned}$$
(12)

Setting \(X=X^\star \), \(\widetilde{X}=X^{k}\) yields

$$\begin{aligned} F(X^\star ) - F(X^k) \le \left\langle X^\star -X^{k},\nabla F(X^k)\right\rangle + \frac{\Vert \mathcal {A}\Vert ^2}{2}\Vert X^k-X^\star \Vert _F^2, \end{aligned}$$
(13)

while the choice of \(X=X^{k}\), \(\widetilde{X}=X^\star \) gives

$$\begin{aligned} F(X^k) - F(X^\star )\le & {} \left\langle X^{k} -X^\star ,\nabla F(X^\star )\right\rangle \\&+ \frac{\Vert \mathcal {A}\Vert ^2}{2}\Vert X^k-X^\star \Vert _F^2. \end{aligned}$$

Recall that \(\mathcal {A}^*(b -b_0)=0\). It follows that

$$\begin{aligned} \nabla F(X^\star )=\mathcal {A}^*(\mathcal {A}(X^\star )-b)=\mathcal {A}^*(b_0-b)=0, \end{aligned}$$
(14)

and

$$\begin{aligned} F(X^k) - F(X^\star ) \le \frac{\Vert \mathcal {A}\Vert ^2}{2}\Vert X^k-X^\star \Vert _F^2. \end{aligned}$$

Combining this with (11) and (13) tells us that

$$\begin{aligned} \Delta _k (X^\star ) \le \delta _k\Vert \mathcal {A}\Vert ^2\Vert X^k-X^\star \Vert _F^2 - D_\Psi ^{Y^{k+1}}(X^{k},X^{k+1}) \le \delta _k\Vert \mathcal {A}\Vert ^2\Vert X^k-X^\star \Vert _F^2. \end{aligned}$$

But \(\Vert X^k-X^\star \Vert _F^2 \le 2 D_\Psi ^{Y^k}(X^\star ,X^k)\) according to (7). Then we have

$$\begin{aligned} D_\Psi ^{Y^{k+1}}(X^\star ,X^{k+1})\ge (1-2\delta _k\Vert \mathcal {A}\Vert ^2)D_\Psi ^{Y^k}(X^\star ,X^k). \end{aligned}$$
(15)

Now we need the restriction \(0<\delta _k\le \frac{1}{(2+\kappa )\Vert \mathcal {A}\Vert ^2}\) with \(\kappa >0\) on the step size sequence. Denote \(a=\frac{2+\kappa }{2}\log \frac{2+\kappa }{\kappa }\) and apply the elementary inequality

$$\begin{aligned} 1-x\ge \exp (-ax), \quad \forall 0< x \le \frac{2}{2+\kappa }. \end{aligned}$$

Then we see from (15) that

$$\begin{aligned} D_\Psi ^{Y^{k+1}}(X^\star ,X^{k+1})\ge \exp \left\{ -2a\delta _k\Vert \mathcal {A}\Vert ^2\right\} D_\Psi ^{Y^k}(X^\star ,X^k). \end{aligned}$$

Applying this inequality iteratively for \(k=1,\ldots ,T\) yields

$$\begin{aligned} D_\Psi ^{Y^{T+1}}(X^\star ,X^{T+1}) \ge \prod _{k=1}^{T}\exp \left\{ -2a\delta _k\Vert \mathcal {A}\Vert ^2\right\} D_\Psi ^{Y^1}(X^\star ,X^1). \end{aligned}$$

Since \(Y^1=X^1=0\), we have \(D_\Psi ^{Y^1}(X^\star ,X^1)= \Psi (X^\star )>0\) by our assumption of \(b_0\not =0\). So \(D_\Psi ^{Y^{T+1}}(X^\star ,X^{T+1})>0\) and

$$\begin{aligned} 2a\Vert \mathcal {A}\Vert ^2\sum _{k=1}^{T}\delta _k \ge \log \Psi (X^\star )-\log D_\Psi ^{Y^{T+1}}(X^\star ,X^{T+1}). \end{aligned}$$

This verifies the desired lower bound on \(\sum _{k=1}^{T}\delta _k\). The proof is complete. \(\square \)

We are in a position to prove our second main result.

Proof of Theorem 2

We follow (10), but decompose \(\Delta _k (X)\) in a different way by means of \(D_\Psi ^{Y^k}(X^{k+1},X^k) =\Psi (X^{k+1}) - \Psi (X^k) - \left\langle X^{k+1}-X^k,Y^k\right\rangle \) to get

$$\begin{aligned} \Delta _k (X)=\left\langle X-X^{k+1},Y^{k+1}-Y^k\right\rangle + D_\Psi ^{Y^k}(X^{k+1},X^k). \end{aligned}$$

By (8), \(Y^{k+1}-Y^k=-\delta _k\nabla F(X^k)\). To be consistent with the gradient at \(X^k\), we separate \(X-X^{k+1}\) into \(X-X^{k} + X^{k} -X^{k+1}\) and decompose \(\Delta _k (X)\) as

$$\begin{aligned} \Delta _k (X)= & {} -\delta _k\left\langle X-X^{k},\nabla F(X^k)\right\rangle \nonumber \\&+ \left\{ \delta _k\left\langle X^{k+1}-X^{k},\nabla F(X^k)\right\rangle + D_\Psi ^{Y^k}(X^{k+1},X^k)\right\} . \end{aligned}$$
(16)

The inner product in the above last term can be estimated by applying Part (c) of Lemma 3 to the function F satisfying (12) as

$$\begin{aligned} \langle X^{k+1}-X^{k},\nabla F(X^k)\rangle \ge F(X^{k+1}) - F(X^k) - \frac{\Vert \mathcal {A}\Vert ^2}{2} \Vert X^k-X^{k+1}\Vert _F^2. \end{aligned}$$

But

$$\begin{aligned} D_\Psi ^{Y^k}(X^{k+1},X^k) \ge \frac{1}{2}\Vert X^{k+1} - X^k\Vert _F^2 \end{aligned}$$

according to (7). Putting these lower bounds into the last term of (16) and applying the bound \(\left\langle X-X^{k},\nabla F(X^k)\right\rangle \le F(X) - F(X^k)\) derived from the convexity of F, we find

$$\begin{aligned} \Delta _k (X)&\ge -\delta _k \left[ F(X)-F(X^{k})\right] \\&\quad + \Big \{\delta _k\left[ F(X^{k+1}) - F(X^k)\right] - \frac{\delta _k\Vert \mathcal {A}\Vert ^2}{2} \Vert X^k-X^{k+1}\Vert _F^2 \\&\quad + \frac{1}{2}\Vert X^{k+1} - X^k\Vert _F^2\Big \} \\&= \delta _k \left[ F(X^{k+1}) - F(X)\right] + \frac{1-\delta _k\Vert \mathcal {A}\Vert ^2}{2}\Vert X^{k+1}-X^k\Vert _F^2. \end{aligned}$$

By the assumption on the step size, \(\delta _k\Vert \mathcal {A}\Vert ^2 \le 1\). Therefore, the following inequality holds for all \(X\in \mathbb {R}^{n_1\times n_2}\)

$$\begin{aligned} \delta _k[F(X^{k+1})-F(X)] \le D_\Psi ^{Y^{k}}(X,X^k)-D_\Psi ^{Y^{k+1}}(X,X^{k+1}). \end{aligned}$$
(17)

Then the property \(F(X^{k+1})\le F(X^k)\) stated in Part (a) follows by setting \(X=X^k\) in (17) because \(D_\Psi ^{Y^{k}}(X^k,X^k)=0\) and \(D_\Psi ^{Y^{k+1}}(X^k,X^{k+1}) \ge 0\).

The statement in Part (b) follows immediately from (14). In fact, from the orthogonality of \(b-b_0\) and the range of \(\mathcal {A}\) and \(\mathcal {A}(X^\star ) =b_0\), we see the following well known relation in learning theory

$$\begin{aligned} F(X)&= \frac{1}{2} \Vert \mathcal {A}(X-X^\star )+ \mathcal {A}(X^\star ) -b\Vert _2^2 = \frac{1}{2} \Vert \mathcal {A}(X-X^\star )\Vert _2^2 + \frac{1}{2} \Vert \mathcal {A}(X^\star ) -b\Vert _2^2 \nonumber \\&= \frac{1}{2} \Vert \mathcal {A}(X-X^\star )\Vert _2^2 + F(X^\star ). \end{aligned}$$
(18)

To prove the statement in Part (c), we apply the monotonicity \(F(X^{k+1})\le F(X^k)\) derived in Part (a) and find

$$\begin{aligned} F(X^{T+1})-F(X)\le \frac{\sum _{k=1}^{T}\delta _k[F(X^{k+1})-F(X)]}{\sum _{\tilde{k}=1}^{T}\delta _{\tilde{k}}}. \end{aligned}$$

Taking the summation of (17) from \(k=1\) to T gives

$$\begin{aligned} \sum _{k=1}^{T}\delta _k[F(X^{k+1})-F(X)]&\le \sum _{k=1}^{T}\big [D_\Psi ^{Y^{k}}(X,X^k)-D_\Psi ^{Y^{k+1}}(X,X^{k+1})\big ] \\&= D_\Psi ^{Y^1}(X,X^1)-D_\Psi ^{Y^{T+1}}(X,X^{T+1}). \end{aligned}$$

But \(-D_\Psi ^{Y^{T+1}}(X,X^{T+1})\le 0\) and \(D_\Psi ^{Y^1}(X,X^1) = \Psi (X)\) since \(X^1=Y^1=0\). Hence

$$\begin{aligned} F(X^{T+1})-F(X) \le \frac{\Psi (X)}{\sum _{k=1}^{T}\delta _k},\quad \forall X\in \mathbb {R}^{n_1\times n_2}. \end{aligned}$$

In particular, taking \(X=X^\star \) and applying (18), we get (5). This completes the proof of Theorem 2. \(\square \)

4 Sufficiency of Convergence

This section presents the proof for the sufficiency part of (a) and (b) of Theorem 1. Our analysis is based on the observation that SVT can be viewed as a gradient descent algorithm applied to the dual problem of (3), hence results on gradient descent algorithms can be applied. Here we apply the following standard estimates for the convergence of the gradient descent method applied to smooth optimization problems. The proof is given in the appendix for completeness.

Lemma 6

Suppose \(f:\mathbb {R}^{m}\rightarrow \mathbb {R}\) is convex and L-strongly smooth with \(\lambda ^\star \) being a minimizer. Let \(\{\lambda ^k\}_{k\in \mathbb {N}}\) be the following sequence produced by the gradient descent algorithm

$$\begin{aligned} \lambda ^1 =0, \quad \lambda ^{k+1}=\lambda ^k-\delta _k\nabla f(\lambda ^k), \qquad k\in \mathbb {N}\end{aligned}$$
(19)

with a step size sequence \(\{\delta _k>0\}_{k\in \mathbb {N}}\). Then the following statements hold.

  1. (a)

    If \(\sup _k\delta _k<2/L\), then there exists a constant \(\widetilde{C}\)

    $$\begin{aligned} f(\lambda ^{T+1})-f(\lambda ^\star )\le \widetilde{C}\Big [\sum _{k=1}^{T}\delta _k\Big ]^{-1}. \end{aligned}$$
    (20)
  2. (b)

    If \(\sup _k\delta _k\le 1/L\), then (20) holds with \(\widetilde{C}=\Vert \lambda ^\star \Vert _2^2/2\).

The following lemma shows how SVT can be viewed as a gradient descent algorithm applied to the dual of (3). Part (a) establishes the dual problem of the optimization problem (3), and Part (b) shows that the sequence \(\{Y^k\}\) coincides with \(\{\mathcal {A}^*(\lambda ^k)\}_{k\in \mathbb {N}}\) with \(\{\lambda _k\}\) produced by applying the gradient descent algorithm (19) to the function G given in Part (a). This lemma was presented in [5] when \(\mathcal {A}\) is an orthogonal projector and the system \(\mathcal {A}(X)=b\) is consistent. It is extended here to the general linear transformation \(\mathcal {A}\) allowing for inconsistent systems with b replaced by its orthogonal projection onto the range of \(\mathcal {A}\).

Lemma 7

  1. (a)

    The Lagrangian dual problem of (3) is

    $$\begin{aligned} \min _{\lambda \in \mathbb {R}^m}G(\lambda ), \ \hbox {where} \ G(\lambda ):=\Psi ^*(\mathcal {A}^*(\lambda ))-\left\langle \lambda ,b_0\right\rangle . \end{aligned}$$
    (21)
  2. (b)

    If \(\{(X^k,Y^k)\}_{k\in \mathbb {N}}\) is produced by (1), and \(\{\lambda ^k\}_{k\in \mathbb {N}}\) is produced by applying the gradient descent algorithm (19) to the function G, then we have \(Y^k=\mathcal {A}^*(\lambda ^k)\) for \(k\in \mathbb {N}\).

Proof

The Lagrangian dual problem of (3) is

$$\begin{aligned}&\max _{\lambda \in \mathbb {R}^m}\min _{X\in \mathbb {R}^{n_1\times n_2}}\big [\Psi (X)-\left\langle \lambda ,\mathcal {A}(X)\right\rangle +\left\langle \lambda ,b_0\right\rangle \big ] \\&= \max _{\lambda \in \mathbb {R}^m}\Big [-\max _{X\in \mathbb {R}^{n_1\times n_2}}\big [\left\langle X,\mathcal {A}^*(\lambda )\right\rangle -\Psi (X)\big ]+\left\langle \lambda ,b_0\right\rangle \Big ]\\&=\max _{\lambda \in \mathbb {R}^m}\Big [-\Psi ^*(\mathcal {A}^*(\lambda ))+\left\langle \lambda ,b_0\right\rangle \Big ] = -\min _{\lambda \in \mathbb {R}^m}\big [\Psi ^*(\mathcal {A}^*(\lambda ))-\left\langle \lambda ,b_0\right\rangle \big ]\\&= -\min _{\lambda \in \mathbb {R}^m}G(\lambda ), \end{aligned}$$

where in the second identity we have used the definition of Fenchel conjugate. This proves (21).

When the gradient descent algorithm (19) is applied to the function G defined in (21), we see by the chain rule that the gradient equals

$$\begin{aligned} \nabla G(\lambda ) = \nabla \big (\Psi ^*(\mathcal {A}^*(\lambda ))-\left\langle \lambda ,b_0\right\rangle \big ) =\mathcal {A}\big ((\nabla \Psi ^*)(\mathcal {A}^*(\lambda ))\big )-b_0. \end{aligned}$$
(22)

So the sequence \(\{\lambda ^k\}_{k\in \mathbb {N}}\) produced by (19) translates to

$$\begin{aligned} \lambda ^{k+1}=\lambda ^k-\delta _k[\mathcal {A}((\nabla \Psi ^*)(\mathcal {A}^*(\lambda ^k)))-b_0]. \end{aligned}$$
(23)

Applying the transformation \(\mathcal {A}^*\) to both sides and noticing \(\mathcal {A}^* b_0 = \mathcal {A}^* b\) yield the following identity for all \(k\in \mathbb {N}\)

$$\begin{aligned} \mathcal {A}^*(\lambda ^{k+1})&= \mathcal {A}^*(\lambda ^k) - \delta _k\mathcal {A}^*\big (\mathcal {A}((\nabla \Psi ^*)(\mathcal {A}^*(\lambda ^k))) - b_0\big )\\&=\mathcal {A}^*(\lambda ^k)-\delta _k\mathcal {A}^*\big (\mathcal {A}((\nabla \Psi ^*)(\mathcal {A}^*(\lambda ^k)))-b\big ). \end{aligned}$$

This iteration relation for the sequence \(\{\mathcal {A}^*(\lambda ^k)\}_{k\in \mathbb {N}}\) is exactly the same as (8) in Lemma 4 for the sequence \(\{Y^k\}_{k\in \mathbb {N}}\). This together with the initial conditions \(Y^1=0, \mathcal {A}^*(\lambda ^{1})=0\) tells us that \(Y^k=\mathcal {A}^*(\lambda ^k)\) for \(k\in \mathbb {N}\). The proof of the lemma is complete. \(\square \)

Combining Lemmas 6 and 7 enables us to bound the excess dual objective value \(G(\lambda ^{T+1})-G(\lambda ^\star )\) in terms of \(\sum _{k=1}^T \delta _k\). What is left for estimating \(D_{\Psi }^{Y^{T+1}}(X^\star ,X^{T+1})\) to prove the sufficiency part of Theorem 1 is to find a relation between the excess dual objective value \(G(\lambda ^{T+1})-G(\lambda ^\star )\) and the Bregman distance \(D_{\Psi }^{Y^{T+1}}(X^\star ,X^{T+1})\). This is given in the following key identity which provides an elegant scheme to transfer decay rates of excess dual objective values to those for the Bregman distance of primal variables.

Lemma 8

If \(\{(X^k,Y^k)\}_{k\in \mathbb {N}}\) is produced by (1), and \(\{\lambda ^k\}_{k\in \mathbb {N}}\) is produced by applying the gradient descent algorithm (19) to the function G, then there exists some \(\lambda ^\star \in \mathbb {R}^m\) such that \(\mathcal {A}^*(\lambda ^\star )\in \partial \Psi (X^\star )\) and

$$\begin{aligned} D_{\Psi }^{Y^k}(X^\star ,X^k)=G(\lambda ^k)-G(\lambda ^\star ). \end{aligned}$$

Proof

Since \(X^\star \) is an optimal point of the problem (3) with only linear constraints, the existence of Lagrange multipliers (e.g., Corollary 28.2.2 in [19]) and the first-order optimality condition imply the existence of \(\lambda ^\star \in \mathbb {R}^m\) satisfying

$$\begin{aligned} Y^\star :=\mathcal {A}^*(\lambda ^\star )\in \partial \Psi (X^\star ). \end{aligned}$$
(24)

Together with Part (a) of Lemmas 3 and 4, this implies that

$$\begin{aligned} X^\star =\nabla \Psi ^*(\mathcal {A}^*(\lambda ^\star )) = \nabla \Psi ^*(Y^\star ). \end{aligned}$$
(25)

Since \(\Psi \) is convex, we know (see, e.g., Proposition 3.3.4 in [3]) that for any \(X\in \mathbb {R}^{n_1 \times n_2}\),

$$\begin{aligned} Y\in \partial \Psi (X) \Longrightarrow \Psi ^*(Y) = \left\langle X,Y\right\rangle -\Psi (X). \end{aligned}$$

Applying this implication to the pairs \((X^\star , Y^\star )\) in (24) and \((X^{k}, Y^{k})\) in Lemma 4 satisfying \(Y^{k}\in \partial \Psi (X^{k})\), we know that

$$\begin{aligned} D_{\Psi }^{Y^k}(X^\star ,X^k)&= \Psi (X^\star )-\Psi (X^k)-\left\langle X^\star - X^k, Y^k\right\rangle \\&=\Psi (X^\star )-\left\langle X^\star ,Y^\star \right\rangle +\left\langle X^\star ,Y^\star \right\rangle -\Psi (X^k)-\left\langle X^\star - X^k, Y^k\right\rangle \\&=-\Psi ^*(Y^\star )+\left\langle X^\star ,Y^\star - Y^k\right\rangle +\Psi ^*(Y^k) \\&=\Psi ^*(Y^k)-\Psi ^*(Y^\star )-\left\langle Y^k-Y^\star ,\nabla \Psi ^*(Y^\star )\right\rangle , \end{aligned}$$

where we have used (25) in the last equality. But \(Y^k=\mathcal {A}^*(\lambda ^k)\) according to Part (b) of Lemma 7. Then we see from the definition of the function G that \(D_{\Psi }^{Y^k}(X^\star ,X^k)\) equals

$$\begin{aligned}&\Psi ^*(\mathcal {A}^*(\lambda ^k))-\Psi ^*(\mathcal {A}^*(\lambda ^\star )) - \left\langle \mathcal {A}^*(\lambda ^k-\lambda ^\star ),\nabla \Psi ^*(Y^\star )\right\rangle \\&= \Psi ^*(\mathcal {A}^*(\lambda ^k)) - \Psi ^*(\mathcal {A}^*(\lambda ^\star )) -\left\langle \lambda ^k-\lambda ^*,b_0\right\rangle +\left\langle \lambda ^k-\lambda ^\star ,b_0\right\rangle \\&\quad -\left\langle \mathcal {A}^*(\lambda ^k-\lambda ^\star ), \nabla \Psi ^*(Y^\star )\right\rangle \\&= G(\lambda ^k)-G(\lambda ^\star )+\left\langle \lambda ^k-\lambda ^\star ,b_0-\mathcal {A}(\nabla \Psi ^*(Y^\star ))\right\rangle . \end{aligned}$$

This together with the identities (25) and \(\mathcal {A}(X^\star )=b_0\) implies

$$\begin{aligned} D_{\Psi }^{Y^k}(X^\star ,X^k)&= G(\lambda ^k)-G(\lambda ^\star )+\left\langle \lambda ^k-\lambda ^\star ,b_0-\mathcal {A}(X^\star )\right\rangle \\&= G(\lambda ^k)-G(\lambda ^\star ). \end{aligned}$$

The proof of the lemma is complete. \(\square \)

Now we can prove the sufficiency part of (a) and (b) of Theorem 1 by presenting the following more general estimate.

Proposition 9

Let \(\{(X^k,Y^k)\}_{k\in \mathbb {N}}\) be produced by (1) with a positive step-size sequence \(\{\delta _k\}\) satisfying \(\sup _k\delta _k<\frac{2}{\Vert \mathcal {A}\Vert ^2}\). Then we have

$$\begin{aligned} D_\Psi ^{Y^{T+1}}(X^\star ,X^{T+1})\le \widetilde{C}\Big [\sum _{k=1}^{T}\delta _k\Big ]^{-1}, \end{aligned}$$
(26)

where \(\widetilde{C}\) is a constant independent of T. Furthermore, if \(\sup _k\delta _k\le \frac{1}{\Vert \mathcal {A}\Vert ^2}\), then (26) holds with \(\widetilde{C}=\Vert \lambda ^\star \Vert _2^2/2\), where \(\lambda ^\star \) is an element in \(\mathbb {R}^m\) satisfying \(\mathcal {A}^*(\lambda ^\star )\in \partial \Psi (X^\star )\).

Proof

Recall the expression (22) for the gradient of G. Take the vector \(\lambda ^\star \) given in Lemma 8. The identity (25) implies

$$\begin{aligned} \nabla G(\lambda ^\star )=\mathcal {A}(\nabla \Psi ^*(\mathcal {A}^*(\lambda ^\star )))-b_0=\mathcal {A}(X^\star )-b_0=0 \end{aligned}$$

and therefore \(\lambda ^\star \) minimizes G.

By (7), \(\Psi \) is 1-strongly convex. So its Fenchel conjugate \(\Psi ^*\) is 1-strongly smooth according to Part (b) of Lemma 3. It follows that for \(\lambda , \tilde{\lambda } \in \mathbb {R}^m\),

$$\begin{aligned} G(\lambda )-G(\tilde{\lambda })&= \Psi ^*(\mathcal {A}^*(\lambda ))-\left\langle \lambda ,b_0\right\rangle -\Psi ^*(\mathcal {A}^*(\tilde{\lambda }))+\left\langle \tilde{\lambda },b_0\right\rangle \\&\le \left\langle \nabla \Psi ^*(\mathcal {A}^*(\tilde{\lambda })),\mathcal {A}^*(\lambda -\tilde{\lambda })\right\rangle +\frac{1}{2}\Vert \mathcal {A}^*(\lambda -\tilde{\lambda })\Vert _F^2 - \left\langle \lambda -\tilde{\lambda },b_0\right\rangle \\&= \left\langle \lambda -\tilde{\lambda },\mathcal {A}((\nabla \Psi ^*)(\mathcal {A}^*(\tilde{\lambda })))-b_0\right\rangle +\frac{1}{2}\Vert \mathcal {A}^*(\lambda -\tilde{\lambda })\Vert _F^2\\&\le \left\langle \lambda -\tilde{\lambda },\nabla G(\tilde{\lambda })\right\rangle +\frac{\Vert \mathcal {A}\Vert ^2}{2}\Vert \lambda -\tilde{\lambda }\Vert _2^2, \end{aligned}$$

where in the last step we have used (22) and the definition of operator norm. It tells us that the function \(G(\lambda )\) is \(\Vert \mathcal {A}\Vert ^2\)-strongly smooth. So we apply Lemmas 6(a) and 8 and know that when \(\{\delta _k\}_k\) satisfies \(\sup _k\delta _k \le \frac{2}{\Vert \mathcal {A}\Vert ^2}\), the following inequality holds with a constant \(\widetilde{C}\) independent of T

$$\begin{aligned} D_\Psi ^{Y^{T+1}}(X^\star ,X^{T+1}) = G(\lambda ^{T+1})-G(\lambda ^\star ) \le \widetilde{C}\Big [\sum _{k=1}^{T}\delta _k\Big ]^{-1}. \end{aligned}$$

According to Lemmas 6(b) and 8, the constant \(\widetilde{C}\) can be chosen to be \(\Vert \lambda ^\star \Vert _2^2/2\) if \(\delta _k\le \frac{1}{\Vert \mathcal {A}\Vert ^2}\). The proof is complete. \(\square \)