1 Introduction

Let \(\mathfrak {R}^{p\times q}\) be the space of all \(p\times q\) matrices equipped with the standard trace inner product \(\langle X, Y \rangle = \text{ Tr }(X^TY)\) and its induced Frobenius norm \(\Vert \cdot \Vert \). Without loss of generality, we assume \(p\le q\) throughout this paper. For a given \(X\in \mathfrak {R}^{p\times q}\), its nuclear norm \(\Vert X\Vert _{*}\) is defined as the sum of all its singular values and its operator norm \(\Vert X\Vert _2\) is the largest singular value. Let \(\mathcal{S}^n\) be the space of all \(n\times n\) symmetric matrices and \(\mathcal{S}^n_+\) be the cone of symmetric positive semidefinite matrices. We use the notation \(X\succeq 0\) to denote that \(X\) is a symmetric positive semidefinite matrix.

In this paper, we are interested in solving the following nuclear norm regularized matrix least squares problem with linear equality and inequality constraints:

$$\begin{aligned} \min _{X \in \mathfrak {R}^{p\times q}} \left\{ \frac{1}{2}\Vert \mathcal {A} (X) -b\Vert ^2 + \rho \Vert X\Vert _{*} + \langle C, X \rangle \;:\; \mathcal {B}(X) \, \in \, d + \mathcal {Q}\right\} , \end{aligned}$$
(1)

where \(\mathcal {A}:\mathfrak {R}^{p\times q}\rightarrow \mathfrak {R}^m\) and \(\mathcal {B} :\mathfrak {R}^{p\times q}\rightarrow \mathfrak {R}^s\) are linear maps, \(C\in \mathfrak {R}^{p\times q}, b\in \mathfrak {R}^m, d\in \mathfrak {R}^{s}\), \(\rho \) is a given positive parameter, and \(\mathcal {Q} = {\{0\}}^{s_1}\times \mathfrak {R}_{+}^{s_2}\) is a polyhedral convex cone with \(s= s_1 + s_2\). In many applications, we need to find a low rank approximation of a given target matrix while preserving certain structures. The nuclear norm function has been widely used as a regularizer which favors a low rank solution of (1). Chu et al. [12] addressed some theoretical and numerical issues concerning structured low rank approximation problems. In many data analysis problems, the collected empirical data, possibly contaminated by noise, usually do not have the specified structure or the desired low rank. So it is important to find the nearest low rank approximation of the given matrix while maintaining the underlying structure of the original system. In practice, the data to be analyzed is very often nonnegative such as those corresponding to concentrations or intensity values, and it would be preferable to take into account such structural constraints.

Our nuclear norm regularized matrix least squares problem (1) arises from the recent intensive studies of the following affine rank minimization problem:

$$\begin{aligned} \min \left\{ \text{ rank }(X) : \mathcal {A}(X) = b, \; X\in \mathfrak {R}^{p\times q} \right\} . \end{aligned}$$
(2)

The problem (2) has many applications in diverse fields, see, e.g., [1, 2, 10, 47]. However, this affine rank minimization problem is generally an NP-hard nonconvex optimization problem. A tractable heuristic introduced in [20, 21] is to minimize the nuclear norm over the same constraints as in (2):

$$\begin{aligned} \min \left\{ \Vert X\Vert _{*} : \mathcal {A}(X) = b, \; X\in \mathfrak {R}^{p\times q} \right\} . \end{aligned}$$
(3)

The nuclear norm function is the greatest convex function majorized by the rank function over the unit ball of matrices with operator norm at most one, and it has been widely used as a surrogate for promoting the low rank structure. A frequently used alternative to (3) for accommodating problems with noisy data is the following matrix least squares problem with nuclear norm regularization (see [38, 55]):

$$\begin{aligned} \min \left\{ \frac{1}{2}\Vert \mathcal {A} (X) -b\Vert ^2 + \rho \Vert X\Vert _* : X\in \mathfrak {R}^{p\times q}\right\} . \end{aligned}$$
(4)

It is known that (3) can be equivalently reformulated as a semidefinite programming (SDP) problem (see [47]), which has one \((p+q)\times (p+q)\) semidefinite constraint and \(m\) linear equality constraints. One can use standard interior-point method based semidefinite programming solvers such as SeDuMi [52] and SDPT\(3\) [54] to solve this SDP problem. However, these solvers are not suitable for problems with large \(p+q\) or \(m\) since in each iteration of these solvers, a large and dense Schur complement equation must be solved even when the data is sparse.

To overcome the difficulties faced by interior-point methods, several algorithms have been proposed to solve (3) or (4) directly. In [9], a singular value thresholding (SVT) algorithm is proposed for solving the following Tikhonov regularized version of (3):

$$\begin{aligned} \min \left\{ \tau \Vert X\Vert _{*} + \frac{1}{2}\Vert X\Vert ^2 : \mathcal {A}(X) = b, \; X\in \mathfrak {R}^{p\times q}\right\} , \end{aligned}$$
(5)

where \(\tau \) is a given positive regularization parameter. The SVT algorithm is a gradient method applied to the dual problem of (5). In [38], a fixed point continuation (FPC) algorithm is proposed for solving (4), together with a Bregman iterative algorithm for solving (3). Their numerical results on randomly generated matrix completion problems demonstrated that the FPC algorithm is much more efficient than the semidefinite programming solver SDPT3 when low accuracy solutions are sought. Toh and Yun [55] proposed an accelerated proximal gradient algorithm (APG), which terminates in \(O(1/\sqrt{\varepsilon })\) iterations for achieving \(\varepsilon \)-optimality, to solve the unconstrained matrix least squares problem (4). Their numerical results show that the APG algorithm is highly efficient and robust in solving large-scale random matrix completion problems. Liu et al. [35] considered the following nuclear norm minimization problem with linear and second order cone constraints:

$$\begin{aligned} \min \left\{ \Vert X\Vert _{*} : \mathcal {A}(X) \in b+ \mathcal {K}, \; X\in \mathfrak {R}^{p\times q} \right\} , \end{aligned}$$
(6)

where \(\mathcal {K} = \{0\}^{m_1} \times \mathcal {K}^{m_2}\), and \(\mathcal {K}^{m_2}\) stands for the \(m_2\)-dimensional second order cone. They developed three inexact proximal point algorithms (PPA) in primal, dual and primal-dual forms with comprehensive convergence analysis, built upon the classic results of the general PPA established by Rockafellar [49]. Their numerical results demonstrated the efficiency and robustness of these PPAs in solving randomly generated and real matrix completion problems. Moreover, they showed that the SVT algorithm [9] is just one outer iteration of the exact primal PPA, and the Bregman iterative method [38] is a special case of the exact dual PPA. However, all the above mentioned models and related algorithms cannot address the following goal: given an observed data matrix (possibly contaminated by noise), find the nearest low rank approximation of the target matrix while maintaining certain prescribed structures. In particular, the APG method considered in [55] cannot be applied directly to solve the problem (1).

A strong motivation for proposing the model (1) arises from finding the nearest low rank approximation of transition matrices. For a given data matrix \(P\) which describes the full distribution of a random walk through the entire data set, the problem of finding the low rank approximation of \(P\) can be stated as follows:

$$\begin{aligned} \min \left\{ \frac{1}{2} \Vert X - P\Vert ^2 + \rho \Vert X\Vert _{*}\,:\, Xe = e, \; X \ge 0, \; \; X \in \mathfrak {R}^{n\times n} \right\} , \end{aligned}$$
(7)

where \(e\in \mathfrak {R}^n\) is the vector of all ones and \(X\ge 0\) denotes the condition that all entries of \(X\) are nonnegative. Lin [34] proposed the Latent Markov Analysis (LMA) approach for finding the reduced rank approximations of transition matrices. The LMA is applied to clustering such that the inferred cluster relationships can be described probabilistically by the reduced-rank transition matrix. Chennubhotla [11] exploited the spectral properties of the Markov transition matrix to obtain low rank approximation of the original transition matrix in order to develop a fast eigen-solver for spectral clustering. Furthermore, in many applications, since only partial information of the original transition matrix is available, it is also important to estimate the missing entries of \(P\). For example, transition probabilities between different credit ratings play a crucial role in credit portfolio management. If our primary interest is in a specific group, then the number of observed rating transitions might be very small. Due to the lack of rating data, it is important to estimate the rating transition matrix in the presence of missing data [4].

In this paper, we study a partial proximal point algorithm (PPA) proposed by Ha [30] for solving (1), in which only some of the variables appear in the quadratic proximal term. The partial PPA was further analyzed by Bertsekas and Tseng [6], in which the close relation between the partial PPA and some parallel algorithms in convex programming was revealed. Given a sequence of parameters \(\sigma _k\) such that

$$\begin{aligned} 0< \sigma _k \uparrow \sigma _{\infty } \le + \infty , \end{aligned}$$
(8)

and an initial point \(X^0\in \mathfrak {R}^{p\times q}\), a sequence \(\{(u^k, X^k)\}\subseteq \mathfrak {R}^m \times \mathfrak {R}^{p\times q}\) is generated by the partial PPA for solving (1) via the following scheme:

$$\begin{aligned} (u^{k\!+\!1}, X^{k\!+\!1}) \!\approx \mathrm{arg}\min \left\{ f_\rho (u, X)\! +\! \frac{1}{2\sigma _k}\Vert X\!-\! X^k\Vert ^2: \mathcal {A}(X) \!+\! u \!=\! b,\, \mathcal {B}(X) \in d \!+\! \mathcal {Q}\right\} , \end{aligned}$$
(9)

where \(\displaystyle f_\rho (u, X) := \frac{1}{2}\Vert u\Vert ^2 + \rho \Vert X\Vert _{*} + \langle C, X \rangle \). Since the nuclear norm \(\Vert \cdot \Vert _{*}\) is a nonsmooth function, an important issue we must address is how to solve (9) efficiently. The strong convexity of the objective function suggests to us to apply an indirect method for solving (9) based on the duality theory for convex programming. We note that the proposed partial PPA requires solving an inner subproblem with linear inequality constraints at each iteration. To handle the inequality constraints, Gao and Sun [24] recently designed a quadratically convergent inexact smoothing Newton method to solve semidefinite least squares problems with equality and inequality constraints but with simple quadratic objective functions of the form \(\frac{1}{2}\Vert X-G\Vert ^2\). Their numerical results demonstrated the high efficiency of the inexact smoothing Newton method. This strongly motivated us to use an inexact smoothing Newton method to solve our inner subproblems for achieving fast convergence. For the inner subproblem, due to the presence of inequality constraints, we reformulate the problem as a system of semismooth equations. By defining a smoothing function of the soft thresholding operator, we introduce an inexact smoothing Newton method to solve the semismooth system. The quadratic convergence of the inexact smoothing Newton method is proved under a constraint nondegeneracy condition, together with the strong semismoothness property of the soft thresholding operator.

The remaining parts of this paper are organized as follows. In Sect. 2, we present some preliminaries about semismooth functions. We show that the soft thresholding operator is strongly semismooth everywhere, and define a smoothing function of the soft thresholding operator. In Sect. 3, we design a partial PPA for solving the nuclear norm regularized matrix least squares problem (1), and establish its global and local convergence. Section 4 follows with the design of an inexact smoothing Newton method for solving the inner subproblems. The quadratic convergence of the method is established under a constraint nondegeneracy condition, which we also characterized. We report the numerical implementation and performance of our algorithm in Sect. 5 and give the conclusion in Sect. 6.

2 Preliminaries

In this section, we give a brief introduction on some basic concepts such as semismooth functions, the B-subdifferential and Clarke’s generalized Jacobian of Lipschitz functions. We shall show that the soft thresholding operator is strongly semismooth everywhere. These concepts and properties will be critical for us to develop an inexact smoothing Newton method for solving the inner subproblems in our partial PPA.

Let \(F:\mathfrak {R}^m \longrightarrow \mathfrak {R}^l\) be a locally Lipschitz function. By Rademacher’s theorem, \(F\) is Fréchet differentiable almost everywhere. Let \(D_F\) denote the set of points in \(\mathfrak {R}^m\) where \(F\) is differentiable. The B-subdifferential of \(F\) at \(x\in \mathfrak {R}^m\) is defined by

$$\begin{aligned} \partial _B F(x) := \{V: \ V=\lim _{k\rightarrow \infty } F^{\prime }(x^k),\ x^k\longrightarrow x,\ x^k \in D_F\}, \end{aligned}$$

where \(F^{\prime }(x)\) denotes the Jacobian of \(F\) at \(x\in D_F\). Then Clarke’s [14] generalized Jacobian of \(F\) at \(x\in \mathfrak {R}^m\) is defined as the convex hull of \(\partial _B F(x)\), i.e.,

$$\begin{aligned} \partial F(x) = \mathrm {conv} \{\partial _B F(x)\}. \end{aligned}$$

From [46, Lemma 2.2 ], we know that if \(F\) is directionally differentiable in a neighborhood of \(x\in \mathfrak {R}^m\), then for any \(h\in \mathfrak {R}^m\), there exists \(\mathcal {V} \in \partial F(x)\) such that \(F^{\prime }(x; h) = \mathcal {V}h\). The following concept of semismoothness was first introduced by Mifflin [40] for functionals and was extended by Qi and Sun [46] to vector-valued functions.

Definition 2.1

We say that a locally Lipschitz function \(F:\mathfrak {R}^m \longrightarrow \mathfrak {R}^l\) is semismooth at \(x\) if

  1. 1.

    \(F\) is directionally differentiable at \(x\); and

  2. 2.

    for any \(h\in \mathfrak {R}^m\) and \(V\in \partial F(x+h)\) with \(h\rightarrow 0\), \(F(x+h)- F(x) - Vh = o(\Vert h\Vert ).\)

Furthermore, F is said to be strongly semismooth at \(x\) if \(F\) is semismooth at \(x\) and for any \(h\in \mathfrak {R}^m\) and \(V\in \partial F(x+h)\) with \(h\rightarrow 0\), \( F(x+h)- F(x) - Vh = O(\Vert h\Vert ^2). \)

Let \(K\) be a closed convex cone in a finite dimensional real Euclidean space \(\mathcal {X}\) equipped with a scalar inner product \(\langle \cdot , \cdot \rangle \) and its induced norm \(\Vert \cdot \Vert \). Let \(\Pi _K: \mathcal {X}\rightarrow \mathcal {X}\) denote the metric projector over \(K\), i.e., for any \(y\in \mathcal {X}\), \(\Pi _K(y)\) is the unique optimal solution to the following convex optimization problem:

$$\begin{aligned} \displaystyle \min \left\{ \frac{1}{2}\langle x-y, x-y \rangle \;:\; x \in K \right\} . \end{aligned}$$

It is well known [60] that the metric projector \(\Pi _K(\cdot )\) is Lipschitz continuous with modulus \(1\) and \(\Vert \Pi _K(\cdot )\Vert ^2\) is continuously differentiable. Hence, \(\Pi _K(\cdot )\) is almost everywhere Fréchet differentiable in \(\mathcal {X}\) and for every \(y\in \mathcal {X}\), \(\partial \Pi _K(y)\) is well defined. For any \(X \in \mathcal{S}^n\), let \(X_+ = \Pi _{S_{+}^n}(X)\) be the metric projection of \(X\) onto \(S_+^n\) under the standard trace inner product. Assume that \(X\) has the following spectral decomposition

$$\begin{aligned} X = P \Lambda P^T, \end{aligned}$$

where \(\Lambda \) is the diagonal matrix of eigenvalues and \(P\) is a corresponding orthogonal matrix of eigenvectors. Then

$$\begin{aligned} X_{+} = P \Lambda _{+} P^T, \end{aligned}$$

where \(\Lambda _{+}\) is a diagonal matrix whose diagonal entries are the nonnegative parts of the respective diagonal entries of \(\Lambda \). The strong semismoothness of \(\Pi _{S_{+}^n}(\cdot )\) has been proved by Sun and Sun in [53].

Next, we shall show that the soft thresholding operator [9, 35] is strongly semismooth everywhere. Let \(Y\in \mathfrak {R}^{p\times q}\) admit the following singular value decomposition (SVD):

$$\begin{aligned} Y =U [\Sigma \ 0] V^T, \end{aligned}$$
(10)

where \(U\in \mathfrak {R}^{p\times p}\) and \(V \in \mathfrak {R}^{q\times q}\) are orthogonal matrices, \(\Sigma =\mathrm{Diag} (\sigma _1, \ldots , \sigma _p)\) is the diagonal matrix of singular values of \(Y\), with \(\sigma _1 \ge \sigma _2 \ge \cdots \ge \sigma _p\ge 0\). Define \(s: \mathfrak {R}\rightarrow \mathfrak {R}\) by

$$\begin{aligned} s(t):=(t-\rho )_+ - (-t-\rho )_+. \end{aligned}$$
(11)

For each threshold \(\rho >0\), the soft thresholding operator \(D_\rho \) is defined as follows:

$$\begin{aligned} D_\rho (Y) =U [\Sigma _\rho \ 0] V^T , \end{aligned}$$
(12)

where \(\Sigma _\rho =\mathrm{Diag} (s(\sigma _1), \ldots , s(\sigma _p))\). Decompose \(V\in \mathfrak {R}^{q\times q}\) as \( V \, =\, [V_1 \ \ V_2 ]\, , \) where \(V_1\in R^{q\times p}\) and \(V_2\in {\mathfrak {R}}^{q\times (q-p)}\). Let the orthogonal matrix \(Q\in {\mathfrak {R}}^{(p+q)\times (p+q)}\) be defined by

$$\begin{aligned} Q : \, =\, \frac{1}{\sqrt{2}} \left[ \begin{array}{c@{\quad }c@{\quad }c} U &{} U &{} 0 \\ V_1 &{} -V_1 &{} \sqrt{2} V_2 \end{array} \right] \, , \end{aligned}$$
(13)

and \(\Xi : {\mathfrak {R}}^{p\times q} \rightarrow \mathcal{S}^{p+q}\) be defined by

$$\begin{aligned} \Xi (Y) :\, = \, \left[ \begin{array}{c@{\quad }c} 0 &{} Y \\ Y^T &{} 0 \end{array} \right] \, , \ \ Y \in \mathfrak {R}^{p\times q}\, . \end{aligned}$$
(14)

By [28, Section 8.6], we know that \(\Xi (Y)\) has the following spectral decomposition:

$$\begin{aligned} \Xi (Y) \, =\, Q \left[ \begin{array}{c@{\quad }c@{\quad }c} \Sigma &{} 0 &{} 0 \\ 0 &{} -\Sigma &{} 0 \\ 0 &{} 0 &{} 0 \end{array} \right] Q^T \, , \end{aligned}$$
(15)

i.e., the eigenvalues of \(\Xi (Y)\) are \(\pm \sigma _i , i=1,\ldots , p,\) and \(0\) of multiplicity \(q-p.\)

For any \(W= P\mathrm{Diag}(\lambda _1, \ldots , \lambda _{p+q})P^T\in \mathcal{S}^{p+q}\), define \(\widehat{S}: \mathcal{S}^{p+q} \rightarrow \mathcal{S}^{p+q}\) by

$$\begin{aligned} \widehat{S}(W) := P {\mathrm{Diag}}(s(\lambda _1), \ldots , s(\lambda _{p+q}))P^T = (W-\rho I)_+ - (-W-\rho I)_+\, . \end{aligned}$$

Then, by the strong semismoothness property of \((\cdot )_+ : \mathcal{S}^{p+q} \rightarrow \mathcal{S}^{p+q}\), we have that \(\widehat{S}(\cdot )\) is strongly semismooth everywhere in \(\mathcal{S}^{p+q}\). By direct calculations, we have

$$\begin{aligned} S(Y) := \widehat{S}(\Xi (Y)) \ = \, Q \left[ \begin{array}{c@{\quad }c@{\quad }c} \Sigma _\rho &{} 0 &{} 0 \\ 0 &{} -\Sigma _\rho &{} 0 \\ 0 &{} 0 &{} 0 \end{array} \right] Q^T = \left[ \begin{array}{c@{\quad }c} 0 &{} D_{\rho }(Y) \\ D_{\rho }(Y)^T &{} 0 \end{array} \right] \, . \end{aligned}$$
(16)

Thus we have the following theorem.

Theorem 2.1

The function \(D_{\rho }(\cdot )\) is strongly semismooth everywhere in \(\mathfrak {R}^{p\times q}\).

Proof

Let \(Y\in \mathfrak {R}^{p\times q}\) admit the SVD as in (10). Notice that \(\widehat{S}(\cdot )\) is strongly semismooth in \(\mathcal{S}^{p+q}\). This, together with (16), proves that \(D_{\rho }(\cdot )\) is strongly semismooth at \(Y\). Since \(Y\) is arbitrarily chosen, we have that \(D_{\rho }(\cdot )\) is strongly semismooth everywhere in \(\mathfrak {R}^{p\times q}\).

\(\square \)

For the convenience of later discussion, we define the following three index sets:

$$\begin{aligned} \alpha :=\{1,\ldots ,p\}, \quad \gamma :=\{p+1, \ldots , 2p\}, \quad \beta :=\{2p+1, \ldots , p+q\}. \end{aligned}$$
(17)

For each \(\rho >0\), the index set \(\alpha \) is further decomposed into the following subindex sets:

$$\begin{aligned} \alpha _1 \!:=\!\{i \, | \sigma _i(Y) > \rho , i\in \alpha \}, \alpha _2 \!:=\!\{i \, | \sigma _i(Y) \!=\!\rho , i\in \alpha \}, \alpha _3 \!:=\!\{i \, | \sigma _i(Y) < \rho , i \in \alpha \}. \end{aligned}$$
(18)

In the following, we show that all the elements of the generalized Jacobian \(\partial D_\rho (\cdot )\) are self-adjoint and positive semidefinite. First we prove the following lemma.

Lemma 2.1

Let \(Y \in \mathfrak {R}^{p\times q}\) admit the SVD as in (10). Then the unique minimizer of the following problem

$$\begin{aligned} \displaystyle \min \left\{ \Vert X-Y\Vert ^2 : X\in B_\rho := \{Z \in \mathfrak {R}^{p\times q}: \Vert Z\Vert _2\le \rho \} \right\} \end{aligned}$$
(19)

is \(X^{*}=\Pi _{B_\rho }(Y) = U [\min (\Sigma , \rho ) \ 0] V^T\), where \(\min (\Sigma , \rho ) = \mathrm{Diag}(\min (\sigma _1,\rho ),\dots ,min(\sigma _p,\rho ))\).

Proof

Obviously problem (19) has an unique optimal solution which is equal to \(\Pi _{B_\rho }(Y)\). For any feasible \(Z\) with the SVD as in (10), we have \(\sigma _i(Z)\le \rho ,\ i=1, \ldots , p\). Since \(\Vert \cdot \Vert \) is unitarily invariant, by [7, Exercise IV.3.5], we have that

$$\begin{aligned} \Vert Y-Z\Vert ^2\ge \sum _{i\in \alpha _1} (\sigma _{i}(Y)-\sigma _{i}(Z))^2 + \sum _{i\in \alpha _2 \cup \alpha _3} (\sigma _{i}(Y)-\sigma _{i}(Z))^2\ge \sum _{i\in \alpha _1} (\sigma _{i}(Y)-\rho )^2. \end{aligned}$$

Since \(\Vert Y-X^{*}\Vert ^2 = \sum _{i \in \alpha _1} (\sigma _{i}(Y)-\rho )^2,\) we have \(\Vert Y-Z\Vert ^2\ge \Vert Y-X^{*}\Vert ^2 \quad \mathrm for any Z \in B_\rho .\) Thus \(X^{*}= U [\min (\Sigma , \rho ) \ 0] V^T\) is the unique optimal solution. \(\square \)

Note that the above lemma has also been proved in [44] with a different proof. From the above lemma, we have that \(D_\rho (Y) = Y -\Pi _{B_\rho }(Y)\), which implies that \(\Pi _{B_\rho }(\cdot )\) is also strongly semismooth everywhere in \(\mathfrak {R}^{p\times q}\).

Now, we shall show that even though the soft thresholding operator \(D_\rho (\cdot )\) is not differentiable everywhere, \(\Vert D_\rho (\cdot )\Vert ^2\) is continuously differentiable. First we summarize some well-known properties of the Moreau–Yosida [42, 59] regularization. Assume that \(\mathcal {Y}\) is a finite-dimensional real Euclidean space. Let \(f: \mathcal {Y} \rightarrow (-\infty , +\infty ]\) be a proper lower semicontinuous convex function. For a given \(\sigma >0\), the Moreau–Yosida regularization of \(f\) is defined by

$$\begin{aligned} \displaystyle F_{\sigma }(y)= \min \left\{ f(x) +\frac{1}{2\sigma }\Vert x-y\Vert ^2 : x \;\in \; \mathcal {Y} \right\} . \end{aligned}$$
(20)

It is well known that \(F_{\sigma }\) is a continuously differentiable convex function on \(\mathcal {Y}\) and

$$\begin{aligned} \nabla F_{\sigma }(y) = \frac{1}{\sigma } (y - x(y)), \end{aligned}$$

where \(x(y)\) denotes the unique optimal solution of (20). It is well known that \(x(\cdot )\) is globally Lipschitz continuous with modulus \(1\) and \(\nabla F_{\sigma }\) is globally Lipschitz continuous with modulus \(1/\sigma \).

Proposition 2.1

Let \(\displaystyle \Theta (Y) = \frac{1}{2}\Vert D_\rho (Y)\Vert ^2\), where \(Y\in \mathfrak {R}^{p\times q}\). Then \(\Theta (Y)\) is continuously differentiable and

$$\begin{aligned} \nabla \Theta (Y) = D_\rho (Y). \end{aligned}$$
(21)

Proof

It is already known that the following minimization problem

$$\begin{aligned} F(Y) \; = \; \min \left\{ \rho \Vert X\Vert _{*} + \frac{1}{2}\Vert X-Y\Vert ^2 : X\in \mathfrak {R}^{p\times q}\right\} , \end{aligned}$$

has the unique optimal solution \(X= D_\rho (Y)\) (see, [9, 38]). From the properties of the Moreau–Yosida regularization, we know that \(D_\rho (\cdot )\) is globally Lipschitz continuous with modulus \(1\) and \(F(Y)\) is continuously differentiable with

$$\begin{aligned} \nabla F(Y) = Y - D_\rho (Y). \end{aligned}$$
(22)

Since \(D_\rho (Y)\) is the unique optimal solution, we have that

$$\begin{aligned} F(Y) = \rho \Vert D_\rho (Y)\Vert _{*} + \frac{1}{2}\Vert D_\rho (Y)-Y\Vert ^2 = \frac{1}{2}\Vert Y\Vert ^2 - \frac{1}{2}\Vert D_\rho (Y)\Vert ^2. \end{aligned}$$
(23)

This, together with (22), implies that \( \Theta (\cdot )\) is continuously differentiable with \(\nabla \Theta (Y) = D_\rho (Y)\). \(\square \)

Next, we shall discuss the smoothing counterpart of the soft thresholding operator \(D_{\rho }(\cdot )\). Let \(\chi (\varepsilon , t): \mathfrak {R}\times \mathfrak {R}\rightarrow \mathfrak {R}\) be the following Huber smoothing function for \((t)_{+}\):

$$\begin{aligned} \chi (\varepsilon , t)= \left\{ \begin{array}{l@{\quad }l} t &{} \mathrm{if}\ t\ge \frac{|\varepsilon |}{2}, \\ \displaystyle \frac{1}{2|\varepsilon |} \left( t + \frac{|\varepsilon |}{2}\right) ^2 &{} \mathrm{if} \ -\frac{|\varepsilon |}{2} < t < \frac{|\varepsilon |}{2}, \\ 0 &{} \mathrm{if}\ t \le -\frac{|\varepsilon |}{2}, \end{array} \right. \; (\varepsilon , t) \in \mathfrak {R}\times \mathfrak {R}. \end{aligned}$$
(24)

Then the smoothing function for \(s(\cdot )\) in (11) is defined as follows:

$$\begin{aligned} \overline{s}(\varepsilon , t) = \chi (\varepsilon , t-\rho ) - \chi (\varepsilon , -t-\rho ), \quad (\varepsilon , t) \in \mathfrak {R}\times \mathfrak {R}. \end{aligned}$$
(25)

Note that \(\overline{s}(\varepsilon , \cdot )\) is an odd function with respect to \(t\in \mathfrak {R}\). Let \(Y\in \mathfrak {R}^{p\times q}\) admit the SVD as in (10). For any \(\varepsilon \in \mathfrak {R}\), the smoothing function for \(S(Y)\) in (16) is defined as follows:

$$\begin{aligned} \overline{S}(\varepsilon , Y):= Q\left[ \begin{array}{c@{\quad }c@{\quad }c} \overline{\Sigma }_{\rho } &{} 0 &{} 0 \\ 0 &{} -\overline{\Sigma }_{\rho } &{} 0 \\ 0 &{} 0 &{} 0 \end{array} \right] Q^T, \end{aligned}$$
(26)

where \(\overline{\Sigma }_{\rho } = \mathrm{Diag}(\overline{s}(\varepsilon , \sigma _1), \ldots , \overline{s}(\varepsilon , \sigma _p))\). By direct calculations, we have

$$\begin{aligned} \overline{S}(\varepsilon , Y) = \begin{bmatrix} 0&\quad \overline{D}_{\rho }(\varepsilon , Y)\\ (\overline{D}_{\rho }(\varepsilon , Y))^T&\quad 0 \end{bmatrix} = \Xi \left( \overline{D}_{\rho }(\varepsilon , Y) \right) , \end{aligned}$$

where

$$\begin{aligned} \overline{D}_{\rho }(\varepsilon , Y) = U \big [\overline{\Sigma }_\rho \;\; 0 \big ]V^T \end{aligned}$$
(27)

is a smoothing function for \(D_{\rho }(Y)\). Note that when \(\varepsilon = 0\), \(\overline{S}(0, Y) = S(Y)\) and \(\overline{D}_{\rho }(0, Y)= D_{\rho }(Y)\). For any \(\lambda = (\lambda _1, \ldots , \lambda _{p+q})^T \in \mathfrak {R}^{p+q}\), let \(\lambda _i = \sigma _i\) for \(i\in \alpha \), \(\lambda _i = -\sigma _{i-p}\) for \(i\in \gamma \), and \(\lambda _i = 0\) for \(i\in \beta \). When \(\varepsilon \ne 0\) or \(\sigma _i \ne \rho , i= 1, \ldots , p\), we use \(\Lambda (\varepsilon , \lambda )\in \mathcal {S}^{p+q}\) to denote the following first divided difference symmetric matrix for \(\overline{s}(\varepsilon , \cdot )\) at \(\lambda \)

$$\begin{aligned} \Lambda (\varepsilon , \lambda ) = \left[ \begin{array}{c@{\quad }c@{\quad }c} \Lambda _{\alpha \alpha } &{} \Lambda _{\alpha \gamma } &{} \Lambda _{\alpha \beta } \\ \Lambda _{\alpha \gamma }^T &{} \Lambda _{\gamma \gamma } &{} \Lambda _{\gamma \beta } \\ \Lambda _{\alpha \beta } ^T &{} \Lambda _{\gamma \beta }^T &{} \Lambda _{\beta \beta } \end{array} \right] , \end{aligned}$$
(28)

whose \((i,j)\)-th entry is given by

$$\begin{aligned}{}[ \Lambda (\varepsilon , \lambda ) ]_{ij}= \left\{ \begin{array}{l@{\quad }l} \displaystyle \frac{\overline{s}(\varepsilon , \lambda _i) - \overline{s}(\varepsilon , \lambda _j)}{\lambda _i - \lambda _j} \in [0, 1] &{} \mathrm{if}\ \lambda _i \ne \lambda _j \\ \displaystyle (\overline{s})^{\prime }_{\lambda _i}(\varepsilon , \lambda _i) \in [0, 1] &{} \mathrm{if} \ \lambda _i = \lambda _j \end{array} \right. \! , \ i,j = 1,\ldots p+q. \end{aligned}$$

Since \(\overline{s}(\varepsilon , \cdot )\) is an odd function, we have the following results:

$$\begin{aligned} \Lambda _{\alpha \alpha } = \Lambda _{\gamma \gamma }, \quad \Lambda _{\alpha \gamma } = (\Lambda _{\alpha \gamma })^T, \quad \Lambda _{\gamma \beta } = \Lambda _{\alpha \beta }, \end{aligned}$$

and \((\Lambda (\varepsilon , \lambda ))_{ij}\in [0, 1]\) for all \(i,j= 1,\ldots , p+q\).

Proposition 2.2

Let \(Y\) admit the SVD in (10). If \(\varepsilon \ne 0\) or \(\sigma _i \ne \rho , i= 1, \ldots , p\), then \(\overline{D}(\cdot ,\cdot )\) is continuously differentiable around \((\varepsilon ,Y)\), and for any \((\tau , H)\in \mathfrak {R}\times \mathfrak {R}^{p\times q}\), it holds that

$$\begin{aligned} (\overline{D}_{\rho })^{\prime }(\varepsilon , Y) (\tau , H)&= U\left( \Lambda _{\alpha \alpha }\circ H_1^s \!+\! \Lambda _{\alpha \gamma }\circ H_1^a \!+\! \tau D(\varepsilon , \Sigma )\right) V_1^T \nonumber \\&+ U(\Lambda _{\alpha \beta } \ \circ H_2) V_2^T, \end{aligned}$$
(29)

where \(H_1=U^T H V_1\), \(H_2=U^T H V_2\), with \(H_1^s = (H_1 + H_1^T)/{2}\), \(H_1^a = (H_1 - H_1^T)/{2}\), and

$$\begin{aligned} D(\varepsilon , \Sigma ) = \mathrm{Diag}((\overline{s})^{\prime }_{\varepsilon }(\varepsilon , \sigma _1), \ldots , (\overline{s})^{\prime }_{\varepsilon }(\varepsilon , \sigma _p)). \end{aligned}$$
(30)

Furthermore, by [37], \(\overline{D}_{\rho }(\cdot , \cdot )\) is globally Lipschitz continuous and strongly semismooth at any \((0, Y)\in \mathfrak {R}\times \mathfrak {R}^{p\times q}\).

Proof

By the well known result of Löwner [36], we know that for any \(H\in \mathfrak {R}^{p\times q}\),

$$\begin{aligned} (\overline{S})^{\prime }_{Y}(\varepsilon , Y) H = Q[\Lambda (\varepsilon , \lambda )\circ (Q^T \Xi (H)Q)]Q^T, \end{aligned}$$
(31)

where “\(\circ \)” denotes the Hadamard product. Since

$$\begin{aligned} Q^T\Xi (H) Q&= \frac{1}{2}\begin{bmatrix} H_1+H_1^T&\quad H_1^T- H_1&\quad \sqrt{2}H_2\\ H_1-H_1^T&\quad -(H_1+H_1^T)&\quad \sqrt{2}H_2 \\ \sqrt{2}H_2^T&\quad \sqrt{2}H_2^T&\quad 0 \\ \end{bmatrix}, \end{aligned}$$
(32)

by simple algebraic calculations, we have that

$$\begin{aligned} (\overline{S})^{\prime }_{Y}(\varepsilon , Y) H = Q[\Lambda (\varepsilon , \lambda )\circ (Q^T \Xi (H)Q)]Q^T = \begin{bmatrix} 0&\quad A_{12} \\ A_{12}^T&\quad 0 \\ \end{bmatrix}, \end{aligned}$$

where \( A_{12} = U\left( \Lambda _{\alpha \alpha }\circ H_1^s + \Lambda _{\alpha \gamma }\circ H_1^a \right) V_1^T + U(\Lambda _{\alpha \beta } \ \circ H_2) V_2^T. \) When \(\varepsilon \ne 0\) or \(\sigma _i \ne \rho , i= 1, \ldots , p\), the partial derivative of \(\overline{S}(\cdot , \cdot )\) with respect to \(\varepsilon \) can be computed by

$$\begin{aligned} (\overline{S})^{\prime }_{\varepsilon }(\varepsilon , Y) = Q\left[ \begin{array}{c@{\quad }c@{\quad }c} D(\varepsilon , \Sigma ) &{} 0 &{} 0 \\ 0&{} -D(\varepsilon , \Sigma ) &{} 0 \\ 0 &{} 0 &{} 0 \end{array} \right] Q^T \!=\! \left[ \begin{array}{c@{\quad }c} 0 &{} U D(\varepsilon , \Sigma ) V_1^T \\ V_1 D(\varepsilon , \Sigma ) U^T &{} 0 \end{array} \right] . \end{aligned}$$
(33)

Since for any \((\tau , H) \in \mathfrak {R}\times \mathfrak {R}^{p\times q}\),

$$\begin{aligned} (\overline{S})^{\prime }(\varepsilon , Y)(\tau , H) = \left[ \begin{array}{c@{\quad }c} 0 &{} (\overline{D}_{\rho })^{\prime }(\varepsilon , Y) (\tau , H) \\ ((\overline{D}_{\rho })^{\prime }(\varepsilon , Y)(\tau ,H))^T &{} 0 \end{array} \right] , \end{aligned}$$

we have the required result in (29). Thus, \(\overline{D}_{\rho }(\cdot , \cdot )\) is continuously differentiable around \((\varepsilon , Y)\in \mathfrak {R}\times \mathfrak {R}^{p\times q}\) if \(\varepsilon \ne 0\) or \(\sigma _i \ne \rho , i= 1, \ldots , p\). \(\square \)

Next, we will give a characterization of the generalized Jacobian \(\partial \overline{D}_\rho (0, Y)\) for \((0, Y)\in \mathfrak {R}\times \mathfrak {R}^{p\times q}\). Let \(\mathcal {D}\) be the set of points in \(\mathfrak {R}\times \mathfrak {R}^{p\times q}\) at which \(\overline{D}_\rho \) is differentiable. Suppose that \(\mathcal {N}\) is any set of Lebesgue measure zero in \(\mathfrak {R}\times \mathfrak {R}^{p\times q}\). Then

$$\begin{aligned} \partial \overline{D}_\rho (0, Y) = \mathrm{conv}\left\{ \lim _{(\varepsilon ^k, Y^k) \rightarrow (0, Y)} (\overline{D}_\rho )^{\prime }(\varepsilon ^k, Y^k) : (\varepsilon ^k, Y^k)\in \mathcal {D}, \ (\varepsilon ^k, Y^k) \notin \mathcal {N} \right\} . \end{aligned}$$
(34)

Note that \(\partial \overline{D}_\rho (0, Y)\) does not depend on the choice of the null set \(\mathcal {N}\) [57, Theorem 4].

Before characterizing \(\partial \overline{D}_\rho (0, Y)\) in the next proposition, we need to introduce a “submap” of \(\overline{D}_\rho \) as follows. Define \((\overline{D}_{\rho })_{|\alpha _2|}: \mathfrak {R}\times \mathfrak {R}^{|\alpha _2|\times |\alpha _2|} \rightarrow \mathfrak {R}^{|\alpha _2|\times |\alpha _2|}\) by replacing the dimension \(p\) and \(q\) in the definition of \(\overline{D}_\rho : \mathfrak {R}\times \mathfrak {R}^{p\times q} \rightarrow \mathfrak {R}^{p\times q}\) with \(|\alpha _2|\), respectively, where the index set \(\alpha _2\) is defined as in (18). As in the case for \(\overline{D}_\rho (\cdot , \cdot )\), the mapping \((\overline{D}_{\rho })_{|\alpha _2|}(\cdot , \cdot )\) is also Lipschitz continuous. Thus Clarke’s generalized Jacobian \(\partial (\overline{D}_\rho )_{|\alpha _2|}(0, Z)\) for \((0,Z) \in \mathfrak {R}\times \mathfrak {R}^{|\alpha _2|\times |\alpha _2|}\) is well defined. We let \(I_{|\alpha _2|}\) be the identity matrix of size \(|\alpha _2|\).

Proposition 2.3

Let \(Y\in \mathfrak {R}^{p\times q}\) admit the SVD as in (10). Then, for any \(\mathcal {V} \in \partial \overline{D}_\rho (0, Y)\), there exists \(\mathcal {V}_{|\alpha _2|}\in \partial (\overline{D}_\rho )_{|\alpha _2|}(0, \rho I_{|\alpha _2|})\) such that

$$\begin{aligned} \mathcal {V}(\tau , H)&= U\left[ \begin{array}{c@{\quad }c@{\quad }c} (H_1^s)_{\alpha _1 \alpha _1} &{} (H_1^s)_{\alpha _1 \alpha _2} &{} \Omega _{\alpha _1 \alpha _3}\circ (H_1^s)_{\alpha _1 \alpha _3} \\ (H_1^s)_{\alpha _1 \alpha _2}^T &{} \mathcal {V}_{|\alpha _2|}(\tau ,(H_1^s)_{\alpha _2 \alpha _2}) &{} 0 \\ \Omega _{\alpha _1 \alpha _3}^T\circ (H_1^s)_{\alpha _1 \alpha _3}^T &{} 0 &{} 0 \end{array} \right] V_1^T \nonumber \\&\qquad + \;U \left[ (\Gamma _{\alpha \gamma }\circ H_1^a)V_1^T + (\Gamma _{\alpha \beta } \circ H_2)V_2^T \right] \end{aligned}$$
(35)

for all \((\tau , H)\in \mathfrak {R}\times \mathfrak {R}^{p\times q}\), where \(\Omega _{\alpha _1 \alpha _3}, \Gamma _{\alpha \gamma }\) and \(\Gamma _{\alpha \beta }\) are defined as follows,

$$\begin{aligned} (\Omega _{\alpha _1 \alpha _3})_{ij}&:= \frac{\sigma _i - \rho }{\sigma _i - \sigma _j}, \quad \mathrm{for} \; i\in \alpha _1,\; j \in \alpha _3, \end{aligned}$$
(36)
$$\begin{aligned} \Gamma _{\alpha \gamma }&:= \left[ \begin{array}{c@{\quad }c@{\quad }c} \omega _{\alpha _1 \alpha _1} &{} \omega _{\alpha _1\alpha _2} &{} \omega _{\alpha _1\alpha _3} \\ \omega _{\alpha _1\alpha _2}^T &{} 0 &{} 0 \\ \omega _{\alpha _1\alpha _3}^T &{} 0 &{} 0 \end{array}\right] , \nonumber \\ \omega _{ij}&:= \frac{(\sigma _i - \rho )_{+} + (\sigma _j -\rho )_{+}}{\sigma _i + \sigma _j}, \; \mathrm{for}\; i \in \alpha _1, j \in \alpha ,\end{aligned}$$
(37)
$$\begin{aligned} \Gamma _{\alpha \beta }&:= \left[ \begin{array}{c} \mu _{\alpha _1 \bar{\beta }}\\ 0 \end{array}\right] ,\bar{\beta } :=\beta -2p = \{1,\ldots , q-p\}, \nonumber \\ \mu _{ij}&:= \frac{\sigma _i -\rho }{\sigma _i},\mathrm{for} \; i\in \alpha _1,\; j \in \bar{\beta }, \end{aligned}$$
(38)

\(H_1 = U^T H V_1, H_2 = U^T H V_2\), and \(H_1^s = (H_1 + H_1^T)/2\), \(H_1^a = (H_1-H_1^T)/2\).

Proof

Please refer to [33, Proposition 2.9]. \(\square \)

3 A partial proximal point algorithm for matrix least squares problems

In this section, we will describe how to design a partial proximal point algorithm (PPA) to solve the problem (1).

Let \(\mathcal {Z}\) be a finite dimensional real Euclidean space with an inner product \(\langle \cdot , \cdot \rangle \) and its induced norm \(\Vert \cdot \Vert \). Let \(\mathcal {T}: \mathcal {Z} \rightarrow \mathcal {Z}\) be a maximal monotone operator. We define its domain and image, respectively, as follows: \(\mathrm{Dom}\, (\mathcal {T}) : = \{z\in \mathcal {Z} \;|\; \mathcal {T}(z) \ne \emptyset \}\) and \(\mathrm{Im} \, (\mathcal {T}) := \bigcup _{z\in \mathcal {Z}} \mathcal {T}(z).\) Let \(\mathcal {Z} = \mathcal {X}\times \mathcal {Y}\), where \(\mathcal {X}\) and \(\mathcal {Y}\) are two finite dimensional real Euclidean spaces each equipped with a scalar product \(\langle \cdot , \, \cdot \rangle \) and its induced norm \(\Vert \cdot \Vert \). Suppose now that \(z\in \mathcal {Z}\) is partitioned into two components \(z = (x,y)\), where \(x\in \mathcal {X}\) and \(y \in \mathcal {Y}\). Ha [30] proposed a partial PPA to solve the inclusion problem in two variables \(0 \in \mathcal {T}(x,y)\), in which only one of the variables is involved in the proposed iterative procedure. Below we give a brief review of the partial PPA proposed by Ha [30]. Let \(\Pi : \mathcal {X}\times \mathcal {Y} \rightarrow \mathcal {X}\times \mathcal {Y}\) be the orthogonal projection of \( \mathcal {X}\times \mathcal {Y}\) onto \(\{0\}\times \mathcal {Y}\), i.e., \(\Pi (x,y) = (0,y)\). To solve the inclusion problem \(0\in \mathcal {T}(x,y)\), from a given initial point \((x^0, y^0)\in \mathcal {X}\times \mathcal {Y}\), the exact partial PPA generates a sequence \(\{(x^k, y^k)\}\) by the following scheme:

$$\begin{aligned} (x^{k+1}, y^{k+1}) \in P_{\sigma _k}(x^k,y^k), \end{aligned}$$
(39)

where \(P_{\sigma _k} := (\Pi + \sigma _k \mathcal {T})^{-1}\Pi \) and the sequence \(\{\sigma _k\}\) satisfies (8). In general, the mapping \(P_{\sigma _k}\) is neither single-valued nor nonexpansive. However, by [30, Proposition 2], we know that the second component of \(P_{\sigma _k}(x^k, y^k)\) is uniquely determined and nonexpansive. For practical purpose, the following general approximation criteria were introduced in [30]:

$$\begin{aligned} \Vert (x^{k+1}, y^{k+1}) - (u^{k+1}, v^{k+1})\Vert&\le \varepsilon _k,\quad \varepsilon _k >0, \; \sum _{k=0}^{\infty }\varepsilon _k < \infty , \end{aligned}$$
(40a)
$$\begin{aligned} \Vert (x^{k+1}, y^{k+1}) - (u^{k+1}, v^{k+1})\Vert&\le \delta _k\Vert (x^{k+1}, y^{k+1}) - (x^k, y^k)\Vert , \end{aligned}$$
(40b)
$$\begin{aligned} \Vert y^{k+1} - v^{k+1}\Vert&\le \delta _k \Vert y^{k+1} - y^k\Vert , \quad \delta _k>0, \; \sum _{k=0}^{\infty } \delta _k <\infty , \nonumber \\ \end{aligned}$$
(40c)

where \((u^{k+1}, v^{k+1}) \in P_{\sigma _k}(x^k, y^k)\). In [30], Ha showed that under mild assumptions, any cluster point of the sequence \(\{(x^k, y^k)\}\) generated by the partial PPA under criterion (40a) is a solution to \(0 \in \mathcal {T}(x,y)\). Moreover, the sequence \(\{y^k\}\) converges weakly to \(\bar{y}\), which is the second component of a solution to \(0 \in \mathcal {T}(x,y)\). If, in addition, (40b) and (40c) are also satisfied and \(\mathcal {T}^{-1}\) is Lipschitz continuous at the origin, then the sequence \(\{(x^k, y^k)\}\) converges locally at least at a linear rate which tends to zero as \(\sigma _k \rightarrow +\infty \). For more discussions of the convergence analysis of the partial PPA, see [30, Theorem 1 and 2].

Next, we shall show how to use the partial PPA to solve (1). It is easy to see that (1) can be rewritten as follows:

$$\begin{aligned}&\min _{u\in \mathfrak {R}^m, X \in \mathfrak {R}^{p\times q}} \left\{ f_\rho (u, X):= \frac{1}{2}\Vert u\Vert ^2 + \rho \Vert X\Vert _{*} + \langle C, X \rangle : \mathcal {A} (X) + u = b,\right. \nonumber \\&\qquad \qquad \qquad \qquad \left. \mathcal {B}(X) \in d + \mathcal {Q} \right\} . \end{aligned}$$
(41)

Note that the objective function \(f_\rho (u,X)\) is strongly convex in \(u\) for all \(X\in \mathfrak {R}^{p\times q}\). For the convergence analysis, we assume that the following Slater condition holds:

$$\begin{aligned} \displaystyle \left\{ \begin{array}{l} \{\mathcal {B}_i\}_{i=1}^{s_1} \; \text {are linearly independent and}\; \exists \; X^0 \in \mathfrak {R}^{p\times q} \\ \text {such that}\; \mathcal {B}_i(X^0) = d_i, i= 1, \ldots , s_1 \; \mathrm{and}\; \mathcal {B}_i(X^0)> d_i, i= s_1+1, \ldots , s. \end{array} \right. \end{aligned}$$
(42)

Let \(l(u,X;\zeta ,\xi ): \mathfrak {R}^m \times \mathfrak {R}^{p\times q} \times \mathfrak {R}^m \times \mathfrak {R}^s \rightarrow \mathfrak {R}\) be the ordinary Lagrangian function for (41) in the extended form:

$$\begin{aligned} l(u,X; \zeta ,\xi ):= \left\{ \begin{array}{l@{\quad }l} \displaystyle f_\rho (u, X)+ \langle \zeta , b-\mathcal {A}(X)-u \rangle + \langle \xi , d-\mathcal {B}(X)\rangle &{} \mathrm{if}\ \xi \in \mathcal {Q}^{*},\\ -\infty &{} \mathrm{if} \ \xi \notin \mathcal {Q}^{*}, \end{array} \right. \end{aligned}$$
(43)

where \(\mathcal {Q}^{*} = \mathfrak {R}^{s_1} \times \mathfrak {R}_{+}^{s_2}\) is the dual cone of \(\mathcal {Q}\). The essential objective function in (41) is

$$\begin{aligned} f(u,X) \; := \; \sup _{\zeta \in \mathfrak {R}^m,\ \xi \in \mathfrak {R}^s} l(u,X;\zeta ,\xi ) = \left\{ \begin{array}{l@{\quad }l}\displaystyle f_{\rho }(u,X) &{} \mathrm{if}\ (u,X) \in \mathcal {F}_P, \\ +\infty &{} \mathrm{if} \ (u,X) \notin \mathcal {F}_P, \end{array} \right. \end{aligned}$$
(44)

where \(\mathcal {F}_P=\{(u, X)\in \mathfrak {R}^m \times \mathfrak {R}^{p\times q} \ |\ \mathcal {A} (X) + u = b, \mathcal {B}(X) \in d + \mathcal {Q}\}\) is the feasible set of (41). The dual problem of (41) is given by:

$$\begin{aligned} \max \left\{ g_\rho (\zeta ,\xi ) \!:\! \mathcal {A}^*(\zeta ) \!+\! \mathcal {B}^*(\xi ) \!+\! Z \!=\! C, \, \Vert Z\Vert _2 \le \rho , \, \zeta \!\in \! \mathfrak {R}^m,\, \xi \in \mathcal {Q}^{*}, \, Z\in \mathfrak {R}^{p\times q} \right\} ,\nonumber \\ \end{aligned}$$
(45)

where \(g_\rho (\zeta ,\xi ) := - \frac{1}{2}\Vert \zeta \Vert ^2 + \langle b, \zeta \rangle + \langle d, \xi \rangle \). As in Rockafellar [50], we define the following two maximal monotone operators

$$\begin{aligned} \displaystyle \left\{ \begin{array}{l} \mathcal {T}_{f}(u,X) = \{(v,Y) \in \mathfrak {R}^m \times \mathfrak {R}^{p\times q}: \ (v,Y) \in \partial f(u,X)\},\\ \mathcal {T}_{l}(u,X;\zeta ,\xi ) = \{ (v,Y, y,z) \in \mathfrak {R}^{m}\times \mathfrak {R}^{p\times q} \times \mathfrak {R}^{m} \times \mathfrak {R}^{s}: (v,Y, -y, -z)\\ \quad \quad \quad \in \partial l(u,X;\zeta ,\xi )\}, \end{array} \right. \end{aligned}$$

where \(u\in \mathfrak {R}^m, X\in \mathfrak {R}^{p\times q}, \zeta \in \mathfrak {R}^m,\) and \(\xi \in \mathfrak {R}^s\). Note that since \(f(u,X)\) is strongly convex in \(u\) with modulus \(1\) for all \(X\in \mathfrak {R}^{p\times q}\), \(\mathcal {T}_f\) is strongly monotone with modulus \(1\) with respect to the variable \(u\) [49, Proposition 6], i.e.,

$$\begin{aligned} \langle (u, X) - (u^{\prime }, X^{\prime }), \, (v, Y) - (v^{\prime }, Y^{\prime })\rangle \ge \Vert u-u^{\prime }\Vert ^2 \end{aligned}$$
(46)

for all \((v, Y)\in \mathcal {T}_f(u,X)\) and \((v^{\prime }, Y^{\prime }) \in \mathcal {T}_f(u^{\prime }, X^{\prime })\). From the definition of \(\mathcal {T}_f\), we know that for any \((v,Y)\in \mathfrak {R}^m \times \mathfrak {R}^{p\times q}\),

$$\begin{aligned} \mathcal {T}_f^{-1}(v,Y) = \mathrm{arg}\min _{u\in \mathfrak {R}^m, X\in \mathfrak {R}^{p\times q}} \left\{ f(u,X) - \langle v, u \rangle -\langle Y, X \rangle \right\} . \end{aligned}$$

Similarly, for any \((v, Y, y, z) \in \mathfrak {R}^{m}\times \mathfrak {R}^{p\times q} \times \mathfrak {R}^{m} \times \mathfrak {R}^{s}\),

$$\begin{aligned} \mathcal {T}_{l}^{-1}(v, Y, y, z) = \mathrm{arg}\min _{ \begin{array}{c} u\in \mathfrak {R}^m \\ X\in \mathfrak {R}^{p\times q} \end{array}} \max _{ \begin{array}{c} \zeta \in \mathfrak {R}^m \\ \xi \in \mathfrak {R}^s \end{array}} \left\{ l(u,X;\zeta ,\xi ) \!-\! \langle v, u \rangle \!-\!\langle Y, X \rangle \!+\! \langle y, \zeta \rangle \!+\! \langle z, \xi \rangle \right\} . \end{aligned}$$

Since \(f(u,X)\) is strongly convex in \(u\) for all \(X\in \mathfrak {R}^{p\times q}\), we apply the partial PPA proposed by Ha [30] to the maximal monotone operator \(\mathcal {T}_f\), in which only the variable \(X\) appears in the quadratic proximal term. Given a starting point \((u^0, X^0) \in \mathfrak {R}^m \times \mathfrak {R}^{p\times q}\), the inexact partial PPA generates a sequence \(\{(u^k, X^k)\}\) by approximately solving the following problem

$$\begin{aligned} \min _{u\in \mathfrak {R}^m, X\in \mathfrak {R}^{p\times q}} \left\{ f(u, X) + \frac{1}{2\sigma _k}\Vert X - X^k\Vert \right\} . \end{aligned}$$
(47)

We can easily see that any minimizer \((u,X)\) of problem (47) satisfies

$$\begin{aligned} (0, X^k) \in (0, X) + \sigma _k \mathcal {T}_f (u, X). \end{aligned}$$
(48)

Let \(\Pi : \mathfrak {R}^m \times \mathfrak {R}^{p\times q} \rightarrow \mathfrak {R}^m \times \mathfrak {R}^{p\times q}\) be the orthogonal projector of \( \mathfrak {R}^m \times \mathfrak {R}^{p\times q}\) onto \(\{0\}\times \mathfrak {R}^{p\times q}\), i.e., \(\Pi (u,X) = (0,X)\). Then (48) can be equivalently written as

$$\begin{aligned} (u,X) \in (\Pi + \sigma _k \mathcal {T}_f)^{-1} \Pi (u^k,X^k). \end{aligned}$$

It follows that the set of minimizers of (47) can be expressed as \((\Pi + \sigma _k \mathcal {T}_f)^{-1} \Pi (u^k,X^k)\).

Next, for any parameter \(\sigma >0\), we show some properties of the mappings \(Q_\sigma := (\Pi + \sigma \mathcal {T}_f)^{-1}\) and \(P_\sigma := Q_\sigma \Pi = (\Pi + \sigma \mathcal {T}_f)^{-1} \Pi \). The proofs essentially follow the ideas in [32, Proposition 2 and 3] together with [16, Theorem 2.7], and we shall omit them.

Proposition 3.1

For any given parameter \(\sigma >0\), let \(Q_\sigma = (\Pi + \sigma \mathcal {T}_f)^{-1}\), and \(P_\sigma = Q_\sigma \Pi = (\Pi + \sigma \mathcal {T}_f)^{-1} \Pi \). Suppose that \(\mathrm{Dom}\, (\mathcal {T}_f) \ne \emptyset \). Then we have the following properties:

  1. (i)

    The mapping \(Q_\sigma \) and \(P_\sigma \) are single-valued in \(\mathfrak {R}^m \times \mathfrak {R}^{p\times q}\).

  2. (ii)

    Let \(\beta = \min \{1, \sigma \}\). For any \((u,X), (u^{\prime }, X^{\prime }) \in \mathfrak {R}^m \times \mathfrak {R}^{p\times q}\),

    $$\begin{aligned} \Vert Q_\sigma (u,X) - Q_\sigma (u^{\prime }, X^{\prime })\Vert&\le \frac{1}{\beta } \Vert (u,X) - (u^{\prime }, X^{\prime })\Vert , \\ \Vert P_\sigma (u,X) - P_\sigma (u^{\prime }, X^{\prime })\Vert&\le \frac{1}{\beta } \Vert X - X^{\prime }\Vert . \end{aligned}$$

Since the operator \(P_{\sigma _k}\) is single-valued, the approximate rule of the partial PPA for solving problem (41) can be expressed as

$$\begin{aligned} \displaystyle (u^{k+1}, X^{k+1})\approx P_{\sigma _k}(u^k,X^k) := (\Pi + \sigma _k \mathcal {T}_f)^{-1}\Pi (u^k,X^k), \end{aligned}$$
(49)

where \(P_{\sigma _k}(u^k, X^k)\) is defined by

$$\begin{aligned} \displaystyle P_{\sigma _k}(u^k, X^k) = \text{ argmin }_{u\in \mathfrak {R}^m, X\in \mathfrak {R}^{p\times q}} \left\{ f(u,X) + \frac{1}{2\sigma _k}\Vert X-X^k\Vert ^2\right\} . \end{aligned}$$
(50)

Now we calculate the partial quadratic regularization of \(f\) in (50), which plays a key role in the study of the partial PPA for solving (41). For a given parameter \(\sigma >0\), the partial quadratic regularization of \(f\) in (44) associated with \(\sigma \) is given by

$$\begin{aligned} \displaystyle F_{\sigma }(X) = \min _{u\in \mathfrak {R}^m, Y \in \mathfrak {R}^{p\times q}} \left\{ f(u, Y)+ \frac{1}{2\sigma }\Vert Y-X\Vert ^2 \right\} . \end{aligned}$$
(51)

From (44), we have

$$\begin{aligned} \displaystyle F_{\sigma } (X)&= \min _{\begin{array}{c} u\in \mathfrak {R}^m \\ Y \in \mathfrak {R}^{p\times q} \end{array}} \sup _{\begin{array}{c} \zeta \in \mathfrak {R}^m \\ \xi \in \mathfrak {R}^s \end{array}} l(u,Y;\zeta ,\xi ) + \frac{1}{2\sigma }\Vert Y-X\Vert ^2\nonumber \\&= \sup _{\begin{array}{c} \zeta \in \mathfrak {R}^m \\ \xi \in \mathfrak {R}^s \end{array}} \min _{\begin{array}{c} u\in \mathfrak {R}^m \\ Y \in \mathfrak {R}^{p\times q} \end{array}} l(u,Y;\zeta ,\xi ) + \frac{1}{2\sigma }\Vert Y-X\Vert ^2 \nonumber \\&= \sup _{\begin{array}{c} \zeta \in \mathfrak {R}^m \\ \xi \in \mathcal {Q}^{*} \end{array}} \min _{\begin{array}{c} u\in \mathfrak {R}^m \\ Y \in \mathfrak {R}^{p\times q} \end{array}} \left\{ f_\rho (u, Y) + \langle \zeta , b-\mathcal {A}(Y)-u \rangle + \langle \xi , d-\mathcal {B}(Y)\rangle \right. \nonumber \\&\qquad \qquad \qquad \quad \left. + \frac{1}{2\sigma }\Vert Y-X\Vert ^2 \right\} , \end{aligned}$$
(52)

where the interchange of \(\min _{u, Y}\) and \(\sup _{\zeta , \xi }\) follows from the growth properties in \((u, Y)\) [48, Theorem 37.3] and the third equality follows from (43). Thus we have

$$\begin{aligned} F_{\sigma } (X) \; = \; \sup _{\zeta \in \mathfrak {R}^m, \ \xi \in \mathcal {Q}^{*}} \Theta _\sigma (\zeta ,\xi ; X), \end{aligned}$$

where

$$\begin{aligned} \Theta _\sigma (\zeta ,\xi ; X)&:= \min _{u\in \mathfrak {R}^m,Y \in \mathfrak {R}^{p\times q}} \left\{ f_\rho (u, Y) +\langle \zeta , b-\mathcal {A}(Y)-u \rangle \nonumber \right. \\&\left. +\, \langle \xi , d-\mathcal {B}(Y)\rangle + \frac{1}{2\sigma }\Vert Y-X\Vert ^2 \right\} \nonumber \\&-\frac{1}{2}\Vert \zeta \Vert ^2 + \langle b, \zeta \rangle +\langle d, \xi \rangle \nonumber \\&+\min _{Y \in \mathfrak {R}^{p\times q}} \left\{ \rho \Vert Y\Vert _{*} + \frac{1}{2\sigma }\Vert Y- W(\zeta ,\xi ; X)\Vert ^2 \right\} \nonumber \\&+ \frac{1}{2\sigma }\left( \Vert X\Vert ^2 - \Vert W(\zeta ,\xi ; X)\Vert ^2\right) \nonumber \\&= -\frac{1}{2}\Vert \zeta \Vert ^2 + \langle b, \zeta \rangle +\langle d, \xi \rangle + \frac{1}{2\sigma }\Vert X\Vert ^2\nonumber \\&-\frac{1}{2\sigma }\Vert D_{\rho \sigma }(W(\zeta ,\xi ; X))\Vert ^2. \end{aligned}$$
(53)

In the above,

$$\begin{aligned} W(\zeta ,\xi ; X) = X- \sigma (C-\mathcal {A}^*\zeta - \mathcal {B}^*\xi ) \end{aligned}$$
(54)

and the last equality follows from (23). By the saddle point theorem [48, Theorem 28.3] and (52), we have that for any \((\zeta (X), \xi (X))\) such that

$$\begin{aligned} (\zeta (X), \xi (X)) \in \mathrm{argsup}_{\zeta \in \mathfrak {R}^m, \ \xi \in \mathcal {Q}^{*}} \Theta _\sigma (\zeta ,\xi ; X), \end{aligned}$$

\((\zeta (X), D_{\rho \sigma }(W(\zeta (X),\xi (X); X)))\) is the unique solution to (51).

Now we formally present the partial PPA for solving (41).

figure a

Suppose that \((\bar{\zeta }(X^k), \bar{\xi }(X^k))\) is an optimal solution of the inner subproblem (55) for each \(X^k\). Let \(P_{\sigma _k}\) be defined as in (50). In order to terminate (55) in the above PPA, we introduce the following stopping criteria:

$$\begin{aligned} \sup \theta _{k}(\zeta ,\xi ) - \theta _{k}(\zeta ^{k+1},\xi ^{k+1})&\le \frac{\varepsilon _k^2}{4\sigma _k}, \end{aligned}$$
(56a)
$$\begin{aligned} \Vert \zeta ^{k+1} - \bar{\zeta }(X^k)\Vert ^2&\le \frac{1}{2}\varepsilon _k^2, \quad \varepsilon _k > 0, \ \sum _{k=0}^{\infty } \varepsilon _k <\infty , \end{aligned}$$
(56b)
$$\begin{aligned} \sup \theta _{k}(\zeta ,\xi ) - \theta _{k}(\zeta ^{k+1},\xi ^{k+1})&\le \frac{\delta _k^2}{2\sigma _k} \Vert X^{k+1} - X^k\Vert ^2, \end{aligned}$$
(56c)
$$\begin{aligned} \Vert \zeta ^{k+1} - \bar{\zeta }(X^k)\Vert ^2&\le \delta _k^2 \Vert \zeta ^{k+1} -\zeta ^k\Vert ^2, \quad \delta _k>0, \ \sum _{k=0}^{\infty } \delta _k <\infty , \end{aligned}$$
(56d)
$$\begin{aligned} \mathrm{dist}(0,\partial \theta _{k}(\zeta ^{k+1},\xi ^{k+1}))&\le \frac{\delta _k^{'}}{\sigma _k}\Vert X^{k+1}-X^k\Vert , \quad 0\le \delta _k^{'} \rightarrow 0. \end{aligned}$$
(56e)

Note that \(F_{\sigma _k}(X^k) = \sup \theta _{k}(\zeta ,\xi )\) and \( \theta _{k}(\zeta ^{k+1},\xi ^{k+1}) = \Theta _{\sigma _k}(\zeta ^{k+1},\xi ^{k+1}; X^k)\). The following result reveals the relation between the estimation (56) and (40), which enables us to apply the convergence results of the partial PPA in [30, Theorem 1 & 2] to our partial PPA. The proof essentially follows the idea in [50, Proposition 6].

Proposition 3.2

Suppose that \((\bar{\zeta }(X^k), \bar{\xi }(X^k))\) is an optimal solution of (55). Let \((\bar{u}^{k+1}, \overline{X}^{k+1}) = (\bar{\zeta }(X^k), D_{\rho \sigma _k}(W(\bar{\zeta }(X^k),\bar{\xi }(X^k); X^k)))\) and \(X^{k+1}= D_{\rho \sigma _k}(W(\zeta ^{k+1}, \xi ^{k+1}; X^k))\). Then one has

$$\begin{aligned} \frac{1}{2\sigma _k}\Vert X^{k+1} - \overline{X}^{k+1}\Vert ^2 \le \sup \theta _{k}(\zeta ,\xi ) - \theta _{k}(\zeta ^{k+1},\xi ^{k+1}). \end{aligned}$$
(57)

Proof

Since \(\Theta _{\sigma }(\zeta ,\xi ; X)\) is convex in \(X\) and \(\nabla _X \Theta _{\sigma _k}(\zeta ^{k+1},\xi ^{k+1}; X^k) = (X^k - X^{k+1})/{\sigma _k}, \) the following inequality holds for any \(Y \in \mathfrak {R}^{p\times q}\):

$$\begin{aligned}&\Theta _{\sigma _k}(\zeta ^{k+1},\xi ^{k+1}; X^k) + \langle \sigma _k^{-1}(X^k - X^{k+1}), Y- X^k \rangle \nonumber \\&\quad \le \Theta _{\sigma _k}(\zeta ^{k+1},\xi ^{k+1}; Y) \le \sup \left\{ \Theta _{\sigma _k}(\zeta ,\xi ; Y) \mid {\zeta \in \mathfrak {R}^m, \xi \in \mathcal {Q}^{*}} \right\} = F_{\sigma _k} (Y) \nonumber \\&\quad = \min _{\begin{array}{c} u\in \mathfrak {R}^m \\ X\in \mathfrak {R}^{p\times q} \end{array}} \left\{ f(u, X) + \frac{1}{2\sigma _k}\Vert X-Y\Vert ^2\right\} \le f\left( \bar{u}^{k+1}, \overline{X}^{k+1}\right) + \frac{1}{2\sigma _k}\Vert \overline{X}^{k+1}- Y\Vert ^2.\nonumber \\ \end{aligned}$$
(58)

We also know that

$$\begin{aligned} \sup \theta _{k}(\zeta ,\xi )&= F_{\sigma _k}(X^k) = \min _{u\in \mathfrak {R}^m, X\in \mathfrak {R}^{p\times q}} \left\{ f(u, X) + \frac{1}{2\sigma _k}\Vert X-X^k\Vert ^2\right\} \nonumber \\&= f\left( \bar{u}^{k+1}, \overline{X}^{k+1}\right) + \frac{1}{2\sigma _k}\Vert \overline{X}^{k+1}- X^k\Vert ^2, \end{aligned}$$
(59)

which together with (58) and the fact that \( \theta _{k}(\zeta ^{k+1},\xi ^{k+1}) = \Theta _{\sigma _k}(\zeta ^{k+1},\xi ^{k+1}; X^k)\), implies that

$$\begin{aligned}&\sup \theta _{k}(\zeta ,\xi ) - \theta _{k}(\zeta ^{k+1},\xi ^{k+1}) \nonumber \\&\quad \ge \frac{1}{2\sigma _k}\left[ \Vert \overline{X}^{k+1}- X^k\Vert ^2 - \Vert \overline{X}^{k+1}- Y\Vert ^2 - 2\langle X^{k+1} - X^k, Y- X^k \rangle \right] \nonumber \\&\quad = \frac{1}{2\sigma _k} \left[ -\Vert (\overline{X}^{k+1} + X^k- X^{k+1}) -Y\Vert ^2 + \Vert \overline{X}^{k+1} - X^{k+1}\Vert ^2 \right] . \end{aligned}$$
(60)

Since this inequality holds for all \(Y \in \mathfrak {R}^{p\times q}\), by taking the maximum of (60) in \(Y\), we have

$$\begin{aligned} \sup \theta _{k}(\zeta ,\xi ) - \theta _{k}(\zeta ^{k+1},\xi ^{k+1}) \;\ge \; \frac{1}{2\sigma _k} \Vert \overline{X}^{k+1} - X^{k+1}\Vert ^2, \end{aligned}$$

which proves our assertion. \(\square \)

3.1 Convergence analysis of the partial PPA

In this subsection, we show the global convergence and local convergence of the partial PPA for solving (41), mainly based upon the convergence results of Ha [30, Theorem 1 & 2], which require the condition (62) in the following proposition. The purpose of this proposition is to give a sufficient condition for (62) to hold.

Proposition 3.3

Consider the function \(f(u,X)\) defined in (44). Suppose that for some \(\lambda >0\), the following parameterized problem perturbed by \((v, Y)\in \mathfrak {R}^m\times \mathfrak {R}^{p\times q}\)

$$\begin{aligned} \min _{u\in \mathfrak {R}^m, X \in \mathfrak {R}^{p\times q}}\left\{ f(u, X) - \langle u, \, v\rangle - \langle X, \, Y\rangle \right\} \end{aligned}$$
(61)

has an optimal solution whenever \(\max \{\Vert v\Vert , \Vert Y\Vert \}\le \lambda \). Then we have

$$\begin{aligned} 0 \in \mathrm{int \, Im}\, (\mathcal {T}_f). \end{aligned}$$
(62)

Proof

Since for each \((v, Y)\in \mathfrak {R}^m\times \mathfrak {R}^{p\times q}\) such that \(\max \{\Vert v\Vert , \Vert Y\Vert \}\le \lambda \), the parameterized problem (61) has an optimal solution \((\bar{u}, \overline{X})\), we have that

$$\begin{aligned} 0 \in \partial f(\bar{u}, \overline{X}) - (v, Y), \end{aligned}$$

which implies that \((v, Y) \in \partial f(\bar{u}, \overline{X})\subseteq \; \mathrm{Im}\, (\mathcal {T}_f)\). Therefore, we have \(0 \in \mathrm{int \, Im}\, (\mathcal {T}_f)\).

\(\square \)

Remark 3.1

In many applications, we have \(C=0\) in the objective function \(f_{\rho }(u, X)\) (see the examples in Sect. 5). In this case, the function \(f_\rho (u, X)\) perturbed by \((v, Y)\in \mathfrak {R}^m\times \mathfrak {R}^{p\times q}\) with \(\Vert Y\Vert _2 < \rho \), is coercive. This is because we have

$$\begin{aligned} f_\rho (u, X) - \langle u, \, v\rangle - \langle X, \, Y\rangle&= \left( \frac{1}{2}\Vert u\Vert ^2 - \langle u, \, v\rangle \right) + (\rho \Vert X\Vert _{*} - \langle X, \, Y\rangle ) \\&\ge \frac{1}{2}\Vert u-v\Vert ^2-\frac{1}{2}\Vert v\Vert ^2 + (\rho -\Vert Y\Vert _2)\Vert X\Vert _*. \end{aligned}$$

Therefore, if \(C = 0\), the parameterized problem (61) has an optimal solution for any \((v, Y)\in \mathfrak {R}^m\times \mathfrak {R}^{p\times q}\) such that \(\max \{\Vert v\Vert , \Vert Y\Vert \} \le \lambda \) for some \(0 < \lambda < \rho \).

Theorem 3.1

(Global convergence) Suppose that the condition in Proposition 3.3 is satisfied. Let the partial PPA be executed with the stopping criterion (56a) and (56b). Then the generated sequence \(\{(u^k, X^k)\}\) is bounded and converges to an optimal solution \((\bar{u}, \overline{X})\) of (41), and \(\{(\zeta ^k, \xi ^k)\}\) is asymptotically minimizing for the dual problem (45) with

$$\begin{aligned} \Vert C - \mathcal {A}^{*}(\zeta ^{k+1}) - \mathcal {B}^{*}(\xi ^{k+1}) - Z^{k+1}\Vert&= \frac{1}{\sigma _k}\Vert X^{k+1} - X^k\Vert \rightarrow 0, \end{aligned}$$
(63)
$$\begin{aligned} \mathrm{asym}\; \sup (D) - g_\rho (\zeta ^k, \xi ^k)&\le \frac{1}{2\sigma _k}\left[ \, \frac{1}{2}\varepsilon _k^2 + \Vert X^k\Vert ^2 - \Vert X^{k+1}\Vert ^2 \, \right] , \qquad \end{aligned}$$
(64)

where \(\mathrm{asym}\, \sup (D)\) is the asymptotic supremum of (45). If the problem (41) satisfies the Slater condition (42), then the sequence \(\{(\zeta ^k, \xi ^k)\}\) is also bounded, and all of its accumulation points are optimal solutions for the problem (45).

Proof

Under the given assumption, we have from Proposition 3.3 that \(0 \in \mathrm{int \, Im}\, (\mathcal {T}_f)\). Moreover, we know from Proposition 3.2 that (56a) and (56b) implies the general stopping criterion (40a) for \(\mathcal {T}_f\). It follows from [30, Theorem 1] that the sequence \(\{(u^k, X^k)\}\) is bounded and any of its weak cluster point is an optimal solution to (41) and \(X^k \rightarrow \overline{X}\). Since \(f_{\rho }(u, X)\) is strongly convex with respect to \(u\), the \(u\)-component of the optimal solution is uniquely determined, which implies that \(\{u^k\} \rightarrow \bar{u}\). Thus the sequence \(\{(u^k, X^k)\}\) converges to an optimal solution \((\bar{u}, \overline{X})\) of (41). The rest of the proof follows the arguments as in [49, Theorem 4] and we omit it here. \(\square \)

Theorem 3.2

(Local convergence) Suppose that the hypotheses in Proposition 3.3 are satisfied. Let the partial PPA be executed with the stopping criterion (56a), (56b), (56c) and (56d). If \(\mathcal {T}_f^{-1}\) is Lipschitz continuous at the origin with modulus \(a_f\), then \(\{(u^k, X^k)\}\) converges to the unique optimal solution \((\bar{u}, \overline{X})\) of (41), and

$$\begin{aligned} \Vert X^{k+1} - \overline{X}\Vert \le \eta _k \Vert X^k - \overline{X}\Vert \quad \text {for all}\; k \;\text {sufficiently large}, \end{aligned}$$
(65)

where

$$\begin{aligned} \eta _k = \left[ a_f(a_f^2 + \sigma _k^2)^{-1/2} + \delta _k\right] (1-\delta _k)^{-1} \rightarrow \eta _{\infty } = a_f(a_f^2 + \sigma _{\infty }^2)^{-1/2} < 1. \end{aligned}$$

Moreover, the conclusions of Theorem 3.1 about \(\{(\zeta ^k, \xi ^k)\}\) are valid.

If in addition to (56c), (56d) and the condition on \(\mathcal {T}_{f}^{-1}\), one has (56e) and \(\mathcal {T}_{l}^{-1}\) is Lipschitz continuous at the origin with modulus \(a_{l}\, (\ge a_f)\), then \((\zeta ^k, \xi ^k) \rightarrow (\bar{\zeta }, \bar{\xi })\), where \((\bar{\zeta }, \bar{\xi })\) is the unique optimal solution for (45), and one has

$$\begin{aligned} \Vert (\zeta ^{k+1}, \xi ^{k+1}) - (\bar{\zeta }, \bar{\xi })\Vert \le \eta ^{\prime }_k \Vert X^{k+1} - X^k\Vert , \quad \text {for all} \ k \ \text {sufficiently large}, \end{aligned}$$
(66)

where \(\eta ^{\prime }_k = a_l (1+ \delta _k^{\prime })/\sigma _k \rightarrow \eta ^{\prime }_{\infty } = a_l/\sigma _\infty \).

Proof

Since it follows from Proposition 3.2 that (56c) and (56d) implies the general stopping criterion (40b) and (40c), we can easily obtain the first part of the theorem from Theorem 3.1 and the general results in [30, Theorem 2]. The second part of the theorem can similarly be obtained by following the argument in [50, Theorem 5]. We omit it here. \(\square \)

4 An inexact smoothing Newton method for solving inner subproblems

In this section, we design an inexact smoothing Newton method for solving the inner subproblem (55), which is the most expensive step in each iteration of the partial PPA. Here we assume that there are inequality constraints in (41). If there are only equality constraints, then (55) is an unconstrained problem, and a semismooth Newton-CG method similar to the one proposed in [61] can be designed to solve (55); we refer the reader to [33] for the details.

For later convenience, we let

$$\begin{aligned} \widehat{\mathcal {A}} = \left( {\begin{array}{c}\mathcal {A}\\ \mathcal {B}\end{array}}\right) ,\, \hat{b}=(b;d) \in \mathfrak {R}^{m+s}, \, \mathcal {K} = \mathfrak {R}^m \times \mathcal {Q}^{*} \subseteq \mathfrak {R}^m \times \mathfrak {R}^s, \quad \mathrm{and} \quad y=(\zeta ;\xi ) \in \mathcal {K}. \end{aligned}$$
(67)

In our partial PPA, for some fixed \(X\in \mathfrak {R}^{p\times q}\) and \(\sigma >0\), we need to solve the following form of the inner subproblem:

$$\begin{aligned} \min _{y\in \mathcal {K}} \left\{ \varphi (y) :=\frac{1}{2}\langle y, Ty \rangle + \frac{1}{2\sigma }\Vert D_{\rho \sigma }(W(y;X))\Vert ^2 - \langle \hat{b}, y \rangle -\frac{1}{2\sigma }\Vert X\Vert ^2\right\} , \end{aligned}$$
(68)

where \(T = [I_m, 0; 0, 0] \in \mathfrak {R}^{(m+s)\times (m+s)}, W(y;X) = X-\sigma (C- \widehat{\mathcal {A}}^{*}y)\) and \(\widehat{\mathcal {A}}^{*} = (\mathcal {A}^{*},\, \mathcal {B}^{*})\) is the adjoint of \(\widehat{\mathcal {A}}\). Note that \(-\varphi (\cdot )\) is the objective function of the inner subproblem (55). The function \(\varphi (\cdot )\) in (68) is continuously differentiable with

$$\begin{aligned} \nabla \varphi (y) = Ty + \widehat{\mathcal {A}}D_{\rho \sigma }(W(y;X)) - \widehat{b}, \quad y \in \mathfrak {R}^{m+s}. \end{aligned}$$

Since \(\varphi (\cdot )\) is a convex function, \(\bar{y}= (\bar{\zeta }; \bar{\xi })\in \mathcal {K}\) solves problem (68) if and only if it satisfies the following variational inequality

$$\begin{aligned} \langle y-\bar{y}, \nabla \varphi (\bar{y}) \rangle \ge 0 \quad \forall y \in \mathcal {K}. \end{aligned}$$
(69)

Define \(F: \mathfrak {R}^{m+s} \rightarrow \mathfrak {R}^{m+s}\) by

$$\begin{aligned} F(y) \;:=\; y - \Pi _{\mathcal {K}}(y - \nabla \varphi (y)), \quad y \in \mathfrak {R}^{m+s}. \end{aligned}$$
(70)

Then one can easily prove that \(\bar{y}\in \mathcal {K}\) solves (69) if and only if \(F(\bar{y}) = 0\) [17]. Thus, solving the inner problem (68) is equivalent to solving the following equation

$$\begin{aligned} F(y) = 0, \quad y \in \mathfrak {R}^{m+s}. \end{aligned}$$
(71)

Since both \(\Pi _{\mathcal {K}}(\cdot )\) and \(D_{\rho \sigma }(\cdot )\) are globally Lipschitz continuous, \(F\) is globally Lipschitz continuous. For the purpose of introducing an inexact smoothing Newton method, we need to define a smoothing function for \(F(\cdot )\).

The smoothing function for the soft thresholding operator \(D_{\rho \sigma }(\cdot )\) has been defined by (27) where the threshold value is equal to \(\rho \sigma \). Next, we need to define the smoothing function for \(\Pi _\mathcal {K}(\cdot )\). For simplicity, we shall use the Huber smoothing function \(\chi \) defined in (24). Let \(\pi :\mathfrak {R}\times \mathfrak {R}^{m+s} \rightarrow \mathfrak {R}^{m+s}\) be defined by

$$\begin{aligned} \pi _i(\varepsilon , z)= \left\{ \begin{array}{l@{\quad }l} \displaystyle z_i &{} \mathrm{if}\ 1\le i\le m+s_1 \\ \chi (\varepsilon , z_i) &{} \mathrm{if} \ m+s_1+1\le i\le m+s \end{array} \right. , \quad (\varepsilon , z)\in \mathfrak {R}\times \mathfrak {R}^{m+s}. \end{aligned}$$
(72)

The function \(\pi \) is obviously continuously differentiable around any \((\varepsilon , z)\in \mathfrak {R}\times \mathfrak {R}^{m+s}\) as long as \(\varepsilon \ne 0\) and is strongly semismooth everywhere.

Now, we are ready to define a smoothing function for \(F(\cdot )\). Let

$$\begin{aligned} \Upsilon (\varepsilon , y):= y - \pi (\varepsilon , y - (Ty + \widehat{\mathcal {A}}\overline{D}_{\rho \sigma }(\varepsilon , W(y;X))-\hat{b})), \quad (\varepsilon , y)\in \mathfrak {R}\times \mathfrak {R}^{m+s}.\qquad \end{aligned}$$
(73)

From the definitions of \(\Upsilon , \pi ,\) and \(\overline{D}_{\rho \sigma }\), we have that \(F(y)= \Upsilon (0,y)\) for any \(y\in \mathfrak {R}^{m+s}\).

Proposition 4.1

The mapping \(\Upsilon \) be defined by (73) has the following properties:

  1. (i)

    \(\Upsilon \) is globally Lipschitz continuous on \(\mathfrak {R}\times \mathfrak {R}^{m+s}\).

  2. (ii)

    \(\Upsilon \) is continuously differentiable around \((\varepsilon , y)\) when \(\varepsilon \ne 0\). For any fixed \(\varepsilon \in \mathfrak {R}\), \(\Upsilon (\varepsilon , \cdot )\) is a \(P_0\)-function, i.e., for any \((y,z)\in \mathfrak {R}^{m+s}\times \mathfrak {R}^{m+s}\) with \(y\ne z\),

    $$\begin{aligned} \max _{y_i\ne z_i} (y_i - z_i)(\Upsilon _i(\varepsilon , y) - \Upsilon _i(\varepsilon , z))\ge 0, \end{aligned}$$
    (74)

    and thus for any fixed \(\varepsilon \ne 0\), \(\Upsilon _y^{\prime }(\varepsilon , y)\) is a \(P_0\)-matrix (i.e., all its principal minors are nonnegative).

  3. (iii)

    \(\Upsilon \) is strongly semismooth at \((0,y)\). In particular, for any \(\varepsilon \downarrow 0\) and \(\mathfrak {R}^{m+s} \ni h \rightarrow 0\) we have

    $$\begin{aligned} \Upsilon (\varepsilon , y+h) - \Upsilon (0, y) - \Upsilon ^{\prime }(\varepsilon , y+h)(\varepsilon , h) = O(\Vert (\varepsilon , h)\Vert ^2). \end{aligned}$$
  4. (iv)

    For any \(h\in \mathfrak {R}^{m+s}\),

    $$\begin{aligned} \partial \Upsilon (0, y)(0,h)&\subseteq h - \partial \pi (0, y- \nabla \varphi (y)) (0, h \\&\quad -(Th+ \sigma \widehat{\mathcal {A}} \partial \overline{D}_{\rho \sigma }(0,W(y;X))(0, \widehat{\mathcal {A}}^{*}h))). \end{aligned}$$

Proof

Please refer [33, Proposition 3.9]. \(\square \)

Now we are ready to introduce the inexact smoothing Newton method, which was developed by Gao and Sun in [24], for solving the nonsmooth equation of the form (71). Let \(\kappa \in (0,\infty )\) be a constant. Define \(G:\mathfrak {R}\times \mathfrak {R}^{m+s}\rightarrow \mathfrak {R}^{m+s}\) by

$$\begin{aligned} G(\varepsilon , y) : = \Upsilon (\varepsilon , y) + \kappa |\varepsilon | y, \quad (\varepsilon , y) \in \mathfrak {R}\times \mathfrak {R}^{m+s}, \end{aligned}$$
(75)

where \(\Upsilon : \mathfrak {R}\times \mathfrak {R}^{m+s}\rightarrow \mathfrak {R}^{m+s}\) is defined by (73). Note that \(G_y^{\prime }(\varepsilon , y)\) is a \(P\)-matrix (i.e., all its principal minors are positive) for any \((\varepsilon \ne 0, y) \in \mathfrak {R}\times \mathfrak {R}^{m+s}\), and hence it is nonsingular, while by part \((ii)\) of Proposition 4.1, \(\Upsilon _y^{\prime }(\varepsilon , y)\) is only a \(P_0\)-matrix which may be singular. Define \(E: \mathfrak {R}\times \mathfrak {R}^{m+s} \rightarrow \mathfrak {R}\times \mathfrak {R}^{m+s}\) by

$$\begin{aligned} E(\varepsilon , y) : = \left[ \begin{array}{c} \varepsilon \\ G(\varepsilon , y) \end{array} \right] = \left[ \begin{array}{c} \varepsilon \\ \Upsilon (\varepsilon , y) + \kappa |\varepsilon | y \end{array} \right] , \quad (\varepsilon , y) \in \mathfrak {R}\times \mathfrak {R}^{m+s}. \end{aligned}$$

For any \((\varepsilon \ne 0, y) \in \mathfrak {R}^{m+s}\), \(E^{\prime }(\varepsilon , y)\) is a \(P\)-matrix, and hence it is nonsingular. Then solving the nonsmooth equation \(F(y) = 0\) is equivalent to solving the following smoothed equation

$$\begin{aligned} E(\varepsilon , y) = 0. \end{aligned}$$
(76)

The inexact smoothing Newton method that is used to solve the above equation is described in Fig. 1.

Fig. 1
figure 1

Algorithm 2 (An inexact smoothing Newton method)

Let

$$\begin{aligned} \mathcal {N} : = \left\{ (\varepsilon , y) \in \mathfrak {R}\times \mathfrak {R}^{m+s}\ | \ \varepsilon \ge {\varsigma (\varepsilon , y)} \hat{\varepsilon }\right\} , \end{aligned}$$

where \({\varsigma (\varepsilon , y)} = r\min \{1,\Vert E(\varepsilon ,y)\Vert ^2\}\) and \(\hat{\varepsilon } > 0\) is given. From [24, Theorem 4.1 and Theorem 3.6], we have the following convergence results for the inexact smoothing Newton method. For more details on the inexact smoothing Newton method, see [24].

Theorem 4.1

Algorithm \(2\) is well defined and generates an infinite sequence \(\{(\varepsilon ^k ,y^k)\}\in \mathcal {N}\) with the properties that any accumulation point \((\bar{\varepsilon }, \bar{y})\) of \(\{(\varepsilon ^k ,y^k)\}\) is a solution of \(E(\varepsilon , y) = 0\) and \(\text{ lim } _{k \rightarrow \infty } \Vert E(\varepsilon ^k, y^k)\Vert ^2= 0\). Additionally, if the Slater condition (42) holds, then \(\{(\varepsilon ^k ,y^k)\}\) is bounded.

Theorem 4.2

Let \((\bar{\varepsilon }, \bar{y})\) be an accumulation point of the infinite sequence \(\{(\varepsilon ^k ,y^k)\}\) generated by Algorithm \(2\). Suppose that \(E\) is strongly semismooth at \((\bar{\varepsilon }, \bar{y})\) and that all \(\mathcal {V} \in \partial E(\bar{\varepsilon }, \bar{y})\) are nonsingular. Then the sequence \(\{(\varepsilon ^k ,y^k)\}\) converges to \((\bar{\varepsilon }, \bar{y})\) quadratically, i.e., for \(k\) sufficiently large

$$\begin{aligned} \Vert (\varepsilon ^{k+1} - \bar{\varepsilon }, y^{k+1} - \bar{y})\Vert = O \left( \Vert (\varepsilon ^k - \bar{\varepsilon }, y^k - \bar{y})\Vert ^2\right) . \end{aligned}$$

Suppose that the Slater condition (42) holds. Let \((\bar{\varepsilon }, \bar{y})\) be an accumulation point of the sequence \(\{(\varepsilon ^k, y^k)\}\) generated by Algorithm \(2\). Then, we know that \(\bar{\varepsilon } = 0\) and \(F(\bar{y}) = 0,\) which means that \(\bar{y} = (\bar{\zeta } ; \bar{\xi }) \in \mathcal {K}\) is an optimal solution to the inner subproblem (68). Let \(\overline{X} := D_{\rho \sigma }(W(\bar{y}; X))\). Then \((\bar{\zeta }, \overline{X})\) is the unique optimal solution to the problem (51).

For the quadratic convergence of Algorithm \(2\), we need the concept of constraint nondegeneracy. For a given closed set \(K \subseteq \mathcal {X}\), we let \(T_K(x)\) be the tangent cone of \(K\) at \(x\in K\) as in convex analysis [48]. The largest linear space contained in \(T_K(x)\) is denoted by \(\mathrm{lin}(T_K(x))\), which is equal to \((-T_K(x))\cap T_K(x)\). Define \(g: \mathfrak {R}^{p\times q} \rightarrow \mathfrak {R}\) by \(g(X) = \Vert X\Vert _{*}\). Let \(K_{p,q}\) be the epigraph of \(g\), i.e.,

$$\begin{aligned} K_{p,q}\; :=\; \mathrm{epi}(g) = \{(X, t) \in \mathfrak {R}^{p\times q}\times \mathfrak {R}\; |\; g(X)\le t \}, \end{aligned}$$

which is a closed convex cone. Let \(\widehat{\mathcal {B}} := (\mathcal {B}, \, 0)\). Then (1) can be rewritten in the following form:

$$\begin{aligned} \displaystyle \min \left\{ \frac{1}{2}\Vert \mathcal {A} (X) -b\Vert ^2 + \rho \ t + \langle C, X \rangle \;:\; \widehat{\mathcal {B}}(X, t) \in d + \mathcal {Q}, \; (X, t) \in K_{p,q} \right\} . \end{aligned}$$
(79)

It is easy to see that \(\overline{X}\) is an optimal solution for (1) if and only if \((\overline{X}, \bar{t})\) is an optimal solution to (79) with \(\bar{t} = \Vert \overline{X}\Vert _{*}\). Let \(\mathcal {I}\) be the identity map from \(\mathfrak {R}^{p\times q} \times \mathfrak {R}\) to \(\mathfrak {R}^{p\times q} \times \mathfrak {R}\). Then the constraint nondegeneracy condition is said to hold at \((\overline{X}, \bar{t})\) if

$$\begin{aligned} \left( \begin{array}{c} \widehat{\mathcal {B}}\\ \mathcal {I} \end{array}\right) (\mathfrak {R}^{p\times q} \times \mathfrak {R}) + \left( \begin{array}{c} \mathrm{lin}\left( T_{\mathcal {Q}}(\widehat{\mathcal {B}}(\overline{X}, \bar{t}) - d)\right) \\ \mathrm{lin}( T_{K_{p,q}}(\overline{X}, \bar{t})) \end{array} \right) = \left( \begin{array}{c} \mathfrak {R}^{s}\\ \mathfrak {R}^{p\times q} \times \mathfrak {R}\end{array} \right) . \end{aligned}$$
(80)

Note that \(\mathrm{lin}(T_{\mathcal {Q}}(\widehat{\mathcal {B}}(\overline{X}, \bar{t}) - d)) = \mathrm{lin}(T_{\mathcal {Q}}(\mathcal {B}(\overline{X}) - d))\). Let \(\mathcal {E}(\overline{X})\) be the index set of active constraints at \(\overline{X}\):

$$\begin{aligned} \mathcal {E}(\overline{X}) :=\{i \ | \ \langle \mathcal {B}_i, \overline{X}\rangle = d_i, \ i = s_1 +1, \ldots , s\}, \end{aligned}$$

and \(l : =|\mathcal {E}(\overline{X})|\). Without loss of generality, we assume that

$$\begin{aligned} \mathcal {E}(\overline{X}) :=\{s_1+1, \ldots , s_1+l\}. \end{aligned}$$

Define \(\widetilde{\mathcal {B}}: \mathfrak {R}^{p\times q} \rightarrow \mathfrak {R}^{s_1+l}\) by

$$\begin{aligned} \widetilde{\mathcal {B}}(X) := \left[ \langle \mathcal {B}_1, X \rangle , \ldots , \langle \mathcal {B}_{s_1+l}, X \rangle \right] ^T, \quad X \in \mathfrak {R}^{p\times q}. \end{aligned}$$

Let \(\overline{\mathcal {B}} = (\widetilde{\mathcal {B}}, \, 0)\). Since \(\mathrm{lin}(T_{\mathcal {Q}}(\mathcal {B}(X) - d))\) can be computed directly as follows

$$\begin{aligned} \mathrm{lin}(T_{\mathcal {Q}}(\mathcal {B}(\overline{X}) - d)) = \{ h\in \mathfrak {R}^s \ | \ h_i = 0, i\in \{1,\ldots , s_1\}\cup \mathcal {E}(\overline{X})\}, \end{aligned}$$

(80) can be reduced to

$$\begin{aligned} \left( \begin{array}{c} \overline{\mathcal {B}} \\ \mathcal {I} \end{array}\right) (\mathfrak {R}^{p\times q} \times \mathfrak {R}) + \left( \begin{array}{c} \{0\}^{s_1 + l} \\ \mathrm{lin}( T_{K_{p,q}}(\overline{X}, \bar{t})) \end{array} \right) = \left( \begin{array}{c} \mathfrak {R}^{s_1 + l}\\ \mathfrak {R}^{p\times q} \times \mathfrak {R}\end{array} \right) , \end{aligned}$$

which is equivalent to

$$\begin{aligned} \overline{\mathcal {B}} \left( \mathrm{lin}( T_{K_{p,q}}(\overline{X}, \bar{t}))\right) = \mathfrak {R}^{s_1 + l}. \end{aligned}$$
(81)

Next, we shall characterize the linear space \(\mathrm{lin}(T_{K_{p,q}}(\overline{X}, g(\overline{X})))\). Let \(W(\bar{y};X)\) admit the SVD as in (10). Decompose the index set \(\alpha =\{1,\ldots , p\}\) into the following three subsets:

$$\begin{aligned} \alpha _1&:= \{i \,|\,\sigma _i(W) > \rho \sigma , \ i\in \alpha \},\\ \alpha _2&:= \{i \, |\, \sigma _i(W) = \rho \sigma , \ i\in \alpha \},\\ \alpha _3&:= \{i \, |\, \sigma _i(W) < \rho \sigma , \ i \in \alpha \}. \end{aligned}$$

Then \(U = [U_{\alpha _1}\ U_{\alpha _2} \ U_{\alpha _3}], V=[V_{\alpha _{1}} \ V_{\alpha _2}\ V_{\alpha _3} \; V_2]\), and \(\overline{X} = D_{\rho \sigma }(W(\bar{y};X))\) is of rank \(|\alpha _1|\). For any \(H \in \mathfrak {R}^{p\times q}\), by the results of Watson [58, Theorem 1], we have

$$\begin{aligned} g^{\prime }(\overline{X}; H) = \left\{ \begin{array}{l@{\quad }l} \displaystyle \Vert H\Vert _{*} &{} \mathrm{if}\ |\alpha _1| = 0, \\ \langle UV_1^T, H \rangle &{} \mathrm{if} \ |\alpha _1| = p, \\ \langle U_{\alpha _1} V_{\alpha _1}^T, H \rangle + \Vert [U_{\alpha _2} \; U_{\alpha _3} ]^T H [V_{\alpha _2}\; V_{\alpha _3} \; V_2] \Vert _{*} &{} \mathrm{if } \ 0 < |\alpha _1| < p. \end{array} \right. \end{aligned}$$

From [14, Proposition 2.3.6 and Theorem 2.4.9], we have

$$\begin{aligned} T_{K_{p,q}}(\overline{X}, g(\overline{X})) = \mathrm{epi} (g^{\prime }(\overline{X}; \cdot )), \end{aligned}$$

from which we can readily get

$$\begin{aligned} T_{K_{p,q}}(\overline{X}, g(\overline{X}))&= \{(H, t) \in \mathfrak {R}^{p\times q} \times \mathfrak {R}\ | \ \langle U_{\alpha _1} V_{\alpha _1}^T, H \rangle \\&\quad + \Vert [U_{\alpha _2} \; U_{\alpha _3} ]^T H [V_{\alpha _2}\; V_{\alpha _3} \; V_2] \Vert _{*} \le t \}. \end{aligned}$$

Thus its linearity space is as follows:

$$\begin{aligned} \mathrm{lin}(T_{K_{p,q}}(\overline{X}, g(\overline{X})))&= \{ (H, t) \in \mathfrak {R}^{p\times q} \times \mathfrak {R}\, | \, [U_{\alpha _2} \; U_{\alpha _3} ]^T H [V_{\alpha _2}\; V_{\alpha _3} \; V_2]\\ =&0,\, t = \langle U_{\alpha _1} V_{\alpha _1}^T, H \rangle \}. \end{aligned}$$

Let

$$\begin{aligned} \mathcal {T}(\overline{X}) : = \{H \in \mathfrak {R}^{p\times q} \ | \ [U_{\alpha _2} \; U_{\alpha _3} ]^T H [V_{\alpha _2}\; V_{\alpha _3} \; V_2] = 0\}, \end{aligned}$$

which is a subspace of \(\mathfrak {R}^{p\times q}\). The orthogonal complement of \(\mathcal {T}(\overline{X})\) is given by

$$\begin{aligned} \mathcal {T}(\overline{X})^{\bot } = \{H \in \mathfrak {R}^{p\times q} \ | \ U_{\alpha _1}^T H =0,\; H V_{\alpha _1} = 0\}. \end{aligned}$$

Since \(\overline{\mathcal {B}} = (\widetilde{\mathcal {B}}, \, 0)\), the constraint nondegeneracy condition (81) can be further reduced to

$$\begin{aligned} \widetilde{\mathcal {B}} (\mathcal {T}({\overline{X}}) ) = \mathfrak {R}^{s_1 + l}. \end{aligned}$$
(82)

Lemma 4.1

Let \(W(\bar{y}; X)= X-\sigma (C-\widehat{\mathcal {A}}^{*}\bar{y})\) admit the SVD as in (10). Then the constraint nondegeneracy condition (82) holds at \(\overline{X}=D_{\rho \sigma }(W(\bar{y}; X))\) if and only if for any \(h\in \mathfrak {R}^{s_1+l}\),

$$\begin{aligned} U_{\alpha _1}^T (\widetilde{\mathcal {B}}^{*} h) =0 \quad \mathrm{and} \; (\widetilde{\mathcal {B}}^{*} h)V_{\alpha _1} =0 \; \Longrightarrow \ h=0. \end{aligned}$$
(83)

Proof

Please refer to [33, Lemma 3.12]. \(\square \)

Lemma 4.2

Let \(\widetilde{\mathcal {A}} = (\mathcal {A}; \widetilde{\mathcal {B}})\) and \(\widetilde{\mathcal {A}}^{*} = (\mathcal {A}^{*}, \, \widetilde{\mathcal {B}}^{*})\) be the adjoint of \(\widetilde{\mathcal {A}}\). Let \(\overline{D}_{\rho \sigma }:\mathfrak {R}\times \mathfrak {R}^{p\times q} \rightarrow \mathfrak {R}^{p\times q}\) be defined by (27). Assume that the constraint nondegeneracy condition (82) holds at \(\overline{X}\). Then for any \(\mathcal {V} \in \partial \overline{D}_{\rho \sigma }(0, W(\bar{y}; X))\), we have

$$\begin{aligned} \langle h, \widetilde{T}h+ \sigma \widetilde{\mathcal {A}} \mathcal {V} (0, \widetilde{\mathcal {A}}^{*}h) \rangle >0 \quad \forall \ 0\ne h \in \mathfrak {R}^{m+s_1+l}, \end{aligned}$$
(84)

where \(\widetilde{T} = [I_m, 0; 0, 0]\) is a matrix of size \(m+s_1+l\).

Proof

Please refer to [33, Lemma 3.13]. \(\square \)

Proposition 4.2

Let \(\Upsilon : \mathfrak {R}\times \mathfrak {R}^{m+s} \rightarrow \mathfrak {R}^{m+s}\) be defined by (73). Assume that the constraint nondegeneracy condition (82) holds at \(\overline{X}\). Then for any \(\mathcal {W}\in \partial \Upsilon (0, \bar{y})\), \(\mathcal {W}\) is a \(P\)-matrix, i.e.,

$$\begin{aligned} \max _i h_i(\mathcal {W}(0,h))_i >0 \quad \forall \ 0\ne h \in \mathfrak {R}^{m+s}. \end{aligned}$$
(85)

Proof

Let \(\mathcal {W}\in \partial \Upsilon (0, \bar{y})\). Suppose that there exists \(0\ne h \in \mathfrak {R}^{m+s}\) such that (85) does not hold, i.e.,

$$\begin{aligned} \max _i h_i(\mathcal {W}(0,h))_i \le 0. \end{aligned}$$
(86)

Then from part \((iv)\) of Proposition 4.1, we know that there exist \(\mathcal {D}\in \partial \pi (0, \bar{z})\) and \(\mathcal {V}\in \partial \overline{D}_{\rho \sigma }(0,W(\bar{y};X))\) such that

$$\begin{aligned} \mathcal {W}(0,h)&= h - \mathcal {D}\left( 0, h- (Th + \sigma \widehat{\mathcal {A}} \mathcal {V}(0, \widehat{\mathcal {A}}^{*}h))\right) \\&= h - \mathcal {D}(0,h) + \mathcal {D}(0, Th+\sigma \widehat{\mathcal {A}} \mathcal {V}(0, \widehat{\mathcal {A}}^{*}h)), \end{aligned}$$

where \(\bar{z} = \bar{y}- (T\bar{y} + \widehat{\mathcal {A}}\overline{D}_{\rho \sigma }(0, W(\bar{y};X))-\hat{b})\). By simple calculations, we can find a nonnegative vector \(d\in \mathfrak {R}^{m+s}\) satisfying

$$\begin{aligned} d_i= \left\{ \begin{array}{l@{\quad }l} \displaystyle 1 &{} \mathrm{if}\ 1\le i\le m+s_1, \\ \in [0,1] &{} \mathrm{if} \ m+s_1+1\le i\le m+s_1+l, \\ 0 &{} \mathrm{if}\ m+s_1+l+1 \le i \le m+s, \end{array} \right. \end{aligned}$$

such that for any \(y\in \mathfrak {R}^{m+s}\),

$$\begin{aligned} (\mathcal {D}(0,y))_i = d_i y_i, \; i=1,\ldots , m+s. \end{aligned}$$

Thus we have \( h_i(\mathcal {W}(0,h))_i = h_i\left[ h_i -d_ih_i + d_i\left( Th + \sigma \widehat{\mathcal {A}} \mathcal {V}(0, \widehat{\mathcal {A}}^{*}h)\right) _i\right] , \; i=1,\ldots , m+s. \) This, together with (86), implies that

$$\begin{aligned} \left\{ \begin{array}{l@{\quad }l} \displaystyle h_i(Th + \sigma \widehat{\mathcal {A}} \mathcal {V}(0, \widehat{\mathcal {A}}^{*}h))_i \le 0 &{} \mathrm{if}\ 1\le i\le m+s_1, \\ h_i(Th + \sigma \widehat{\mathcal {A}} \mathcal {V}(0, \widehat{\mathcal {A}}^{*}h))_i \le 0 \; \mathrm{or} \; h_i = 0 &{} \mathrm{if} \ m+s_1+1\le i\le m+s_1+l, \\ h_i = 0 &{} \mathrm{if}\ m+s_1+l+1 \le i \le m+s. \end{array} \right. \end{aligned}$$

Hence, \( \langle h, Th + \sigma \widehat{\mathcal {A}} \mathcal {V}(0, \widehat{\mathcal {A}}^{*}h \rangle = \langle \tilde{h}, \widetilde{T}\tilde{h} + \sigma \widetilde{\mathcal {A}} \mathcal {V}(0, \widetilde{\mathcal {A}}^{*}\tilde{h}\rangle \le 0, \) where \(0\ne \tilde{h}\in \mathfrak {R}^{m+s_1+l}\) is defined by \(\tilde{h}_i = h_i, i= 1,\ldots , m+s_1+l\). However, the above inequality contradicts (84) in Lemma 4.2. Hence, we have that (85) holds. \(\square \)

Theorem 4.3

Let \((\bar{\varepsilon }, \bar{y})\) be an accumulation point of the infinite sequence \(\{(\varepsilon ^k ,y^k)\}\) generated by Algorithm \(2\). Assume that the constraint nondegeneracy condition (82) holds at \(\overline{X}\). Then the sequence \(\{(\varepsilon ^k ,y^k)\}\) converges to \((\bar{\varepsilon }, \bar{y})\) quadratically, i.e., for \(k\) sufficiently large

$$\begin{aligned} \Vert (\varepsilon ^{k+1} - \bar{\varepsilon }, y^{k+1} - \bar{y})\Vert = O \left( \Vert (\varepsilon ^k - \bar{\varepsilon }, y^k - \bar{y})\Vert ^2\right) . \end{aligned}$$

Proof

To prove the quadratic convergence of \(\{(\varepsilon ^k ,y^k)\}\), by Theorem 4.2, it is enough to show that \(E\) is strongly semismooth at \((\bar{\varepsilon }, \bar{y})\) and all \(\mathcal {V}\in \partial E(\bar{\varepsilon }, \bar{y})\) are nonsingular. The strong semismoothness of \(E\) at \((\bar{\varepsilon }, \bar{y})\) follows from part \((iii)\) of Proposition 4.1 and the fact that the modulus function \(|\cdot |\) is strongly semismooth everywhere on \(\mathfrak {R}\).

Next, we show the nonsingularity of all elements in \(\partial E(\bar{\varepsilon }, \bar{y})\). For any \(\mathcal {V}\in \partial E(\bar{\varepsilon }, \bar{y})\), from Proposition 4.2 and the definition of \(E\), we have that for any \(0\ne h \in \mathfrak {R}^{m+s+1}\), \(\max _i h_i (\mathcal {V}d)_i >0\), which implies that \(\mathcal {V}\) is a \(P\)-matrix, and thus nonsingular [15, Theorem 3.3.4]. \(\square \)

4.1 Efficient implementation of the inexact smoothing Newton method

When applying Algorithm \(2\) to solve the inner subproblem (68), the most expensive step is in solving the linear system (77). In our numerical implementation, we first obtain \(\Delta \varepsilon ^k = - \varepsilon ^k + \varsigma _k \hat{\varepsilon }\), and then apply the BiCGStab iterative solver of Van der Vost [56] to the following linear system

$$\begin{aligned} G_y^{\prime }(\varepsilon ^k, y^k) \Delta y^k = -G(\varepsilon ^k, y^k) - G_{\varepsilon }^{\prime }(\varepsilon ^k, y^k) \Delta \varepsilon ^k \end{aligned}$$
(87)

to obtain a \(\Delta y^k\) satisfying condition (78). For convenience, we suppress the superscript \(k\) in our subsequent analysis. By noting that \(G(\varepsilon , y)\) and \(\Upsilon (\varepsilon , y)\) are defined by (75) and (73), respectively, we have that

$$\begin{aligned} G_y^{\prime }(\varepsilon , y) \Delta y&= (1+\kappa \varepsilon )\Delta y + \pi _z^{\prime }(\varepsilon , z)\left( T \Delta y + \sigma \widehat{\mathcal {A}} (\overline{D}_{\rho \sigma })_W^{\prime }(\varepsilon , W) (H)-\Delta y\right) , \end{aligned}$$
(88)

where \(H = \widehat{\mathcal {A}}^{*}\Delta y\), \(z: = y - (Ty + \widehat{\mathcal {A}}\overline{D}_{\rho \sigma }(\varepsilon , W)-\hat{b})\) and \(W: = X-\sigma (C-\widehat{\mathcal {A}}^{*}y)\). Let \(W\) have the SVD as in (10). Then, by (29), we have

$$\begin{aligned} (\overline{D}_{\rho \sigma })_W^{\prime }(\varepsilon , W) (H) = U (\Lambda _{\alpha \alpha }\circ H_1^s + \Lambda _{\alpha \gamma }\circ H_1^a)V_1^T + U(\Lambda _{\alpha \beta } \ \circ H_2) V_2^T, \end{aligned}$$
(89)

where \(\Lambda _{\alpha \alpha }, \Lambda _{\alpha \gamma }\) and \(\Lambda _{\alpha \beta }\) are given by (28), \(H_1 = U^THV_1\), and \(H_2 = U^THV_2\). When implementing the BiCGStab iterative method, one needs to repeatedly compute the matrix-vector multiplication \(G_y^{\prime }(\varepsilon , y) \Delta y\). From (89), it seems that a full SVD of \(W\) is required. But as we shall explain next, it is not necessary to compute \(V_2\) explicitly. Observe that \(\Lambda _{i,j}=d_i\) for all \(i\in \alpha , j\in \beta \) for some \(d\in \mathfrak {R}^p\). Thus

$$\begin{aligned} U(\Lambda _{\alpha \beta }\circ H_2)V_2^T = U\mathrm{diag}(d) U^T H V_2V_2^T = U\mathrm{diag}(d) U^T(H- (HV_1)V_1^T), \end{aligned}$$
(90)

which shows that the computation in (89) can be done without explicitly using \(V_2\).

To achieve fast convergence for the BiCGStab method, we introduce an easy-to-compute diagonal preconditioner for the linear system (87). Since both \(\pi _z^{\prime }(\varepsilon , z)\) and \(T\) are diagonal matrices, we know from (88) that it is enough to find a good diagonal approximation of \(\widehat{\mathcal {A}} (\overline{D}_{\rho \sigma })_W^{\prime }(\varepsilon , W) \widehat{\mathcal {A}}^{*}\). Let

$$\begin{aligned} M: = \mathbf {\widehat{A}S\widehat{A}}^{T}, \end{aligned}$$

where \(\mathbf {\widehat{A}}\) and \(\mathbf {S}\) denote the matrix representation of the linear map \(\widehat{\mathcal {A}}\) and \((\overline{D}_{\rho \sigma })_W^{\prime }(\varepsilon , W) \) with respect to the standard bases in \(\mathfrak {R}^{p\times q}\) and \(\mathfrak {R}^{m+s}\), respectively. Let the standard basis in \(\mathfrak {R}^{p\times q}\) be \(\{E^{ij}\in \mathfrak {R}^{p\times q}: 1\le i\le p, 1\le j \le q\}\), where for each \(E^{ij}\), its \((i,j)\)-th entry is one and all the others are zero. Then the diagonal element of \(\mathbf {S}\) with respect to the standard basis \(E^{ij}\) is given by

$$\begin{aligned} \mathbf {S}_{(i,j),(i,j)} = ((U\circ U) \widetilde{\Lambda }(V\circ V)^T)_{ij} + \frac{1}{2} \langle H_1^{ij}\circ (H_1^{ij})^T, \Lambda _{\alpha \alpha } - \Lambda _{\alpha \gamma }\rangle , \end{aligned}$$

where \( \widetilde{\Lambda } := \left[ \frac{1}{2}(\Lambda _{\alpha \alpha } +\Lambda _{\alpha \gamma }), \Lambda _{\alpha \beta }\right] \) and \(H_1^{ij} = U^T E^{ij}V_1\). Based on the above expression, the total cost of computing all the diagonal elements of \(\mathbf {S}\) is equal to \(2(p+q)pq + 3p^3q \) flops, which is too expensive if \(p^2\gg p+q\). Fortunately, the first term

$$\begin{aligned} \mathbf {d}_{(ij)} = ((U\circ U) \widetilde{\Lambda }(V\circ V)^T)_{ij} \end{aligned}$$

is usually a very good approximation of \(\mathbf {S}_{(i,j),(i,j)}\), and the cost of computing all the elements \(\mathbf {d}_{(ij)}\), for \(1\le i\le p, 1\le j\le q\), is \(2(p+q)pq\) flops. Thus we propose the following diagonal preconditioner for the coefficient matrix \(G_y^{\prime }(\varepsilon , y)\):

$$\begin{aligned} M_G : = (1+\kappa \varepsilon ) I + \pi _z^{\prime }(\varepsilon , z)\left( T +\sigma \mathrm{diag}(\mathbf {\widehat{A}\mathrm{diag}(\mathbf {d})\widehat{A}}^{T})- I\right) . \end{aligned}$$

5 Numerical experiments

In this section, we report some numerical results to demonstrate the efficiency of our smoothing-Newton partial PPA. We implemented our algorithm in MATLAB 2011a (version 7.12), and the numerical experiments are run in MATLAB under Windows 7 operating system on an Intel Xeon (quad-core) 2.80GHz CPU with 24GB memory.

In our numerical implementation, we apply the well-known alternating direction method of multipliers (ADMM) proposed by Gabay and Mercier [23], Glowinski and Marrocco [26], and Glowinski [27], for generating a good starting point for our PPA. To use the ADMM, we introduce two auxiliary variables \(Y\) and \(v\), and consider the following equivalent form of (1):

$$\begin{aligned} \displaystyle \min _{X\in \mathfrak {R}^{p\times q}, Y\in \mathfrak {R}^{p\times q}, v \in \mathcal {Q}} \left\{ \frac{1}{2}\Vert \mathcal {A}(X) -b\Vert ^2 + \rho \Vert Y\Vert _{*} + \langle C, X \rangle : Y = X, \, \mathcal {B}(X) - v \!=\! d \right\} . \end{aligned}$$
(91)

Consider the following augmented Lagrangian function for the problem (91):

$$\begin{aligned} L_{\beta }(X,Y,v; Z, \lambda )&= \frac{1}{2}\Vert \mathcal {A}(X) -b\Vert ^2 + \rho \Vert Y\Vert _{*} + \langle C, X \rangle + \langle Z, X-Y \rangle \\&+\, \langle \lambda , d-\mathcal {B}(X) + v\rangle \\&+\frac{\beta }{2}\Vert X- Y\Vert ^2 + \frac{\beta }{2} \Vert d-\mathcal {B}(X) + v\Vert ^2, \end{aligned}$$

where \(Z\in \mathfrak {R}^{p\times q}\) and \(\lambda \in \mathfrak {R}^s\) are the Lagrangian multipliers for the linear equality constraints and \(\beta >0\) is a penalty parameter. Given a starting point \((X^0, Y^0, v^0,Z^0,\lambda ^0)\), the ADMM generates new iterates according to the following procedure:

$$\begin{aligned} X^{k+1}&:= \mathrm{argmin}_{X\in \mathfrak {R}^{p\times q}} L_{\beta }(X,Y^k,v^k; Z^k, \lambda ^k), \\ (Y^{k+1},v^{k+1})&:= \mathrm{argmin}_{Y\in \mathfrak {R}^{p\times q},\, v\in \mathcal {Q}} L_{\beta }(X^{k+1},Y,v; Z^k, \lambda ^k), \\&= \left( D_{\rho /\beta }(X^{k+1}+\beta ^{-1}Z^k), \, \Pi _{\mathcal {Q}}( \mathcal {B}(X^{k+1})-d-\beta ^{-1}\lambda ^k)\right) , \\ Z^{k+1}&:= Z^k + \gamma \beta (X^{k+1} - Y^{k+1}), \quad \lambda ^{k+1} := \lambda ^k \!+\! \gamma \beta (d\!-\!\mathcal {B}(X^{k+1}) \!+\! v^{k+1}), \end{aligned}$$

where \(\gamma \in (0,(1+\sqrt{5})/2)\) is a given constant. Note that the theoretical convergence for the above procedure is guaranteed; see [27]. However, during our numerical implementation, we observe that the performance of the ADMM is very sensitive to the choice of the penalty parameter \(\beta \). In order to achieve faster convergence, it is essential to adaptively adjust \(\beta \) so as to balance the progress of the primal and dual infeasibilities defined in (92). Let \(\beta _k, R_P^k, R_D^k\) be the value of \(\beta \), and the primal and dual infeasibilities at the \(k\) iteration of the ADMM. In our implementation, we adjust \(\beta \) according to the following rule:

$$\begin{aligned} \beta _{k+1} = \left\{ \begin{array}{l@{\quad }l} 1.25 \beta _{k} &{} \mathrm{if} \ R_P^k > 10 R_D^k \quad \hbox { and } \quad \hbox {rem}(k, N_k) = 0\\ \frac{1}{1.25} \beta _{k} &{}\hbox {if } R_P^k < 0.1 R_D^k \quad \hbox { and }\quad \hbox {rem}(k, N_k) = 0 \end{array} \right. \end{aligned}$$

where \(N_k = 3\) if \(k\le 30\); \(N_k=6\) if \(30<k\le 60\); \(N_k=12\) if \(60<k\le 120\); \(N_k = 25\) if \(120<k\le 250\); \(N_k = 50\) if \(250 < k\). Observe that when the parameter \(\beta \) is modified, we are actually restarting a new cycle of ADMM using the iterate from the last cycle as the starting point for the new cycle.

In order to measure the infeasibilities of the primal problem (41), we define two linear operators \(\mathcal {B}_e:\mathfrak {R}^{p\times q} \rightarrow \mathfrak {R}^{s_1}\) and \(\overline{\mathcal {B}}_e: \mathfrak {R}^{p\times q} \rightarrow \mathfrak {R}^{s_2}\), respectively, as follows:

$$\begin{aligned} \left\{ \begin{array}{l} (\mathcal {B}_e (X))_i := \langle \mathcal {B}_i, X \rangle , \ \mathrm{for}\ i=1,\dots , s_1, \\ (\overline{\mathcal {B}}_e (X))_i := \langle \mathcal {B}_i, X \rangle , \ \mathrm{for}\ i=s_1+1,\dots , s. \end{array} \right. \end{aligned}$$

Let \(d = (d_{s_1}; d_{s_2})\) where \(d_{s_1}\in \mathfrak {R}^{s_1}\) and \(d_{s_2}\in \mathfrak {R}^{s_2}\). We measure the infeasibilities and optimality for the primal problem (41) and the dual problem (45) as follows:

$$\begin{aligned} \begin{array}{rl} \displaystyle R_P &{} \displaystyle = \; \frac{\Vert [b-u-\mathcal {A}(X);\, d_{s_1} - \mathcal {B}_e(X);\, \max (0, d_{s_2} - \overline{\mathcal {B}}_e (X)) ]\Vert }{1+\Vert \hat{b}\Vert }, \\ \displaystyle R_D &{} \displaystyle = \; \frac{\Vert C- \mathcal{A}^{*}\zeta -\mathcal{B}^*\xi - Z \Vert }{1+\Vert [\mathcal{A}^*,\mathcal{B}^*]\Vert }, \quad \mathrm{relgap} = \frac{f_\rho (u,X) - g_\rho (\zeta ,\xi )}{1+|f_\rho (u,X)|+|g_\rho (\zeta ,\xi )|}, \end{array} \end{aligned}$$
(92)

where \(y = (\zeta ; \xi ), Z = (D_{\rho \sigma }(W) - W)/\sigma \) with \(W = X-\sigma (C-\mathcal{A}^*\zeta -\mathcal{B}^*\xi )\), and \(f_\rho (u,X)\) and \(g_\rho (\zeta ,\xi )\) are the objective functions of (41) and (45), respectively. The infeasibility of the condition \(\Vert Z\Vert _2 \le \rho \) is not checked since it is satisfied up to machine precision throughout the algorithm. In our numerical experiments, we stop the partial PPA when

$$\begin{aligned} \max \{R_P, R_D\} \le 10^{-6} \quad \mathrm{and} \quad |\mathrm{relgap}|\le 10^{-5}. \end{aligned}$$

We choose the initial iterate \(X^0 =0, y^0 = 0\), and \(\sigma _0 = 1\). The parameter \(\rho \) in (1) is set to be \(\rho = 10^{-3}\Vert \mathcal {A}^{*}b\Vert _2\) if the data is not contaminated by noise; otherwise, the parameter \(\rho \) is set to be \(\rho = 5\times 10^{-3}\Vert \mathcal {A}^{*}b\Vert _2\).

In our numerical experiments, we run the ADMM for at most 30 iterations to generate a good starting point for our proposed PPA. In addition, the ADMM is switched to PPA as soon as \(\max \{R_P,R_D\}\) is less than \(10^{-3}\).

5.1 Example 1

We consider the nearest matrix approximation problem which was discussed by Golub, Hoffman and Stewart in [29], where the classic Eckart–Young [18]–Mirsky[41] theorem was extended to obtain the nearest lower-rank approximation while certain specified columns of the matrix are fixed. The Eckart–Young–Mirsky theorem has the drawback that the approximation generally differs from the original matrix in all its entries. Thus it is not suitable for applications where some columns of the original matrix must be fixed. For example, in statistics the regression matrix for the multiple regression model with a constant term has a column of all ones, and this column should not be perturbed.

For each triplet \((p,q,r)\), where \(r\) is the predetermined rank, we generate a random matrix \(M \in \mathfrak {R}^{p\times q}\) of rank \(r\) by setting \(M = M_1 M_2^T\) where both \(M_1 \in \mathfrak {R}^{p\times r}\) and \(M_2 \in \mathfrak {R}^{q\times r}\) have i.i.d. standard uniform entries in \((0, 1)\). As observed entries in practice are rarely exact, we corrupt the entries of \(M\) by Gaussian noises to simulate the situation where the observed data may be noisy as follows. First we generate a random matrix \(N \in \mathfrak {R}^{p\times q}\) with i.i.d Gaussian entries. Then we assume that the observed data is given by \(\widetilde{M} = M + \tau N \Vert M\Vert /\Vert N\Vert \), where \(\tau \) is the noise factor. In our numerical experiments, we set \(\tau = 0.1\). We assume that the first column of \(M\) should be fixed, and consider the following minimization problem:

$$\begin{aligned} \min _{X\in \mathfrak {R}^{p\times q}} \left\{ \frac{1}{2}\Vert X - \widetilde{M} \Vert ^2 + \rho \Vert X\Vert _{*} \, : \, X e_1 \, = \, Me_1, \, X \ge 0 \right\} , \end{aligned}$$
(93)

where \(e_1\) is the first column of the \(q\times q\) identity matrix. Here we impose the extra constraint \(X\ge 0\) since the original matrix \(M\) is nonnegative. Note that the approximation derived in [29] generally is not nonnegative.

For each \((p,q,r)\) and \(\tau \), we generate \(5\) random instances. In Table 1, we report the total number of constraints \((m+s)\) in (41), the average number of outer iterations, the average total number of inner iterations, the average number of BiCGStab steps taken to solve (87), the average infeasibilities, the average relative gap, the average relative mean square error \(\mathrm{MSE} := \Vert X-M\Vert /\Vert M\Vert \) (where \(M\) is the original matrix), the mean value of the rank (\(\#\)sv) of \(X\), and the average CPU time taken (in the format hours:minutes:seconds). Note that the CPU time includes the cost of running the ADMM to generate a good starting point, and the time spent in the ADMM initialization is given in the parenthesis. We may observe from the table that the partial PPA is very efficient for solving (93). For the problem where \(p\) is moderate but \(q\) is large, e.g., \(p=100\) and \(q= 20,000\), we only compute the economical form of the SVD and use the formula in (90) to avoid the explicit computation of \(V_2\). It takes less than a minute to solve the last instance to achieve the tolerance of \(10^{-6}\) while the MSE is reasonably small.

Table 1 Numerical performance of the partial PPA on (93)

In the numerical experiments, we observe that when the generated matrix \(M\) is of small rank, e.g., \(r=10\), the singular values of the computed solution \(X\) are separated into two clusters with the first cluster having much larger mean value than that of the second cluster (see, e.g., Fig. 2). We may view the number of singular values in the first cluster as a good estimate of the rank of the true solution, while the smaller positive singular values in the second cluster may be attributed to the presence of noise in the given data. When the matrix \(M\) is of high rank, e.g., \(r=50\), the singular values of \(X\) are usually not well separated into two clusters (see, e.g., Fig. 2), excluding the largest singular value. In Table 1, when the singular values of \(X\) are well separated into two clusters, we also report the number of singular values in the first cluster in parenthesis next to \(\#\)sv. In the table, “\(\mathrm{NA}\)” means that the singular values of \(X\) are not well separated into two clusters.

Fig. 2
figure 2

Distribution of singular values of \(X\) and \(M\)

In order to give the reader an idea on the relative performance of our proposed PPA (without using ADMM to generate a good starting point) and ADMM, we also perform the same experiments for some of the instances in Table 1 for the PPA (without ADMM initialization) and ADMM. We can see that the performance of the PPA with ADMM initialization is more efficient than the one without ADMM initialization. And both are more efficient than the ADMM.

5.2 Example 2

In [34], Lin proposed the Latent Markov Analysis (LMA) approach for finding the reduced rank approximations of transition matrices. The LMA is applied to clustering based on pairwise similarities such that the inferred cluster relationships can be described probabilistically by the reduced-rank transition matrix. Benczúr et al. [5] considered the problem of finding the low rank approximation of the transition matrix for computing the personalized PageRank, which describes the backlink-based page quality around user-selected pages.

In this example, we evaluate the performance of our partial PPA for finding the nearest transition matrix of lower rank. Consider the set of \(n\) web pages as a directed graph whose nodes are the web pages and whose edges are all the links between pages. Let \(\mathrm{deg}(i)\) be the outdegree of the page \(i\), i.e., the number of pages which can be reached by a direct link from page \(i\). Note that all the self-referential links are excluded. Let \(P \in \mathfrak {R}^{n\times n}\) be the matrix which describes the transition probability between pages \(i\) and \(j\), where \(P_{ij} = 1/\mathrm{deg}(i)\) if \(\mathrm{deg}(i)>0\) and there is a link form \(i\) to \(j\). For some page \(i\) having no outlink (dangling pages), we assume \(P_{ij} = 1/n\) for \(j=1,\ldots , n\), i.e., the user will make a random choice with uniform distribution \(1/n\). Since the matrix \(P\) for the web graph generally is reducible, \(P\) may have several eigenvalues on the unit circle, which could cause convergence problems to the power method for computing the PageRank [43]. The standard way of ensuring irreducibility is that we replace \(P\) by the matrix

$$\begin{aligned} P_c = c P + (1- c) ev^T, \end{aligned}$$

where \(c \in (0,1), e\in \mathfrak {R}^n\) is a vector of all ones, and \(v\in \mathfrak {R}^n\) is a vector such that \(v\ge 0\) and \(e^T v = 1\). We generate a random matrix \(N \in \mathfrak {R}^{n\times n}\) with i.i.d Gaussian entries. Then we assume that the observed data is given by \(\widetilde{P}_c = P_c + \tau N \Vert P_c\Vert /\Vert N\Vert \), where \(\tau \) is the noise factor. In our numerical experiments, we set \(\tau = 0.1\), \(c = 0.85\) which is a typical value used by Google, and \(v_i = 1/n\;\forall \, i\). The minimization problem which we solve is as follows:

$$\begin{aligned} \min _{X \in \mathfrak {R}^{n\times n}}\left\{ \frac{1}{2} \Vert X - \widetilde{P}_c\Vert ^2 + \rho \Vert X\Vert _{*}: Xe = e, \; X \ge 0 \right\} . \end{aligned}$$
(94)

We use the data Harvard500.mat generated by Cleve Moler’s MATLAB program surfer.m Footnote 1 to evaluate the performance of our algorithm. We also use surfer.m to generate three adjacency graphs of a portion of web pages starting at the root page “http://www.nus.edu.sg”. We also apply our algorithm to the data setsFootnote 2 collected by Tsaparas on querying the Google search engine about four topics: automobile industries, computational complexity, computational geometry, and randomized algorithms. Table 2 reports the average numerical results of PPA for solving (94) over \(5\) runs, where \(r\) denotes the rank of \(P_c\). We can observe from the table that the partial PPA (with ADMM initialization) is quite efficient for solving (94) when applied to the real web graph data sets. But the ADMM is surprisingly even more efficient for the two tested instances, automobile industries, computational geometry.

Table 2 Numerical performance of the partial PPA on (94)

5.3 Example 3

We consider the problem of finding a low rank doubly stochastic matrix with a prescribed entry. A matrix \(M\in \mathfrak {R}^{n\times n}\) is called doubly stochastic if it is nonnegative and all its row and column sums are equal to one. This problem arises from numerical simulation of large circuit networks. In order to reduce the complexity of simulating the whole system, Padé approximation via a Krylov subspace method, such as the Lanczos algorithm, is used to generate a low order approximation to the linear system matrix describing the large network [3]. The tridiagonal matrix \(M \in \mathfrak {R}^{n\times n}\) produced by the Lanczos algorithm is generally not doubly stochastic. But if the original matrix is doubly stochastic, then we need to find a low rank approximation of \(M\), which is doubly stochastic and matches the maximal moments. In our numerical experiments, we will not restrict the matrix \(M\) to be tridiagonal.

For each pair \((n,r)\), we generate a positive matrix \(\overline{M}\in \mathfrak {R}^{n\times n}\) with rank \(r\) by the same method as in Example 1. Then we use the Sinkhorn–Knopp algorithm [51] to find two positive definite diagonal matrices \(D_1, D_2 \in \mathfrak {R}^{n\times n}\) such that \(M = D_1 \overline{M} D_2\) is a doubly stochastic matrix of rank \(r\). We sample a subset \(\mathcal {E}\) of \(m\) entries of \(M\) uniformly at random, and generate a random matrix \(N_{\mathcal {E}} \in \mathfrak {R}^{p\times q}\) with sparsity pattern \(\mathcal {E}\) and i.i.d standard Gaussian random entries. Then we assume that the observed data is given by \(\widetilde{M}_{\mathcal {E}} = M_{\mathcal {E}} + \tau N_{\mathcal {E}} \Vert M_{\mathcal {E}}\Vert /\Vert N_{\mathcal {E}}\Vert \), where \(\tau \) is the noise factor. The problem for matching the first moment of \(M\) can be stated as follows:

$$\begin{aligned} \min _{X\in \mathfrak {R}^{n\times n}} \left\{ \frac{1}{2}\Vert X_{\mathcal {E}} - \widetilde{M}_{\mathcal {E}} \Vert ^2 + \rho \Vert X\Vert _{*} \, : \, X e = e,\, X^T e = e,\, X_{11} = M_{11}, \, X \ge 0 \right\} . \end{aligned}$$
(95)

In [13], the following relatively simpler version of (95) was studied:

$$\begin{aligned} \min _{X\in \mathfrak {R}^{n\times n}} \left\{ \frac{1}{2}\Vert X - \widetilde{M}\Vert ^2 \, : \, X e = e,\, X^T e = e,\, X_{11} = M_{11}, \, X \ge 0 \right\} , \end{aligned}$$
(96)

i.e., without the nuclear norm regularization term and all the entries of \(\widetilde{M}\) are assumed to be available. Because of the simplicity, the problem (96) was solved in [13] via its dual by a semismooth Newton-CG method analogous to that in [45].

In our numerical experiments, we set \(\tau = 0, 0.1\), and the number of sampled entries to be \(m = 10dr\), where \(dr = r(2n-r)\) is the degree of freedom in an \(n\times n\) matrix of rank \(r\). In Table 3, we report the average numerical results for solving (95) on randomly generated matrices over \(5\) runs, where \(m\) is the average number of sampled entries, and \(m+s\) is the average number of total constraints in (41). We can observe from the table that the partial PPA can solve (95) very efficiently for all the instances. Note that the partial PPA with ADMM initialization is at least twice more efficient than the one without ADMM initialization. The former is also more efficient than the ADMM, except for the instance with (\(n/\tau = 1{,}000/0.1\), \(r=10\)).

Table 3 Numerical performance of the partial PPA on (95)

We may also consider the following generalized version of (95), where we want to find a low rank doubly stochastic matrix with \(k\) prescribed entries of \(M\):

(97)

where \((i_1,j_1), \ldots , (i_k,j_k)\) are distinct pairs. In our numerical experiments, we set \(k = \lceil 10^{-3}n^2 \rceil \), which is the number of prescribed entries selected uniformly at random. Table 4 presents the average numerical results for solving (97) on randomly generated matrices over \(5\) runs. We can observe that for this problem, partial PPA with ADMM initialization is much more efficient than the ADMM.

Table 4 Same as Table 3 but for the problem (97)

5.4 Example 4

We consider the problem of finding a low rank nonnegative approximation which preserves the left and right principal eigenvectors of a square positive matrix. This problem was considered by Ho and van Dooren in [31]. Let \(M\in \mathfrak {R}^{n \times n}\) be a positive matrix. By the Perron–Frobenius theorem, \(M\) has a simple positive eigenvalue \(\lambda \) with the largest magnitude. Moreover, there exist two positive eigenvectors \(v,w\in \mathfrak {R}^n\) such that \( M v = \lambda v\) and \( M^T w = \lambda w\). As suggested by Bonacich [8], the principal eigenvector can be used to measure network centrality, where the \(i\)-th component of the eigenvector gives the centrality of the \(i\)-th node in the network. For example, the Google’s PageRank [43] is a variant of the eigenvector centrality for ranking web pages.

For each pair \((n,r)\), we generate a positive matrix \(M\in \mathfrak {R}^{n\times n}\) of rank \(r\) by the same method as in Example 1. We sample a subset \(\mathcal {E}\) of \(m\) entries of \(M\) that are possibly corrupted by Gaussian noise as in Example 3, and the resulting partially observed matrix is denoted by \(\widetilde{M}_\mathcal{E}\). Given the largest positive eigenvalue \(\lambda \) and the corresponding left and right eigenvectors \(v\) and \(w\) of \(M\), the problem of finding a low rank approximation of \(M\) while preserving the left and right eigenvectors can be stated as follows:

$$\begin{aligned} \min _{X\in \mathfrak {R}^{n\times n}} \left\{ \frac{1}{2}\Vert X_{\mathcal {E}} - \widetilde{M}_{\mathcal {E}}\Vert ^2 + \rho \Vert X\Vert _{*} \, : \, X v = \lambda v, X^T w = \lambda w, X \ge 0 \right\} . \end{aligned}$$
(98)

Table 5 reports the average numerical results of the partial PPA for solving (98) over \(5\) runs. Again, we observe that our partial PPA (with ADMM initialization) is very efficient in solving the problem (98).

Table 5 Numerical performance of the partial PPA on (98)

6 Conclusion

In this paper, we introduced a partial PPA for solving nuclear norm regularized matrix least squares problems with equality and inequality constraints, and presented global and local convergence results based on the classical results for a general partial PPA. The inner subproblems, reformulated as a system of semismooth equations, were solved by an inexact smoothing Newton method, which is proved to be quadratically convergent under a constraint nondegeneracy condition, which we also characterized. Extensive numerical experiments conducted on a variety of large scale nuclear norm regularized matrix least squares problems demonstrated that our proposed algorithm is very efficient and robust.

We view our current work as providing an important tool for solving the very difficult problem of structured approximation with a prescribed rank. Although it is popular to regularize a structured approximation problem by the nuclear norm to promote a low rank solution, it cannot always deliver a solution with a prescribed rank. Thus it would be desirable to design an algorithm (along the line discussed in [25]) to find the best approximation with a prescribed rank efficiently while preserving certain desired structures.