1 Introduction

Quaternion matrices have been widely used in color image processing, including color image denoising and inpainting, color face recognition, color image classification, etc., [4, 5, 9, 11, 17,18,19, 21, 23, 26]. Quaternions have also been widely used in other engineering areas [20]. The core mathematical models among most of these works can be formulated by optimization problems with quaternion matrices as decision variables, such as the low rank quaternion decomposition (LRQD) model proposed in [4], the robust quaternion matrix completion (RQMC) model proposed in [9], and the low rank quaternion approximation (LRQA) model proposed in [5]. All of these motivate us to focus on the quaternion matrix optimization (QMO) where quaternion matrices are acting as decision variables.

Comparing to the classical optimization defined in the real field, optimization theory for QMO is far from complete, even though all involved objective and constraint functions are real-valued. The main difficulty comes from the calculus for quaternions. In this paper, we intend to provide necessary theoretical foundations on optimality analysis for QMO, in order to enrich the contents of optimization theory for QMO, and to pave way for further developments of algorithm design for solving QMO.

Various real-valued functions of quaternion matrix variables, such as Frobenius norms, nuclear norms, spectral norms, \(\ell _1\)-norms, traces, \(\ell _0\)-norms and rank functions of quaternion matrices, arise from engineering applications.

To optimize such real-valued functions, derivatives, subdifferentials and generalized subdifferentials are needed to handle them.

For the color image denoising problem, optimization models may involve \(\ell _0\)-norms and rank functions of quaternion matrix variables. They are discrete. In sparse optimization [2, 12], generalized subdifferentials of \(\ell _0\)-norms and rank functions of real matrix variables are used. This uses the knowledge of variational analysis [6, 8, 10, 15, 16]. In this paper, we aim to develop adequate analysis tools for developing derivatives, subdifferentials and generalized subdifferentials of real-valued functions of quaternion matrix variables, such that we can establish optimality conditions of such optimization models and pave the way for future works on convergence analysis of algorithms for such optimization models.

In the literature, there are some works on derivatives of real-valued functions of quaternion variables [14, 20], and subgradients of norms of quaternion matrix variables [9]. There are no discussion on generalized subdifferentials of \(\ell _0\)-norms and rank functions of quaternion matrix variables.

In [4], Chen, Qi, Zhang and Xu formulated the color image inpainting problem as an equality constrained optimization problem of real-valued functions in quaternion matrix variables, and proposed a lower rank quaternion decomposition (LRQD) algorithm to solve it. To conduct convergence analysis for their algorithm, they introduced a concise form

$$\begin{aligned} \nabla f(X) = {\partial f \over \partial X_0} + {\partial f \over \partial X_1}\mathbf {i}+ {\partial f \over \partial X_2}\mathbf {j}+ {\partial f \over \partial X_3}\mathbf {k}\end{aligned}$$

for the gradient of a real-valued function f in a quaternion matrix variable \(X=X_0 + X_1\mathbf {i}+ X_2\mathbf {j}+ X_3 \mathbf {k}\), where \(\mathbf {i}, \mathbf {j}\) and \(\mathbf {k}\) are three imaginary units of quaternions. This form is different from the generalized HR calculus studied in [20]. The first-order optimality necessary condition of their quaternion least squares problem has a simple expression with this form. With this tool, convergence and convergence rate of their algorithm are established.

To explore the above approach further, we introduce R-product and R-linear independence of quaternion matrix vectors, and establish the first-order necessary optimality condition for the QMO problem. We also present a method to calculate the second-order partial derivatives of real-valued functions in quaternion matrix variables, and establish the second-order necessary optimality condition and second-order sufficient optimality condition for the QMO problem.

Norms of quaternion matrix variables may not be continuously differentiable, but they are always convex. In 2019, Jia, Ng and Song [9] introduced subgradients for norms of quaternion matrix variables.

In this paper, we show that our approach is consistent with the subgradient concept for norms of quaternion matrix variables, introduced in [9]. We also establish the relations between subgradients and derivatives of real-valued functions of quaternion matrix variables.

The \(\ell _0\)-norms and the ranks of quaternion matrices are not continuous, but are very useful in applications [5, 21]. In this paper, we develop the concepts of generalized subdifferentials of proper functions of quaternion matrices, and use them to analyze the optimality conditions of a sparse color image denoising (SCID) model. We show that the real representation set of low rank quaternion matrices is closed and semi-algebraic.

The generalized subdifferential calculus is totally new in the literature and useful in color image applications.

While some important real-valued functions, such as the squares of the Frobenius norms, of quaternion matrix variables are separable in the sense that they can be calculated with respect to each real matrix variable, then summed together, the rank function of quaternion matrix variables is not separable. We treat this in a nontrivial way by considering the real representation of a quaternion matrix, and show that the real representation set of low rank quaternion matrices is semi-algebraic. This will be useful for convergence analysis of some first-order algorithms.

In the next section, we present some necessary preliminary knowledge on quaternions and quaternion matrices. The QMO problem and some of its special cases are presented in Sect. 3. In Sect. 4, we introduce R-product and R-linear independence as our key tools for defining first- and second-order derivatives of real-valued functions of quaternion matrix variables, and then present first- and second-order optimality conditions for QMO. In Sect. 5, we study convex functions of quaternion matrix variables, their subdifferentials, and their relations with derivatives of real-valued functions of quaternion matrix variables. We introduce the generalized subdifferentials of proper functions of quaternion matrices, and use them to analyze the optimality conditions of the SCID problem in Sect. 6. Some final remarks are made in Sect. 7.

2 Preliminaries

2.1 Quaternions

In general, we use the notation in [22, 24]. We denote the real field, the complex field and the quaternion algebra by \({{\mathbb {R}}}\), \({\mathbb {C}}\) and \({\mathbb {Q}}\), respectively. Scalars, vectors and matrices are denoted by small letters, bold small letters and capital letters, respectively. Vectors with matrix components are denoted by bold capital letters. For example, we have \(\mathbf{X} = (W, Y, Z)\). They are called matrix component vectors. We use \(\mathbf {0}, O\) and \(\mathbf{O}\) to denote zero vector, zero matrix and zero matrix component vector with appropriate dimensions. The three imaginary units of quaternions are denoted by \(\mathbf {i}, \mathbf {j}\) and \(\mathbf {k}\), which satisfy

$$\begin{aligned} \mathbf {i}^2= & {} \mathbf {j}^2 = \mathbf {k}^2 =\mathbf {i}\mathbf {j}\mathbf {k}= -1,\\ \mathbf {i}\mathbf {j}= & {} -\mathbf {j}\mathbf {i}= \mathbf {k}, \ \mathbf {j}\mathbf {k}= - \mathbf {k}\mathbf {j}= \mathbf {i}, \mathbf {k}\mathbf {i}= -\mathbf {i}\mathbf {k}= \mathbf {j}. \end{aligned}$$

These rules, along with the distribution law, determine the product of two quaternions. The multiplication of quaternions is noncommutative.

Let \(x = x_0 + x_1\mathbf {i}+ x_2\mathbf {j}+ x_3\mathbf {k}\in {{\mathbb {Q}}}\), where \(x_0, x_1, x_2, x_3 \in {{\mathbb {R}}}\). The conjugate of x is

$$\begin{aligned} x^* = x_0 - x_1\mathbf {i}- x_2\mathbf {j}- x_3\mathbf {k}, \end{aligned}$$

the modulus of x is

$$\begin{aligned} |x| = |x^*| = \sqrt{xx^*} = \sqrt{x^*x} = \sqrt{x_0^2 + x_1^2 + x_2^2 + x_3^2}, \end{aligned}$$

and if \(x \not = 0\), then \(x^{-1} = {x^* \over |x|^2}\).

2.2 Quaternion Matrices

The collections of real, complex and quaternion \(m \times n\) matrices are denoted by \({{\mathbb {R}}}^{m \times n}\), \({{\mathbb {C}}}^{m \times n}\) and \({{\mathbb {Q}}}^{m \times n}\), respectively.

A quaternion matrix \(A= (a_{ij}) \in {{\mathbb {Q}}}^{m \times n}\) can be denoted as

$$\begin{aligned} A = A_0 + A_1\mathbf {i}+ A_2\mathbf {j}+ A_3\mathbf {k}, \end{aligned}$$
(1)

where \(A_0, A_1, A_2, A_3 \in {{\mathbb {R}}}^{m \times n}\). The transpose of A is \(A^\top = (a_{ji})\). The conjugate of A is \(\bar{A} = (a_{ij}^*)\). The conjugate transpose of A is \(A^* = (a_{ji}^*) = {\bar{A}}^T\). For \(A, B \in {{\mathbb {Q}}}^{m \times n}\), their inner product is defined as

$$\begin{aligned} \langle A, B \rangle = \mathrm{Tr}(A^*B), \end{aligned}$$

where \(\mathrm{Tr}(A^*B)\) denotes the trace of \(A^*B\). The Frobenius norm of A is

$$\begin{aligned} \Vert A\Vert _F = \sqrt{\langle A, A \rangle } = \sqrt{\mathrm{Tr}(A^*A)} = \sqrt{\sum _{i=1}^m \sum _{j=1}^n |a_{ij}|^2}, \end{aligned}$$

the \(\ell _1\)-norm of \(A = (a_{ij}) \in {{\mathbb {Q}}}^{m \times n}\) is defined by \(\Vert A\Vert _1 = \sum _{i=1}^m \sum _{j=1}^n |a_{ij}|\), and the \(\ell _\infty \)-norm of A is defined by \(\Vert A\Vert _\infty = \max _{i, j} |a_{ij}|\) [9]. Analogous to real/complex vectors or matrices, we can also introduce the so-called \(\ell _0\)-norm for a quaternion matrix to measure the sparsity among quaternion entries. Specifically, the \(\ell _0\)-norm of \(A\in {{\mathbb {Q}}}^{m \times n}\), termed as \(\Vert A\Vert _0\), is defined as the number of nonzero entries in A. It is not a true norm as it does not satisfy the positive homogeneity as required for norms, and it is discrete. However, it plays an important role in sparse color image processing [21]. In some papers, it is called the counting function [10].

The following theorem for the Quaternion Singular Value Decomposition (QSVD) of a quaternion matrix was proved by Zhang [24].

Theorem 2.1

(Zhang 1997) Any quaternion matrix \(A \in {{\mathbb {Q}}}^{m \times n}\) has the following QSVD form

$$\begin{aligned} X = U\left( {\begin{array}{cc} \Sigma _r \ O \\ O \ \ O \end{array}}\right) V^*, \end{aligned}$$
(2)

where \(U \in {{\mathbb {Q}}}^{m \times m}\) and \(V \in {{\mathbb {Q}}}^{n \times n}\) are unitary, and \(\Sigma _r\) = diag\(\{ \sigma _1, \cdots , \sigma _r\}\) is a real positive \(r \times r\) diagonal matrix, with \(\sigma _1 \ge \cdots \ge \sigma _r\) as the singular values of A.

Similar to the real/complex matrix case, the rank of a quaternion matrix A, denoted as rank(A), is defined by the number of singular values, the spectral norm and the nuclear norm of A are defined by \(\Vert A\Vert _S = \max \{ \sigma _1, \cdots , \sigma _r \}\) and \(\Vert A\Vert _* = \sum _{i=1}^r \sigma _i\), respectively (See, e.g., [9, Definitions 1 & 2]). Similar to the case of real matrices, it is revealed in [9, Proposition 2] that the convex envelope of the quaternion matrix rank function on the unit ball in the sense of the spectral norm is exactly the nuclear norm, which indicates that the nuclear norm can be regarded as the best convex and continuous surrogate for the rank function in quaternion matrices.

For a quaternion matrix \(A = A_0 + A_1\mathbf {i}+ A_2\mathbf {j}+ A_3\mathbf {k}\in {{\mathbb {Q}}}^{m \times n}\), its real representation [22] is a \(4m\times 4n\) real matrix of the form

$$\begin{aligned} A^R = \left( \begin{array} {cccc} A_0 &{} -A_1 &{} -A_2 &{} -A_3 \\ A_1 &{} A_0 &{} -A_3 &{} A_2 \\ A_2 &{} A_3 &{} A_0 &{} -A_1 \\ A_3 &{} -A_2 &{} A_1 &{} A_0 \end{array}\right) . \end{aligned}$$

A color image dataset can be represented as a pure quaternion matrix

$$\begin{aligned} A = A_1\mathbf {i}+ A_2\mathbf {j}+ A_3\mathbf {k}\in {{\mathbb {Q}}}^{m \times n}, \end{aligned}$$

where \(A_1, A_2\) and \(A_3\) are real \(m \times n\) matrices. We use \(O_{m \times n}\) to denote the zero matrix in \({{\mathbb {Q}}}^{m \times n}\). More knowledge of quaternion matrices can be found in [22, 24].

3 The QMO and Its Motivation

Consider a matrix component vector \(\mathbf{X} := (W, Y, Z)\), where \(W \in {{\mathbb {Q}}}^{m_1 \times n_1}\), \(Y \in {{\mathbb {Q}}}^{m_2 \times n_2}\) and \(Z \in {{\mathbb {Q}}}^{m_3 \times n_3}\). Here matrix component vectors may have more components, while three components will be enough to illustrate the problem. In this paper, we are interested in the QMO problem which takes the following form

$$\begin{aligned} \min \left\{ f(\mathbf{X}) : h_j(\mathbf{X}) = 0, j = 1, \cdots , p, g_k(\mathbf{X}) \le 0, k = 1, \cdots , q \right\} , \end{aligned}$$
(3)

where \(f, h_j, g_k :{{\mathbb {H}}} \rightarrow {{\mathbb {R}}}\) are real-valued functions, for \(j = 1, \cdots , p\) and \(k = 1, \cdots , q\), with \({{\mathbb {H}}} := {{\mathbb {Q}}}^{m_1 \times n_1} \times {{\mathbb {Q}}}^{m_2 \times n_2} \times {{\mathbb {Q}}}^{m_3 \times n_3}\).

For \(\mathbf{X} = (W, Y, Z)\), suppose that

$$\begin{aligned} \begin{array}{rl} W &{} = W_0 + W_1\mathbf {i}+ W_2\mathbf {j}+ W_3\mathbf {k}, \\ Y &{} = Y_0 + Y_1\mathbf {i}+ Y_2\mathbf {j}+ Y_3\mathbf {k}, \\ Z &{} = Z_0 + Z_1\mathbf {i}+ Z_2\mathbf {j}+ Z_3\mathbf {k}, \end{array} \end{aligned}$$

where \(W_i \in {{\mathbb {R}}}^{m_1 \times n_1}\), \(Y_i \in {\mathbb R}^{m_2 \times n_2}\) and \(Z_i \in {{\mathbb {R}}}^{m_3 \times n_3}\), for \(i = 0, 1, 2, 3\). Denote

$$\begin{aligned} R(\mathbf{X}) := (W_0, W_1, W_2, W_3, Y_0, Y_1, Y_2, Y_3, Z_0, Z_1, Z_2, Z_3), \end{aligned}$$

and call it the real form of \(\mathbf{X}\). Note that f can be regarded as a function in \(R(\mathbf{X})\). To avoid notational confusion, we introduce the real representation of f, termed as \(f^R\), with the relation that

$$\begin{aligned} f(\mathbf{X}) = f^R(R(\mathbf{X})),~~\forall \mathbf{X}\in {{\mathbb {H}}}. \end{aligned}$$
(4)

Similarly, we introduce \({h_j^R}\) and \(g_k^R\) for the real representations of \(h_j\) and \(g_k\), respectively. Such real representations of real-valued quaternion matrix functions will greatly facilitate the sequent analysis in calculus. The first benefit contributes to the characterization of differentiability. Given \(f:{{\mathbb {H}}} \rightarrow {{\mathbb {R}}}\), we say that f is locally Lipschitz continuous (continuously differentiable, or twice continuously differentiable, respectively) in \(\mathbf{X}\) if \(f^R\) is locally Lipschitz continuous (continuously differentiable, or twice continuously differentiable, respectively) in \(R(\mathbf{X})\). Based on these three types of differentiation properties, we introduce the follow three assumptions for the QMO model (3).

Normal Assumption: f, \(h_j\)’s and \(g_k\)’s in (3) are all locally Lipschitz continuous in \(\mathbf{X}\).

Middle Assumption: f, \(h_j\)’s and \(g_k\)’s in (3) are all continuously differentiable in \(\mathbf{X}\).

Strong Assumption: f, \(h_j\)’s and \(g_k\)’s in (3) are all twice continuously differentiable in \(\mathbf{X}\).

The QMO model (3) provides a unified form for many quaternion matrix optimization problems arising from color image processing. Four selected ones are elaborated as below for illustration.

3.1 LRQD: Color Image Inpainting

The LRQD model, arising from the color image inpainting [4, 5], takes the form of

$$\begin{aligned} \min \left\{ \left\| YZ-W\right\| _F^2 : W_\Omega = D_\Omega \right\} , \end{aligned}$$
(5)

where \(W \in {{\mathbb {Q}}}^{m \times n}\), \(Y \in {{\mathbb {Q}}}^{m \times r}\) and \(Z \in {{\mathbb {Q}}}^{r \times n}\) are decision variables, \(D\in {{\mathbb {Q}}}^{m \times n}\) is a given quaternion dataset matrix, \(\Omega \) denotes the index set of the observed part of D, the equality \(W_\Omega = D_\Omega \) means that the corresponding entries of W and D are equal at the index set \(\Omega \). The inherent low-rankness of the quaternion matrix for the given color image is exploited in terms of the decomposition YZ with a relatively small integer \(r<\min \{m,n\}\).

Apparently, if we take \(\mathbf{X} =(W,Y,Z)\), \(f(\mathbf{X}) = \left\| YZ-W\right\| _F^2\), \(h_j(\mathbf{X}) = w_{il}-d_{il}\), with (il) the jth index in \(\Omega \), for \(j= 1,..., |\Omega |\) (the cardinality of \(\Omega \)), then (3) is reduced to (5). Thus, (5) is a special case of the QMO problem which satisfies the strong assumption.

3.2 RQMC: Color Image Inpainting

In [9], Jia, Ng and Song proposed the following RQMC model for color image inpainting:

$$\begin{aligned} \min \left\{ \left\| Y\right\| _*+\lambda \left\| Z\right\| _1 : (Y+Z)_\Omega = D_\Omega \right\} , \end{aligned}$$
(6)

where \(Y, Z, D \in {{\mathbb {Q}}}^{m \times n}\), D is a quaternion dataset matrix, \(\Omega \) denotes the index set of the observed part of D, \(\lambda \) is a positive parameter.

Clearly, the RQMC problem (6) is a special case of the QMO problem (3). The objective function here is not differentiable, but locally Lipschitz continuous, and hence it satisfies the normal assumption. It is worth mentioning that the objective function is convex, with the first term the convex surrogate for the rank function, and the second term the convex relaxation for the \(\ell _0\)-norm.

3.3 LRQA: Color Image Denoising and Inpainting

In [5], Chen, Xiao and Zhou proposed the following LRQA model for color image denoising and inpainting:

$$\begin{aligned} \min \left\{ {1 \over 2}\left\| W - D\right\| _F^2+\lambda \sum _i \phi (\sigma _i(W), \gamma ) \right\} , \end{aligned}$$
(7)

where \(W \in {{\mathbb {Q}}}^{m \times n}\) is the decision variable, \(D\in {{\mathbb {Q}}}^{m \times n}\) is a given quaternion dataset matrix, W is a surrogate quaternion matrix, \(\sigma _i(W)\) denotes the ith largest singular value of W, \(\phi \) is a nonconvex surrogate for the rank function, such as Schatten-\(\gamma \) function, logarithm-determinant function and Laplace function, \(\gamma \) is a nonnegative parameter related to the specific \(\phi \). As one can see, the LRQA problem (7) is also a special case of the QMO problem (3).

3.4 SCID: Color Image Denoising and Inpainting

Inspired by the promising performance of the greedy-type methods tackling the original \(\ell _0\)-norm and rank function in vector/matrix cases [13, 25], we formulate the color image denoising and inpainting by the following sparse color image denoising (SCID) model

$$\begin{aligned} \min \left\{ f(\mathbf{X}) := {1 \over 2}\Vert L(Y + Z) - D\Vert _F^2 + \lambda \Vert Z\Vert _0 : \mathrm{rank}(Y) \le r \right\} , \end{aligned}$$
(8)

where \(D \in {{\mathbb {Q}}}^{m \times n}\) is an observed quaternion matrix of the color image, \(Y \in {{\mathbb {Q}}}^{m \times n}\) is a low rank quaternion matrix to approximate D, r is a prescribed integer for the upper bound of the rank of Y, \(Z \in {\mathbb Q}^{m \times n}\) is the color image noise to be detected, \(\lambda > 0\) is a prescribed parameter, \(L : {{\mathbb {Q}}}^{m \times n} \rightarrow {{\mathbb {Q}}}^{m \times n}\) is a linear operator, for example, a projection operator to indicate the observed area in the color image inpainting problem, \(\mathbf{X} = (Y, Z)\) is the quaternion matrix vector variable, \(f : {{\mathbb {M}}} := {{\mathbb {Q}}}^{m \times n} \times {{\mathbb {Q}}}^{m \times n} \rightarrow {{\mathbb {R}}}\) is the objective function. The SCID problem is also a special case of the QMO problem (3). Since the involved \(\ell _0\)-norm in the objective function, and the rank function in the constraint are both discontinuous, the SCID problem does not satisfy even the normal assumption. Thus, a special treatment for the SCID will be provided in Sect. 6 by exploiting properties on the \(\ell _0\)-norm and the rank function of quaternion matrices.

Besides the aforementioned instances of QMO, more specific models in the class of QMO can be found in the color image recognition problem [26] and the color image classification problem [23], etc.

4 R-Product and R-Linear Independence

In this section, we introduce R-product and R-linear independence, use them to calculate the first-order derivatives and the second-order partial derivatives of real-valued functions in quaternion matrix variables, and establish first- and second-order optimality conditions of the QMO problem.

Suppose that A and E are two real matrices with the same dimension. For example, assume that \(A = (a_{ij})\) and \(E = (e_{ij})\) are \(m \times n\) real matrices. Define the R-product (real product) of A and E as

$$\begin{aligned} A \cdot E := \sum _{i=1}^m \sum _{j=1}^n a_{ij}e_{ij}. \end{aligned}$$

Consider the matrix component vector space \({{\mathbb {H}}}\).

For \(\mathbf{A} := (B, C, D) \in {{\mathbb {H}}}\) and \(\mathbf{H} := (E, F, G) \in {{\mathbb {H}}}\), we may define the inner product of \(\mathbf{A}\) and \(\mathbf{H}\) as

$$\begin{aligned} \langle \mathbf{A}, \mathbf{H} \rangle = \langle B, E \rangle + \langle C, F \rangle + \langle D, G \rangle . \end{aligned}$$

Then define the R-product (real product) of \(\mathbf{A}\) and \(\mathbf{H}\) as the real part of \(\langle \mathbf{A}, \mathbf{H} \rangle \):

$$\begin{aligned} \mathbf{A} \cdot \mathbf{H} := \mathrm{Re}\langle \mathbf{A}, \mathbf{H} \rangle . \end{aligned}$$

We have the following proposition.

Proposition 4.1

Suppose that \(\mathbf{A} = (B, C, D) \in {{\mathbb {H}}}\) and \(\mathbf{H} = (E, F, G) \in {{\mathbb {H}}}\). Assume that \(B = B_0 + B_1\mathbf {i}+ B_2\mathbf {j}+ B_3\mathbf {k}, \cdots , G = G_0 + G_1\mathbf {i}+ G_2\mathbf {j}+ G_3\mathbf {k}\), where \(B_i, C_i, D_i, E_i, F_i\) and \(G_i\) are real matrices with corresponding dimensions. Then the R-product of \(\mathbf{A}\) and \(\mathbf{H}\) has the following form:

$$\begin{aligned} \mathbf{A} \cdot \mathbf{H} = \sum _{i=0}^3 (B_i \cdot E_i + C_i \cdot F_i + D_i \cdot G_i). \end{aligned}$$

Proof

Let

$$\begin{aligned} B= & {} B_0 + B_1\mathbf {i}+ B_2\mathbf {j}+ B_3\mathbf {k},\ E = E_0 + E_1\mathbf {i}+ E_2\mathbf {j}+ E_3\mathbf {k},\\ B_i= & {} (b_{pqi}),\ E_i = (e_{pqi}),\ \mathrm{for}\ i = 0, 1, 2, 3. \end{aligned}$$

Then

$$\begin{aligned}&\langle B, E \rangle = \mathrm{Tr}(B^*E) \\&\quad = \sum _{p=1}^{m_1} \sum _{q=1}^{n_1} \left( b_{pq0} - b_{pq1}\mathbf {i}- b_{pq2}\mathbf {j}- b_{pq3}\mathbf {k}\right) \left( e_{pq0} + e_{pq1}\mathbf {i}+ e_{pq2}\mathbf {j}+ e_{pq3}\mathbf {k}\right) . \end{aligned}$$

We see that the real part of \(\left( b_{pq0} - b_{pq1}\mathbf {i}- b_{pq2}\mathbf {j}- b_{pq3}\mathbf {k}\right) \left( e_{pq0} + e_{pq1}\mathbf {i}+ en_{pq2}\mathbf {j} + e_{pq3}\mathbf {k}\right) \),

$$\begin{aligned}&\mathrm{Re}\left[ \left( b_{pq0} - b_{pq1}\mathbf {i}- b_{pq2}\mathbf {j}- b_{pq3}\mathbf {k}\right) \left( e_{pq0} + e_{pq1}\mathbf {i}+ e_{pq2}\mathbf {j}+ e_{pq3}\mathbf {k}\right) \right] \nonumber \\&\quad = \sum _{i=0}^3 b_{pqi}e_{pqi}. \end{aligned}$$

This implies that the real part of \(\langle B, E \rangle \) is \(\sum _{i=0}^3 B_i \cdot E_i\). Hence, the conclusion holds.

\(\square \)

This proposition reveals the meaning of the R-product. In Sect. 5, it connects our derivative concept with the subgradient concept of Jia, Ng and Song [9].

Suppose that \(\mathbf{A^{(j)}} := (A^{(j)}, B^{(j)}, C^{(j)}) \in {{\mathbb {H}}}\) for \(j = 1, \cdots , p\). We say that \(\left\{ \mathbf{A^{(j)}} : j = 1, \cdots , p \right\} \) is R-linearly independent if there do not exist real numbers \(\alpha _j\) for \(j = 1, \cdots , p\) such that some of them are nonzero and

$$\begin{aligned} \sum _{j=1}^p \alpha _j \mathbf{A^{(j)}} = \mathbf{O}. \end{aligned}$$
(9)

4.1 First-Order Derivatives

Suppose that \(f(\mathbf{X}) : {{\mathbb {H}}} \rightarrow {{\mathbb {R}}}\) satisfies the normal assumption, with \(\mathbf{X} = (W, Y, Z)\), and f is differentiable at \(\mathbf{X}\) with respect to \(W_i, Y_i\) and \(Z_i\) for \(i = 0, 1, 2, 3\). Then we define the partial derivatives and gradient of f at \(\mathbf{X}\) as:

$$\begin{aligned} {\partial \over \partial W} f(\mathbf{X})= & {} {\partial \over \partial W_0} f(\mathbf{X}) + {\partial \over \partial W_1} f(\mathbf{X})\mathbf {i}+ {\partial \over \partial W_2} f(\mathbf{X})\mathbf {j}+ {\partial \over \partial W_3} f(\mathbf{X})\mathbf {k},\\ {\partial \over \partial Y} f(\mathbf{X})= & {} {\partial \over \partial Y_0} f(\mathbf{X}) + {\partial \over \partial Y_1} f(\mathbf{X})\mathbf {i}+ {\partial \over \partial Y_2} f(\mathbf{X})\mathbf {j}+ {\partial \over \partial Y_3} f(\mathbf{X})\mathbf {k},\\ {\partial \over \partial Z} f(\mathbf{X})= & {} {\partial \over \partial Z_0} f(\mathbf{X}) + {\partial \over \partial Z_1} f(\mathbf{X})\mathbf {i}+ {\partial \over \partial Z_2} f(\mathbf{X})\mathbf {j}+ {\partial \over \partial Z_3} f(\mathbf{X})\mathbf {k},\\ \nabla f(\mathbf{X})= & {} \left( {\partial \over \partial W} f(\mathbf{X}), {\partial \over \partial Y} f(\mathbf{X}), {\partial \over \partial Z} f(\mathbf{X})\right) . \end{aligned}$$

We see that \(\nabla f(\mathbf{X}) \in {{\mathbb {H}}}\).

Define the directional derivative of f at \(\mathbf{X} = (W, Y, Z) \in {{\mathbb {H}}}\) in the direction \(\varvec{\Delta X} = (\Delta W, \Delta Y, \Delta Z) \in {{\mathbb {H}}}\) as

$$\begin{aligned} f'(\mathbf{X}; \varvec{\Delta X}) = {\mathop {\mathop {\lim }\limits _{t \rightarrow 0}}\limits _{t \in {\mathbb {R}}}} {f(\mathbf{X} + t\varvec{\Delta X}) - f(\mathbf{X}) \over t}. \end{aligned}$$

Note that while the gradient of f is in \({{\mathbb {H}}}\), the directional derivative of f is real. They are connected via the R-product operation in \({{\mathbb {H}}}\).

Proposition 4.2

Suppose that \(\mathbf{X} = (W, Y, Z) \in {{\mathbb {H}}}\), \(\varvec{\Delta X} = (\Delta W, \Delta Y, \Delta Z) \in {{\mathbb {H}}}\), and f satisfies the middle assumption. Then

$$\begin{aligned} f'(\mathbf{X}; \varvec{\Delta X}) = \nabla f(\mathbf{X}) \cdot \varvec{\Delta X}. \end{aligned}$$

Furthermore, we have

$$\begin{aligned} f(\mathbf{X} + \varvec{\Delta X}) = f(\mathbf{X}) + \nabla f(\mathbf{X}) \cdot \varvec{\Delta X} + o(\Vert \varvec{\Delta X}\Vert _F). \end{aligned}$$

Proof

We may regard \(f(\mathbf{X})\) as a function of \(W_i, Y_i\) and \(Z_i\), \(i = 0, 1, 2, 3\). Then

$$\begin{aligned}&f'(\mathbf{X}; \varvec{\Delta X})\\&\quad = \sum _{i=0}^3 \left[ {\partial \over \partial W_i}f(\mathbf{X})\Delta W_i + {\partial \over \partial Y_i}f(\mathbf{X})\Delta Y_i + {\partial \over \partial Z_i}f(\mathbf{X})\Delta Z_i\right] \\&\quad = \nabla f(\mathbf{X}) \cdot \varvec{\Delta X}. \end{aligned}$$

Furthermore,

$$\begin{aligned}&f(\mathbf{X} + \varvec{\Delta X}) - f(\mathbf{X})\\&\quad = \sum _{i=0}^3 \left[ {\partial \over \partial W_i}f(\mathbf{X})\Delta W_i + o(\Vert {\Delta W_i }\Vert ) \right. \nonumber \\&\qquad \left. + {\partial \over \partial Y_i}f(\mathbf{X})\Delta Y_i + o(\Vert {\Delta Y_i }\Vert ) + {\partial \over \partial Z_i}f(\mathbf{X})\Delta Z_i + o(\Vert {\Delta Z_i }\Vert ) \right] \\&\quad = \nabla f(\mathbf{X}) \cdot \varvec{\Delta X} + o(\Vert \varvec{\Delta X}\Vert _F). \end{aligned}$$

\(\square \)

Now, we may study the first-order optimality conditions of (3).

Theorem 4.3

Suppose that the functions \(f, h_j\) and \(g_k\) for \(j = 1, \cdots , p\) and \(k = 1, \cdots , q\) satisfy the middle assumption. Assume that \(\mathbf{X}^{\#} = \left( W^{\#}, Y^{\#}, Z^{\#}\right) \in {{\mathbb {H}}}\) is an optimal solution of (3). If

$$\begin{aligned} \left\{ \nabla h_j\left( \mathbf{X}^{\#}\right) , j = 1, \cdots , p \right\} \bigcup \left\{ \nabla g_k\left( \mathbf{X}^{\#}\right) : g_k(\mathbf{X^{\#}}) = 0, k = 1, \cdots , q \right\} \end{aligned}$$
(10)

is R-linearly independent, then there are Lagrangian multipliers \(\lambda _j, \mu _k \in {{\mathbb {R}}}\) for \(j = 1, \cdots , p\) and \(k = 1, \cdots , q\), such that

$$\begin{aligned}&\nabla f\left( \mathbf{X}^{\#}\right) + \sum _{j=1}^p \lambda _j \nabla h_j\left( \mathbf{X}^{\#}\right) + \sum _{k=1}^p \mu _k \nabla g_k\left( \mathbf{X}^{\#}\right) = \mathbf{O}, \end{aligned}$$
(11)
$$\begin{aligned}&h_j\left( \mathbf{X}^{\#}\right) = 0, \ \ j = 1, \cdots , p, \end{aligned}$$
(12)
$$\begin{aligned}&g_k\left( \mathbf{X}^{\#}\right) \le 0, \mu _k \ge 0, \mu _kg_k\left( \mathbf{X}^{\#}\right) =0, \ \ k = 1, \cdots , q. \end{aligned}$$
(13)

Proof

Again, we may regard \(f, h_j, g_k\) for \(j = 1, \cdots , p\), \(k = 1, \cdots , q\) as functions of \(W_i, Y_i\) and \(Z_i\), \(i = 0, 1, 2, 3\). Then (10) is equivalent to linear independence constraint qualification of such an optimization problem of \(W_i, Y_i\) and \(Z_i\), \(i = 0, 1, 2, 3\). Then from the first-order optimality condition for real constrained optimization, we have (11-13). \(\square \)

Note that the Lagrangian multipliers are real numbers. We call \(\mathbf{X}^{\#} \in {{\mathbb {H}}}\), which satisfies (11), (12) and (13) with some Lagrangian multipliers, a stationary point of (3). We may reduce the R-linear independence condition in Theorem 4.3 to other constraint qualifications for nonlinear programs.

We now consider the LRQD problem (5). Let \(f(\mathbf{X}) = {1 \over 2}\Vert YZ-W\Vert _F^2\). Then by [4], we have

$$\begin{aligned} {\partial \over \partial W} f(\mathbf{X})= & {} W-YZ,\end{aligned}$$
(14)
$$\begin{aligned} {\partial \over \partial Y} f(\mathbf{X})= & {} (YZ-W)Z^*,\end{aligned}$$
(15)
$$\begin{aligned} {\partial \over \partial Z} f(\mathbf{X})= & {} Y^*(YZ-W) \end{aligned}$$
(16)

and the following theorem.

Theorem 4.4

Assume that \(\mathbf{X}^{\#} = \left( W^{\#}, Y^{\#}, Z^{\#}\right) \in {{\mathbb {H}}}\) is an optimal solution of (5). Then \(\mathbf{X}^{\#}\) is a stationary point of (5), i.e.,

$$\begin{aligned} \begin{array}{rl} W^{\#}_{\Omega ^C} &{} = (Y^{\#}Z^{\#})_{\Omega ^C}, \\ \left( Y^{\#}Z^{\#}-W^{\#}\right) \left( Z^{\#}\right) ^{*} &{} = O_{m \times r}, \\ \left( Y^{\#}\right) ^{*}\left( Y^{\#}Z^{\#}-W^{\#}\right) &{} = O_{r \times n}, \\ W^{\#}_{\Omega } &{} = D_{\Omega }, \end{array} \end{aligned}$$

where \(\Omega ^C\) is the complementary set of \(\Omega \).

We now study the product rule and the chain rule of first-order derivatives.

Theorem 4.5

Suppose that \(f(\mathbf{X}), g(\mathbf{X}): {{\mathbb {H}}} \rightarrow {{\mathbb {R}}}\) satisfies the middle assumption, with \(\mathbf{X} = (W, Y, Z)\). Then

$$\begin{aligned} \nabla (f(\mathbf{X})g(\mathbf{X})) = f(\mathbf{X})\nabla g(\mathbf{X}) + g(\mathbf{X})\nabla f(\mathbf{X}). \end{aligned}$$
(17)

Proof

We have

$$\begin{aligned}&{\partial \over \partial W} (f(\mathbf{X})g(\mathbf{X})) \\&\quad = {\partial \over \partial W_0} (f(\mathbf{X})g(\mathbf{X})) + {\partial \over \partial W_1} (f(\mathbf{X})g(\mathbf{X}))\mathbf {i}+ {\partial \over \partial W_2} (f(\mathbf{X})g(\mathbf{X}))\mathbf {j}+ {\partial \over \partial W_3} (f(\mathbf{X})g(\mathbf{X}))\mathbf {k}\\&\quad = \left[ f(\mathbf{X}){\partial \over \partial W_0}g(\mathbf{X}) + g(\mathbf{X}){\partial \over \partial W_0}f(\mathbf{X})\right] + \left[ f(\mathbf{X}){\partial \over \partial W_1}g(\mathbf{X}) + g(\mathbf{X}){\partial \over \partial W_1}f(\mathbf{X})\right] \mathbf {i}\\&\qquad + \left[ f(\mathbf{X}){\partial \over \partial W_2}g(\mathbf{X}) + g(\mathbf{X}){\partial \over \partial W_2}f(\mathbf{X})\right] \mathbf {j}+ \left[ f(\mathbf{X}){\partial \over \partial W_3}g(\mathbf{X}) + g(\mathbf{X}){\partial \over \partial W_3}f(\mathbf{X})\right] \mathbf {k}\\&\quad = f(\mathbf{X}){\partial \over \partial W}g(\mathbf{X}) + g(\mathbf{X}){\partial \over \partial W}f(\mathbf{X}). \end{aligned}$$

Similarly, we have

$$\begin{aligned} {\partial \over \partial Y} (f(\mathbf{X})g(\mathbf{X})) = f(\mathbf{X}){\partial \over \partial Y}g(\mathbf{X}) + g(\mathbf{X}){\partial \over \partial Y}f(\mathbf{X}) \end{aligned}$$

and

$$\begin{aligned} {\partial \over \partial Z} (f(\mathbf{X})g(\mathbf{X})) = f(\mathbf{X}){\partial \over \partial Z}g(\mathbf{X}) + g(\mathbf{X}){\partial \over \partial Z}f(\mathbf{X}). \end{aligned}$$

Then

$$\begin{aligned}&\nabla (f(\mathbf{X})g(\mathbf{X}))\\&\quad = \left( {\partial \over \partial W} (f(\mathbf{X})g(\mathbf{X})), {\partial \over \partial Y} (f(\mathbf{X})g(\mathbf{X})), {\partial \over \partial Z} (f(\mathbf{X})g(\mathbf{X}))\right) \\&\quad = \left( f(\mathbf{X}){\partial \over \partial W}g(\mathbf{X}) + g(\mathbf{X}){\partial \over \partial W}f(\mathbf{X}), f(\mathbf{X}){\partial \over \partial Y}g(\mathbf{X}) + g(\mathbf{X}){\partial \over \partial Y}f(\mathbf{X}),\right. \\&\left. \qquad f(\mathbf{X}){\partial \over \partial Z}g(\mathbf{X}) + g(\mathbf{X}){\partial \over \partial Z}f(\mathbf{X})\right) \\&\quad = f(\mathbf{X})\nabla g(\mathbf{X}) + g(\mathbf{X})\nabla f(\mathbf{X}). \end{aligned}$$

\(\square \)

Similarly, we may prove the following theorem.

Theorem 4.6

Suppose that \(f(\mathbf{X}): {{\mathbb {H}}} \rightarrow {{\mathbb {R}}}\) satisfies the middle assumption, with \(\mathbf{X} = (W, Y, Z)\), and \(\phi : {\mathbb R} \rightarrow {{\mathbb {R}}}\) is continuously differentiable. Then

$$\begin{aligned} \nabla \phi (f(\mathbf{X})) = \phi '(f(\mathbf{X}))\nabla f(\mathbf{X}). \end{aligned}$$
(18)

In [4], the Kurdyka- Lojasiewicz inequality was established for real-valued functions of quaternion matrix variables.

4.2 Second-Order Derivatives

Suppose that \(f(\mathbf{X}) = f(W, Y, Z) : {{\mathbb {H}}} \rightarrow {{\mathbb {R}}}\) is twice continuously differentiable in the sense of the strong assumption. Then the second-order derivative of f exists. We first consider the second-order partial derivatives of f. To make an example, we consider \({\partial ^2 \over \partial Y \partial W}f(\mathbf{X})\). It is more convenient to consider \({\partial ^2 \over \partial Y \partial W}f(\mathbf{X})\Delta Y\) and \({\partial ^2 \over \partial Y \partial W}f(\mathbf{X})\Delta Y \cdot \Delta W\).

Let \({\partial \over \partial W}f(\mathbf{X})\) be as considered in the last section. Let \(\Delta Y \in {{\mathbb {Q}}}^{m_2 \times n_2}\). Suppose that

$$\begin{aligned}&{\partial \over \partial W}f(W, Y+\Delta Y, Z) - {\partial \over \partial W}f(W, Y, Z) \\&\quad = \phi (W, Y+\Delta Y, Z) + \eta (W, Y+\Delta Y, Z), \end{aligned}$$

where \(\phi \) is R-linear in \(\Delta Y\) in the sense that for any \(\alpha , \beta \in {{\mathbb {R}}}\) and \(\Delta Y^{(1)}, \Delta Y^{(2)} \in {{\mathbb {Q}}}^{m_2 \times n_2}\),

$$\begin{aligned}&\phi \left( W, Y+\alpha \Delta Y^{(1)} + \beta \Delta Y^{(2)}, Z\right) \\&\quad = \alpha \phi \left( W, Y+\Delta Y^{(1)}, Z\right) + \beta \phi \left( W, Y+\Delta Y^{(2)}, Z\right) , \end{aligned}$$

and

$$\begin{aligned} \eta (W, Y+\Delta Y, Z) = o\left( \left\| \Delta Y\right\| _F\right) , \end{aligned}$$

i.e.,

$$\begin{aligned} \lim _{\left\| \Delta Y\right\| _F \rightarrow 0} {\eta (W, Y+\Delta Y, Z) \over \left\| \Delta Y\right\| _F} = O. \end{aligned}$$

Then we have

$$\begin{aligned} {\partial ^2 \over \partial Y \partial W}f(\mathbf{X})\Delta Y = \phi (W, Y+\Delta Y, Z). \end{aligned}$$

Later, we will see that we may use this approach to calculate \({\partial ^2 \over \partial Y \partial W}f(\mathbf{X})\Delta Y\) and \({\partial ^2 \over \partial Y \partial W}f(\mathbf{X})\Delta Y \cdot \Delta W\), for f defined by the LRQD problem (5).

We may express the other second-order partial derivatives of f similarly.

Proposition 4.7

Under the strong assumption, we have

$$\begin{aligned}&{\partial ^2 \over \partial Y \partial W}f(\mathbf{X})\Delta Y \cdot \Delta W = {\partial ^2 \over \partial W \partial Y}f(\mathbf{X})\Delta W \cdot \Delta Y,\\&{\partial ^2 \over \partial Z \partial W}f(\mathbf{X})\Delta Z \cdot \Delta W = {\partial ^2 \over \partial W \partial Z}f(\mathbf{X})\Delta W \cdot \Delta Z, \end{aligned}$$

and

$$\begin{aligned} {\partial ^2 \over \partial Y \partial Z}f(\mathbf{X})\Delta Y \cdot \Delta Z = {\partial ^2 \over \partial Z \partial Y}f(\mathbf{X})\Delta Z \cdot \Delta Y. \end{aligned}$$

Proof

Let \(\Delta W = \Delta W_0 + \Delta W_1\mathbf {i}+ \Delta W_2\mathbf {j}+ \Delta W_3\mathbf {k}\in {{\mathbb {Q}}}^{m_1 \times n_1}\) and \(\Delta Y = \Delta Y_0 + \Delta Y_1\mathbf {i}+ \Delta Y_2\mathbf {j}+ \Delta Y_3\mathbf {k}\in {{\mathbb {Q}}}^{m_2 \times n_2}\). Then \({\partial ^2 \over \partial Y \partial W}f(\mathbf{X})\Delta Y \cdot \Delta W\) and \({\partial ^2 \over \partial W \partial Y}f(\mathbf{X})\Delta W \cdot \Delta Y\) are real. We have

$$\begin{aligned}&{\partial ^2 \over \partial Y \partial W}f(\mathbf{X})\Delta Y \cdot \Delta W = \sum _{i, j=0}^3 {\partial ^2 \over \partial Y_i \partial W_j}f(\mathbf{X})\Delta Y_i \cdot \Delta W_j\\&\quad = \sum _{i, j=0}^3 {\partial ^2 \over \partial W_j \partial Y_i}f(\mathbf{X})\Delta W_j \cdot \Delta Y_i = {\partial ^2 \over \partial W \partial Y}f(\mathbf{X})\Delta W \cdot \Delta Y. \end{aligned}$$

The other two equalities can be proved similarly. \(\square \)

Then we may define \(\nabla ^2 f(\mathbf{X})\) by

$$\begin{aligned}&{1 \over 2}\nabla ^2 f(\mathbf{X}) \varvec{\Delta X} \cdot \varvec{\Delta X} = {\partial ^2 \over \partial Y \partial W}f(\mathbf{X})\Delta Y \cdot \Delta W + {\partial ^2 \over \partial Z \partial W}f(\mathbf{X})\Delta Z \cdot \Delta W \\&\qquad + {\partial ^2 \over \partial Y \partial Z}f(\mathbf{X})\Delta Y \cdot \Delta Z\\&\quad + {1 \over 2}{\partial ^2 \over \partial W^2}f(\mathbf{X})\Delta W \cdot \Delta W + {1 \over 2}{\partial ^2 \over \partial Y^2}f(\mathbf{X})\Delta Y \cdot \Delta Y + {1 \over 2}{\partial ^2 \over \partial Z^2}f(\mathbf{X})\Delta Z \cdot \Delta Z. \end{aligned}$$

Here, \(\varvec{\Delta X} = (\Delta W, \Delta Y, \Delta Z) \in {\mathbb H}\).

If \({1 \over 2}\nabla ^2 f(\mathbf{X}) \varvec{\Delta X} \cdot \varvec{\Delta X} \ge 0\) for any \(\varvec{\Delta X} \in {{\mathbb {H}}}\), then we say that \(\nabla ^2 f\) is positive semi-definite at \(\mathbf{X}\). If \({1 \over 2}\nabla ^2 f(\mathbf{X}) \varvec{\Delta X} \cdot \varvec{\Delta X} > 0\) for any \(\varvec{\Delta X} \in {{\mathbb {H}}}\) and \(\varvec{\Delta X} \not = \mathbf{O}\), then we say that \(\nabla ^2 f\) is positive definite at \(\mathbf{X}\).

Proposition 4.8

Under the strong assumption, we have

$$\begin{aligned} f(\mathbf{X} + \varvec{\Delta X}) = f(\mathbf{X}) + \nabla f(\mathbf{X}) \cdot \varvec{\Delta X} + {1 \over 2} \nabla ^2 f(\mathbf{X})\varvec{\Delta X} \cdot \varvec{\Delta X} + o(\Vert \varvec{\Delta X}\Vert _F^2). \end{aligned}$$
(19)

Proof

Again, we may regard \(f(\mathbf{X})\) as a function of \(W_i, Y_i\) and \(Z_i\), \(i = 0, 1, 2, 3\). Then

$$\begin{aligned}&f(\mathbf{X} + \varvec{\Delta X}) - f(\mathbf{X}) \\&\quad = \sum _{i=0}^3 \left[ {\partial \over \partial W_i}f(\mathbf{X})\Delta W_i + {\partial \over \partial Y_i}f(\mathbf{X})\Delta Y_i + {\partial \over \partial Z_i}f(\mathbf{X})\Delta Z_i\right] \\&\qquad + \sum _{i, j=0}^3 {\partial ^2 \over \partial Y_i \partial W_j}f(\mathbf{X})\Delta Y_i \cdot \Delta W_j\\&\quad + \sum _{i, j=0}^3 {\partial ^2 \over \partial Z_i \partial W_j}f(\mathbf{X})\Delta Z_i \cdot \Delta W_j + \sum _{i, j=0}^3 {\partial ^2 \over \partial Y_i \partial Z_j}f(\mathbf{X})\Delta Y_i \cdot \Delta Z_j \\&\qquad +{1 \over 2}\sum _{i, j=0}^3 {\partial ^2 \over \partial W_i \partial W_j}f(\mathbf{X})\Delta W_i \cdot \Delta W_j\\&\quad + {1 \over 2}\sum _{i, j=0}^3 {\partial ^2 \over \partial Y_i \partial Y_j}f(\mathbf{X})\Delta Y_i \cdot \Delta Y_j + {1 \over 2}\sum _{i, j=0}^3 {\partial ^2 \over \partial Z_i \partial Z_j}f(\mathbf{X})\Delta Z_i \cdot \Delta Z_j\\&\quad + \sum _{i=0}^3 \left[ o(\Vert \Delta W_i\Vert _F^2) + o(\Vert \Delta Y_i\Vert _F^2) + o(\Vert \Delta Z_i\Vert _F^2) \right] \\&\quad = \nabla f(\mathbf{X}) \cdot \varvec{\Delta X} + {1 \over 2} \nabla ^2 f(\mathbf{X})\varvec{\Delta X} \cdot \varvec{\Delta X} + o(\Vert \varvec{\Delta X}\Vert _F^2). \end{aligned}$$

We have (19). \(\square \)

We now consider the function f in the LRQD problem (5). Then, we have the following result.

Theorem 4.9

Suppose that \(f(\mathbf{X}) = \Vert YZ-W\Vert _F^2\), where \(W \in {\mathbb Q}^{m \times n}\), \(Y \in {{\mathbb {Q}}}^{m \times r}\) and \(Z \in {{\mathbb {Q}}}^{r \times n}\). Then,

$$\begin{aligned} {\partial ^2 \over \partial W^2}f(\mathbf{X}) \Delta W= & {} \Delta W,\end{aligned}$$
(20)
$$\begin{aligned} {\partial ^2 \over \partial Y \partial W}f(\mathbf{X}) \Delta Y= & {} - \Delta Y Z, \end{aligned}$$
(21)
$$\begin{aligned} {\partial ^2 \over \partial Z \partial W}f(\mathbf{X}) \Delta Z= & {} - Y \Delta Z, \end{aligned}$$
(22)
$$\begin{aligned} {\partial ^2 \over \partial W \partial Y}f(\mathbf{X}) \Delta W= & {} - \Delta W Z^*, \end{aligned}$$
(23)
$$\begin{aligned} {\partial ^2 \over \partial Y^2}f(\mathbf{X}) \Delta Y= & {} \Delta Y ZZ^*, \end{aligned}$$
(24)
$$\begin{aligned} {\partial ^2 \over \partial Z \partial Y}f(\mathbf{X}) \Delta Z= & {} (YZ-X)(\Delta Z)^* + Y(\Delta Z)Z^*, \end{aligned}$$
(25)
$$\begin{aligned} {\partial ^2 \over \partial W \partial Z}f(\mathbf{X}) \Delta W= & {} - Y^* \Delta W, \end{aligned}$$
(26)
$$\begin{aligned} {\partial ^2 \over \partial Y \partial Z}f(\mathbf{X}) \Delta Y= & {} (\Delta Y)^*(YZ-X) + Y^*(\Delta Y)Z, \end{aligned}$$
(27)
$$\begin{aligned} {\partial ^2 \over \partial Y^2}f(\mathbf{X}) \Delta Y= & {} YY^* \Delta Z. \end{aligned}$$
(28)

Proof

By (14), we have

$$\begin{aligned} {\partial \over \partial W} f(W, Y, Z) = W-YZ. \end{aligned}$$

Thus,

$$\begin{aligned}&{\partial \over \partial W} f(W+\Delta W, Y, Z) = W+\Delta W -YZ,\\&\quad {\partial \over \partial W} f(W+\Delta W, Y, Z) - {\partial \over \partial W} f(W, Y, Z) = \Delta W. \end{aligned}$$

This implies (20). By above, we also have

$$\begin{aligned}&{\partial \over \partial W} f(W, Y+\Delta Y, Z) = W -(Y+\Delta Y)Z,\\&\quad {\partial \over \partial W} f(W, Y+\Delta Y, Z) - {\partial \over \partial W} f(W, Y, Z) = - \Delta YZ. \end{aligned}$$

This implies (21). The formulas (22-28) can be proved similarly. \(\square \)

We study the second-order optimality conditions of (3) now.

Suppose that \(\mathbf{X}^{\#} \in {{\mathbb {H}}}\) is a stationary point of (3), i.e., there exist some Lagrangian multipliers \(\lambda _j, \mu _k \in {{\mathbb {R}}}\) for \(j = 1, \cdots , p\) and \(k = 1, \cdots , q\), such that (11), (12) and (13) are satisfied. Let

$$\begin{aligned} I(\mathbf{X}^{\#}) = \left\{ k \in \{1, \cdots , p\} : g_k(\mathbf{X}^{\#})= 0 \right\} . \end{aligned}$$

We say that \(\varvec{\Delta X} \in {{\mathbb {H}}}\) is in the critical cone \(C(\mathbf{X}^{\#})\) of (3) at \(\mathbf{X}^{\#}\), if

$$\begin{aligned} \nabla f(\mathbf{X}^{\#}) \cdot \varvec{\Delta X}= & {} 0,\end{aligned}$$
(29)
$$\begin{aligned} \nabla h_j(\mathbf{X}^{\#}) \cdot \varvec{\Delta X}= & {} 0,\ \mathrm{for}\ j =1, \cdots , p,\end{aligned}$$
(30)
$$\begin{aligned} \nabla g_k(\mathbf{X}^{\#}) \cdot \varvec{\Delta X}= & {} 0,\ \mathrm{for}\ k \in I(\mathbf{X}^{\#}). \end{aligned}$$
(31)

By regarding \(f, h_j, g_k\) for \(j = 1, \cdots , p\), \(k = 1, \cdots , q\) as functions of \(W_i, Y_i\) and \(Z_i\), \(i = 0, 1, 2, 3\), from the second-order optimality necessary condition and sufficient condition for real constrained optimization, we have the following two theorems.

Theorem 4.10

Suppose that the functions \(f, h_j\) and \(g_k\) for \(j = 1, \cdots , p\) and \(k = 1, \cdots , q\) satisfy the strong assumption. Assume that \(\mathbf{X}^{\#} = \left( W^{\#}, Y^{\#}, Z^{\#}\right) \in {{\mathbb {H}}}\) is an optimal solution of (3), and (10) is R-linearly independent. Then for any \(\varvec{\Delta X} \in C(\mathbf{X}^{\#})\),

$$\begin{aligned} {1 \over 2}\nabla ^2 f(\mathbf{X}) \varvec{\Delta X} \cdot \varvec{\Delta X} \ge 0. \end{aligned}$$

Theorem 4.11

Suppose that the functions \(f, h_j\) and \(g_k\) for \(j = 1, \cdots , p\) and \(k = 1, \cdots , q\) satisfy the strong assumption. Assume that \(\mathbf{X}^{\#} = \left( W^{\#}, Y^{\#}, Z^{\#}\right) \in {{\mathbb {H}}}\) is a stationary point of (3), and for any \(\varvec{\Delta X} \in C(\mathbf{X}^{\#})\),

$$\begin{aligned} {1 \over 2}\nabla ^2 f(\mathbf{X}) \varvec{\Delta X} \cdot \varvec{\Delta X} > 0. \end{aligned}$$

Then \(\mathbf{X}^{\#}\) is an optimal minimizer of (3).

5 Convex Functions of Quaternion Matrix Variables

In this section, we study convex functions of quaternion matrix variables.

Suppose that \(f(\mathbf{X}) : {{\mathbb {H}}} \rightarrow {{\mathbb {R}}}\). A natural definition for f to be convex is as follows. We say that f is a convex function if for any \(\mathbf{X, {\hat{\mathbf{X}}}} \in {{\mathbb {H}}}\), and any \(t \in {{\mathbb {R}}}\), \(0 \le t \le 1\), we have

$$\begin{aligned} f(t\mathbf{X}+(1-t){{\hat{\mathbf{X}}}}) \le tf(\mathbf{X}) + (1-t)f({{\hat{\mathbf{X}}}}). \end{aligned}$$

It is possible to use the modern convex analysis terminology, epigraphs, to define convex functions of quaternion matrix variables. Here, we use the classical definition to make the definition, such that it is more convenient for engineering readers.

Then, a question is: Is f a convex function if and only if \(f^R\) is convex?

Proposition 5.1

Suppose that \(f(\mathbf{X}) : {{\mathbb {H}}} \rightarrow {{\mathbb {R}}}\). Then f is a convex function if and only if \(f^R\) is convex.

Proof

Let \(\mathbf{X} = \left( W, Y, Z\right) , {{\hat{\mathbf{X}}}} = \left( {\hat{W}}, {\hat{Y}}, {\hat{Z}}\right) \in {{\mathbb {H}}}\), \(t \in {{\mathbb {R}}}\) and \(0 \le t \le 1\). Then

$$\begin{aligned}&t\mathbf{X}+(1-t){{\hat{\mathbf{X}}}} \\&\quad = \left( tW+(1-t){\hat{W}}, tY+(1-t){\hat{Y}}, tZ+(1-t){\hat{Z}}\right) ,\\&\qquad tW+(1-t){\hat{W}} \\&\quad = \left( tW_0+(1-t){\hat{W}}_0\right) +\left( tW_1+(1-t){\hat{W}}_1\right) \mathbf {i}+\left( tW_2+(1-t){\hat{W}}_2\right) \mathbf {j}\\&\qquad +\left( tW_3+(1-t){\hat{W}}_3\right) \mathbf {k},\\&\qquad tY+(1-t){\hat{Y}} \\&\quad = \left( tY_0+(1-t){\hat{Y}}_0\right) +\left( tY_1+(1-t){\hat{Y}}_1\right) \mathbf {i}+\left( tY_2+(1-t){\hat{Y}}_2\right) \mathbf {j}\\&\qquad +\left( tY_3+(1-t){\hat{Y}}_3\right) \mathbf {k},\\&\qquad tZ+(1-t){\hat{Z}} \\&\quad = \left( tZ_0+(1-t){\hat{Z}}_0\right) +\left( tZ_1+(1-t){\hat{Z}}_1\right) \mathbf {i}+\left( tZ_2+(1-t){\hat{Z}}_2\right) \mathbf {j}\\&\qquad +\left( tZ_3+(1-t){\hat{Z}}_3\right) \mathbf {k}. \end{aligned}$$

From here, we may conclude that f is a convex function if and only if \(f^R\) is convex. \(\square \)

We now define subgradients and subdifferentials by R-product. Suppose that \(f(\mathbf{X}) : {{\mathbb {H}}} \rightarrow {{\mathbb {R}}}\), and \({{\bar{\mathbf{X}}}} = \left( {\bar{W}}, {\bar{Y}}, {\bar{Z}}\right) \in {{\mathbb {H}}}\). Let \(\mathbf{G} = \left( A, B, C\right) \in {{\mathbb {H}}}\). We say that \(\mathbf{G}\) is a subgradient of f at \({{\bar{\mathbf{X}}}}\) if for any \(\mathbf{X} = \left( W, Y, Z\right) \in {{\mathbb {H}}}\), we have

$$\begin{aligned} f(\mathbf{X}) \ge f({{\bar{\mathbf{X}}}}) + \mathbf{G} \cdot (\mathbf{X}-{{\bar{\mathbf{X}}}}). \end{aligned}$$

The set of all subgradients of f at \({{\bar{\mathbf{X}}}}\) is called the subdifferential of f at \({{\bar{\mathbf{X}}}}\) and denoted as \(\partial f({{\bar{\mathbf{X}}}})\).

By the definition of R-product, we see that \(\mathbf{G}\) is a subgradient of f at \({{\bar{\mathbf{X}}}}\) if and only if \(R(\mathbf{G})\) is a subgradient of \(f^R\) at \(R({{\bar{\mathbf{X}}}})\).

This definition is slightly different from the definition of Jia, Ng and Song [9] in face. They used the real part of the inner product of \(\mathbf{G}\) and \(\mathbf{X}-{{\bar{\mathbf{X}}}}\), instead of their R-product here. By Proposition 4.1, these two definitions are the same. This also reveals that the subgradient concept introduced in [9] can be regarded as subgradients of the norms of real matrix variables.

From Proposition 5.1, the definition of R-product and the knowledge of convex functions of real variables, we have the following proposition.

Proposition 5.2

Suppose that \(f(\mathbf{X}) : {{\mathbb {H}}} \rightarrow {{\mathbb {R}}}\) is a convex function. Then for any \({{\bar{\mathbf{X}}}} = \left( {\bar{W}}, {\bar{Y}}, \bar{Z}\right) \in {{\mathbb {H}}}\), the subdifferential \(\partial f({{\bar{\mathbf{X}}}})\) is a nonempty, convex and compact set in \({{\mathbb {H}}}\). The subdifferential \(\partial f({{\bar{\mathbf{X}}}})\) is a singleton if and only if f is differentiable at \({\bar{\mathbf{X}}}\). In this case, \(\nabla f({{\bar{\mathbf{X}}}})\) is the unique subgradient of f at \({\bar{\mathbf{X}}}\).

By Proposition 5.1 and the knowledge of convex functions of real variables, we have the following proposition.

Proposition 5.3

Suppose that \(f(\mathbf{X}) : {{\mathbb {H}}} \rightarrow {{\mathbb {R}}}\) satisfies the strong assumption. Then f is convex if and only if \(\nabla ^2 f\) is positive semi-definite at any \(\mathbf{X} \in {{\mathbb {H}}}\).

In the RQMC problem (6), the objective function is convex, the constraints are linear. Thus, it is a convex QMC problem. Optimality analysis on it was made in [9], and numerical examples were reported there.

6 A Sparse Color Image Denoising Problem

The task of this section is to analyze the optimality conditions of (8) to pave the way for further study on similar color image problems. As the \(\ell _0\)-norm and the rank function are not continuous, we have to develop general subdifferential calculus for proper functions of quaternion matrix variables first.

6.1 Generalized Subdifferentials of Lower Semi-Continuous Functions of Quaternion Matrix Variables

The generalized subdifferential calculus for real-valued functions of real variables can be found in standard references [6, 15, 16]. In this subsection, we extend it to proper functions of quaternion matrix variables.

Denote \({\bar{{\mathbb {R}}}} := {{\mathbb {R}}} \cup \{ +\infty \}\). For a function \(h(\mathbf{X}) : {{\mathbb {M}}} \rightarrow {\bar{{\mathbb {R}}}}\), denote its domain as dom\((h) = \left\{ \mathbf{X} \in {{\mathbb {M}}} : h(\mathbf{X}) < +\infty \right\} \). We say that h is a proper function if its domain is not empty.

Suppose that \(h(\mathbf{X}) : {{\mathbb {M}}} \rightarrow {\bar{{\mathbb {R}}}}\) is a proper function, \({{\bar{\mathbf{X}}}} = \left( {\bar{Y}}, {\bar{Z}}\right) \in \) dom(h). Let \(\mathbf{G} \in {{\mathbb {M}}}\). We say that \(\mathbf{G}\) is a F(réchet)-subgradient of h at \({{\bar{\mathbf{X}}}}\) if

$$\begin{aligned} \liminf _{\mathbf{X} \rightarrow {{\bar{\mathbf{X}}}}, \mathbf{X} \not = {{\bar{\mathbf{X}}}}} {h(\mathbf{X}) - h({{\bar{\mathbf{X}}}}) - \mathbf{G} \cdot (\mathbf{X}-{{\bar{\mathbf{X}}}}) \over \Vert \mathbf{X}- {{\bar{\mathbf{X}}}}\Vert } \ge 0. \end{aligned}$$

The set of all F-subgradients of h at \({{\bar{\mathbf{X}}}}\) is called the Fréchet subdifferential of h at \({{\bar{\mathbf{X}}}}\) and denoted as \(\partial ^F h({{\bar{\mathbf{X}}}})\). On the other hand, \(\mathbf{G} \in {{\mathbb {M}}}\) is a limiting subgradient of h at \({{\bar{\mathbf{X}}}}\) if

$$\begin{aligned} \mathbf{G} = \lim _{k \rightarrow \infty } \mathbf{G}^k,\ \mathbf{G}^k \in \partial ^F h(\mathbf{X^k}),\ \lim _{k \rightarrow \infty } \mathbf{X}^k = \mathbf{X}. \end{aligned}$$

The set of all limiting subgradients of h at \({{\bar{\mathbf{X}}}}\) is called the limiting subdifferential of h at \({{\bar{\mathbf{X}}}}\) and denoted as \(\partial ^L h({{\bar{\mathbf{X}}}})\).

By the definition of R-product, we see that \(\mathbf{G}\) is a F-subgradient (limiting subgradient) of h at \({{\bar{\mathbf{X}}}}\) if and only if \(R(\mathbf{G})\) is a F-subgradient (limiting subgradient) of \(h^R\) at \(R({{\bar{\mathbf{X}}}})\). Here, h is regarded as a function of \(Y_i\) and \(Z_i\) for \(i = 0, 1, 2, 3\), and denote such a function as \(h^R\), and for \(\mathbf{X} = (Y, Z)\), denote \(R(Y) = (Y_0, Y_1, Y_2, Y_3)\), \(R(Z)= (Z_0, Z_1, Z_2, Z_3)\) and \(R(\mathbf{X}) = (R(Y), R(Z))\). By (3) of [10], we have

$$\begin{aligned} \partial ^F h({{\bar{\mathbf{X}}}}) \subset \partial ^L h({{\bar{\mathbf{X}}}}). \end{aligned}$$

By Theorem 1 of [10], we have the following theorem.

Theorem 6.1

Let \(A = (a_{ij}) \in {{\mathbb {Q}}}^{m \times n}\). Then

$$\begin{aligned} \partial ^F \Vert A \Vert _0 = \partial ^L \Vert A \Vert _0 = {{\mathbb {Q}}}^{m\times n}_{\Gamma ^C_A}, \end{aligned}$$

where \({{\mathbb {Q}}}^{m\times n}_{\Gamma ^C_A} = \left\{ B \in {\mathbb Q}^{m \times n} : B_{\Gamma _A} = O \right\} \), \(\Gamma _A = \left\{ (i, j) : \ a_{ij} \not = 0 \right\} \) is the support of A, and \(\Gamma ^C_A\) is the complementary set of \(\Gamma _A\).

By using Theorem 4 of [10], we may also characterize the generalized subdifferential of the rank function of A.

6.2 The Feasible Set of The SCID Problem

In this subsection, we study the feasible set of the SCID problem (8):

$$\begin{aligned} S = \left\{ Y \in {{\mathbb {Q}}}^{m \times n} : \mathrm{rank}(Y) \le r \right\} . \end{aligned}$$

Let

$$\begin{aligned} R(S) = \left\{ R(Y) : Y \in S \right\} . \end{aligned}$$

In this paper, we say that R(S) is the real representation set of S. Note that it is different from \(\left\{ Y^R : Y \in S \right\} \).

Recall that a set in a real finite-dimensional space is called a semi-algebraic set if it is defined by a set of polynomial equations and inequalities, and a real-valued function in that space is called semi-algebraic if its graph is semi-algebraic [1, 7].

Proposition 6.2

The set S is closed, and its real representation set R(S) is closed and semi-algebraic.

Proof

Suppose that \(Y^k \in S\) and \(Y^k \rightarrow Y\). Denote \(r_k:= {\text {rank}}(Y^k)\). Thus \(r_k\le r\). By Theorem 2.1, there exist unitary matrices \(U^k \in {{\mathbb {Q}}}^{m \times m}\) and \(V^k \in {{\mathbb {Q}}}^{n \times n}\), and real positive diagonal matrices \(\Sigma _{r_k}^k = {\text {diag}}\left( \sigma _{k,1}, \ldots , \sigma _{k,r_k}\right) \) such that

$$\begin{aligned} Y^k = U^k\left( {\begin{array}{cc}\Sigma _{r_k}^k \ O \\ O \ \ O\end{array}}\right) (V^k)^*. \end{aligned}$$

For each k, introduce a diagonal matrix \(\Sigma _r^k ={\text {diag}}\left( \sigma _{k,1}, \ldots , \sigma _{k,r_k}, 0,\ldots , 0\right) \in {{\mathbb {R}}}^{r\times r}\). We can rewrite \(Y^k\) as

$$\begin{aligned} Y^k = U^k\left( {\begin{array}{cc}\Sigma _{r}^k \ O \\ O \ \ O\end{array}}\right) (V^k)^*. \end{aligned}$$

This is also equivalent to

$$\begin{aligned} \left( U^k\right) ^*Y^k V^k = \left( {\begin{array}{cc} \Sigma _r^k \ O \\ O \ \ O\end{array}}\right) . \end{aligned}$$
(32)

Since \(U^k\) and \(V^k\) are unitary, \(\{ U^k \}\) and \(\{ V^k \}\) are bounded. Thus, \(\{ U^k \}\) and \(\{ V^k \}\) have limiting points. Without loss of generality, we may assume that \(U^k \rightarrow U\) and \(V^k \rightarrow V\). Then U and V are unitary. Taking the limit on both sides of (32) as \(k\rightarrow +\infty \), we can find some nonnegative diagonal matrix \(\Sigma _r \in {{\mathbb {R}}}^{r\times r}\) such that

$$\begin{aligned} U^*YV = \left( {\begin{array}{cc} \Sigma _r \ O \\ O \ \ O\end{array}}\right) . \end{aligned}$$

That is, \(Y = U\left( {\begin{array}{cc} \Sigma _r \ O \\ O \ \ O\end{array}}\right) V^*\), which indicates that rank\((Y) \le r\). Thus, \(Y\in S\), and the desired closedness of S is obtained.

Since S is closed, the set R(S) is also closed. By Theorem 1.8.4 of [22], if the rank of \(A \in {{\mathbb {Q}}}^{m \times n}\) is r, then the rank of its real representation \(A^R\) is 4r. Note that

$$\begin{aligned} R(S)= & {} \left\{ R(Y) : Y \in {{\mathbb {Q}}}^{m \times n}, \mathrm{rank}(Y) \le r \right\} \\= & {} \left\{ R(Y) : Y \in {{\mathbb {Q}}}^{m \times n}, \mathrm{rank}(Y^R) \le 4r \right\} . \end{aligned}$$

Since R(S) is characterized by vanishing of all \((4r+1) \times (4r+1)\) minors of \(Y^R\), we conclude that the set R(S) is semi-algebraic. \(\square \)

For a nonempty closed set \(\Omega \), the indicator function with respect to \(\Omega \), denoted as \(\delta _\Omega \) is defined by

$$\begin{aligned} \delta _\Omega (\mathbf {x}) = \left\{ \begin{array}{ll} 0, &{} \mathrm{if}\ \mathbf {x}\in \Omega ; \\ +\infty , &{} \mathrm{otherwise}. \end{array} \right. \end{aligned}$$

Let \(A \in {{\mathbb {Q}}}^{m \times n}\) and \(B \rightarrow A\). Then there is \(\delta > 0\) such that for all \(B \in {{\mathbb {Q}}}^{m \times n}\) satisfying \(\Vert B-A \Vert _F \le \delta \), we have \(\Vert B \Vert _0 \ge \Vert A \Vert _{0}\). This shows that the \(\ell _0\)-norm function of quaternion matrices is lower semi-continuous.

Proposition 6.3

Suppose that \(p : {{\mathbb {M}}} \rightarrow {\bar{{\mathbb {R}}}}\) is defined by \(p(\mathbf{X}) = \lambda \Vert Z\Vert _0 + \delta _S(Y)\), where \(\mathbf{X} = (Y, Z)\). Let \(p^R(R(Y), R(Z)) := p(\mathbf{X})\). Then,

(i) p is a proper lower semi-continuous function, and \(p^R\) is a semi-algebraic function;

(ii) the limiting subdifferential of h takes the form of

$$\begin{aligned} \partial ^L p(Y, Z) = N_S({Y}) \times {{\mathbb {Q}}}^{m\times n}_{\Gamma ^C_Z}, \end{aligned}$$

where \(N_S(Y)\) is the normal cone with respect to S at Y.

Proof

(i) For any \(Y \in S\) and \(Z \in {{\mathbb {Q}}}^{m \times n}\), \(p(\mathbf{X})\) is nonnegative and finite valued. Thus, p is proper. Since the \(\ell _0\)-norm function of quaternion matrices is lower semi-continuous, and S is closed by Proposition 6.2, p is lower semi-continuous. By Proposition 6.2, R(S) is semi-algebraic. The \(\ell _0\)-norm function has a piecewise linear graph, see [1]. Thus, \(p^R\) is a semi-algebraic function.

(ii) Note that p is separable in Y and Z. Thus,

$$\begin{aligned} \partial ^L p(Y, Z)= & {} \partial ^L \delta _S(Y) \times \partial ^L(\lambda \Vert Z\Vert _0)\\= & {} N_S({Y}) \times {{\mathbb {Q}}}^{m\times n}_{\Gamma ^C_Z}, \end{aligned}$$

where the second equality comes from Theorem 6.1. \(\square \)

The semi-algebraic property of \(p^R\) will not be used in this paper, yet it is very important in convergence analysis of first-order algorithms for solving (8) [1].

6.3 Stationarity

The SCID problem (8) is a rank-constrained sparse optimization problem of quaternion matrix variables. Inspired by the sparse optimization [2] and rank-constrained optimization [12], we may introduce stationary point concepts for (8).

We first introduce proximal mapping for a lower semi-continuous function \(h : {{\mathbb {M}}} \rightarrow {{\mathbb {R}}} \cup \{ +\infty \}\). We denote it as \(\hbox {Prox}_h\). It is defined as

$$\begin{aligned} \mathrm{Prox}_h(\mathbf{X}) : = {\mathrm{arg}\min }_{\mathbf{W} \in {{\mathbb {M}}}} \left\{ h(\mathbf{W}) + {1 \over 2}\Vert \mathbf{W} - \mathbf{X}\Vert _F^2 \right\} . \end{aligned}$$

Then we define \(\beta -\)stationary points and stationary points for (8). Let

$$\begin{aligned} h(\mathbf{X}) = {1 \over 2}\Vert L(Y+Z) - D\Vert _F^2. \end{aligned}$$
(33)

Let \(\beta \) be a positive number, \({{\hat{\mathbf{X}}}} = ({\hat{Y}}, {\hat{Z}}) \in {{\mathbb {M}}}\). We say that \({{\hat{\mathbf{X}}}}\) a \(\beta -\)stationary point of (8) if

$$\begin{aligned} {\hat{Y}} \in \Pi _S({{\hat{Y}}}- \beta \nabla _Y h({{\hat{\mathbf{X}}}})),\ {\hat{Z}} \in \mathrm{Prox}_{\beta \lambda \Vert \cdot \Vert _0}(\hat{Z} - \beta \nabla _Zh({{\hat{\mathbf{X}}}})); \end{aligned}$$
(34)

we say that \({{\hat{\mathbf{X}}}}\) a stationary point of (8) if

$$\begin{aligned} \nabla _Y h({{\hat{\mathbf{X}}}}) \in N_S({{\hat{Y}}}),\ \nabla _Z h({{\hat{\mathbf{X}}}}) \in {{\mathbb {Q}}}^{m\times n}_{\Gamma ^C_A}. \end{aligned}$$

Here, \(\Pi _S\) is the projection operator with respect to S.

By (16), we have

$$\begin{aligned} \nabla h(\mathbf{X}) =\left[ {\partial \over \partial Y}h(\mathbf{X}), {\partial \over \partial Z}h(\mathbf{X})\right] = \left[ L^*\left( L(Y+Z)-D\right) , L^*\left( L(Y+Z)-D\right) \right] ,\nonumber \\ \end{aligned}$$
(35)

where \(L^*\) stands for the adjoint of L in the sense that

$$\begin{aligned} L(Y)\cdot W = Y \cdot L^*(W), ~~\forall ~ Y, W\in {{\mathbb {Q}}}^{m\times n}. \end{aligned}$$

Proposition 6.4

There is a gradient Lipschitz constant \(L_h = \sqrt{2\Vert L^*L\Vert } > 0\) (here \(\Vert L^*L\Vert \) is the spectral norm of the linear operator \(L^*L\)) such that for any \(\mathbf{X} = (Y, Z), {{\hat{\mathbf{X}}}} = ({\hat{Y}}, {\hat{Z}}) \in {{\mathbb {M}}}\),

$$\begin{aligned} \Vert \nabla h(\mathbf{X}) - \nabla h({{\hat{\mathbf{X}}}}) \Vert _F \le L_h \Vert \mathbf{X} -{{\hat{\mathbf{X}}}}\Vert _F, \end{aligned}$$

i.e.,

$$\begin{aligned} \Vert \nabla h^R(R(\mathbf{X})) - \nabla h^R(R({{\hat{\mathbf{X}}}})) \Vert _F \le L_h \Vert R(\mathbf{X}) -R({{\hat{\mathbf{X}}}})\Vert _F. \end{aligned}$$

Proof

By (35) and the equivalence relation between \((h, \mathbf{X})\) and \((h^R, R(\mathbf{X}))\), we have the conclusions. \(\square \)

Theorem 6.5

For the SCID problem (8), we have the following conclusions:

(i) Any local minimizer is a stationary point;

(ii) Any \(\beta -\)stationary point for \(\beta > 0\) is a stationary point;

(iii) Any global minimizer \({\mathbf{X}^\#} = (Y^\#, Z^\#)\) is a \(\beta -\)stationary point for \(\beta \in (0, {1 \over L_h})\); furthermore, \(\Pi _S(Y^\#- \beta \nabla _Y f({\mathbf{X}^\#}))\times \) \(\hbox {Prox}_{\beta \lambda \Vert \cdot \Vert _0}(Z^\#- \beta \nabla _Zf({\mathbf{X}^\#}))\) is a singleton;

(iv) If \({\mathbf{X}^\#} = (Y^\#, Z^\#)\) is a \(\beta -\)stationary point and rank\((Y^\#)<r\), then \({\mathbf{X}^\#}\) is a local minimizer.

Proof

(i) Let \({{\hat{\mathbf{X}}}} = ({\hat{Y}}, {\hat{Z}})\) be a local minimizer of (8). Rewrite problem (8) as

$$\begin{aligned} \min _{\mathbf{X}} h(\mathbf{X}) + p(\mathbf{X}), \end{aligned}$$
(36)

where h is defined by (33), p is defined in Proposition 6.3. Note that the quaternion matrix optimization problem (36) is equivalent to the real matrix optimization problem

$$\begin{aligned} \min _{R(\mathbf{X})} h^R(R(\mathbf{X})) + p^R(R(\mathbf{X})). \end{aligned}$$
(37)

Apply the generalized Fermat rule [16, Theorem 10.1] to the real matrix optimization problem (37). Since \(\mathbf{G}\) is a F-subgradient (limiting subgradient) of h or p at \({{\hat{\mathbf{X}}}}\) if and only if \(R(\mathbf{G})\) is a F-subgradient (limiting subgradient) of \(h^R\) or \(p^R\) at \(R({{\hat{\mathbf{X}}}})\), we have the desired result.

(ii) Let \({{\hat{\mathbf{X}}}} = ({\hat{Y}}, {\hat{Z}})\) be a \(\beta -\)stationary point of (8). Note that

$$\begin{aligned} {\hat{Y}} \in \Pi _S({{\hat{Y}}}- \beta \nabla _Y h({{\hat{\mathbf{X}}}})) \end{aligned}$$

is equivalent to

$$\begin{aligned} R({\hat{Y}}) \in \Pi _{R(S)}(R({{\hat{Y}}})- \beta \nabla _{R(Y)} h^R(R({{\hat{\mathbf{X}}}})), \end{aligned}$$

which is further equivalent to

$$\begin{aligned} R({\hat{Y}}) = \mathrm{argmin}_{R(Y)} \left\{ {1 \over 2}\left\| R(Y) - (R({{\hat{Y}}})- \beta \nabla _{R(Y)} h^R(R({{\hat{\mathbf{X}}}})))\right\| _F^2 + \delta _{R(S)}(Y)\right\} . \end{aligned}$$

Combining this with the generalized Fermat rule [16, Theorem 10.1] and the fact that

$$\begin{aligned} \partial _{R(y)} \delta _{R(S)}(R(Y)) = N_{R(S)}(R({{\hat{Y}}})) \end{aligned}$$

[16, Exercise 8.14], we conclude that

$$\begin{aligned} O \in R({\hat{Y}})- (R({{\hat{Y}}})- \beta \nabla _{R(Y)} h^R(R({{\hat{\mathbf{X}}}})))+ N_{R(S)}(R({{\hat{Y}}})), \end{aligned}$$

which is exactly

$$\begin{aligned} \nabla _{R(Y)} h^R(R({{\hat{\mathbf{X}}}})) \in N_{R(S)}(R({{\hat{Y}}})), \end{aligned}$$
(38)

due to the conic property of \(N_{R(S)}(R({{\hat{Y}}}))\). Note that (38) is equivalent to

$$\begin{aligned} \nabla _Y h({{\hat{\mathbf{X}}}}) \in N_S({{\hat{Y}}}). \end{aligned}$$
(39)

On the other hand, the relation

$$\begin{aligned} {\hat{Z}} \in \mathrm{Prox}_{\beta \lambda \Vert \cdot \Vert _0}({\hat{Z}} - \beta \nabla _Zh({{\hat{\mathbf{X}}}})) \end{aligned}$$

indicates that

$$\begin{aligned} \nabla _Z h({{\hat{\mathbf{X}}}}) \in {{\mathbb {Q}}}^{m\times n}_{\Gamma ^C_A}. \end{aligned}$$

Combining with (39), we have Conclusion (ii).

(iii) Denote \(R_1^\# = \nabla _{R(Y)} h^R(R(Y^\#), R(Z^\#))\) and \(R_2^\# = \nabla _{R(Z)} h^R(R(Y^\#), R(Z^\#))\). By the Lipschitz continuity of \(\nabla h^R\) and the descent lemma [3], we have

$$\begin{aligned}&h^R(R(Y), R(Z)) \\&\quad \le h^R(R(Y'), R(Z')) - \langle \nabla h^R(R(Y'), R(Z')), (R(Y) - R(Y'), R(Z) - R(Z') \rangle \\&\quad + {L_h \over 2}\left\| (R(Y) - R(Y'), R(Z) - R(Z'))\right\| _F^2, \end{aligned}$$

for any \(Y, Y', Z, Z' \in {{\mathbb {Q}}}^{m \times n}\). This is equivalent to

$$\begin{aligned} h(Y, Z) \le h(Y', Z') + \nabla h(Y', Z') \cdot (Y-Y', Z-Z') + {L_h \over 2}\Vert (Y-Y', Z-Z')\Vert _F^2,\nonumber \\ \end{aligned}$$
(40)

for any \(Y, Y', Z, Z' \in {{\mathbb {Q}}}^{m \times n}\). We then prove the both parts simultaneously. Assume on the contrary that \((Y^\#, Z^\#)\) is not a \(\beta -\)stationary point of problem (8) for some \(\beta < {1 \over L_h}\). Then there is \((Y_0, Z_0) \not = (Y^\#, Z^\#)\) such that

$$\begin{aligned} (Y_0, Z_0) \in \Pi _S(Y^\# - \beta R_1^\#)\times \mathrm{Prox}_{\beta \lambda \Vert \cdot \Vert _0} (Z^\# - \beta R_2^\#). \end{aligned}$$

The inclusion \(Y_0 \in \Pi _S(Y^\#-\beta R_1^\#)\) indicates that

$$\begin{aligned} \Vert Y_0 - (Y^\#-\beta R_1^\#)\Vert _F^2 \le \Vert Y^\# - Y^\#-\beta R_1^\#)\Vert _F^2. \end{aligned}$$

Thus,

$$\begin{aligned} R_1^\# \cdot (Y_0 - Y^\#) \le -{1 \over 2\beta }\Vert Y_0 - Y^\#\Vert _F^2. \end{aligned}$$
(41)

On the other hand, the relation \(Z_0 \in \mathrm{Prox}_{\beta \lambda \Vert \cdot \Vert _0} (Z^\# - \beta R_2^\#)\) implies that

$$\begin{aligned} {1 \over 2}\Vert Z_0 - (Z^\# - \beta R_2^\#) \Vert _F^2 + \lambda \beta \Vert Z_0\Vert _0 \le {1 \over 2}\Vert Z^\# - (Z^\# - \beta R_2^\#) \Vert _F^2 + \lambda \beta \Vert Z^\#\Vert _0. \end{aligned}$$

After simplification, we have

$$\begin{aligned} \lambda \Vert Z_0\Vert _0 + {1 \over 2\beta }\Vert Y_0-Y^\#\Vert _F^2 + R_2^\# \cdot (Y_0-Y^\#) \le \lambda \Vert Z^\#\Vert _0. \end{aligned}$$
(42)

By (40)-(42), we have

$$\begin{aligned}&h(Y_0, Z_0) + \lambda \Vert Z_0\Vert _0 \nonumber \\&\qquad \le h(Y^\#, Z^\#) + \lambda \Vert Z^\#\Vert _0 + \left( {L_h \over 2} - {1 \over 2\beta }\right) \Vert (Y_0-Y^\#, Z_0-Z^\#)\Vert _F^2. \end{aligned}$$
(43)

Note that \(\beta < {1 \over L_h}\) and \(\delta _S(Y_0) = \delta _S(Y^\#)\). Then (43) contradicts the global optimality of \((Y^\#, Z^\#)\). Thus, \((Y^\#, Z^\#)\) is the unique element in \(\Pi _S(Y^\#- \beta \nabla _Y f({\mathbf{X}^\#}))\times \) \(\hbox {Prox}_{\beta \lambda \Vert \cdot \Vert _0}(Z^\#- \beta \nabla _Zf({\mathbf{X}^\#}))\). This proves Conclusion (iii).

(iv) Since the Eckart-Young-Mirsky low rank approximation theorem can be applied to the case of quaternion matrices [9], we can easily verify the implication below:

$$ \begin{aligned} Y\in \Pi _{S}(Y+T) ~~ \& ~~ \mathrm{rank}(Y)<r ~~\Longrightarrow ~~ T = O. \end{aligned}$$
(44)

If \({\mathbf{X}^\#} = (Y^\#, Z^\#)\) is a \(\beta -\)stationary point with rank\((Y^\#)<r\), then by the definition of \(\beta \)-stationarity in (34), together with (35), we have \(\nabla _Y h({\mathbf{X}^\#})= \nabla _Z h({\mathbf{X}^\#}) =O\) by employing (44). Additionally, the convexity of h yields that for any \(\mathbf{X}\in {{\mathbb {Q}}}^{m\times n}\),

$$\begin{aligned} h(\mathbf{X}) \ge h({\mathbf{X}^\#}) + \nabla h({\mathbf{X}^\#}) \cdot (\mathbf{X}-{\mathbf{X}^\#}) = h({\mathbf{X}^\#}). \end{aligned}$$
(45)

Note that \(\Vert Z\Vert _0\) only takes integer values 0, 1, \(\ldots \), mn. Thus we can find some positive scalar \(\epsilon \) such that

$$\begin{aligned} \Vert Z\Vert _0\ge \Vert Z^\#\Vert _0,~~\forall Z\in N(Z^\#, \epsilon ), \end{aligned}$$

where \(N(Z^\#, \epsilon ):=\left\{ Z: \Vert Z-Z^\#\Vert _F \le \epsilon \right\} \). Combining with (45), we can conclude that for any feasible solution \(\mathbf{X}\in N(X^\#, \epsilon )\), \(h(\mathbf{X})+p(\mathbf{X})\ge h({\mathbf{X}^\#})+p({\mathbf{X}^\#})\), that is, \({\mathbf{X}^\#}\) is a local minimizer. This completes the proof. \(\square \)

7 Final Remarks

In this paper, we proposed a general form of the QMO problem, presented its optimal analysis, introduced first- and second-order derivatives of real-valued functions of quaternion matrix variables, and established their calculation rules. Our approach is consistent with the subgradient concept for norms of quaternion matrix variables introduced in [9]. We established first and second-order optimality conditions for the QMO problem. Optimization methods can be developed based upon these.

One key tool of our approach is the R-product. It turns out that the R-product of two quaternion matrices is equal to the real part of the inner product of these two quaternion matrices. This is not by chance. As we may form a third-order real tensor to a quaternion matrix, there is also an inverse reaction to make the operation results of quaternion matrices to the real field, such as singular values and R-products.

Finally, we introduced the generalized subdifferentials of proper functions of quaternion matrices, and used them to analyze the optimality conditions of the SCID problem, a special case of the QMO problem. This combines the knowledge of quaternion matrices, color image processing and variational analysis.

Different QMO models have different structures. For example, the LRQD problem satisfies the strong assumption, thus even second-order methods are possible to be developed. On the other hand, the LRQA problem (7) involves all singular values of quaternion matrices, whose generalized subdifferentials need to be further studied.

We hope that our work is useful to people working with optimization models involving quaternion matrices.