1 Introduction

Super-resolution (SR), as its name states, aims at enhancing the resolution of a sensing system, in which the resolution is limited by hardware such as lens and sensors. It is closely related to interpolation [23] in the sense of filling in information on an unknown fine grid based on what is available on the coarse grid. As particularly useful in imaging applications, such as high-definition television and retina display used in Apple products, SR is often cast as an image reconstruction problem, for which some methods are directly transplanted onto SR, e.g., total variation [24], non-local means [31], and sparse dictionary representation [39]. For other SR methods, please refer to two survey papers [3, 26] and references therein.

The super-resolution problem addressed in this paper is different to image zooming or magnification, but aiming to recover a real-valued signal from its low-frequency measurements. A mathematical model is expressed as

$$\begin{aligned} b_k=\frac{1}{\sqrt{N}}\sum _{t=0}^{N-1} x_t e^{-i2\pi kt/N}, \qquad |k|\le f_c, \end{aligned}$$
(1)

where \(x\in \mathbb R^N\) is a vector of interest, and \(b\in \mathbb C^n\) is the given low frequency information with \(n=2f_c+1\ (n<N)\). This is related to super-resolution in the sense that the underlying signal x is defined on a fine grid with spacing 1 / N, while we only observe the lowest n Fourier coefficients, which implies that we can only expect to recover the signal on a coarser grid with spacing 1 / n. For simplicity, we use matrix notations to rewrite Eq. (1) as \(b=S_n\mathcal {F} x\), where \(S_n\) is a sampling matrix by collecting the lowest n frequency coefficients, \(\mathcal {F}\) is the Fourier transform, and we denote \(\mathcal {F}_n=S_n\mathcal {F}\). The frequency cutoff induces a resolution limit inversely proportional to \(f_c\); below we set \(\lambda _c=1/f_c\), which is referred to as Rayleigh length (a classical resolution limit of hardware [19]). Hence, a super-resolution factor (SRF) can be interpreted as the ratio between the spacing in the coarse and fine grids, i.e., \( \text{ SRF } = N/n\approx 0.5\lambda _cN. \)

We are interested in superresolving point sources. It is particularly useful in astronomy [32], where blurred images with point sources need to be cleaned or super-resolved. Suppose x is composed of points sources, i.e., \( x=\sum _{t_j\in T} c_j\delta _{t_j}, \) where \(\delta _{\tau }\) is a Dirac measure at \(\tau \), spikes of x are located at \(t_j\) belonging to a set T, and \(c_j\) are coefficients. Denote \(K=|T|\) be the cardinality of the set T, and sparsity assumption implies that \(K\ll N\). Recently, sparse recovery problem becomes popular due to rapid advances in compressive sensing (CS) [13]. The provable performance of CS methods relies on either restricted-isometry property (RIP) [4] or incoherent measurements [35, 36]. Unfortunately for SR, the sensing matrix \(\mathcal {F}_n\) is highly coherent [17]. Consequently, sparse SR deserves special attention, which may lead to a better understanding of CS. For example, Demanet and Nguyen [11] discussed minimax recovery theory and error bounds by analyzing restricted isometry constant (a CS concept).

In addition to sparsity, we also assume that the point sources are separated by a critical distance, which is referred to as minimum separation (MS) [6] (cf. Definition 1). Theoretical results based on the analysis of \(L_1\) certificate or interpolating trigonometric polynomials of sparse sign patterns [6] demonstrate that point sources can be exactly recovered in the noise-free case as long as any two spikes are MS distance apart (cf. Theorem 1).

Definition 1

(Minimum Separation) Let \(\mathbb T\) be the circle obtained by identifying the endpoints on [0, 1] and \(\mathbb T^d\) the \(d-\)dimensional torus. For a family of points \(T\in \mathbb T^d\), the minimum separation is defined as the closest warp-around distance between any two elements from T,

$$\begin{aligned} \text{ MS } :=\triangle (T) := \inf _{(t,t')\in T:t\ne t'} |t-t'|, \end{aligned}$$
(2)

where \(|t-t'|\) is the \(L_\infty \) distance (maximum deviation in any coordinate).

Theorem 1

[6, Corollary 1.4] Let \(T=\{t_j\}\) be the support of x. If the minimum distance obeys

$$\begin{aligned} \triangle (T)\ge 2\lambda _c N, \end{aligned}$$
(3)

then x is the unique solution to \(L_1\) minimization:

$$\begin{aligned} \min |x|_1 \quad \text{ s.t. } \quad \mathcal {F}_nx=y. \end{aligned}$$
(4)

If x is real-valued, then the minimum gap can be lowered to \(1.87\lambda _c N\).

We want to analyze the constant in front of \(\lambda _c N\) in Eq. (3), referred to as minimum separation factor (MSF). Theorem 1 indicates that MSF\(\ge 2\) guarantees the exact recovery of \(L_1\) minimization with a recent improvement to 1.26 [18] at the cost of an additional constraint that \(f_c\ge 1000\). This line of research was originated from Donoho [12], who showed that \(\mathrm{MSF}>1\) is sufficient if the spikes are on the grid. Note that both aforementioned works [6, 18] are formulated in terms of off-grid spikes. Another article about off-grid spikes was [1] by Aubel et al., who also arrived at \(\mathrm{MSF}>1\) if windowed Fourier (or short-time Fourier transform) measurements are available. Furthermore, there are two works that do not require MS. De Castro and Gamboa [10] showed that K spikes can be resolved from \(2K+1\) Fourier samples; and with additional positive assumption of point sources, Donoho et al. [14] showed that 2K noiseless measurements are sufficient to yield exact recovery of K positive spikes. In addition to these exact recovery results, errors in spike detection and noise robustness are of great interest as well. Fernandez-Granda analyzed error bounds of constrained \(L_1\) minimization in [18], while the unconstrained version was addressed in [34] under a Gaussian noise model as well as in [2] for any sampling scheme. The robustness of spike detection was discussed in [15].

1.1 Our Contributions

We investigate recovery performance of two nonconvex \(L_1\) based penalties, the difference of \(L_1\) and \(L_2\) norms (\(L_{1-2}\)) and capped \(L_1\) (C\(L_1\)). The former is recently proposed in [22, 40] as an alternative to \(L_1\) for CS, and the latter is often used in statistics and machine learning [33, 41]. Numerical simulations show that \(L_1\) minimization often fails when \(\mathrm{MSF}<1\), in which case we demonstrate that both \(L_{1-2}\) and C\(L_1\) outperform the classical \(L_1\) method.

During the course of simulation study, we observe that the rank property is mostly satisfied for \(L_{1-2}\), i.e. the \(L_0\) norm of the reconstructed solution does not exceed n (the rank of A). We find that exact sparse recovery is almost unlikely when MSF is very small, but the reconstructed solution is still sparse with sparsity at most n. In addition, we have the following relationship: \(\text{ MS }\cdot K\le N\), \(\hbox {MSF}=\hbox {MS}\cdot f_c/N\), and rank\((A)=n=2\cdot fc+1\). Putting them together, we get \(K<0.5n/\)MSF. This inequality implies that we may reconstruct a vector sparser than the ground-truth (c.f. Fig. 4).

The rest of the paper is organized as follows. Section 2 reviews numerical methods for \(L_1\) minimization [6] and \(L_p\) (\(0<p<1\)) minimization [20]. Section 3 describes the proposed algorithms for two non-convex functionals, \(L_{1-2}\) and C\(L_1\), in a unified way. We analyze the theoretical aspects of the two methods in Sect. 4 including rank property, local minimizers, and stationary points. Experiments on both one-dimensional signals and two-dimensional images are examined in Sect. 5, followed by conclusions in Sect. 6.

2 Review on \(L_1\) and \(L_p\) Minimization

To make the paper self-contained, we briefly review two numerical algorithms: \(L_1\) minimization via semi-definite program (SDP) in [6] and \(L_p\) minimization via iteratively reweighted least square (IRLS) in [20], both of which will be examined in Sect. 5 as a benchmark to the proposed \(L_{1-2}\) and C\(L_1\) methods.

2.1 \(L_1\) via SDP

To recover the optimal solution of (4), Candés and Fernandez-Granda [6] considered a dual problem, i.e.,

$$\begin{aligned} max_c \ \text{ Re } \langle y,c\rangle \quad \text{ s.t. } \quad \Vert \mathcal {F}_n^*c\Vert _\infty \le 1; \end{aligned}$$
(5)

the constraint says that the trigonometric polynomial \(\mathcal {F}_n^*c(t)=\sum _{|k|\le f_c} c_k e^{i2\pi kt}\) has a modulus uniformly bounded by 1 over the interval [0, 1]. As indicated in [6, Corollary 4.1], this constraint is equivalent to the existence of a Hermitian matrix \(Q\in \mathbb {C}^{n\times n}\) such that

$$\begin{aligned} \left[ \begin{array}{ll} Q &{} \quad u\\ u^* &{}\quad 1 \end{array} \right] \succeq 0, \quad \sum _{i=1}^{n-j}Q_{i,i+j}=\left\{ \begin{array}{ll} 1 &{}\quad j =0\\ 0 &{}\quad j = 1,2, \cdots , n-1. \end{array}\right. \end{aligned}$$
(6)

Therefore, the dual problem is equivalent to

$$\begin{aligned} \{\hat{c}, \hat{Q}\}=\arg \max _{c,Q} \text{ Re }\langle y,c\rangle \quad \text{ s.t. } \quad (6), \end{aligned}$$
(7)

which can be solved via SDP on the decision variations \(c\in \mathcal {C}^n, Q\in \mathcal {C}^{n\times n}\), in total \((n+1)^2/2\) variables. Once the optimal dual variations \(\hat{c}, \hat{Q}\) are obtained, a root-finding technique is used to retrieve a solution to the primal problem (4). In particular, the trigonometric polynomial,

$$\begin{aligned} p_{2n-2}(e^{i2\pi t}) = 1-|\mathcal {F}^*_c(t)|^2=1-\sum _{k=-2f_c}^{2f_c} u_k e^{i2\pi kt}, \quad u_k=\sum _j \hat{c}_j\bar{\hat{c}}_{j-k}, \end{aligned}$$
(8)

is a real-valued and nonnegative trigonometric polynomials by construction; and \(p_{2n-2}(e^{i 2\pi t})\) is either equal to zero everywhere or has at most \(n-1\) roots on the unit circle. Therefore, one simply locates the roots of \(p_{2n-2}\) on the unit circle in order to recover the support of the optimal solution to the \(L_1\) minimization (4); and then amplitudes can be estimated via least-squares. The noise robustness of this algorithm was analyzed in a follow-up work [5].

2.2 \(L_p\) via IRLS

The \(L_p\) quasi-norm is often used in CS as an alternative to \(L_1\) to approximate the \(L_0\) norm, see [79, 20, 38]. As it is nonconvex for \(p<1\), \(L_p\) minimization is generally NP hard. In [20], the authors considered a smoothed \(L_p\) minimization, which is expressed as

$$\begin{aligned} \min \lambda \sum \limits _{j=1}^N (|x_j|^2+\epsilon ^2)^{p/2}+\frac{1}{2} \Vert Ax-b\Vert _2^2, \end{aligned}$$
(9)

for \(\epsilon >0\). Taking the gradient of (9) gives the first-order optimality condition,

$$\begin{aligned} \lambda \left[ \dfrac{p x_j}{(|x_j|^2+\epsilon ^2)^{1-p/2}}\right] _{1\le j\le N}+A^T(Ax-b)=0. \end{aligned}$$
(10)

Then an iterative scheme is formulated as

$$\begin{aligned} \left\{ \begin{array}{ll} x^{k+1} = \arg \min \lambda \sum \limits _{j=1}^N w_j^k |x_j|^2 +\frac{1}{2} \Vert Ax-b\Vert _2^2\\ w_j^{k+1} = p(|x_j^{k+1}|^2+\epsilon ^2)^{p/2-1}. \end{array}\right. \end{aligned}$$
(11)

Each subproblem in (11) can be solved by a weighted least-square type of equation,

$$\begin{aligned} (\lambda W^k + A^TA)x^{k+1} = A^Tb, \end{aligned}$$
(12)

where \(W^k\) is a diagonal matrix with diagonal elements of \(\{w^k_j, j=1,\ldots ,N\}\). The parameter \(\epsilon \) should be discreetly chosen so as to avoid local minima. The update for \(\epsilon \) in [20] is given as \(\epsilon ^{k+1}=\min \{\epsilon ^k,c\cdot r(x^{k+1})_{K+1}\}\), where \(c\in (0,1)\) is a constant, r(z) is the rearrangement of absolute value of \(z\in \mathbb R^N\), and K is the estimated sparsity of the vector x to be constructed. This method gives better results than the classical \(L_1\) approaches in the RIP regime and/or incoherent scenario, but it does not work so well for highly coherent CS, as observed in [21, 22, 40].

3 Nonconvex \(L_1\) Based Minimization via DCA

In this section, we describe a unified approach for solving two nonconvex \(L_1\) based minimization problems via a difference of convex algorithm (DCA) [28, 29]. The unconstrained minimization problem is formulated as follows,

$$\begin{aligned} \min F(x) := \lambda R(x)+\frac{1}{2} \Vert Ax-b\Vert _2^2, \end{aligned}$$
(13)

where R(x) is a regularization term, \(\lambda \) is a balancing parameter, and \(A=\mathcal {F}_n\). We consider two regularization terms:

$$\begin{aligned}&R_{L_{1-2}}(x)=\Vert x\Vert _1-\Vert x\Vert _2\qquad \quad \quad (L_{1-2}) \end{aligned}$$
(14)
$$\begin{aligned}&R_{CL_1}(x)= \sum _j\min \{|x_j|,\alpha \}, \qquad (CL_1) \end{aligned}$$
(15)

where \(\alpha \) in \(R_{CL_1}\) is a pre-defined parameter. A variant of C\(L_1\) is of the form \(\sum _j\min \{|x_j|/\alpha ,1\}\), referred to as a normalized capped \(L_1\) [27]. However, the normalized C\(L_1\) is computationally stiff in the super-resolution setting, while \(L_{1-2}\) is not, and parameter free.

The method of DCA decomposes \(F(x) = G(x)-H(x)\) where both G(x) and H(x) are convex. By linearizing H, we obtain an iterative scheme that starts with \(x^1\ne \mathbf {0}\),

$$\begin{aligned} \left\{ \begin{array}{l} y^{k} \in \partial H(x^k)\\ x^{k+1} ={\arg \min }_{x\in \mathbb {R}^N} G(x)-\big (H(x^k)+\langle y^k, x-x^k\rangle \big ), \end{array}\right. \end{aligned}$$
(16)

where \(y^k\) is a subgradient of H(x) at \(x^k\). The DC decomposition is

$$\begin{aligned} \left\{ \begin{array}{l} G_{L_{1-2}}(x) =\frac{1}{2}\Vert A\,x - b\Vert _2^2 + \lambda \Vert x\Vert _1 \\ H_{L_{1-2}}(x) = \lambda \Vert x\Vert _2, \end{array}\right. \left\{ \begin{array}{l} G_{CL_1}(x) =\frac{1}{2}\Vert A\,x - b\Vert _2^2 + \lambda \Vert x\Vert _1 \\ H_{CL_1}(x) = \lambda {\sum }_j\max (|x_j|-\alpha ,0), \end{array}\right. \end{aligned}$$
(17)

for \(L_{1-2}\) and C\(L_1\) respectively. Each subproblem in (16) amounts to an \(L_1\) regularized form

$$\begin{aligned} x^{k+1} =\arg \min _{x\in \mathbb {R}^N} \frac{1}{2}\Vert Ax-b\Vert _2^2 +\lambda \Vert x\Vert _1 -\langle y^k,x\rangle , \end{aligned}$$
(18)

where \(y_{L_{1-2}}^k=\lambda \frac{x^k}{\Vert x^k\Vert _2}\) for \(L_{1-2}\), and \(y_{CL_1}^k=\lambda \text{ sign }(x^k).*\max (|x^k|-\alpha ,0)\) Footnote 1 for C\(L_1\). To solve (18), we consider the augmented Lagrangian

$$\begin{aligned} L_\delta (x,z,u) = \frac{1}{2}x^T(A^TA)x + \lambda \Vert z\Vert _1 -\langle y^k,x\rangle + \langle u, x-z\rangle + \frac{\delta }{2}\Vert x-z\Vert _2^2, \end{aligned}$$

where z is an auxiliary variable; to enforce the constraint \(x=z\), the Lagrange multiplier \(\delta >0\) and dual variable u are introduced. ADMM iterates between minimizing \(L_{\delta }\) with respect to x and z, and updating the dual variable u. Therefore, an iterative scheme for solving the subproblem (18) goes as follows,

$$\begin{aligned} \left\{ \begin{array}{l} x_{l+1} = (A^TA+\lambda I_d)^{-1}(\delta (z_l+u_l)-y^k)\\ z_{l+1} = \text{ shrink }(x-u,\lambda /\delta )\\ u_{l+1} = u_l + z_{l+1}-x_{l+1}, \end{array}, \right. \end{aligned}$$
(19)

where the subscript l indexes the inner iterations. Note that the matrix inversion \((A^TA+\lambda I_d)^{-1}\) can be efficiently implemented by Fast Fourier Transforms, as A is the multiplication of a sampling matrix and Fourier matrix. The subproblem (18) is convex, and hence it is guaranteed to have an optimal solution \(x_*\) via (19), and we take it to be the solution of (18), i.e., \(x^{k+1} = x_*\).

For the constrained formulation,

$$\begin{aligned} \min R(x) \quad \text{ s.t. } \quad Ax=b, \end{aligned}$$
(20)

we apply a similar trick to the unconstrained version by considering the following iterative scheme,

$$\begin{aligned} x^{k+1} = \arg \min _x \left\{ \Vert x\Vert _1 - \langle y^k, x\rangle \quad \text{ s.t. } \quad Ax = b\right\} . \end{aligned}$$
(21)

To solve (21), we introduce two dual variables uv and define an augmented Lagrangian

$$\begin{aligned} L_\delta (x,z, u, v) = \Vert z\Vert _1 -\langle y^k,x\rangle + \langle u, x-z\rangle + \langle v, Ax-b \rangle + \frac{\delta }{2}\Vert x-z\Vert ^2 + \frac{\delta }{2}\Vert Ax-y\Vert ^2, \end{aligned}$$

where ADMM finds a saddle point \((x_*,z_*,u_*,v_*) \) satisfying

$$\begin{aligned} L_\delta (x_*,z_*,u,v) \leqslant L_\delta (x_*,z_*,u_*,v_*) \leqslant L_\delta (x,z,u_*,v_*) \quad \forall x,z,u,v. \end{aligned}$$

As a result, we take \(x^{k+1} = x_*\).

4 Theoretical Properties

In this section, we investigate a rank property, namely the \(L_0\) norm of the reconstructed solution does not exceed n (the rank of A). First of all, we examine the probability of finding the exact solution with 100 random trials for \(L_{1-2}\), C\(L_1\) with \(\alpha =0.1\), and \(L_p\) with \(p=1/2\). The left of Fig. 1 illustrates that it is unlikely to find the exact solution when MSF is small (\(<0.8\)), which implies that multiple sparse vectors satisfying \(Ax=b\) do exist. On the other hand, we plot the probability of rank property being satisfied on the right of Fig. 1, by counting how many times the \(L_0\) norm of the reconstructed solution is smaller than or equal to n. The results suggest that the rank property is true for both \(L_{1-2}\) and C\(L_1\) when MSF\(>1\) or when \(L_1\) certificate holds. More importantly in the worse case (MSF\(<0.8\)), \(L_{1-2}\) provides a sparse solution (sparsity \(\le n\)) while satisfying the constraint, which is the best one can do. It seems unlikely for \(L_p\) to have the rank property.

Fig. 1
figure 1

Probability (%) of finding the exact solution (left) and of rank property being satisfied (right) for both \(L_{1-2}\) and C\(L_1\) with \(\alpha =0.1\). It shows over  95 % chance that the reconstructed solutions using \(L_{1-2}\) are \(n-\)sparse, though it does not find the exact ground-truth solution when MSF is small \((<0.8)\)

One deterministic result regarding the rank property is given in [40, Theorem 3.1] that there exists \(\lambda _n\) such that for any \(\lambda >\lambda _n:=\frac{\Vert A\Vert _2\Vert b\Vert }{\sqrt{n+1}-1}\), the stationary point of the unconstrained \(L_{1-2}\) minimization problem has at most n non-zero elements. In practice, we choose a much smaller \(\lambda \) than \(\lambda _n\), which usually yields a smaller residual and better recovery. As for C\(L_1\), we can not derive such result, as \(y_{CL_1}^k\) is not bounded a priori and hence no upper bound of sparsity in terms of \(\lambda \) for C\(L_1\). For the rest of this section, we give some theoretical analysis on the rank property that is independent of \(\lambda \).

4.1 Local Minimizers

It is shown in [40, Theorem 2.3–2.4] that any local minimizer of \(L_{1-2}\) has the rank property, as summarized in Theorem 2. With additional assumption, we prove the rank property for C\(L_1\) in Theorem 3. The error bounds at high probability of a local minimizer of C\(L_1\) from the true solution are established in [37, 41] under sparse eigenvalue assumptions of the Hessian of the loss functions (similar to RIP), which is unfortunately hard to verify.

Theorem 2

Suppose \(A\in \mathbb R^{n\times N}\) is of full row rank. If \(x^*\) is a local minimizer of \(L_{1-2}\) in either a unconstrained (13) or constrained (20) formulation, then the sparsity of \(x^*\) is at most n.

Theorem 3

Suppose \(A\in \mathbb R^{n\times N}\) is of full row rank. If \(x^*\) is a local minimizer of C\(L_1\) in either a unconstrained (13) or constrained (20) formulation and the objective function is not flat in the neighborhood of \(x^*\), then the sparsity of \(x^*\) is at most n.

Proof

We only provide proof for the constrained case, and the unconstrained version is almost the same. It is sufficient to show the columns of \(A_{\varLambda ^*}\) are linearly independent. Prove by contradiction. Suppose there exists \(v\in \text{ ker }(A)\setminus 0\) such that supp\((d)\subseteq \varLambda ^*\). Denote \(\varLambda ^*_{\alpha +} = \{j:|x_j^*|>\alpha \}\), \(\varLambda ^*_{\alpha -} = \{j:|x_j^*|<\alpha \}\), and \(\varLambda ^*_\alpha = \{j:|x_j^*|=\alpha \}\). For any fixed neighborhood of \(x^*\), we scale d so that

$$\begin{aligned} \left\{ \begin{array}{ll} |x_j^*\pm d_j|> \alpha &{}\quad j\in \varLambda ^*_{\alpha +}\\ |x_j^*\pm d_j|< \alpha &{}\quad j\in \varLambda ^*_{\alpha -}\\ \end{array} \right. \end{aligned}$$
(22)

Consider two feasible vectors in \(\mathbf {B}_r(x^*)\), \(\hat{x}=x^*+d\) and \(\breve{x}=x^*-d\). Since \(\text{ supp }(d)\subseteq \varLambda ^*\) and \(d\in \text{ ker }(A)\), we have \(\text{ supp }(\hat{x})\subseteq \varLambda ^*\), \(\text{ supp }(\breve{x})\subseteq \varLambda ^*\), and \(A\hat{x} = A\breve{x}=Ax^*\). By analyzing \(R_{CL_1}(x^*)\) and \(R_{CL_1}(x^*\pm d)\), we get

$$\begin{aligned}&R_{CL_1}(x^*+d)+R_{CL_1}(x^*-d) -2 R_{CL_1}(x^*)\\&\quad = \sum _{j\in \varLambda ^*_{\alpha -}} \Big (|x_j^*+d_j|+|x_j^*-d_j|-2|x_j^*|\Big )+\sum _{j\in \varLambda ^*_0} \Big (\min (|\alpha +d_j|,|\alpha -d_j|)-\alpha \Big ). \end{aligned}$$

The first term is zero for v sufficiently small, while the second term is negative if \(\varLambda ^*_0\ne \emptyset \), so we have

$$\begin{aligned} R_{CL_1}(x) \ge \min \{R_{CL_1}(\hat{x}),R_{CL_1}(\breve{x})\}. \end{aligned}$$

As long as \(R_{CL_1}(x^*)\) is not flat (or constant) in \(\mathbf {B}_r(x^*)\), the above inequality is strict, which contradicts with the assumption that \(x^*\) is a local minimizer in \(\mathbf {B}_r(x^*)\). \(\square \)

Remark

It is possible that objective function for C\(L_1\) is not constant. For example, if the set \(\{j: -\alpha <x_j<0\}\) has different cardinality to the set \(\{j:0<x_j<\alpha \}\), then \(R_{CL_1}(\hat{x} )\ne R_{CL_1}(\breve{x})\). Or if \(\varLambda ^*_0\ne \emptyset \), then \(\sum _{j\in \varLambda ^*_0} \Big (\min (|\alpha +d_j|,|\alpha -d_j|)-\alpha \Big )<0.\) In addition, the rank property of C\(L_1\) depends on \(\alpha \). If \(\alpha \) is small, then the set \(\varLambda ^*_{\alpha -}\) may be empty, and hence rank property does not hold. Another interpretation is that if \(\alpha \) is too small, the problem is a small perturbation of the least squares problem where sparsity is absent. If \(\alpha \) is too large, the C\(L_1\) is no longer a good approximation of the \(L_0\) norm. Empirically, we find that an adaptive update of \(\alpha \) during iterations works better than a fixed value, one advantage of which is no need to tune this parameter. The analysis of adaptive \(\alpha \) is beyond the scope of this paper.

Applying convergence properties of general DCA studied in [29, 30] for C\(L_1\), we know the limit point, \(x^*\), is a local minimizer if no component of \(x^*\) is equal to \(\pm \alpha \). We numerically calculate the probability of the computed solution not taking values \(\pm \alpha \), which implies local minimizers. For this purpose, we test 100 random sparse (ground-truth) vectors from Gaussian distribution and 25 random choices of \(\alpha \) from [0, 1] by uniform distribution, and compute how many times that the computed solution does not take values \(\pm \alpha \). Finally we plot the probability of the limit points being local minimizers in Fig. 2, which is almost for sure (\({\sim }99.6\%\)) at each MSF. The probabilities of having exact recovery and rank property are also provided, which validates that local minimizers do not imply the rank property, as indicated by Theorem 3. Compared with Fig. 1, rank property is more likely to occur for C\(L_1\) when \(\alpha \) is chosen randomly instead of a fixed value. This phenomenon again suggests that an adaptive \(\alpha \) may be better.

Fig. 2
figure 2

Probability (%) of the computed solution of C\(L_1\) with no elements equal to \({\pm }\alpha \). The results are averaged over 100 random signals and 25 random choices of \(\alpha \) drawn from uniform distribution [0, 1] at each MSF

4.2 Second-Order Optimality Condition

By analyzing the second-order optimality condition, we show that either the stationary point \(x^*\) has at most n non-zero elements or there exists a vector in any neighborhood of \(x^*\) that has a smaller objective function. We will need the following technical lemma.

Lemma 1

If \(\lambda < \min \{\frac{\Vert A^Tb\Vert _2}{\sqrt{N}+\Vert A\Vert ^2}, \frac{\Vert A^Tb\Vert _2}{\sqrt{N}+1} \}\), then any first-order stationary point \(x^*\in \mathbb R^N\) of \(L_{1-2}\) unconstrained problem (13) satisfies \(\Vert x^*\Vert _2>\lambda \).

Proof

First, we show that \(x^*\) can not be zero. Suppose \(x^* = 0\), then by the optimality condition,

$$\begin{aligned} \lambda (p^*-q^*) - A^Tb=0, \end{aligned}$$
(23)

where \(p^*\in \partial \Vert x^*\Vert _1\) is the subgradient of \(\Vert x\Vert _1\) at \(x^*\), and \(q^*\in \partial \Vert x^*\Vert _2\). It is easy to see that when \(x^* = 0\), \(\Vert p^*\Vert _{\infty }\le 1\) and \(\Vert q^*\Vert _{2}\le 1\). By (23), we have

$$\begin{aligned} \Vert A^Tb\Vert _2=\lambda \Vert p^*-q^*\Vert _2\le \lambda (\sqrt{N}+1), \end{aligned}$$

or \(\lambda \ge \frac{\Vert A^Tb\Vert _2}{\sqrt{N}+1}\), which is a contradiction.

Therefore \(x^*\ne 0\), and

$$\begin{aligned} \lambda \left( p^*-\frac{x^*}{\Vert x^*\Vert _2}\right) +A^T(Ax^*-b)=0. \end{aligned}$$
(24)

It follows from (24) that

$$\begin{aligned} \Vert A\Vert ^2\Vert x^*\Vert _2 =&\Vert A^TA\Vert \Vert x^*\Vert _2\ge \Vert A^TAx^*\Vert _2=\Vert -\lambda \left( p^* -\frac{x^*}{\Vert x^*\Vert _2}\right) +A^Tb\Vert _2 \nonumber \\ \ge&\Vert A^Tb\Vert _2 - \lambda \Vert p^* -\frac{x^*}{\Vert x^*\Vert _2}\Vert _2. \end{aligned}$$
(25)

Let \(\varLambda ^*\) be the support of \(x^*\), then \(p_i^* = \text{ sign }(x)\) for \(i \in \varLambda ^*\), and \(|p_i^*|\le 1\) otherwise. So we have \(|p^* -\frac{x^*}{\Vert x^*\Vert _2}|_i<1\) for \(i \in \varLambda ^*\), and \(|p^* -\frac{x^*}{\Vert x^*\Vert _2}|_i\le 1\) otherwise. Using the assumption \(\lambda \le \frac{\Vert A^Tb\Vert _2}{\sqrt{N}+\Vert A\Vert ^2}\), from (25) it follows that

$$\begin{aligned} \Vert A\Vert ^2\Vert x^*\Vert _2 > \Vert A^Tb\Vert _2 - \lambda \sqrt{N}\ge \lambda \Vert A\Vert ^2, \end{aligned}$$

and thus \(\Vert x^*\Vert _2>\lambda \). \(\square \)

Theorem 4

Suppose \(\lambda < \min \{\frac{\Vert A^Tb\Vert _2}{\sqrt{N}+\Vert A\Vert ^2}, \frac{\Vert A^Tb\Vert _2}{\sqrt{N}+1} \}\). Let \(x^*\) be any limit point of the DCA \(L_{1-2}\) minimizing sequence. Then we have either \(\Vert x^*\Vert _0\le n\) (rank property) or there exists \(d\in \mathbb R^N\) such that \(F(x^*+d)< F(x^*)\) and \(x^* + d\) is sparser than \(x^*\).

Proof

Taking the difference of objective function values at \(x^*+d\) and \(x^*\), we get

$$\begin{aligned}&\frac{1}{\lambda }\Big (F(x^*+d)-F(x^*)\Big )\nonumber \\&\quad = \left( \sum _{j\in \varLambda ^*} \langle \text{ sign }(x^*_j),d_j\rangle + \sum _{j\in \varLambda ^*_c} |d_j| +\langle \frac{1}{\lambda }A^T(Ax-b)-\frac{x^*}{\Vert x^*\Vert _2}, d\rangle \right) \nonumber \\&\qquad +\,\frac{1}{2} d^T\left( \frac{1}{\lambda } A^TA-\frac{1}{\Vert x^*\Vert _2} + \frac{x^*{x^*}^T}{\Vert x^*\Vert _2^3}\right) d + O(\Vert d\Vert _2^3). \end{aligned}$$
(26)

Note that \(x^*\) is a column vector in (26), and hence \(x^*{x^*}^T\) is a (rank-one) matrix. Since \(x^*\) is the limit point of DCA sequence, it satisfies the first-order optimality condition (24). Therefore, the first term in (26) is equal to \(\sum _{j\in \varLambda _c^*} |d_j|-\langle p^*_j, d_j\rangle \), and is nonnegative, since \(p^*_j\in [-1,1]\) for \(j\in \varLambda ^*_c\).

As for the Hessian matrix in (26), denoted as H, we have

$$\begin{aligned} H := \frac{1}{\lambda } A^TA-\frac{1}{\Vert x^*\Vert _2} + \frac{x^*{x^*}^T}{\Vert x^*\Vert _2^3}=\frac{1}{\Vert x^*\Vert _2} \mathcal {F}^T\left( \frac{\Vert x^*\Vert _2}{\lambda } S_n^TS_n-I_d+ yy^T\right) \mathcal {F}, \end{aligned}$$

where \(y=\mathcal {F}x^*/\Vert x^*\Vert _2\) and \(A=S_n\mathcal {F}\) (\(S_n\) is a sampling matrix and \(\mathcal {F}\) is the Fourier matrix). As \(S_n^TS_n\) are a diagonal matrix taking values of either 1 or 0, the matrix \(D:=\frac{\Vert x^*\Vert _2}{\lambda } S_n^TS_n-I_d\) is also diagonal, the elements of which are \(\beta :=\frac{\Vert x^*\Vert _2}{\lambda }-1\) with multiplicity n, and \(-1\) with multiplicity \(N-n\). By Lemma 1, we have \(\beta >0\).

We want to analyze the eigenvalues of H, which is equivalent to analyzing a diagonal matrix D with rank-one perturbation \(yy^T\). Suppose u is an eigenvector of \(D+yy^T\) with corresponding eigenvalue \(\gamma \), then we have

$$\begin{aligned} (u^Ty)\cdot \left[ \begin{array}{c} y_n \\ y_{N-n} \end{array}\right] + \left[ \begin{array}{cc} (\beta -\gamma )I_n &{} 0 \\ 0 &{} (-1-\gamma )I_{N-n} \end{array}\right] \left[ \begin{array}{c} u_n \\ u_{N-n} \end{array}\right] =0, \end{aligned}$$
(27)

where \(y=[y_n,y_{N-n}]^T\) and \(u=[u_n,u_{N-n}]^T\). So the eigenvalues of \(D+yy^T\) are \(\beta \) with multiplicity \(n-1\), \(-1\) with multiplicity \(N-n-1\), and other two, denoted as \(\gamma _1, \gamma _2\), satisfying \( \frac{\Vert y_n\Vert ^2}{\gamma -\beta } + \frac{\Vert y_{N-n}\Vert ^2}{\gamma +1}=1\), or \(\gamma ^2-\beta \gamma -(\beta +1)\Vert y_n\Vert ^2 =0, \) where we use \(\Vert y_n\Vert _2^2+\Vert y_{N-n}\Vert _2^2=1\). It follows from the quadratic formula that these eigenvalues satisfy \(-1<\gamma _1<0<\beta <\gamma _2\) and \(\gamma _1+\gamma _2 = \beta \).

Now we discuss eigenvectors and diagonalization. Each eigenvector for \(\beta \) has the form of \([u_n,0]\) with \(u_n^Ty_n=0\). Denote \(U_n\) be a matrix with each column being one of the eigenvectors corresponding to \(\beta \). We further assume \(U_n\) is orthonormal after Gram-Schmidt orthogonalization. Similarly, each eigenvectors for -1 has the form of \([0,u_{N-n}]\) with \(u_{N-n}^Ty_{N-n}=0\), and we denote \(U_{N-n}\) be an orthonormal matrix composed of all the corresponding eigenvectors. Therefore, an orthonormal matrix, denoted as U, that diagonalizes H can be expressed as

$$\begin{aligned} U=\left[ \begin{array}{cccc} U_n &{} 0 &{} \dfrac{y_n}{(\gamma _1-\beta )\alpha _1} &{} \dfrac{y_n}{(\gamma _2-\beta )\alpha _2}\\ &{} &{} &{} \\ 0 &{} U_{N-n} &{}\dfrac{y_{N-n}}{(\gamma _1+1)\alpha _1} &{} \dfrac{y_{N-n}}{(\gamma _2+1)\alpha _2} \end{array} \right] , \end{aligned}$$
(28)

where \(\alpha _1,\alpha _2\) are normalizing factors.

For any \(d\in \mathbb R^N\), we can decompose \(d=d^k+d^r,\) where \(d^k\in \text{ ker }(A)\) and \(d^r\in \text{ range }(A^T)\), and hence \(\mathcal {F}d^k, \mathcal {F}d^r\) have the forms of \([0,g_{N-n}]^T, [g_n,0]^T\) respectively. Denote \(s_1:=\langle y_n, g_n\rangle , \ s_2:=\langle y_{N-n},g_{N-n}\rangle \). We can prove that \(s_1,s_2\) are real numbers and \(s_1+s_2=\langle d, x^*\rangle \). Then after tedious calculation, (26) reduces to

$$\begin{aligned} F(x^*+d)-F(x^*)= & {} \lambda \sum _{j\in \varLambda _c^*}\Big (|d_j|-\langle p^*_j, d_j\rangle \Big )\nonumber \\&\ + \,\beta \Vert U^T_ng_n\Vert ^2-\Vert U^T_{N-n}g_{N-n}\Vert ^2+ P_1s_1^2+P_2s_1s_2+N_0s_2^2 \end{aligned}$$
(29)

where \(P_1,P_2>0\) and \(N_0<0\) are constant with respect to d.

If \(\Vert x^*\Vert _0>n\), then the columns of \(A_{\varLambda ^*}\) are linearly dependent, and hence there exists \(d\in \text{ ker }(A)\setminus \{0\}\) such that \(d\in S_{\varLambda ^*}\), where \(S_{\varLambda ^*}=\{x: \text{ supp }(x)\in \varLambda ^*\}\). As \(d\in \text{ ker }(A)\), \(\mathcal {F} d=[0,g_{N-n}]\), i.e., \(g_n=0\) and \(s_1=0\). Since \(g_{N-n}\ne 0\), we have \(F(x^*+d)-F(x^*)<0\), i.e., \(x^*+d\) has a smaller objective function than \(x^*\). In addition, we can scale d to cancel one non-zero element of \(x^*\), and hence \(x^*+d\) is sparser than \(x^*\). \(\square \)

4.3 Stationary Points

It is shown in [21, 40] that any limit point of the DCA sequence converges to a stationary point; and in Theorem 5, we give a tighter result, which states that the limit point is d-stationary rather than stationary. These stationarity concepts are related as the set of local minimizers belongs to the set of d-stationary points, which belongs to the set of stationary points. As we often observe that limit points of DCA are sparse (see Fig. 1), it is likely that any d-stationary point may have the rank property, which will be left to a future work. We first give the definition of d-stationary points [16], and then prove the DCA sequence converges to a d-stationary point in Theorem 5.

Definition 2

(D-stationary) We say that a vector \(\hat{x}\in X\) is a d(irectional) stationary point of the minimization of a function F(x), or in short, d-stationary, if

$$\begin{aligned} F'(\hat{x}; x-\hat{x}) \ge 0, \quad \forall x\in X, \end{aligned}$$

where the directional derivative is defined as one-sided derivative

$$\begin{aligned} F'(x;d):=\lim _{\tau \searrow 0}\frac{F(x+\tau d)-F(x)}{\tau }. \end{aligned}$$
(30)

For example, directional derivatives of \(L_1\) and \(L_2\) norms at \(x^*\) are

$$\begin{aligned} \Vert \cdot \Vert _1'(x^*;d) = \sum _{j\in \varLambda ^*} \langle \text{ sign }(^*x_j),d_j\rangle + \sum _{j\notin \varLambda ^*} |d_j|, \end{aligned}$$
(31)

and

$$\begin{aligned} \Vert \cdot \Vert _2'(x^*;d) = \left\{ \begin{array}{ll} \left\langle \frac{x^*}{\Vert x^*\Vert _2},d \right\rangle &{}\quad \text{ if } \ x^*\ne 0\\ \Vert d\Vert _2 &{}\quad \text{ if } \ x^*=0. \end{array}\right. \end{aligned}$$
(32)

As a result, the directional derivative of F(x) for \(L_{1-2}\) can be expressed as

$$\begin{aligned} F'(x^*;d) = \lambda \sum _{j\in \varLambda ^*} \langle \text{ sign }(x^*_j),d_j\rangle + \lambda \sum _{j\in \varLambda _c^*} |d_j|-\lambda \left\langle \frac{x^*}{\Vert x^*\Vert _2},d\right\rangle +\langle A^T(Ax^*-b), d\rangle , \end{aligned}$$
(33)

for \(x^*\ne 0\).

Theorem 5

Let \(\{x^k\}\) be the sequence of iterates generated by DCA (16), or DCA sequence in short, for \(L_{1-2}\), then any limit point \(x^*\) of \(\{x^n\}\) is a d-stationary point of F(x), defined in (13) and \(R(x)=R_{L_{1-2}}(x)\).

Proof

In [40], the DCA sequence \(\{x^k\}\) was shown to converge to a stationary point \(x^*\). We now prove that all the iterates (except for the first one) and the limit point \(x^*\) are non-zero. We assume that the initial point is \(x^0=0\). Then it follows from (16) that

$$\begin{aligned} F(x^1)= & {} \lambda (||x^1||_1 - ||x^1||_2 ) + \frac{1}{2} || Ax^1 - b||^2 \end{aligned}$$
(34)
$$\begin{aligned}\le & {} \lambda ||x^1||_1 + \frac{1}{2} \Vert Ax^1 - b\Vert ^2 \le \frac{1}{2} \Vert b\Vert ^2 = F(0). \end{aligned}$$
(35)

Strict inequality holds if zero is not global minimum of \(L_1\) problem (which is generically true). Therefore, nonzero property is maintained during descending iterations of DCA, i.e., \(x^k\ne 0, \ \forall k.\) In addition, \(x^* \not = 0\) as \(F(x^*) < F(0).\)

As \(x^*\) satisfies the first-order optimality condition (24), we can simplify the directional derivative of \(F'(x;d)\), given in (33),

$$\begin{aligned} F'(x;d)= & {} \lambda \sum _{j\in \varLambda ^*} \langle \text{ sign }(x^*_j),d_j\rangle + \lambda \sum _{j\in \varLambda _c^*} |d_j|-\langle p^*,d\rangle \end{aligned}$$
(36)
$$\begin{aligned}= & {} \lambda \sum _{j\in \varLambda _c^*} \Big (|d_j|-\langle p^*_j, d_j\rangle \Big ), \end{aligned}$$
(37)

where \(p^*\in \partial \Vert x^*\Vert _1\). As \(p_j^*\in [-1,1]\) for \(j\in \varLambda _c^*\), then \(|d_j|-\langle p^*_j,d_j\rangle \ge 0\), and hence \(F'(x^*;d)\ge 0 \ \forall d\), which means \(x^*\ne 0\) is a d-stationary point. \(\square \)

5 Experimental Results

We numerically demonstrate that the proposed \(L_{1-2}\) and C\(L_1\) via DCA can recover signals beyond the Rayleigh length. Two existing methods, \(L_1\) via SDP [6] and \(L_{1/2}\) via IRLS [20], serve as benchmarks. For 2D image super-resolution, it is computationally expensive to solve the \(L_1\) minimization via SDP, so we adopt the ADMM approach instead.

5.1 One-Dimensional Signal Super-Resolution

We consider a sparse signal (the ground-truth) \(x_g\) of 1000-dimensional with MS = 20. We vary \(f_c\) from 31 to 60, thus \(\hbox {MSF}:=\varDelta (T) \cdot f_c/N:=\hbox {MS}\cdot f_c/N=0.62:0.02:1.2\). Denoted \(x^*\) as the reconstructed signal using any of the methods: SDP, constrained and unconstrained \(L_{1-2}\). In Fig. 3, we plot the residual (\(\Vert Au^*-b\Vert /\Vert b\Vert \)) and relative reconstruction errors (\(\Vert x^*-x_g\Vert /\Vert x_g\Vert \)) on \(\log 10\) scale. As MSF decreases towards and passes 1, \(L_{1-2}\) with DCA maintains fidelity (constraints) much better, even for the unconstrained \(L_{1-2}\). More importantly, we observe smaller relative errors of \(L_{1-2}\) than SDP for MSF\(<1\) when unique sparse solution is not guaranteed. The \(L_{1-2}\) approaches pick one among a solution pool, while errors in SDP’s root findings tend to violate the fidelity \(Ax=b\). For MSF\(>1\) where the \(L_1\) certificate holds, \(L_{1-2}\) seems not as good as SDP in terms of reconstruction errors, which is due to stopping conditions; on the other hand, relative errors on the order of \(1e-5\) are accurate enough.

Fig. 3
figure 3

Error analysis of residuals (left) and relative reconstruction errors (right) on \(\log 10\) scale. The underlying signal in this case is with \(\hbox {N}=1000\) and \(\hbox {MS}=20\)

In Figs. 4 and 5, we examine one particular ground-truth vector and choose \(f_c\) to have \(\hbox {MSF}=0.4\) and 0.8 respectively, when all the methods fail to recover the ground-truth. Figure 4 shows that the reconstructed solutions are sparser than the ground-truth when \(\hbox {MSF}=0.4\), which is consistent with our intuition as discussed in Sect. 4. We see in Fig. 5 that the solution of C\(L_1\) is not sparse when \(\alpha =0.1\) (the result becomes sparse if we increase \(\alpha \) to 0.25). For results in Fig. 5 (\(\hbox {MSF}=0.8\)), we observe “peak splitting”, i.e., all the methods miss one single peak in the ground-truth, and instead recover two nearby peaks, marked in blue squares. In addition, a peak shift, circled in green, is probably attributed to both peak splitting and peak merging, as there is a tiny peak on the left of the green in the ground-truth vector. Since there is no certificate guarantee in this regime, there are acceptable solutions if the tolerance on residual is satisfied. The large residual of SDP is clearly due to shifted peak locations, and there are very few small peaks in SDP. In contrast, there are some small peaks appearing in \(L_{1-2}\), which may be the cost to satisfy the constraint. No matter how peaks split or merge, the reconstructed solutions of \(L_{1-2}\) are sparse.

Fig. 4
figure 4

Reconstruction comparison for \(\hbox {MSF}=0.4\), from top to bottom: ground-truth, \(L_1\) via SDP (residual \({\sim } 10^{-1.6}\)), \(L_{1-2}\) (residual \({\sim } 10^{-7.7}\)), and C\(L_1\) (residual \({\sim } 10^{-5.0}\)). All the reconstruction methods yield overly sparser vectors compared to the ground-truth

Fig. 5
figure 5

Reconstruction comparison for \(\hbox {MSF}=0.8\), from top to bottom: ground-truth, \(L_1\) via SDP (residual \({\sim } 10^{-1.3}\)), \(L_{1-2}\) (residual \({\sim } 10^{-7.7}\)), and C\(L_1\) (residual \({\sim } 10^{-4.9}\))

We now look at success rates in two tests with fixed MS and fixed \(f_c\) respectively. In the first case, we consider 100 random realizations of the same setting as discussed above to get Fig. 3, and success rates of three methods can be computed. An incident (or a reconstructed signal \(x^*\)) is labeled as “successful” if \(\Vert x^*-x_g\Vert /\Vert x_g\Vert <1.5e-3\) and \(\Vert Au^*-b\Vert /\Vert b\Vert <5e-4\). In the second case, we fix \(f_c =20\), generate a sparse signal with \(\hbox {MS} = \hbox {MSF}*\hbox {N}/f_c\), and repeat 100 random realizations to compute success rates. Both plots in Fig. 6 show big advantages of \(L_{1-2}\) and C\(L_1\) over SDP when \(\hbox {MSF}<1\).

Fig. 6
figure 6

Success rates (%) of fixed \(\hbox {MS}=20\) (left) and fixed \(f_c=20\) (right) when \(N=1000\)

We examine the scability of the algorithms for \(N=1000, 2000, 4000\), while keeping SRF fixed, specifically \(N/f_c=50\). Roughly speaking, all the algorithms are scalable to some extent, as illustrated in Fig. 7. For SDP, the smaller N is, the smaller MSF is observed for exact recovery. As for \(L_{1-2}\) and C\(L_1\), the success rates diminish while N increases, which attributes to the curse of dimension: as N is large, the iterative DCA scheme does not converge (for both inner and outer loops) or it takes too long to converge so we have to stop earlier to obtain less accurate results within a reasonable amount of time.

Fig. 7
figure 7

Scability of the algorithms: \(\hbox {N} = 2000\) (top) and 4000 (bottom) when \(f_c=20\). \(N=1000\) is plotted on the right of Fig. 6

5.2 Two-Dimensional Image Super-Resolution

We present the super-resolution results of 2D images. There are two types of minimum separation. Definition 1 [6] uses \(L_{\infty }\) norm to measure the distance, while another definition is called Rayleigh regularity (RR) [12, 25], as given below.

Definition 3

(Rayleigh regularity) Fix Nn,  and set \(\lambda = 1/f_c=2/(n-1)\). We say that the set of points \(\mathcal {T}\subset \{0, 1/N, \ldots , 1-1/N\}\subset \mathbb T^d\) is Rayleigh regular with parameters (dr) and write \(\mathcal {T}\in \mathcal {R}_d(t,r;N,n)\) if it may be partitioned as \(\mathcal {T}=\mathcal {T}_1 \cup \cdots \cup \mathcal {T}_r\) where the \(\mathcal {T}_i\)’s are disjoint, and each obeys a minimum separation constraint, that is, for all square subsets \(\mathcal {D}\subset \mathcal {T}^d\) of side length \(t\lambda _c/2\), \(|\mathcal {T}_i\cap \mathcal {D}|\le 1\).

Images with these two definitions are illustrated in Fig. 8, where Definition 3 (RR) produces more spikes than Definition 1, thus more challenging for image super-resolution. The theoretical guarantee is studied in [6] for MS and in [25] for RR with additional assumption of positive sources (the spikes have positive values). In both papers, MSF is theoretically proven to be capped at 2.38, while we observe empirically that an exact recovery via \(L_1\) minimization occurs at \(\hbox {MSF} = 1.7\). Here is the problem setting: image is of size \(100\times 100\), \(MS=10\), and \(f_c=4:20\), thus yielding \(\hbox {MSF}= 0.4:0.1:2\). We examine three regularization terms: traditional \(L_1\), \(L_{1-2}\), and C\(L_1\), all of which are formulated in a constrained model. The success rates for these two cases are present in Fig. 8. When MSF is below 1.7, the success rates of MS are much higher than that of RR. Both plots illustrate that \(L_{1-2}\) and C\(L_1\) (\(\alpha = 0.1\)) have advantages over \(L_1\) (solved by ADMM). Note that C\(L_1\) is better than \(L_{1-2}\) on the right plot of Fig. 8 in the 2d examples where sources only take positive values.

Fig. 8
figure 8

Image examples shown on the top row: positive spikes of size \(100\times 100\) and \(\hbox {MS}=10\), which is defined differently in [6] (left) and [25] (right). The corresponding success rates (%) of \(L_1\), \(L_{1-2}\), and C\(L_1\), all in a constrained formulation, are plotted on the bottom. An exact recovery via \(L_1\) minimization occurs at \(\hbox {MSF} = 1.7\) for both MS definitions

Finally we show an image super-resolution example to illustrate the visual difference. We look at a particular point source image similar to the upper right plot of Fig. 8, which reminds one of the stars in a clear night sky. The image is of size \(100\times 100\) with \(\hbox {MS}=10\) based on RR definition, and we only take \(15\times 15\) (\(f_c=7\)) measurements from low-frequency data, thus yielding \(\hbox {MSF}=0.7\). If using the inverse FFT after the zero-padded frequency data, the reconstruction looks very blurry as shown on the upper left of Fig. 9. All the \(L_1\) variants (\(L_1\), capped \(L_1\), and \(L_{1-2}\)) result in much sparser and clearer reconstructions. To have a better visual comparison, we plot the error maps of the reconstructed result to the ground-truth for the \(L_1\) methods. All of them can find the spikes’ location relatively well, \(L_1\) also picks up some neighboring pixels, and \(L_{1-2}\) has the smallest error.

Fig. 9
figure 9

A particular example at \(\hbox {MSF} = 0.7\) using the RR definition (the top right plot of Fig. 8). a A reconstruction via a direct inverse FFT, and bd are the difference of the reconstructed solutions using \(L_1\), capped \(L_1\), and \(L_{1-2}\) to the ground-truth image, with intensity window \([-0.1, 0.1]\). The root-means-errors for these results are 0.004 (\(L_1\)), 0.001 (capped \(L_1\)), and 0.0005 (\(L_{1-2}\))

6 Conclusions

We presented local properties (local minimum, d-stationary points, and sparsity upper bound) of computed solutions minimizing \(L_{1-2}\) and capped-\(L_1\) penalties for super-resolution of point sources. At a high probability, the limit point of DCA-capped-\(L_1\) algorithm is a local minimum and is sparse if the intrinsic parameter of the capped-\(L_1\) is suitably chosen. The limit point of DCA-\(L_{1-2}\) algorithm is a directional stationary point, which is observed numerically to have sparsity upper bounded by the rank of the sensing matrix at a high probability. In numerical experiments in one and two dimensions, the two non-convex penalties produced better solutions either in the relative accuracy of ground truth recovery or seeking a sparse solution while maintaining the measurement constraints when peak distance of the sparse solutions is below the Rayleigh length (the classical barrier).