Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

In the literature the problem of compressed sensing in the presence of side information is studied. But, in most cases the side information are about the source itself, i.e. structure, probability distribution, etc. In this chapter, the problem of compressed sensing in the presence of side information about the feasible region is reviewed. We follow an approach similar to [1] to formulate the problem mathematically for a wider class. Next it is shown that uniqueness and stability results of CS still holds in this formulation. Finally, an efficient recovery algorithm is derived which incorporates the side information.

3.1 Formulation

Consider a general compressed sensing problem (1.3). Assume null(A) satisfies spherical section property with parameter \(\Delta \), consequently Theorems 2.3.1 and 2.3.2 hold for this problem. From linear algebra if \(\mathbf{s}_1\) is a special solution to the system \(A\mathbf{s }=\mathbf{y }\), then the feasible region for the optimization problem (1.3) would be:

$$\begin{aligned} F=\{\mathbf{s}_1+\mathbf{s }_2|A\mathbf{s }_2=0\}=\{\mathbf{s }_1\}+{\text{ null(A) }}. \end{aligned}$$
(3.1)

In the current work we adopt FISTA as the reconstruction algorithm. To solve for the unique solution we start from an initial point and then search in the feasible region in (2.34). The size of this region depends on \({\text{ null(A) }}\equiv {{\mathbb{R }}^\mathrm{{m}}}\) (\(rank(A)\)) and intuitively we expect the bigger this space is, the harder is to solve the optimization problem. In other words when the feasible region is small then (2.34) converges faster to the solution of (1.4). Thus any side information about the feasible region is helpful.

Now consider cases that we have side information about the feasible region. For instance in the case of derivative compressed sensing (DCS) [1], where the source signal is a gradient field, the side information will result in \(B\mathbf{s }=0\) condition on the source signal (\(B\in {\mathbb{R }}^{\frac{n}{2}\times n}\) and is resulted from inherent property of a gradient field. We will discuss this special case in more details in Chap.4). A more general case may happen when we have side information as:

$$\begin{aligned} B\mathbf{s }=\mathbf{b }, \end{aligned}$$
(3.2)

where \(B \in {\mathbb{R }}^{m^{\prime }\,{\times }\,n}\) is a full rank matrix. Many constraints on a source can be formulated as (3.2). In such cases we have two types of information about the source. We call the first type, primary information, which is resulted through measurements (\(A\mathbf{s }=\mathbf{y }\)). The secondary information comes in hand through an inherent property of the source. Broadly speaking, we can assume we have a general inverse problem (\(B\mathbf{s }=\mathbf{b }\)) and we also have sparsity prior on the source, then we apply CS as a regularization method on this problem. Some problems in image/signal processing area such as image super-resolution, image impainting, and medical imaging can be modeled in this framework. We expect that if we incorporate this side information it somehow improves CS reconstruction. For instance we may be able to decrease the number of measurements for recovering the source with similar accuracy or some kind of robustness towards noise when the measurements are noisy.

Let \(A^{\prime }=\left[ \begin{array}{l}A\\ B\end{array}\right] \) (\(A^{\prime }\in {\mathbb{R }}^{(m+m^{\prime }) \times n}\) and is full rank matrix) and \(\mathbf y ^{\prime }=\left[ \begin{array}{l}\mathbf{y }\\ \mathbf{b }\end{array}\right] \) (\(\mathbf{y }^{\prime }\in {\mathbb{R }}^{(m+m^{\prime })}\)). We have the following equivalent problem:

$$\begin{aligned} \hat{\mathbf{s }}=\underset{\mathbf{s }}{\arg \min }{\Vert \mathbf{s }\Vert _0} \quad \text{ s.t. }\quad \mathbf{y }^{\prime }=A^{\prime } \mathbf{s }. \end{aligned}$$
(3.3)

Assume \(m+m^{\prime } \le n\), in such cases the new problem is exactly in theform of a CS problem. We assume \(m+m^{\prime } \le n\) so as to ensure an underdetermined system to deal with problem in CS framework. Now the question is: Does this problem have a unique solution? Can we still replace the \(l_0\)-norm with \(l_1\)-norm? To answer these questions, the new sensing matrix \(A^{\prime }\) must be studied. In the next section we will show the answers to both questions are positive.

3.2 Uniqueness and Stability

Consider the optimization problem (3.3). Our assumption is that \(A\) has \(\Delta \)-spherical section property, so (1.3) has a unique solution \(\hat{\mathbf{s }}\) and also we have \(l_1\)-\(l_0\) equivalence. We show that adding the secondary condition will not violate the uniqueness, and furthermore solutions of (1.3) and (3.3) are equal.

Lemma 1

The problem (3.3) has a unique solution\(\hat{\mathbf{s }}\)equivalent to the solution of (1.3). Furthermore\(l_1\)-\(l_0\)equivalence holds for this problem, i.e.:

$$\begin{aligned} \hat{{\varvec{s}}}=\underset{{\varvec{s}}}{\arg \min }{\Vert {{\varvec{s}}}\Vert _1} \qquad {\text{ s.t. }} \qquad {{\varvec{y}}}^{\prime }=A^{\prime } {{\varvec{s}}}. \end{aligned}$$
(3.4)

Proof 3

The proof is simple. First we show\(A^{\prime }\)has spherical section property with\(\Delta ^{\prime }=(1+\frac{m^{\prime }}{m})\Delta \). Note\({\textit{dim}} ({\mathrm{{null(A^{\prime })}}})=\mathrm{n-(m+m^{\prime })}\)and:

$$\begin{aligned} A^{\prime }{{\varvec{s}}}=0\rightarrow \left\{ \begin{array}{l} A{{\varvec{s}}}=0\\ B{{\varvec{s}}}=0\end{array}\right. \rightarrow {\mathrm{{null(A^{\prime })}}}={\mathrm{null(A)}}\cap {\mathrm{null(B)}}, \end{aligned}$$
(3.5)

thus\({\mathrm{{null(A^{\prime })}}}\subset {\mathrm{null(A)}}\). Consequently\(\forall \mathbf{s }\in {\mathrm{{null(A^{\prime })}}}\subset {\mathrm{null(A)}}\):

$$\begin{aligned} \frac{\Vert {{\varvec{s}}}\Vert _1}{\Vert {{\varvec{s}}}\Vert _2}\ge \sqrt{\frac{m}{\Delta }}=\sqrt{\frac{m+m^{\prime }}{\Delta (1+\frac{m^{\prime }}{m})}}, \end{aligned}$$
(3.6)

according to the definition of SSP,\({\mathrm{{null(A^{\prime })}}}\)has spherical section property with\(\Delta =(1+\frac{m^{\prime }}{m})\Delta \).

According to the assumption\(\hat{\mathbf{s }}\)is the solution of (1.3) and also satisfies\(B\mathbf{s }=\mathbf{b }\), \(\hat{\mathbf{s }}\)lays in the feasible region of (3.3). According to Theorem 2.3.1,\(\hat{\mathbf{s }}\)is a unique solution of (3.3), if\(\Vert {\hat{\mathbf{s }}}\Vert _0\le \frac{m+m^{\prime }}{2\Delta ^{\prime }}\). Note\(\Vert {\hat{\mathbf{s }}}\Vert _0\le \frac{m}{2\Delta }\)since it is the unique solution of (1.3), then from Theorem 2.3.1:

$$\begin{aligned} \Vert {\hat{{\varvec{s}}}}\Vert _0\le \frac{m}{2\Delta }= \frac{m(m+m^{\prime })}{2\Delta (m+m^{\prime })}= \frac{m+m^{\prime }}{2\Delta (1+\frac{m^{\prime }}{m})}= \frac{m+m^{\prime }}{2\Delta ^{\prime }}, \end{aligned}$$
(3.7)

which concludes the proof. Similarly we can conclude if the original primary CS problem has\(l_1\)-\(l_0\)equivalence in Theorem 2.3.2, then (3.3) inherits this property. Also note that this unique solution satisfies\(A{{\varvec{s}}}={{\varvec{y}}}\)and thus is equal to solution of (1.3).

This lemma states that we can add any side information in the form of (3.2) to our problem and this will not make the situation worse. This result is very intuitive and is expected but the Lemma also gives a mathematical justification. For source reconstruction we can use a general proposed CS reconstruction algorithm and find the unique solution of (3.3), however this may not be efficient enough. In the next section, a more efficient algorithm is proposed to solve (3.3).

3.3 Numerical Solution Algorithm

As explained, the problem that we formulated in Sect. 3.1 can be formulated by (3.3). We also expect some improvement if we use side information. In this section an efficient algorithm is derived for solving this problem. When Lemma 1 holds, (3.3) can be equivalently formulated as follows:

$$\begin{aligned} \hat{\mathbf{s }}=\underset{\mathbf{s }}{\arg \min }{\mu \Vert \mathbf{s }\Vert _1+\Vert A\mathbf{s }-\mathbf{y }\Vert ^2_2}\quad {\text{ s.t. }}\quad B\mathbf{s }=\mathbf{b }, \end{aligned}$$
(3.8)

now we have our original CS reconstruction problem constrained to the side information. To solve optimization problems in this form one can use operator splitting [24]. We will have a quick review on this method and then use it to solve (3.8).

3.3.1 Bregman Iteration and Operator Splitting

Consider the following optimization problem:

$$\begin{aligned} \underset{\mathbf{s }}{\min }{J(\mathbf{s })}\text{ s.t. }H(\mathbf{s })=0, \end{aligned}$$
(3.9)

where \(H\) is a convex differentiable furcation while \(J\) is also convex but possibly non-differentiable functions. An efficient method to solve this type of problems is to use the Bregman iterations [2].

To proceed we need the definition of sub-gradient and Bregman distance.

Definition 6

Let\(J(\cdot ):{\mathbb{R }}^n\rightarrow {\mathbb{R }}^{+}\)be a convex and possiblynon-differentiable function. The vector\({{\varvec{p}}}\in {\mathbb{R }}^n\)is called a sub-gradient of\(J\)at point\({{\varvec{w}}}_0\):

$$\begin{aligned} \forall {{\varvec{w}}}\in {\mathbb{R }}^n: J({{\varvec{w}}})-J({{{\varvec{w}}}}_0)\le \langle {{\varvec{p}}}, {{\varvec{w}}}-{{{\varvec{w}}}}_0\rangle . \end{aligned}$$
(3.10)

Also, the set of all\({{\varvec{p}}}\)’s is called sub-differentiable of\(J\)at point\({{\varvec{w}}}\)and is denoted by\(\partial J({{\varvec{w}}}_0)\).

For a differentiable function, \(\partial J(\mathbf{w }_0)\) reduces to a singleton which only contains the gradient vector, \(\nabla J(\mathbf{w }_0)\). This concept extends the definition of gradient to convex but possibly non-differentiable functions. For instance sub-differentiable of \(J(w)=|w|\) at the point \(w=0\) is the set \([-1,1]\). Next we require definition of Bregman distance

Definition 7

The Bregman distance of a convex function\(J(\cdot ):{\mathbb{R }}^n\rightarrow {\mathbb{R }}^{+}\)between two points\({{\varvec{s}}}\) and \({{\varvec{w}}}\)is defined as:

$$\begin{aligned} D^{{\varvec{p}}}_J({{\varvec{s}}},{{\varvec{w}}})=J({{\varvec{s}}})-J({{\varvec{w}}})-\langle {{\varvec{p}}}, {{\varvec{s}}}-{{\varvec{w}}}\rangle , \end{aligned}$$
(3.11)

where\({{\varvec{p}}}\)is a sub-gradient of\(J\) at \({{\varvec{w}}}\).

Note (3.11) is not symmetric, thus Bregman distance is not a metric but somehow measures closeness of the two points.

Now back in our problem (3.9), this problem can be solved iteratively as follows:

$$\begin{aligned} \left\{ \begin{array}{l} \mathbf{s }^{i+1}=\underset{\mathbf{s }}{\arg \min }{ D^\mathbf{p ^i}_J(\mathbf{s },\mathbf{s }^i)+\delta H(\mathbf{s })}\\ \mathbf{p }^{i+1}=\mathbf{p }^i-\delta \partial H(\mathbf{s }^{i+1}), \end{array}\right. \end{aligned}$$
(3.12)

where \(\delta \ge 0\) is a constant. It is shown in [4], that if the original problem (3.9) has a solution \(\hat{\mathbf{s }}\), then through the iterations in (3.12), as \(i\rightarrow \infty \) then \(\mathbf{s }^i\rightarrow \hat{\mathbf{s }}\).

Now we apply this algorithm on (3.8), for which we can assume, \(H(\mathbf{s })=\frac{1}{2}\Vert B\mathbf{s }-\mathbf{b }\Vert ^2_2\) and \(J(s)=\mu \Vert \mathbf{s }\Vert _1+\Vert A\mathbf{s }-\mathbf{y }\Vert ^2_2\). This will reduce (3.12):

$$\begin{aligned} \left\{ \begin{array}{l} (\mathbf{s }^{i+1},\mathbf{b }^{i+1})=\underset{\mathbf{s },\mathbf{b }}{\arg \min }\mu \Vert \mathbf{s }\Vert _1+\frac{1}{2}\Vert A\mathbf s -\mathbf y \Vert ^2_2+\frac{\delta }{2}\Vert B\mathbf s -\mathbf b +\mathbf p ^i\Vert ^2_2\\ \mathbf{p }^{i+1}=\mathbf{p }^i+\delta B\mathbf{s }^{i+1}-\mathbf{b }^{i+1}. \end{array}\right. \end{aligned}$$
(3.13)

Note that the update step of the first equation in (3.13) has the format of a standard basis pursuit de-noising (BPDN) problem [5], which can be solved by a variety of optimization methods [6]. In the present paper, we used the FISTA algorithm of [7] due to the simplicity of its implementation as well as for its remarkable convergence properties. It should be noted that the algorithm does not require explicitly defining the matrices \(A\) and \(B\). Only the operations of multiplication by these matrices and their transposes need to be known, which can be implemented in an implicit and computationally efficient manner. The main advantage of solving the problem using operator splitting is the much faster convergence of the thresholding algorithm.

Now equipped with some theoretical evidence and an efficient reconstruction algorithm we continue with some experimental study on advantageous prospect of our approach.

3.4 Experimental Study

To verify our analysis and algorithm, this section is devoted to experimental study on synthetics data, where as in the next Chapters we will focus on the practical applications of the developed method.

3.4.1 Source Model

For source simulation we used mixture of Gaussian model as the sparse source model:

$$\begin{aligned} \mathbf{s }\sim p N(0,\sigma _1)+ (1-p) N(0, \sigma _2), \end{aligned}$$
(3.14)

where \(N(0,\sigma _i)\) denotes a Gaussian distribution with zero mean and variance \(\sigma ^2_i\), \(\sigma _1\ll \sigma _2\), and \(p\) is the parameter for a Bernoulli distribution. This model has been used to represent sparse signals in the literature [8, 9]. Although it is not a proper model for some applications, it is useful for our experimental study. Here, it is assumed that the source signal has two states. The first state corresponds to source elements with large values (non-zero elements) and the second state corresponds to elements with negligible value (approximately zero elements). The Bernoulli distribution parameter \(p\) decides for each element, what state is active and controls the level of sparsity, and then each state is modeled via a Gaussian distribution. It must be taken into account that we only use this procedure for producing the source and assume the user does not have any information about the source probability distribution.

We also need a type of side information about the source that we can model in the form of (3.2). For this purpose we assumed that we have a prior information about the positions and values of some of the large value elements. This assumption is a good embed for testing the proposed method. We transform this information to the form of (3.2). An example makes the procedure clear. Assume we have a sparse source \(\mathbf s \in {\mathbb{R }}^{10}\). Assume we know that the second and the fourth elements are non-zero and both equal to 2, thus one concludes:

$$\begin{aligned} B=\left[ \begin{array}{l@{\quad }l@{\quad }l@{\quad }l@{\quad }l@{\quad }l@{\quad }l@{\quad }l@{\quad }l@{\quad }l} 0 &{} 1 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0\\ 0 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0\\ \end{array}\right] \end{aligned}$$
(3.15)

and:

$$\begin{aligned} \mathbf{b }=\left[ \begin{array}{l} 2 \\ 2 \end{array}\right] . \end{aligned}$$
(3.16)

In the general case for \(B\in {\mathbb{R }}^{m^{\prime }\,{\times }\,n}\), where \(m^{\prime }\) is the number of the known non-zero elements, in each row we set \(B_{ij}\) related to the \(j\mathrm{th}\) known non-zero element equal to one and the rest of the matrix entries equal to zero. Trivially \(\mathbf{b }_j\) is equal to the \(j\mathrm{th}\) known non-zero element. We continue with experiments in this framework.

3.4.2 Experiments

We set \(n=\)1,024, \(\sigma _1=0.1\), and \(\sigma _2=10\) with sparsity level of \(p=0.1\) to generate the source signal through this subsection. This means that we have an approximately sparse source with about 100 large value elements. The sensing matrix was chosen as a random matrix with i.i.d Gaussian entries and applied to the source to produce the measurement vector \(\mathbf{y }\). This selection is standard in the CS literature because these matrices satisfy both RIP and SSP. Based on our assumption, the positions and the values of a fraction of large value elements are known and one can form (3.2). Through the experiments, we assumed one fourth of the large value elements of the source are known (\(m^{\prime }\approx 25\)) and \(m=300\), unless stated.

Figure  3.1a depicts an instance of the generated source in which the signal is presented versus time (index). Figure 3.1b depicts the reconstruction results for the classical CS. Visually, it can be seen some approximately zero elements are estimated larger than the real values and the quality of the reconstruction is poor. Figure  3.1c depicts the reconstruction results for the proposed method. As it can be detected visually our method outperforms the classical method, which is also confirmed numerically by calculating the signal-to-noise ration (SNR). The result confirms that the proposed algorithm works properly and is able to incorporate the side information.

Fig. 3.1
figure 1

a The source signal b reconstructed using the classical CS, \(\text{ SNR } = 14.4\)c reconstructed using the proposed method, \(\text{ SNR } = 26.0\)

To analyze the algorithm two sets of experiments are done. First, we study the effect of the number of the known elements on the performance of the algorithm. Assume \(0<r\le 1\) indicates the fraction of the known elements. Figure 3.2 depicts the proposed algorithm reconstruction quality (measured via SNR) versus \(r\) for \(r\in (0.1,1)\). As expected as we increase the side information, the quality of the reconstruction also improves such that for \(r=1\) we have near complete recovery.

Fig. 3.2
figure 2

Output SNR versus the fraction of known large value elements (r)

Fig. 3.3
figure 3

SNR of the source reconstruction obtained with different methods as a function of \(m\). Here, the dashed and solid lines correspond to the classical CS and the proposed CS method, respectively, and \(r=0.25\)

In the second experiments we consider the effect of number of the measurements, \(m\), on reconstruction quality. Figure 3.3 depicts output reconstruction SNR versus \(m\) for the classical CS and the proposed method. As expected, the reconstruction quality improves as the number of measurement increases for both methods. As it can be detected for large number of the measurements both methods are saturated and we have high SNR values. This is not surprising since when the number of the measurements are large enough we can reconstruct the source perfectly and the side information has negligible effect on the quality of the measurements. But for insufficient measurements, the side information becomes important and improves the quality of the reconstruction. This implies that the proposed method can be used to either improve the quality of the reconstruction or decrease the number of the required measurements without deteriorating the quality of the reconstruction. Overall, these experiments confirm the effectiveness of the proposed method. In the next chapters, this method has been applied to two practical examples: image deblurring in optical imaging [10] and surface reconstruction in the gradient field [11]. In both applications the source signals are gradient fields and the side information can be formulated as (3.2) as in [1]. Also, further analysis has been done through these applications.