1 Introduction

The traditional video coding standards share a common characteristic that the design of the encoders is more complex than the decoders. For this reason, these standards are not suitable for all application backgrounds. For example, in the rapidly developing Wireless Multimedia Sensor Networks (WMSN) [1], the lifetime of the large number of sensors has a great effect on the performance of the network. So the low-complexity transmitter with a good coding efficiency has to be considered to decrease the power consumption. In contrast to traditional video coding, most burden of video processing is shifted to the decoders where abundant resources are available.

At present, one of the techniques that apply the strategy mentioned above is Distributed Video Coding (DVC) [15]. In this technique, the intra-frame encoding is implemented independently without performing motion estimation (ME) and motion compensation (MC). Especially as for Wyner-Ziv (WZ) frames, a kind of channel code is applied and then a portion of the resulting parity bits are transmitted. The decoder employs the received WZ bits and the side information (SI) generated from previous decoded frames to carry out WZ frames reconstruction jointly. Another technique aiming to achieve low complexity for encoders is Compressed Sensing (CS) [5, 9]. CS, a new branch of signal processing, can realize the recovery of sparse or compressible signals by only obtaining a small number of non-adaptive linear measurements. CS breaks the limit of Nyquist sampling theorem and needs no additional compression. Although the video coding technique based on CS, as it is often called Compressed Video Sensing (CVS), has considerable advantages over the ones noted above, a gap still exists in the practical application of CVS.

Existing CVS schemes perform encoding for each frame independently. According to the sampling mode, these schemes can be classified into two categories: the plain compressive sampling [2, 12, 19, 40] and the hybrid sampling [8, 25, 27, 32]. In terms of the plain compressive sampling mode, Kang and Lu [19] proposed the DCVS framework to complete video recovery with the help of SI derived from the reconstructed key frames and the modified GPSR algorithm. Asif et al. [2] considered a motion-adaptive linear dynamical system at the decoder to model the temporal variations and designed the correspondent recovery algorithm which were only suitable for frames with small size. Two CVS schemes: MC-BCS-SPL and MH-BCS-SPL are studied in [12]. In the first one, ME/MC was incorporated into the reconstruction process and three main components were included: multi-hypothesis initialization, residual reconstruction and forward/backward MC. The second one took another form of ME/MC—multi-hypothesis (MH) prediction and proposed a Tikhonov regularization-based MH prediction method which cooperates with the residual reconstruction. Ying et al. [40] exploited how to enhance the sparse representation of each frame block and realized it by means of generating Karhunen-Loève (KL) bases adaptively at the decoder. However, it is hard for this means to meet the real-time processing. This video codec had some differences in sampling form. All frames were treated equally without the partition of key frames and non-key frames. The hybrid sampling mode used a traditional video coding technique for key frames. [27] presented a practical acquisition process that combined the compressive sampling with a conventional one and applied it to non-key frames. The key to this acquisition process was the compressive sampling test that identified which blocks were sparse within key frames. Prades-Nebot et al. [25] proposed a CS-based DVC technique in which three coding modes (SKIP, SINGLE and L1 mode) were utilized to improve the coding efficiency of video. Do et al. [8] designed the DISCOS codec in which both block-based and frame-based measurements were acquired for non-key frames. Their main contributions were the interframe sparsity model, sparsity-constraint block prediction algorithm and sparse recovery with decoder SI method. Tzagkarakis et al. [32] proposed a CVS system for remote imaging applications. This system employed the iterative frame refinement process and super-resolution step to enhance the quality of the reconstructed non-key frames. In brief, compared with the plain compressive sampling without any other form of encoding, the hybrid sampling was more complex and required more computing resources. It was not appropriate for the low-cost, low-power encoding devices applied to WMSN. Based on this fact, we just employ the plain compressive sampling on all frames.

It is worth noting that much attention has been paid to the performance at low sampling rates (which means the fewer amounts of data will be acquired), because it can greatly reduce the power consumption of WMSN. In this paper, we introduce the elastic net regression which is widely utilized in statistical learning to CVS and propose an Elastic net-based hybrid hypothesis prediction technique named MS-wEnet. This technique implements the single-hypothesis (SH) mode or the multi-hypothesis (MH) weighted Elastic net (MH-wEnet) mode selectively. Both modes are carried out in the compressed measurement domain at the decoder to obtain the predictions of non-key frames. After that, the residual reconstruction step is applied to accomplish the final recovery. The simulation results and analyses show our proposed scheme is more superior at low sampling rates when compared with the other CVS schemes which apply MH prediction.

The remainder of this paper is organized as follows: Section 2 describes the background of Compressed Sensing and further reviews its application in video processing. The architecture of the proposed CVS scheme is illustrated in Section 3. The MS-wEnet technique is described in detail in Section 4, and then the performance evaluation is carried out in Section 5. Finally, a conclusion for this paper is made in Section 6.

2 Background

2.1 Compressed sensing

Briefly, CS integrates signal acquisition and compression (dimensionality reduction) together and under certain conditions, a small number of linear measurements obtained contain all the information that is necessary for exact recovery of the signal. Specifically, assume that a real-valued signal of interest x ∈ R N has sparse representation in a certain basis Ψ, the transform coefficients vector α can be accurately recovered with high probability from a series of projections acquired through

$$ y=\varPhi x= A\alpha $$
(1)

where y is a measurement vector, Φ ∈ R M × N is a measurement matrix with M ≪ N and A denotes ΦΨ. It involves searching the sparse solution for the underdetermined linear equations when recovering α from y. Under the condition that A conforms the restricted isometry property [5], this can be realized by some iterative greedy algorithms, such as OMP [31], StOMP [10] and CoSaMP [24]. In addition, with the same condition and a convex relaxation applied, the sparse solution can also be obtained by solving the \( {\mathcal{l}}_1 \)-minimization problem,

$$ \underset{\alpha }{ \min }{\left\Vert \alpha \right\Vert}_1,\kern1em s.t.\kern0.5em y= A\alpha $$
(2)

In practice, the acquisition of CS measurements is often corrupted by noise,

$$ y=\varPhi x+e= A\alpha +e $$
(3)

where e represents the noise vector. In this case, to deal with noisy measurements, (2) should be relaxed to yield the following optimization problem,

$$ \underset{\alpha }{ \min}\frac{1}{2}{\left\Vert y- A\alpha \right\Vert}_2^2+\lambda {\left\Vert \alpha \right\Vert}_1 $$
(4)

known as basis pursuit denoising (BPDN) [6] criterion. Here, λ is a non-negative parameter. For the advantage of obtaining the global optimal solution, convex optimization approaches for solving (4) achieve better in reconstruction quality than greedy algorithms, but with more computational complexity.

2.2 Video techniques based on CS

In many cases, natural video frames are compressible in DCT or Wavelet basis and a random matrix is selected as Φ to guarantee the incoherence with Ψ. In the field of reconstruction algorithms, a variety of solvers have been proposed against (4). For example, the FPC_AS [39] algorithm combines the characteristic of greedy algorithms with the improved fixed-point continuation (FPC) to deal with the challenging problems arising in CS, such as high dynamic range; the ADMM [4] solver utilizes the simple but powerful alternating direction method of multipliers to handle the large-scale problems efficiently. Moreover, another recovery algorithm named CGIST [16] has also developed. This algorithm is based on iterative shrinkage/thresholding (IST) and takes advantage of the conjugate gradient partan method to accelerate convergence when the active set remains constant.

When we employ the reconstruction algorithms noted above to handle the video streaming, the most straightforward approach is to implement global frame-based acquisition and reconstruct the individual frames independently. However, it causes an increasing storage at the encoder and takes much time to recover at the decoder. To solve these problems, we take the strategy of non-overlapping block splitting, which in turn leads to the severe blocking artifacts. Learning from the method of eliminating blocking artifacts in traditional video processing, the BCS-SPL [22] algorithm incorporates a Wiener-filtering step into the reconstruction. In view of its fast implementation and good reconstruction quality, we adopt the BCS-SPL algorithm for the recovery of video frames.

As to the specific dynamic magnetic resonance imaging (MRI), a number of approaches, such as LS-CS [34], KF-CS [33], modified-CS [35], modified-CS-residual [21], and k-t FOCUSS [18] are suggested in CVS community. However, in comparison with natural video sequences, the dynamic MRI ones have fewer amounts of inter-frame motion and smaller temporal variations in video contents. So these approaches are not suitable for natural video scenes, but we can benefit from the idea of “prediction—residual reconstruction”. We can get the conclusion from the CVS schemes described in Section 1 that various ME/MC techniques can be embedded into the CS video receiver/decoder efficiently and that the performance of CVS can also be further improved by merging the respective features of CS and DVC. In addition, we can apply the methods in Matrix Completion, such as MATRIX ALPS [20] and SpaRCS [38], to video surveillance environment in which the static background is explained as a low-rank matrix while the moving foreground objects result in a sparse matrix. Also, we may take the form of 3D processing [14, 36, 37]. In this case, a 3D joint reconstruction is implemented. However, the computation and memory issues will be worse with the amount of data increase.

In essence, all these CVS schemes make efforts to take advantage of the redundancy (which exists in both intra-frame and inter-frame forms), especially the temporal redundancy to improve the performance of video reconstruction.

3 Architecture of the proposed CVS scheme

The block diagram of our proposed CVS system is presented in Fig. 1.

Fig. 1
figure 1

Block diagram of the proposed CVS codec: a CVS encoder, b CVS decoder

Figure 1(a) shows the CVS encoder and Fig. 1(b) illustrates the CVS decoder. The decoder presents the recovery procedure for one non-key frame. It is worth emphasizing that the number of non-key frames plays a decisive role in the amount of data we need to process. In WMSN, with the acceptable visual quality, the more non-key frames, the less power will be consumed on encoding devices. From this block diagram, we can clearly learn that there are four main stages to accomplish the proposed CVS scheme. The first stage, named as plain compressive sampling, carries out at the encoder. At this stage, the block-based measurements will be acquired on key frames and non-key frames separately. Then, at the decoder, key frames will be reconstructed first based on the received and inverse-quantized measurements of them at the second stage. After the block sliding and projection process, the third and the most important stage, MS-wEnet prediction, is implemented to construct the predictions of the blocks in one non-key frame with the aid of the calculated distance sets. The last stage, residual recovery, recovers the difference between the prediction and the target non-key frame. Additionally, the recovery procedure described in Fig. 1(b) can be extended to other non-key frames.

3.1 Design of CVS encoder

As we discussed above, to meet the low-cost and low-power demand, we apply the plain compressive sampling to the video streaming frame-by-frame, block-by-block to obtain measurements. According to the analysis of the distribution of measurements in [2, 12], we divide the video sequence into many Group-of-Pictures (GOPs) which consist of one key frame and several non-key frames. For convenience, we assume x K j,m is the m th block(vectorized) in the j th key frame and that x CS i,m is the m th block(vectorized) in the i th non-key frame. Moreover, \( {\varPhi}_B^K\in {R}^{M_B^K\times N} \) is defined as the block-based key-frame measurement matrix and \( {\varPhi}_B^{CS}\in {R}^{M_B^{CS}\times N} \) is defined as the block-based non-key-frame measurement matrix. Here, we define M K B /N and M CS B /N as Key-subrate and CS-subrate, respectively. In Fig. 1(a), all of the frames are splitted into blocks with the same size. Each block (vectorized) of key frames is compressively sampled by projecting onto Φ K B to get the measurement vector y K j,m , while non-key frames are sampled by Φ CS B to obtain all measurement vectors y CS i,m . Besides, for simple and practical implementation, we adopt the first M CS B rows of Φ K B to form Φ CS B .

3.2 Design of CVS decoder

The recovery of key frames is completed by using the BCS-SPL algorithm. Same as traditional video coding, to make full use of the temporal redundancy, two neighboring key frames are employed to carry out bi-directional prediction for non-key frames between them. On the reconstructed key frames, the block sliding process is implemented. In this process, a block with the same size of x CS i,m will slide pixel-by-pixel in all directions within the search space associated with x CS i,m and all obtained blocks will be transformed to vectors to generate the hypothesis set H m . Multiplied by Φ CS B , H m is projected to yield the measurement vector set Q m . The Euclidean distances between every column q s in Q m and y CS i,m (the measurement vector corresponds to x CS i,m ) are calculated to form the distance vector D i,m , and then D i,m is used for the implementation of the hybrid hypothesis prediction procedure. Specifically, the hybrid hypothesis prediction is a new method. It selects either the SH or the MH-wEnet prediction mode in the projection domain to produce \( {\tilde{x}}_{i,m}^{CS} \) as the approximation of x CS i,m . These two modes complete the prediction task for non-key frames together. After that, the predicted blocks are arranged via block grouping to acquire the whole predicted \( {\tilde{x}}_i^{CS} \) followed by the step of residual reconstruction.

As pointed out above, the most essential part in our proposed CVS scheme is the MS-wEnet prediction stage, i.e., the Elastic net-based hybrid hypothesis prediction, and next we will illustrate it in detail.

4 Elastic net-based hybrid hypothesis prediction

In MS-wEnet, MH-wEnet is a new approach to carry out the measurement-domain MH prediction. Therefore, it is necessary to explain the MH prediction employed in CVS first. In the following part, we assume H i,m represents the hypothesis set associated with x CS i,m and that Q i,m denotes the projection set of H i,m by Φ CS B .

4.1 MH prediction mode in CVS

This mode [8, 30], as well as the SH prediction one, belongs to the ME/MC technique which is more suitable for dealing with complex motion. In CVS, ME/MC only occurs in the CS reconstruction of video. In this process, we can construct the prediction \( {\tilde{x}}_{i,m}^{CS} \) as the approximation of x CS i,m with the measurement vector y CS i,m . Similar to the usage of ME/MC in traditional video coding, in CVS, the more accurate this prediction \( {\tilde{x}}_{i,m}^{CS} \) is, the more efficiently for the corresponding residual \( {\tilde{r}}_{i,m}^{CS} \) to be recovered via reconstruction algorithms based on the residual measurements, and then get \( {x}_{i,m}^{CS}={\tilde{x}}_{i,m}^{CS}+{\tilde{r}}_{i,m}^{CS} \). Therefore, the key to applying ME/MC to CVS successfully is the prediction methods on the basis of y CS i,m .

In the CS reconstruction of video, both MH and SH prediction modes can be implemented in the pixel domain or in the measurement domain. However, from [3, 7, 12, 17], \( {\tilde{x}}_{i,m}^{CS} \) made in the measurement domain is probably more close to x CS i,m as compared to that made in the pixel domain according to the Johnson-Lindenstrauss (JL) lemma. Therefore, the idea of “prediction—residual reconstruction” is more suitable for the measurement domain and this has been confirmed in [12]. Besides, by selecting multiple candidate hypotheses (columns) h s from H i,m and then executing an optimal linear combination, the MH mode produces a prediction superior to any other ones constructed by the SH mode. Based on these facts, we conclude that the predictions obtained by the MH mode in the measurement domain achieve better accuracy than the other ones.

When this mode is carried out, the MH coefficients vector w mh i,m is acquired in the projection domain first and then applied to all hypotheses to obtain the linear prediction \( {\tilde{x}}_{i,m}^{mh}={H}_{i,m}{w}_{i,m}^{mh} \). Because of the dimensionality reduction caused by Φ CS B , the optimization problem of acquiring w mh i,m should be added a certain penalty term to obtain a preferable solution. At present, two methods have been proposed for MH prediction in the projection domain.

  1. (a)

    \( {\mathcal{l}}_1 \) regularization-based MH prediction. In [8], the proposed sparsity-constraint block prediction scheme supposes x CS i,m can be represented as a linear combination of a few h s . In other words, w mh i,m is assumed to be sparse and the support used for combination is a subset of that of H i,m . As a result, the \( {\mathcal{l}}_1 \) regularization-based MH prediction involves solving

    $$ {w}_{i,m}^{{\mathcal{l}}_1}=\underset{w}{ \arg \min}\left|\right|{y}_{i,m}^{CS}-{Q}_{i,m}w\left|\right|{}_2^2+\lambda \left|\right|w\left|\right|{}_1 $$
    (5)

    where \( {w}_{i,m}^{{\mathcal{l}}_1} \) is a form of w mh i,m .

  2. (b)

    Tikhonov regularization-based MH prediction. This method, as the main part of MH-BCS-SPL [12, 30] which is called MH-Tikhonov in our work assumes the support used for combination is just that of H i,m . All h s make contributions to the prediction. Based on the Euclidean distances between each column q s in Q i,m and y CS i,m , the contributions can be further adjusted. With these distances as prior knowledge, the Tikhonov regularization-based MH prediction can be written as

    $$ {w}_{i,m}^{Tik}=\underset{w}{ \arg \min}\left|\right|{y}_{i,m}^{CS}-{Q}_{i,m}w\left|\right|{}_2^2+\left|\right|\lambda \varGamma w\left|\right|{}_2^2 $$
    (6)

    Here, Γ is a Tikhonov matrix which contains the prior knowledge stated, and w Tik i,m is another form of w mh i,m . Compared with (5) which is solved by iteration, a closed-form solution is available for (6).

4.2 Elastic net

A conclusion can be drawn from the analysis noted above that the performance of prediction is dominated by the selected support and the associated coefficients. In fact, both \( {\mathcal{l}}_1 \)-based and Tikhonov-based prediction methods in the measurement domain can be viewed as two limiting cases in the MH mode for their supports applied in the linear combination. However, to achieve the best prediction performance, h s which is highly correlated with x CS i,m should be selected as much as possible and the others should be dropped in order to avoid the corruption. With this principle, we introduce and modify Elastic net which is widely used in statistical learning to execute MH prediction in the projection domain for video data.

Associated an \( {\mathcal{l}}_1 \) penalty with an \( {\mathcal{l}}_2 \) one together, the optimization problem, as it is often called Elastic net [42], can be written as

$$ {w}_{i,m}^{Enet}=\left(1+\frac{\lambda_2}{n}\right)\underset{w}{ \arg \min}\left|\right|{y}_{i,m}^{CS}-{Q}_{i,m}w\left|\right|{}_2^2+{\lambda}_1\left|\right|w\left|\right|{}_1+{\lambda}_2\left|\right|w\left|\right|{}_2^2 $$
(7)

with both λ 1 and λ 2 be non-negative values. w Enet i,m represents the acquired MH coefficients vector. From (7), it is obvious to find out the connection between Elastic net and \( {\mathcal{l}}_1 \) regularization or Tikhonov regularization. Here, the introduction of the factor \( \left(1+\frac{\lambda_2}{n}\right) \) is to eliminate the double shrinkage caused by these two regularization terms. It is well known that the elastic net regression combines the merits of the LASSO [28] and ridge regression. Specifically, the \( {\mathcal{l}}_1 \) part in the elastic net penalty promotes the sparsity in w Enet i,m and the \( {\mathcal{l}}_2 \) part not only overcomes the limitation on the number of columns in Q i,m to be selected but also stabilizes the solution path when p ≫ n (Q i,m  ∈ R n × p). More importantly, a grouping effect on w Enet i,m can be encouraged by the \( {\mathcal{l}}_2 \) part. This makes Elastic net be able to handle many highly correlated columns in Q i,m perfectly. Unlike the LASSO which tends to select one column and ignore the others, all of these columns can be selected or dropped. From the construction of H i,m , we can learn that the correlations between some columns (hypotheses) are so high that they can be treated as a group in H i,m . According to the JL lemma, the projections in Q i,m which correspond to the group are also highly correlated, thus it is reasonable for Elastic net to deal with these columns.

To make the statements above more understandable, we take the first frame of Foreman sequence (CIF format, 352 × 288) as the reference frame. On it, the block-based compressive sampling with a block size of B = 16 and Key-subrate = 0.7 is executed. Then, the BCS-SPL algorithm is used for the recovery of this frame. On the reconstructed reference frame, we extract the search space related to the 25th block of non-key frames with a spatial window size W (15 pixels) to form H i,25 and Q i,25. Figure 2 presents the acquisition of the search space and the CS-subrate we set for Q i,25 is 0.2. The experimental data in Tables 1 and 2 present the correlations between columns 40 ~ 43 in H i,25 and Q i,25, separately.

Fig. 2
figure 2

A block with B×B and the search space with (B+2W) × (B+2W); W is the spatial window size

Table 1 The correlations between columns 40 ~ 43 in H i,25
Table 2 The correlations between columns 40 ~ 43 in Q i,25

From Tables 1 and 2, we can conclude that the columns 40 ~ 43 are highly correlated with all the ρ a,b above 0.8, especially in Q i,25. In this case, Elastic net can be utilized to process these columns effectively. All of them will be abandoned and the corresponding coefficients are set zero, or be selected and allocated similar non-zero coefficients.

LARS-EN, an efficient algorithm, is also developed in [42] to obtain the entire Elastic net solution path by the updating and downdating of Cholesky decomposition and [26] makes a further description of this algorithm with the available Matlab software: SpaSM. In the situation where p ≫ n , we adopt the early stopping [42] which means the optimal solution is achieved before the algorithm runs to end. Besides, LARS-EN requires O(k 3 + pk 2) operations after k iterations, so the early stopping can be used to ease the computational burden to acquire the solution.

4.3 Weighted Elastic net

Applying Elastic net simply to video processing does not utilize all the underlying available information. In [43], the described adaptive Elastic net combines the adaptive LASSO [41] and Elastic net together. This adaptive Elastic net, with the pre-computed Elastic net estimator as the penalty factor embedded into the \( {\mathcal{l}}_1 \) part, allocates different weight to the corresponding coefficient. As to the multi-hypothesis setting in the projection domain for video data, the distance-weighted rule is employed to form the Tikhonov regularization-based MH prediction in [30]. The greater the distance is, the heavier penalty to the correspondent coefficient will be. We can adopt this rule to Elastic net or replace the pre-computed Elastic net estimator in the adaptive Elastic net by the distance vector D i,m defined in Section 3.2. Compared with the adaptive Elastic net, the resulting weighted Elastic net (wEnet), as we call, is also data-dependent for D i,m varies with the block number m in the frame number i but more efficient in computing without acquisition of the Elastic net solution. Additionally, the wEnet also inherits the characteristics of sparse representation and grouping effect in Elastic net mentioned above. In summary, the proposed Multi-Hypothesis weighted Elastic net (MH-wEnet) which is used in the measurement domain to make the prediction \( {\tilde{x}}_{i,m}^{wEnet} \) has the following form:

$$ {w}_{i,m}^{wEnet}=\left(1+\frac{\lambda_2}{n}\right)\underset{w}{ \arg \min}\left|\right|{y}_{i,m}^{CS}-{Q}_{i,m}w\left|\right|{}_2^2+{\lambda}_1\left|\right|{\mathcal{D}}_{i,m}w\left|\right|{}_1+{\lambda}_2\left|\right|w\left|\right|{}_2^2 $$
(8)
$$ {\tilde{x}}_{i,m}^{wEnet}={H}_{i,m}{w}_{i,m}^{wEnet} $$
(9)

where

$$ {\mathcal{D}}_{i,m}\kern0.5em =\kern0.5em \left(\begin{array}{lllll}{d}_1\hfill & \hfill & \hfill & \hfill & 0\hfill \\ {}\hfill & \ddots \hfill & \hfill & \hfill & \hfill \\ {}\hfill & \hfill & {d}_s\hfill & \hfill & \hfill \\ {}\hfill & \hfill & \hfill & \ddots \hfill & \hfill \\ {}0\hfill & \hfill & \hfill & \hfill & {d}_{\varSigma}\hfill \end{array}\right) $$

w wEnet i,m is the MH coefficients vector in MH-wEnet and d s  =  ||y CS i,m  − Φ CS B h s ||2 is also the element in the distance vector D i,m . Σ represents the total number of columns in Q i,m .

For the computation, the LARS-EN algorithm can still be employed to obtain the solution path efficiently based on the adjusted Q i,m by D i,m and the early stopping is also adopted.

4.4 MH/SH wEnet

We have designed the measurement-domain MH-wEnet mode to handle the problems of support selection and coefficients adjustment. But from the construction of D i,m , we can see that the value of d s will be extremely close to zero when y CS i,m is approximately equal to q s and that in the special case where y CS i,m  = q s , d s will precisely be zero. When one of these two cases happens, the LARS-EN algorithm used to solve the wEnet will become unstable because of the implementation of Cholesky decomposition. To avoid this and improve the robustness of the prediction procedure, we take the approach of SH prediction and combine it with MH-wEnet to complete the prediction task for non-key frames.

In CVS situation, the SH mode has two options for implementation. In the pixel domain [2, 19, 23, 32], it is similar to the traditional ME/MC framework and an ME operation is made between an initial predicted frame and a reference frame to form the associated field of motion vectors. This field is applied to produce the more precise prediction. In the measurement domain [25], a single and best-matching hypothesis h s is chosen as \( {\tilde{x}}_{i,m}^{CS} \) according to some distortion standards, such as the minimum mean square error (MMSE), carried out between each measurement vector q s and the target y CS i,m . As noted in Section 4.1, the prediction calculated in the measurement domain is more precise than that in the pixel domain. Moreover, because MH-wEnet has its implementation in the measurement domain, the SH mode should also keep the same manner to guarantee the efficiency of prediction.

Specifically, as Fig. 1(b) shows, after the calculation of D i,m , the minimum d θ in it will be found and then compared with a suitable threshold T which is determined by experience and requirements. According to the results of judgment, either the SH or the MH-wEnet prediction mode in the projection domain will be selected to execute. That is the reason why we call MS-wEnet a hybrid prediction method. Once the SH mode is chosen to carry out, the θ th column in H m , i.e., H m,θ will be viewed as the prediction of the vectorized x CS i,m directly and then added to predictions of the other blocks to constitute the whole prediction of x CS i via block grouping.

Here, there may exist a problem that since SH prediction is less attractive than MH prediction in quality, the participation of SH prediction may lead to a reduction in the final reconstruction quality. It should be emphasized that the number of blocks which go through the process of SH prediction is much less than that through MH prediction. And if the threshold T is small enough, x CS i,m can be accurately approximated by H m,θ according to the JL lemma. As a result, the quality achieved by the hybrid hypothesis prediction method (MH/SH wEnet, and MS-wEnet for simplicity) will not be deteriorated compared with that by the solely MH mode. Moreover, the prediction procedure will also be speeded up because the implementation of SH prediction is much faster than MH one. This will be presented in the simulation part.

With the discussion above, we have described the proposed CVS system in detail, especially the vital component—MS-wEnet at the decoder. The performance of this scheme will be shown and analyzed in the next section.

5 Experiments and results

In order to make the performance evaluation, in all our experiments, we take the standard test video sequences with CIF format available at http://trace.eas.asu.edu/yuv/ and use the luminance component. A block size of B = 16 and an orthonormalized Gaussian Φ B are applied to the block-based plain compressive sampling. The Key-subrate we use is 0.7 and CS-subrate varies from 0.1 to 0.5. The dual-tree discrete wavelet transform (DDWT) with four levels of decomposition is used as the sparsity-inducing basis for reconstruction and the residual reconstruction is only executed once by using the BCS-SPL algorithm. Moreover, in those schemes implementing MH/MS prediction, the spatial window size W, as shown in Fig. 2, is set to be 15 pixels. All of the experiments are carried out under the configuration of Windows 7 SP1, Intel Core i3-2330 M, CPU 2.20GHz, 2GB RAM. The version of Matlab is R2010(b).

Before the performance of the proposed CVS scheme is demonstrated, there are still several problems that need to be handled.

5.1 Parameters in MS-wEnet

In the MH-wEnet mode, the Matlab software: SpaSM is adopted to solve the wEnet presented in (8) and (9). There are two parameters λ 1 and λ 2 that need to be predetermined in the wEnet. In other words, we should define the values of delta and stop (delta is equal to λ 2 and − stop which is related to λ 1 denotes the number of hypotheses to be selected) first for SpaSM. We can learn from the JL lemma that ||y CS i,m  – Q i,m w||2 will be close to ||x CS i,m  – H i,m w||2 with a higher probability and meanwhile the correlations between each q s in Q i,m and y CS i,m will be distinguished more efficiently as CS-subrate increases. So it allows us to select more h s in H i,m to make contributions to the prediction with higher CS-subrate. Based on the analysis and experimental observations over a large set of different frames, we find delta ∈ [0.01,0.2] and stop = − 1000 × CS – subrate provide the better results. As a result, we use delta = 0.1 and stop = − 1000 × CS – subrate from this point on.

Because of the application background of WMSN, there are abundant computing resources available at the receiver. So compared with the reconstruction time, the reconstruction quality is more important. Next, we will analyze the effect that the threshold T makes on the reconstruction quality in detail.

In fact, the lower limit of the threshold T is the minimum value of d θ that the LARS-EN algorithm allows when solving (8) stably. It is found that when d θ is lower than 1e-11, the adjusted Q i,m (each adjusted column q , s  = q s ./d s ) is badly conditioned and this leads to the instability of LARS-EN. For this reason, the lower limit of T we set is 1e-11. Based on this lower limit and without loss of generality, we utilize the frames 1,5,9 of Foreman and News sequences to determine the scope of T in which the reconstruction quality of MS-wEnet achieves the optimum. The frames 1,9 are treated as key frames and frame 5 as non-key frame to be recovered. Figures 3 and 4 show the relationship between the threshold T and the reconstruction quality of frame 5 at different CS-subrates. The reconstruction quality is measured in peak signal-to-noise ratio (PSNR).

Fig. 3
figure 3

Reconstruction quality of frame 5 of Foreman sequence with different T at various CS-subrates

Fig. 4
figure 4

Reconstruction quality of frame 5 of News sequence with different T at various CS-subrates

It can be observed that the trend of all curves is consistent. When T ∈ [ 1e – 11,  1 ), the reconstruction quality remains unchanged. The reason is that few d θ are in [ 1e – 11,  1 ) and the block corresponding to d θ which is in [ 1e – 11,  1 ) can be approximated by SH prediction. When T ∈  [1,  500 ], the reconstruction quality reduces rapidly with the increasing T which indicates most d θ are in [ 1,  500 ]. When T > 500, for the reason that all blocks employ the reconstruction method based on SH prediction, the reconstruction quality does not change any more. With this analysis, in order to achieve the best reconstruction quality stably by applying MS-wEnet, we should select T from [ 1e – 11,  1 ). Therefore, T = 1e-7 is used from this point on.

5.2 Comparison of various MH prediction methods

In our decoding procedure, the acquired H i,m and Q i,m can be viewed as two special dictionaries in the spatial domain and measurement domain, respectively. It exists many kinds of techniques that can be employed to settle the MH prediction problem based on Q i,m . Here, we take five typical algorithms: OMP, CGIST, LARS [26], LARS-EN and Tikhonov regularization [30]. The OMP is an iterative greedy algorithm, while CGIST is an \( {\mathcal{l}}_1 \)-regularization solver by convex optimization. The LARS and LARS-EN are two fast algorithms based on Least Angle Regression [11] for line regression: the LASSO and Elastic net. At last, in Tikhonov regularization, an \( {\mathcal{l}}_2 \)-penalty term that contains the prior knowledge is imposed on the norm of w mh i,m . All of the algorithms except Tikhonov regularization, will provide the sparse solution w mh i,m . Additionally, we also extend the distance-weighted rule to the LASSO, yielding another MH prediction method, MH-LARS-wLASSO. So there are seven methods which are demonstrated in Fig. 5 in total to compare.

Fig. 5
figure 5

Reconstruction quality of frame 5 of Foreman sequence for various MH prediction methods

As a representative, we take the frames 1,5,9 of Foreman sequence as an example. The frames 1,9 are treated as key frames and frame 5 as non-key frame to be recovered. In OMP, the maximum number of iterations K equals 500 and in CGIST, the value of the parameter mu we set is 100. To solve Enet (Elastic net) and wEnet, we take delta = 0.1 and stop = − 1000 × CS – subrate. As for the LASSO and wLASSO, SpaSM is also utilized to obtain the solution with delta = 0 and stop = − 1000 × CS – subrate. Note that the number of the selected hypotheses will be limited by n when applying LASSO or wLASSO. In MH-Tikhonov, lambda = 0.25, the same setting with [12, 30]. The comparison of reconstruction quality for various methods is shown in Figure 5. And Table 3 presents the associated experiment data.

Table 3 PSNR(dB) and reconstruction time(s) of frame 5 of Foreman sequence for various MH methods

Here, it should be emphasized that when using MH-wEnet or MH-LARS-wLASSO on the three frames we select, there is no unstable situation which is described in Section 4.4. In other words, the SH mode has no contribution to the reconstruction.

As Fig. 5 shows, MH-wEnet almost outperforms the other methods in reconstruction quality but MH-OMP is the worst. The suboptimal ones are MH-L1-CGIST and MH-LARS-wLASSO. The corresponding curves of these methods are very close to each other. MH-Tikhonov has a slight advantage over the suboptimal ones at high CS-subrates. From the quality formed by MH-LARS-LASSO and MH-Enet, we can see that the improvement of PSNR can not always benefit from the simply more selected h s from the hypothesis set H i,m . It is essential to apply the prior knowledge contained in D i,m to further distinguish the contribution made by each h s to the linear combination.

According to the reconstruction time (CPU time) for various MH methods provided in Table 3, clearly, MH-OMP is the fastest method, while MH-L1-CGIST is the slowest one. So it is not suitable for the real-time processing. It takes nearly the same amount of time to implement the recovery process for MH-Tikhonov at each CS-subrate. The remaining four methods need more and more time for reconstruction with the increasing CS-subrate. Additionally, it can be found that the application of the distance-weighted rule makes MH-LARS-wLASSO and MH-wEnet much faster than that without it.

With the analysis above, MH-wEnet, MH-Tikhonov, and MH-LARS-wLASSO are the better ones. However, for the application background of WMSN, much attention is paid to the performance at low sampling rates, especially the reconstruction quality because of the abundant computing resources available at the receiver. From Fig. 5 and Table 3, we can see that with MH-Tikhonov to be the baseline, MH-wEnet is more competitive than MH-LARS-wLASSO in terms of reconstruction quality at low CS-subrates, though some longer time is required. In addition, MH-LARS-wLASSO does not only ignore the grouping effect but also inherit the instability of the LASSO solution path when q s are highly correlated. Therefore, between the two proposed methods, MH-wEnet and MH-LARS-wLASSO, we select the former to compare with MH-Tikhonov which is one of the best methods in reconstruction quality as we know at present. But for the reason that MH-wEnet is not suitable for all video sequences, we adopt MS-wEnet instead of MH-wEnet.

5.3 Contribution comparison of MH-wEnet and SH to MS-wEnet

Because MS-wEnet involves the mode selection, it is necessary to figure out the contributions two modes make to the reconstruction performance gain of MS-wEnet. In this time, to make SH mode play a role in MS-wEnet under the threshold T = 1e-7, we choose frames 1,5,9 of News sequence and the treatment is the same as that in Section 5.2. Figure 6 shows the reconstruction quality of the 5th frame by applying MS-wEnet and solely SH method, separately. Note that MMSE distortion standard is used in the solely SH method. Table 4 lists the PSNR values, the reconstruction time and the proportion of SH mode in either method.

Fig. 6
figure 6

Reconstruction quality for frame 5 of News sequence for MS-wEnet/SH

Table 4 PSNR(dB), reconstruction time(s) and SH proportion for frame 5 of News sequence for MS-wEnet/SH

From these results, we can see the reconstruction quality achieved by MS-wEnet is significantly better than that by solely SH method. So it can be inferred that in terms of prediction quality, the MH-wEnet mode is much better than the SH mode. From Table 4, we can know the reconstruction time for solely SH method is much less than that for MS-wEnet. So it can be concluded that in terms of prediction speed, the SH mode is much faster than the MH-wEnet mode. It should be emphasized that in the reconstruction method applying MS-wEnet, the proportion of blocks using the SH mode is 0.313 only. Therefore, in comparison with the SH mode, the contribution made by the MH-wEnet mode is dominant in the reconstruction performance gain of MS-wEnet.

5.4 Performance of the proposed CVS scheme for video

We are more interested in the performance of the proposed CVS system in entire video sequences. So a comprehensive comparison of the proposed CVS scheme for video sequences with the other ones is made in this section. Note that all CVS schemes take the same plain compressive sampling, though different strategies are utilized for recovery. As a result of Sections 5.2 and 5.3, MS-wEnet and MH-Tikhonov are the main participants in the final comparison. Besides these, MH-OMP and Dir-BCS-SPL which directly applies BCS-SPL to reconstruction are added. The MH-OMP method can be considered as a benchmark for the methods using MH prediction and by comparison with Dir-BCS-SPL, we can see how efficient MH prediction is to improve the reconstruction quality.

In the experiments, we employ the first 88 frames of the commonly used video test sequences, Foreman, Coastguard, Mother and Daughter, Hall Monitor, and News. At the CVS encoder, we define a GOP size of P = 8. At the CVS decoder, in the schemes that take MH/MS prediction strategy, for fair comparison, the same reconstructed key frames must be used for the recovery of non-key frames. Moreover, the better accuracy key frames are reconstructed, the more margins will be obtained in quality for non-key frames. So the enhancement process [29] in which the temporally neighboring non-key frames are utilized to re-recover and boost key frames should be applied to MH-OMP, MH-Tikhonov, and MS-wEnet. Additionally, the parameter setting is the same with that in Sections 5.1 and 5.2 and the Matlab code for MH-Tikhonov can be available at http://www.ece.msstate.edu/~ewt16/publications/.

Figures 7, 8, 9, 10 and 11 illustrate the performance of the four CVS schemes for varying CS-subrates. The PSNR is averaged over all 88 frames of each sequence. Table 5 shows the same experiment data as well and Table 6 presents the average reconstruction time per frame for various methods at each CS-subrate. Visual results of the reconstructed frame 15 of Foreman sequence with CS-subrate = 0.1 are depicted in Fig. 12.

Fig. 7
figure 7

Performance comparison with the first 88 frames of Foreman sequence

Fig. 8
figure 8

Performance comparison with the first 88 frames of Coastguard sequence

Fig. 9
figure 9

Performance comparison with the first 88 frames of Mother and Daughter sequence

Fig. 10
figure 10

Performance comparison with the first 88 frames of Hall Monitor sequence

Fig. 11
figure 11

Performance comparison with the first 88 frames of News sequence

Table 5 Average PSNR(dB) for the first 88 frames of several video sequences
Table 6 Average reconstruction time(s) per frame for all methods at each CS-subrate
Fig. 12
figure 12

Reconstruction of frame 15 of Foreman sequence with CS-subrate = 0.1: a Dir-BCS-SPL,PSNR = 26.83 dB b MH-OMP,PSNR = 26.93 dB c MH-Tikhonov,PSNR = 33.31 dB d MS-wEnet,PSNR = 35.86 dB

In terms of reconstruction quality, we can observe intuitively that the CVS schemes based on MS-wEnet and MH-Tikhonov have a great improvement compared with the remaining ones. Furthermore, at low sampling rates (CS-subrate < 0.3), MS-wEnet always outperforms MH-Tikhonov and at high sampling rates, such as CS-subrate = 0.5 for all video sequences except Foreman, MH-Tikhonov has a very slight advantage over MS-wEnet. The comparison between MH-OMP and Dir-BCS-SPL shows the usage of MH prediction can enhance the reconstruction quality greatly.

Besides the reconstruction quality, we also take the average reconstruction time per frame measured by the CPU time into consideration, though none of these schemes has been specially optimized for execution speed. As Table 6 presents, it takes the shortest time for Dir-BCS-SPL to reconstruct each frame and a longer time for MH-OMP. As for MH-Tikhonov, the average time spent on recovery has small changes with CS-subrates for each video sequence. At last, we can find the corresponding time for MS-wEnet has a characteristic that more time is needed with the increasing CS-subrate because of the linear relation between the parameter stop and CS-subrate. However, by comparison with MH-Tikhonov, MS-wEnet is faster at low sampling rates. Moreover, in the special case of News sequence, we can observe the reconstruction time is reduced further for the reason that more blocks need to be processed by the SH mode. As an extension, many measures can be taken to speed up the CVS scheme based on MS-wEnet. For example, we can replace LARS-EN with the glmnet [13] algorithm to solve the wEnet for its faster implementation; in the enhancement process for key frames, MS-wEnet can be utilized to produce the temporally neighboring non-key frames at low sampling rates; the value of threshold T in MS-wEnet can also be increased greatly to reduce the number of blocks that need to be dealt with by MH-wEnet, though a little loss will be made in reconstruction quality.

In conclusion, because of the application background of WMSN, the low sampling rates draw much attention and at such rates, the proposed CVS scheme based on MS-wEnet is superior to that based on MH-Tikhonov not only in the reconstruction quality but also in the recovery time.

6 Conclusions

In this paper, we have studied how to make efficient use of the temporal redundancy to improve the performance of Compressed Video Sensing in WMSN. The proposed CVS scheme based on MS-wEnet performs the hybrid hypothesis prediction in the measurement domain for each non-key frame. It means either the SH or the MH-wEnet prediction mode is selected to carry out and that both modes are implemented in the projection domain constructed by the measurement matrix. With the prediction process completed, a residual reconstruction step utilizing the measurement-domain residual is followed to accomplish the recovery of non-key frames. By comparison with various MH prediction methods, we can see that MH-wEnet provides a better reconstruction quality. Next, by the comparison made between MS-wEnet and MH-Tikhonov applied to the video sequences, we can see the proposed CVS scheme outperforms that based on MH-Tikhonov, especially at low sampling rates. In terms of future work, to make the proposed CVS scheme more practical in WMSN, we will work on the issue of quantization for CS measurements which is largely underdeveloped in CS community, and the adaptive strategy for compressive sampling at the encoder will also be considered to reduce the amount of data more.