1 Introduction

Throughout the past decade, convolutional neural networks (CNN) have acquired great interest in both the industrial and academic fields and have emerged as the leading technique for visual recognition tasks. Despite the vast traction that CNN gained, a gap between expectations and reality persisted. Therefore, practical use of such networks in real-world applications requires an optimal level of performance next to which higher response time is sought. However, to achieve much higher performance, modern neural networks have become very deep and difficult to tune on large datasets. On the other hand, the design of efficient (small and performant) neural networks has become a new research focus that has not matured yet [6]. Even more, the development of vision-based perception capabilities for embedded systems, such as robots or autonomous vehicles, with rapid response time and high accuracy is not easy to achieve. With limited computational power and weight capabilities, the choice of sensors and algorithms able to run onboard in real time is the real challenge. Common computer vision approaches usually employ multiple sensors, including cameras, LASERs, LIDARs, and RADARs. Nevertheless, these sensors have their drawbacks, including the high cost and low spatial resolution, they are havier end consume more, interference between the different sensing modalities, adding to that the preprocessing latency [40]. The best compromise of cost and quality at a reasonable preprocessing load is offered by cameras [5]. Therefore, in the present work, we focus on accelerating computer vision models using cameras to achieve image understanding. Most of the proposed architectures in the literature are designed to process RGB images even though the latter are compressed before or after acquisition for storage and transmission purposes using either engineered compression algorithms or machine-learning ones (Fig. 1a). Hence, adding an additional decoding step, which is computationally costly and memory demanding, complicates the real-time and embedded deployment. A possible alternative is to design architectures capable of learning with compressed representations (Figure 1b), rather than traditional RGB images in order to speed up training and inference of deep networks [14]. Likewise, image compression algorithms have also benefited from the rise of CNN. In addition to their outstanding compression performances, learned compression algorithms can be adapted to specific target domains resulting in better compression rates than engineered ones. One famous learned compression algorithm that has gained significant interest is compressive sensing (CS) tailored to image acquisition applications. A plausible idea would be to develop recognition architectures that work directly on CS compressed representations and bypass the reconstruction phase.

Fig. 1
figure 1

a RGB-based training/inference pipeline; b compressed images training/inference pipeline

In this paper, we explore CS on adapted CNN architectures to perform computer vision tasks. The chosen computer vision task to validate the proposed approach is object detection based on the Single-Shot Detector (SSD) network since they encompass classification and regression tasks in a single network [25]. The proposed lightweight compressed-sensing architecture is inspired by [33] and is composed of two networks for sampling and reconstruction. Our approach aims at taking the latent space representations resulting from the sampling network and using them for training and inference in the adapted architectures. The major contributions of the paper are the following:

  • A lightweight CS network is proposed based on the one contributed in [33] which provides minimal latency for both sampling and reconstruction.

  • We propose a new approach that enables recognition networks to operate in CS-based ecosystems. By redesigning the early layers of the recognition network (backbone), this latter will exploit sampled data directly without further reconstruction.

  • We validate the proposed approach for the detection task using the SSD object detector upon heavy and light backbones, namely VGG and Mobilenet, respectively.

  • Both the redesigned backbones and the SSD head are enhanced by adding Batch Normalization layers to permit training from scratch.

  • Our approach is validated on three different datasets on high-performing and small GPUs. The obtained results are promising, achieving interesting accuracies while being up to 30% faster on small GPUs, which is suitable for resource-constrained devices and embedded environments.

We focus on redesigning computer vision architectures for applications in embedded settings and we validate our findings on object detection tasks. The remainder of this paper is organized as follows. The background on object detection, CS paradigm and compressed learning related works are introduced in Sect. 2. In Sect. 3, the details of the proposed approach are presented. Section 4 provides the experimental setup followed by the experimental results in Sect. 5. In Sect. 6, we conclude the paper.

2 Related works

2.1 Object detection

Since the rise of CNN, deep learning-based object detection has been outperforming traditional methods by a significant margin. Existing models are mainly based on RGB images and can be either two-stage or one-stage detectors. Two-stage models rely on region proposal followed by box classification, such as R-CNN [13] and its subsequent improved versions (Fast R-CNN [12], Faster R-CNN [32], Mask RCNN [17] and Cascade RCNN [7]). While models belonging to this category have been proven successful, they remain highly computationally expensive. On the other hand, the current state-of-the-art of fast object detection is mainly driven by one-stage detectors such as YOLO [28] and SSD [25]. By re-framing object detection as a single regression problem, a single network that simultaneously outputs bounding box coordinates and class scores could be used [26]. For YOLO models, the prediction relies on global image features extracted from convolutional layers, which significantly improves the detection speed at the cost of detection precision. Nonetheless, YOLO improvements [4, 29, 30] are one of the fastest and most accurate object detectors by integrating several improvements such as multilabel object class prediction, prediction across scale, the use of K-means clustering to determine box priors, etc. Another popular one-shot detector is SSD. This latter considers a fixed set of default bounding boxes with an associated feature map at different scales and aspect ratios. By coupling the box matching strategy with the multi-scale features, SSD is significantly more accurate than the original YOLO network with a higher detection speed. Also, SSD-based models such as [15] are one of the most accurate and lightweight object detectors. Moreover, many attempts to accelerate existing state-of-the-art models have been carried out. The authors of [23] proposed a general compression pipeline (through pruning, knowledge distillation, and quantization) for one-stage object detection networks to meet the real-time requirements. In [16], a lightweight and fast object detector based on ShuffleNetV2 and YOLO head is proposed. The model has achieved competitive results in accuracy and speed while being lightweight. Likewise, many efforts were made to accelerate object detection tasks for specific applications. The work in [39] proposed a faster version SSD model based on parameter reduction and dilated convolution. The obtained results showed that the proposed model achieves higher speed compared to the original one for specific applications (apple detection, bicycle detection, and vehicle detection). In [31], a real-time traffic sign detection network using DS-DetNet and lite fusion FPN is proposed. The model achieves compelling accuracy with high speed. It should be noted that the methods mentioned above were proposed to speed up the inference stage using RGB-based architectures. The proposed approach herein aims at improving the network’s inference speed by considering lightweight compressed inputs. Since this work focuses on embedded settings with limited resources, such as transportation systems and robots, the SSD one-stage object detector was chosen as the baseline architecture for the proposed solution. For instance, the combination of our approach with other RGB-based classifiers, detectors, or segmentation networks could be of interest. Also, the combination of our approach with the above-mentioned ones could be relevant.

2.2 Compressive sensing

Compressive sensing is a powerful sensing paradigm to sample sparse signals with much fewer samples than the Shannon-Nyquist sampling limit [3, 21]. Inherent redundancy present in real signals like images and videos allows significant data compression. CS exploits this redundancy and enables sampling at Sub-Nyquist rates. This makes CS extremely useful for capturing images and videos in systems that cannot afford high data bandwidth. The ability to sample very few data points and still be able to reconstruct the original signal helps to create lower power consumption imaging systems. The existing CS methods are mainly classified into two categories: iterative optimization-based CS methods and neural network-based CS [24]. Refer to [24, 33] for a review on existing solutions. For methods belonging to both categories, the theory behind CS guarantees that a sparse signal (in some domains) can be exactly recovered from many fewer measurements. Concretely, suppose that \(x \in R^{N \times 1}\) is a real-valued signal and \(\Phi \in R^{M \times N}\) is a sampling matrix, with \(M<< N\) , the CS measurements acquisition is expressed as:

$$\begin{aligned} y = \Phi x \end{aligned}$$
(1)

where \(y \in R^{M \times 1}\) is the CS measurement. In general, because the number of unknowns is much larger than the number of observations [11], recovering the original signal x from its corresponding measurements, y is not feasible. However, if the signal x is sparse in some domain \(\Psi\), the CS theory shows that an exact recovery of x is possible. The CS reconstruction can be formulated as:

$$\begin{aligned} \min _x ||\Psi x||_p, \quad \mathrm{subject} \, \mathrm{to}: \quad y = \Phi x \end{aligned}$$
(2)

where \(\Psi x\) is the spare coefficients for domain \(\Psi\), and the subscript p is usually set to 1 or 0, characterizing the sparsity of the vector \(\Psi x\).

2.3 Neural networks and compression: compressed learning

Compressed learning concepts were first introduced in [1], which showed that direct inference from compressed representations and measurements is feasible with high performances. In the light of this approach and given that training and particularly inference speed is critical, many works have focused on accelerating networks’ computations by employing spatial frequency decomposition, and other image compressed representations obtained using engineered codecs [8, 9, 14]. Other works have explored deep learned compressed data to promising effect [1, 34]. In [14], the authors demonstrate the use of DCT coefficients available in the JPEG image format as an effective input representation to CNN. Instead of inputting RGB pixels, a JPEG-compressed image half-decompressed to \(8\times 8\) block DCT coefficients is used to input the network. The method can achieve good performance while offering a significant speedup, primarily by replacing the first two blocks with the already-computed JPEG DCT coefficients. Several approaches are evaluated to sample and place the DCT input into the network. However, few examples of architectures exploring features obtained from learned image compression algorithms exist in the literature. A relevant work is [34], where the authors explore the use of learned compressed image representation for solving two computer vision tasks (classification and semantic segmentation) without employing a decoding step. The compressed inputs are acquired using a heavy auto-encoder architecture. The results are compared to those obtained using RGB images and are similar for classification and slightly improved segmentation (especially for aggressive compression rates). The authors also jointly train the compression and classification and show that it can enhance the results. However, the deep encoder constitutes a memory and computation burden and is inefficient. Another work is the one proposed in [41]. The author proposed a compressive convolutional network (CCN), which is a compressive-sensing-enabled CNN. The proposed CCN optimizes and reuses the convolution operations of the first layers of the detector for recoverable data embedding and image compression. Therefore, no extra computational overhead is required for image compression. However, the detection task is performed, as usual, i.e. on RGB images. The approach we propose is most related to [34] since we aim to use deep learned image compressed representations directly to solve vision tasks.

3 Methodology

Traditionally, CNNs deployed in embedded systems such as robots or autonomous vehicles are fed with a flow of RGB images from a high-resolution acquisition device to perceive the surrounding environment better. Frame-by-frame processing of such an enormous amount of data requires hefty computational resources.

Additionally, and motivated by the emergence of deep learning and compressive sensing, different learned sampling mechanisms were developed to incorporate CS in image and video acquisition. In fact, and in contrast to traditional compression algorithms such as JPEG, learned CS does not override the spatial structure of images, as it generates feature maps ready to be explored with adapted architectures. Consequently, using such sampling methods for data acquisition than exploring the acquired data would be more efficient than traditional pipelines. In this scope, the present work proposes and explores an efficient learned compression method, namely learned CS for both real-time sampling and reconstruction, for object detection task and adapts state-of-the-art architectures accordingly. Figure 2 illustrates the proposed pipeline.

Fig. 2
figure 2

The proposed architecture for joint object detection and image sampling

The proposed solution transforms the recognition problem into a multiple output learning problem [37]. The proposed scheme learns simultaneously to predict two outputs given an input image. The first is the reconstructed image following the CS branch, and the second is the bounding boxes and scores following the detection branch. It offers in one hand two training procedures:

  • The entire network can be trained jointly from scratch or from a previous trained state of the CS network as an initialization;

  • The detection network can be trained using a fixed previously trained sampling network.

However, in this work, only the first training procedure is explored. Furthermore, during deployment and inference, the reconstruction part can be omitted since the compressive camera using the sampling learned weights deliverers representations ready to be explored by the modified detection network. Hence, the gain for such architectures mainly stems from the reduced data transfer between CPU and GPU due to image compression and decompression. As it can be seen in Fig. 2, after the generation of the compressed representation by the sampling network, this latter is fed to the reconstruction network to optimize for mean-squared reconstruction error and to the new detection network (named CS_D) to optimize for both classification and bounding box regression losses. The total loss function is:

$$\begin{aligned} L_\mathrm{c} = \beta L_\mathrm{CS} + L_\mathrm{OD} \end{aligned}$$
(3)

where \(L_{CS}\) is the loss term for the compressive sensing network (see Eq. 4), and \(L_\mathrm{OD}\) is the CS_D loss (see Eq. 5). \(\beta\) controls the trade-off between compression loss and detection loss.

$$\begin{aligned} L_\mathrm{CS} = \mathrm{MSE}\left( x,\hat{x}\right) \end{aligned}$$
(4)

where x is the input image and \(\hat{x}\) is the reconstructed image such as: \(\hat{x} = R(S(x))\).

$$\begin{aligned} L_\mathrm{OD} = L(x,c,l,g) = \frac{1}{N} \left( L_\mathrm{conf}(x,c) + \alpha L_\mathrm{loc}(x,l,g)\right) \end{aligned}$$
(5)

Similarly to [25], the CS_D loss function consists of two terms: \(L_\mathrm{conf}\) and \(L_\mathrm{loc}\) where N is the matched default boxes. \(L_\mathrm{loc}\) is the localization loss which is the Smooth L1 loss between the predicted box l and the ground-truth box g. \(L_\mathrm{conf}\) is the confidence loss which is the Softmax loss over multiple classes confidences c (\(\alpha\) is set to 1 by cross validation).

3.1 Lightweight CNN for image compressive sensing: L_CSnet

The proposed lightweight CNN for image CS is inspired by [33]. It has a sampling network and a reconstruction network. The sampling network is used to learn the sampling matrix and acquire CS measurements. The reconstruction network, which is linear, learns an end-to-end mapping from the CS measurements to the reconstructed images (Fig. 3). In contrast to [33], the deep nonlinear reconstruction network used for quality refinement is removed.

Fig. 3
figure 3

Proposed light compressive sensing CNN architecture

3.1.1 Sampling network

First, the image is divided into non-overlapping blocks of size \(B \times B \times l\) (l is the number of channels). Then, using a sampling matrix \(\Phi _B\) of size \(n_B \times l B^2\), the CS measurements are acquired. A convolution layer is used to imitate the compressed sampling process (while considering each row of the sampling matrix \(\Phi _B\) as a filter). The size of each filter in the sampling layer is \(B \times B \times l\) (according to the size of the image blocks) so that each filter outputs one measurement. For the sampling ratio \(\frac{M}{N}\), there are \(n_B = \frac{M}{N}l B^2\) rows in the sampling matrix \(\Phi _B\) to obtain \(n_B\) CS measurements. Therefore, there are \(n_B\) filters of size \(B \times B \times l\) in this network. Formally, the sampling process S can be expressed as :

$$\begin{aligned} y = S(x) = W_{s} *x \end{aligned}$$
(6)

where \(*\) represents convolution operation, x is the input image, y is the CS measurement, \(W_s\) corresponds to \(n_B\) filters of size \(B \times B \times l\). The output of the sampling network to an image block is a \(1 \times 1 \times n_B\) vector. The learned sampling matrix can efficiently utilize the characteristic of images and make the CS measurements retain more structural information for better reconstruction. In the application phase, the learned sampling matrix is used as an encoder to generate CS measurements.

3.1.2 Reconstruction network

To reconstruct the image, the pseudo-inverse matrix of the sampling network \(\Phi _B\) is used, following the works in [33, 38]. Given CS measurements \(y_i\) of the \(j\mathrm{t h}\) block, its reconstruction result is \(\overset{\sim }{{x_j}} = \overset{\sim }{{\Phi }} y_j\). \(\Phi _B\) is a matrix of size \(l B^2 \times n_B\) . Similar to the sampling process, a convolution layer with special kernel size and stride could be used to implement the reconstruction process. Similar to [33], the matrix \(\overset{\sim }{{\Phi }}_B\) is optimized instead of the pseudo-inverse matrix of \(\Phi _B\). The reconstruction process R can be expressed as:

$$\begin{aligned} \overset{\sim }{{R}}(y) = W_\mathrm{int} *y \end{aligned}$$
(7)

where y is the CS measurement, and \(W_\mathrm{int}\) is the filters. The size of each one of the \(l B^2\) convolution filter in the reconstruction layer is \(1 \times 1 \times n_B\). The stride of this convolution layer is set to \(1 \times 1\) to reconstruct each block and the bias is ignored. Each column of \(\overset{\sim }{{R}}(y)\) is a \(1 \times 1 \times l B^2\) vector corresponding to an image block of size \(B \times B \times l\). A combination layer, which contains a reshape and a concatenation function, is used to obtain the final reconstructed image.

3.2 Recognition network

Our goal is to design a CNN for object detection starting from compressed representations of images obtained via the sampling network of the deep CS model described in the previous subsection. As mentioned earlier, the chosen baseline to design such a network is the state-of-the-art object detector SSD.

3.2.1 RGB baselines

Like most CNN object detectors (except YOLO and its variants), SSD relies on existing classification modules. Originally, SSD was designed to use the VGG network as a backbone. However, object detection networks are not restricted to a specific classification backbone and can easily use others, provided that the dimensions of input images and network outputs are compatible. The proposed approach was implemented using two backbones, namely VGG and MobileNet [19], to evaluate its impact on both heavy and lightweight architectures. Other computer vision tasks, such as classification and segmentation, can also be used alongside the proposed approach. We first use the original SSD framework (based on VGG), then we study the effectiveness of BatchNorm for training SSD from scratch following the approach proposed in [42] (denoted SSD_BN). We incorporate BatchNorm layers after each convolutional layer in the VGG backbone and head (MobileNet already has BatchNorm layers). According to the authors [42], BatchNorm renders the optimization landscape remarkably smoother, inducing a more predictable and steady behavior of the gradients to permit bigger searching space and prompter convergence.

3.2.2 CS based-architectures

The use of compressed representation requires the design of adequate CNN. Figure 2 illustrates the proposed architecture where the global model is composed of two subnetworks, the CS one and the SSD one. The SSD subnetwork (i.e. backbone and head) used in the CS-based models is the SSD_BN. This latter is further adapted to allow compressed data processing and is denoted: CS_D_i_j where i stands for the block size B and j for the sampling rate \(\frac{M}{N}\) used in the CS sub-network.

It is worth mentioning that the block sizes were chosen to produce compressed representations that could be fed to CS_D models. In the literature [11, 27, 33], a block size of 32 was used, but in this work, other values were explored. For input images with spatial dimension \(h \times w \times l\), the sampling network of the CS model outputs a compressed representation with dimensions \(B \times B \times n_B\), where \(n_B\) is the number of produced feature maps (corresponding to \(n_B\) filters, as described in Sect. 3.1.1). Therefore, variants of the backbones architecture are proposed to use these latent space representations as input. Similar to [34], these networks are designed by cutting off the front of the regular RGB models that have a larger spatial dimension than \(B \times B\).

The spatial data inputs that will be investigated are \(75 \times 75\) and \(38 \times 38\). The first one is obtained using a block size of \(B = 4\) and input images of \(300 \times 300\), while for the second one, a block size of \(B = 8\) and input images of \(304 \times 304\). For the VGG backbone, the first and second convolutional blocks are removed for the former inputs, and the compressed representations are fed to the third convolutional block. Similarly, the fourth convolutional block is used directly after removing previous blocks for the later inputs. As for the Mobilenet network, only inputs of \(75 \times 75\) are used while removing the early three convolutional blocks.

4 Experimental setup

4.1 Evaluation metrics

The image quality assessment we use herein to evaluate the proposed CS network is a full reference one based on Peak Signal-to-Noise Ratio [18] (PSNR) and Structural Similarity Index [36] (SSIM) metrics. The former is used as it correlates with the pixel-based loss used in the optimization of the CS branch (see Eq. 4). The second is used to measure better the similarity of images as perceived by humans. Also, sampling and reconstruction latency are analyzed. For the object detection task, the mean average precision (mAP) at different IOU thresholds along with the number of Frames Per Seconds (FPS) that can be processed by the network are used [27]. Besides, the number of floating-point operations [20] (FLOPs) that represents the amount of calculation of a model is used to measure models complexity.

4.2 Datasets

The experimentations were carried out on the YYmnist, the Pascal VOC datasets and the Mask dataset (Table 1). It should be mentioned that the first one has similar classes, as all of them are black sharing a white background. As for the pascal VOC, the classes belong to sparse and heterogenous categories (cow vs. tv monitor, for example), and the last used dataset has only one class (with mask).

Table 1 Datasets characteristics

4.3 Training procedure

On the Pascal VOC dataset, we used the same training settings as the original SSD (the baseline for this study), including data augmentation and anchor settings for all models (RGB-based SSD and CS-based SSD). When training SSD_BN variants, obtained by adopting the approach proposed in [42], we use their proposed configuration (learning rate, batch size, etc.), to ascertain the effectiveness of Batchnorm layers in training from scratch. For the YYmnist dataset and the Mask dataset, a simple data augmentation pipeline is used to accelerate the training process.

After hyperparameter tuning on NVIDIA Tesla V100 GPUs, all generated models are trained from scratch. A large batch size (of 128 images) is used for training to ensure the stable statistical results of BatchNorm in the training phase. All models are trained for a fixed number of epochs with no early stopping to ensure a fair test comparison between their results. The loss function to minimize is described in Eq. (3).

5 Experimental results

5.1 Models complexity

Execution time required for a forward pass through a neural network depends on the number of floating-point operations (FLOPs). From Fig. 4, we can see that by applying our approach:

  • VGG backbone: for the set of models where a block size of 4 is used, the gain in terms of FLOPS is \(31\%\), and for those where a block size of 8 is up to \(57\%\);

  • Mobilenet backbone: the number of FLOPs is reduced by \(23\%\) using a block size of 4.

Fig. 4
figure 4

Model complexity (BFLOPS) for the different networks

5.2 Evaluation of the proposed lightweight CS network

Herein, we investigate the performance of L_CSnet network in terms of both image reconstruction quality and running speed (Table 2). We compare it with the JPEG standard [35], as it is one of the most popular and effective compression algorithms, and two state-of-the-art deep learning based CS method, namely ReconNet [22] and CSnet [33]. For this part of the evaluation, our model was trained using the DIV2k dataset [2] using \(256 \times 256\) grayscale images. For a fair comparison, we follow [22, 33] to use a block size of \(32\times 32\) and Set11 [2] as the evaluation dataset. Refer to section S1 (Supplementary Information) for a deeper full reference evaluation of the proposed CS network based on [10].

Table 2 Average running time and image quality metrics for the lightweight_CSNet, the JPEG standard, ReconNet [22] and CSnet [33] for a sampling ratio of \(\frac{M}{N}=0.1\) on Set11 test images

In contrast to the JPEG standard, the deep learning-based CS methods are much faster (95.24%, 97.34% and 99.76% for Reconet, CSnet and L_CSNet, respectively). However, the JPEG standard performs better in image quality metrics (PSNR and SSIM). Comparing the learned CS methods, our model is faster, offering acceptable reconstruction performances (gained 1.56 dB over ReconNet and lost 2.53 dB over CSnet). We explain this gain in speed by the linearity of the reconstruction branch of the proposed CS network (no enhancement step as in [33]). Yet, it is also the reason for the loss in image quality. Still, similar to [22, 33], the proposed solution can easily be adapted to specific target domains since it is learned, such as stereo, medical, and aerial imaging, leading to even better compression performances. Figure 5 and section S2 illustrate qualitative results of the proposed compressive sensing model.

Fig. 5
figure 5

Sample from PASCAL VOC dataset: (left) original image, (right) reconstructed image using the sampling/reconstruction branch of the CS_D model

Even when employed in classical training and inference pipelines, L_CSNet is more interesting since it offers shorter encoding/decoding time, mainly due to the linear reconstruction branch (Figure S2.4).

5.3 Analysis of Batch-normalization when training from scratch

To ascertain the contribution of Batch-normalization layers, we train SSD from scratch without BatchNorm as our baseline. As mentioned before, BatchNorm induces a remarkably smoother optimization landscape, permitting a bigger searching space and prompter convergence. When we use the original SSD configuration, our baseline produces 63.4% mAP on VOC 2007, which is 12.5% worse than the performance reached by the detector when it is initialized with a pre-trained classification network (i.e. 75.9%). As for the SSD_BN model, an equivalent mAP to the detector initialized with the pre-trained VGG backbone is reached (i.e. 75.8%). Refer to Table 3 for more details.

Table 3 Analysis of BatchNorm for SSD trained from scratch on VOC 2007 test set

5.4 Detection results

After hyperparameter tuning of the newly designed multi-output learning networks (CS_D), Adam optimizer was used since the network did not converge using the SGD optimizer, even though it was the one that permitted the SSD_BN network to converge from scratch. Also, higher learning rate values adversely affect the CS part of the network, leading to divergence (0.001 is used after tuning). Furthermore, the SSD_BN is retrained from scratch using this configuration for a fair comparison (refer to Table 3 for more details).

5.4.1 Detection results on the YYmnist dataset

Extensive experiments were conducted on the YYmnist dataset. For both used backbones, similar mAP is obtained using compressed representations in comparison with the networks using the full input (see Fig. 6). Consequently, it appears that full images are not critical to correctly detecting objects within images (refer to figure S2.5 for a sample image alongside the 4 highest entropy channels of the compressed representation and predicted boxes from this latter). The obtained results aligned with the results of [5, 14, 34], which claims that CNNs are resilient to image compression given that its level is sufficient. Moreover, from Table 4, we can see that the loss in accuracy is negligible compared to the gain in FPS and can be improved by using more sophisticated augmentation pipelines. Also, as mentioned before, the classes of this dataset (MNIST classes) share many characteristics and are not sparse, causing the results of the CS-based SSD models to be close to their original counterparts. After further training of the best performing configurations, the CS_D models reached the baseline one in terms of accuracy (refer to Table 4).

Fig. 6
figure 6

mAP vs FPS for the CS-based detection networks

Table 4 FPS, mAP and image quality metrics for different backbones, sampling rates \(\frac{M}{N}\) (averaged over test images) and block sizes

For the VGG-based models, using a block size of \(B=8\) resulted in better improvements in the speed of the models. All different used aspect ratios resulted in almost the same speed with slightly different drops in mAP. The worst performance is obtained for the sampling ratio of 0.01 (compression rate = \(99\%\)) with a decline in mAP of \(18.6\%\). In fact, the CS_D_8_5 network is \(30\%\) faster then the original SSD while being only \(0.017\%\) less accurate on the mAP metric. This speed up gain is due to the fact that the CS_D_8_5 branch used for detection has three convolutional blocks less than the original SSD and thus requires less time to process its inputs and produce predictions. Using a block size of \(B=4\) resulted in a slight improvement in the FPS of models compared to using a block size of\(B=8\) because, first their is an additional convolutional block in the CS_D_4_j models compared to the CS_D_8_j models and second because the shape of the CS measurement using a block size of 4 is larger (\(75 \times 75 \times c\) for \(B=4\) vs \(38 \times 38 \times c\) for \(B=8\), c is the number of channels and depend on the sampling ratio \(\frac{M}{N}\)) and thus requiring more time to flow through the detection branch. The most significant improvement is for the sampling ratio of 0.01 with a drop in mAP of \(4.1\%\). From the reported results in Table 4, even though data are compressed using the sampling ratios 0.5, 0.25, 0.1 resulting in more compressed representations (less channels with smaller sampling ratio), the FPS that the models can process seem to saturate with an identical or small drop in performance. In general, the FPS ratio increased when reducing the sampling ratio (increasing the compression rate) and increasing the block size, except for the value 0.5 with a block size of 4. A possible explanation for such a performance behaviour could be that hyper-parameters tuned for a certain configuration is not the best for all configurations. Thus, block size, compression ratio, and image resolution are new hyper-parameters to tune. Similarly for the Mobilenet-based models, the CS_D model is 32.1% faster than the RGB one while being only 3.8% less accurate. Training for an additional 20 epochs reduces this gap to only 1%.

5.4.2 Detection results on the PASCAL VOC dataset

The preliminary results obtained in the first experiments permitted us to choose the best configurations to validate the proposed approach on the Pascal VOC dataset. Four VGG-based configurations are selected (CS_D_4_01, CS_D_8_5, CS_D_8_25, and CS_D_8_1) and will be compared to the SSD_BN network. The chosen models are those that delivered the best results in term of speed-up and accuracy. The results obtained on the PASCAL VOC dataset follow the findings of the first experiments on the YYmnist dataset (Table 5). However, for the pascal VOC dataset, the loss in accuracy is more critical (9.1% for the best configuration). There may be two reasons for this fact, the first one being the sparsity and heterogeneity of PASCAL VOC classes and the second one being the inability of the used CS network to encode relevant features of such dataset.

As it was mentioned before, the CS network proposed to validate our approach is single-scale, lightweight and linear Fig. 3. Therefore, its ability to learn relevant features of many classes is restricted/limited (Figures S4.7 and S4.8 illustrates some detection results on the PASCAL VOC test set). To overcome these limits, we propose a variant of our approach using a multi-layer CS network, described in detail in S3. The rational idea behind this step is the assumption that multi-layer sampling would emphasize the ability of the CS network to represent sparse features. However, the results show that single-scale and multi-scale sampling networks perform equivalently when used in our approach (refer to S3 for detailed results). We believe the reason behind such performance is the absence of both nonlinearity and bias in the sampling network (to maintain compatibility with the conventional CS). According to [34], the recognition network would perform better if an autoencoder architecture is used to obtain the compressed representations. However, the drawback of using an autoencoder is the memory complexity and time the encoder needs to generate the feature maps, which is neither suitable for embedded systems nor real-time applications. Thus, to improve the performance focus should be on the recognition branch. Furthermore, from the obtained results (Fig. 7), we conclude that the detection branch performance is not affected by the image quality metrics, SSIM and PSNR, confirming the no need for the enhancement block in the reconstruction branch in our approach.

Fig. 7
figure 7

Detection results, mAP, SSIM and PSNR values over the different trained SSD based models on the Pascal VOC dataset

Table 5 mAP, image quality metrics, FPS, encoding and decoding times (averaged over test images) for the different cs based SSD models on different GPUs

As to speed-up gain, we have found that the proposed solution delivers more interesting accelerations on small GPUs. The same implementation is 41.66% faster on Nvidia GTX 950M and 21.62% faster on Tesla V100-SXM2. A possible explanation for this is that powerful GPUs have more RAM (4GB for Nvidia GTX 950M vs. 32GB for Tesla V100-SXM2) and thus can store input data, weight parameters and activations as an input propagates through the network. Worth mentioning that our implementation was not optimized for GPUs; hence, it will deliver better results in terms of speed-up.

5.4.3 Detection results on the Mask dataset

We have also tested one of our best performing models on the Mask dataset, using both the VGG and Mobilenet backbones. For the CS-based models a high compression rate of \(75\%\) which correspond to a sampling rate of \(\frac{M}{N} = 0.25\). Refer to Table 6 for the obtained results (figure S4.9 shows some detection results obtained using compressed data).

Table 6 FPS, mAP, encoding and decoding times for different models on the Mask dataset after 40 epochs

5.5 Approach limitations

The proposed approach is particularly suited for homogenous datasets and wherever the memory constraints and storage are critical. However, when applied to a sparse dataset it fails to learn relevant features for the recognition branch. This weakness arises from the linearity of the CS network, which is also the strong point of our approach. Subsequent research will focus on enhancing the detection branch outputs while exploring CS-sampled data.

6 Conclusion

This paper proposes a new efficient approach to design CNN for embedded environments with limited resources for real-time applications. Although the existing CNN models achieved state-of-the-art performances, they still suffer from efficiency. To cope with this limit and validate our approach, the single-shot object detector was redesigned using compressed data according to the proposed method. During training, a lightweight CS network is merged with a truncated recognition network for joint learning of sampling /reconstruction and detection weights. During deployment, only the truncated backbone and the detection neck are used to predict from compressed data delivered by a compressive device that uses the sampling network of the designed lightweight CS model. Through our experiments, we showed that the detection models took a parameter reduction in channel deletion and convolution reduction, respectively. Also, the proposed training workflow permitted the augmentation of the dataset during training, which adversely affected the approach proposed in [14]. Our approach is particularly well suited for embedded use, as demonstrated by our tests on the Nvidia GTX 950M. In the future, we think some topics need to be deeply investigated to improve the proposed approach. First, combine our approach with those of the literature that permits the improvement of performance. Also, the optimization of our implementation for GPUs to achieve better results. Furthermore, the combination of our approach with lighter and more efficient backbones such as the one proposed in [39]. Finally, validate the proposed approach using other computer vision tasks, such as segmentation and tracking.