1 Introduction

DCNNs are preferred in mobile intelligence more than ever. Since 2012, deep learning techniques have prompted CNNs to flourish as the mainstream status of the computer vision field [1]. The powerful local modeling ability of deep convolution neural networks endows it dazzling in computer vision tasks such as image classification, object detection, segmentation, recognition, etc. As we witnessed, to pursue better performance, the shape of the deep networks becomes increasingly deeper and wider. Thus, the data-driven DNN has massive parameters to ensure performance on highly parallel hardware devices, which requires an awful amount of hardware resources to train the parameter deluge. Inevitably, most deep CNN models are large-scale and computation-intensive. Specifically, AlexNet [1] consumes more than 200MB of memory, VGGNet [2] takes up more than 500MB, and ResNet50 [3] is about 95MB. Due to the high resource demands, most models with excellent performance suffer from many limitations in real-world scenarios, especially for edge intelligence (EI) with widespread application demands.

Table 1 Relative surveys and their focus

At present, deep learning primarily adopts the cloud-end paradigm in practical applications. This approach entails exchanging information between cloud computing servers and mobile devices, employing deep learning algorithms to address real-world issues. In this process, the edge terminal devices send requests to the cloud computing center through the network, and the computing center then returns the processed results to each corresponding terminal device. However, the paradigm heavily relies on network coverage and stability, making it time-consuming, tedious, error-prone, and pose potential security risks. Edge terminal scenarios, such as smartphones, autopilot systems, and drones, have a superior demand in real-time and security performance for visual applications. Traditional cloud-based models may not meet the performance demands of these edge terminal scenarios. In fact, there are many redundancy connections in different layers of a DNN. Lightweight CNNs are potential candidates for such edge scenarios to solve vision tasks, but pruning a large-scale network to fit a resource-constrained terminal device is full of challenges. The lightweight technology aims to explore and eliminate those idle neurons without significantly decreasing the performance. It generally refers to lightweight model design or model compression. Indeed, some optimization is selective in specific mobile application scenarios as essential, such as machine learning libraries selection, hardware platforms deployment, etc. In recent years, more and more industries and computer vision communities have ventured into this area. This wave of interest has spurred a flood of impactful studies and breakthroughs.

Table 1 lists the surveys [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] that refer to light-weight strategy within the past 5 years. The works [4,5,6,7,8,9,10,11] concentrate on light-weight methods from the viewpoint of structural models and compression technologies. Others [12,13,14,15,16,17,18,19] deem lightweight technology an essential part of EI. [4] reviews those lightweight networks maturely applied in object detection, a branch of computer vision tasks. [5, 7, 11] only present some classical lightweight models while lacking a comprehensive overview of lightweight techniques as well as recent improvements. [6] just pecked at lightweight technologies in terms of artificial design, model compression, and architecture search. [8] mainly reviewed those convolution variants with high computational efficiency, and did not cover other promising lightweight within the scope of CNNs. [9, 10] emphasized model compression while neglecting its peer technologies like model design. And in [12,13,14,15,16,17,18,19] focuses on the techniques of compacting and accelerating DNN models. However, it does not cover the underlying support framework and lacks the latest technology due to obsolescence. [13, 14, 16] mainly emphasized EI and the relationship between edge computing and intelligent applications. [15, 17,18,19] start from the perspective of deep learning, which encompasses a broad range of technologies and applications to provide specific guidance.

Although there are many works that try to elaborate the lightweight paradigm, they either don’t cover comprehensive key technologies or fully consider the characteristics within lightweight CNN architectures. Additionally, recent improvements and trends for future directions are vague in those works. Therefore, this paper aims to bridge the gaps by comprehensively analyzing the state-of-the-art techniques adopted in lightweight DCNNs, incorporating underlying supports. To establish a more complete and up-to-date resource regarding this pivotal topic for researchers and practitioners alike, we carefully retrieve articles that are technically representative for summarizing. We will elaborate on the evolution of lightweight DCNNs. It has been driven by the deployment to edge terminals. The concept of ‘lightweight’ is evaluated from aspects such as the number of parameters, computation complexity, memory consumption, etc. All involved literatures are categorized into two classes respectively on algorithms and libraries—one focusing on algorithms designed specifically for DCNNs on resource-constrained edge devices, the other on libraries optimized for hardware components of DCNN inference pipelines in such environments. Thereinto, in the process of pursuing light-weight, the relative software algorithms are further divided into two categories: the design from scratch, and the compression method widely used in large-scale DCNNs. Underlying difficulties, limitations, merits, and disadvantages are discussed in applying these algorithms. Based on the review and analysis, some potential and promising directions associated with lightweight DCNNs are proposed.

Figure 1 describes the overall structure of this survey, which is organized as follows: 2 introduces the motivation of lightweight CNNs, as well as the taxonomic perspective. The representative technical works of lightweight networks are reviewed in 3. Underlying frameworks support and common benchmark datasets are depicted in 4 and 5, respectively. The perspective of lightweight techniques is given in 6. Finally, conclusions are drawn in 7.

Fig. 1
figure 1

Overview of this survey

2 Motivation and Category

This section first discusses the motivation for utilizing lightweight CNNs and then classifies the corresponding lightweight methods.

2.1 The Motivation of Utilizing Lightweight CNNs

For a long time, artificial neural networks (ANNs) have been seriously hampered by insufficient samples of the data set and low hardware performance. Up until 2009, the ImageNet database was released [20] and there was also sufficient computing power of the hardware, the ImageNet large-scale visual recognition challenge (ILSVRC) [21] was held spanning the period from 2010 to 2017. The AlexNet [1] in 2012 was a milestone that promoted the great success of deep NNs in the field of image recognition. During the period of ILSVRC, DNNs are capable of identifying objects accurately by executing large-scale computing, the cost is “deep”, i.e., high computational complexity and high memory consumption. However, those large-scale models are not friendly for edge computing which is the mainstream of future general intelligence. Accuracy and real-time property are two main requirements for edge intelligence, which requires both intelligent algorithms and hardware to work in concert. Conventional large-scale intelligent algorithms with high accuracy are hardly deployed on edge devices, while those tiny algorithms are under-powered for the accuracy demands. Moreover, studies have shown that there is a certain degree of redundancies in deep models, whether for edge devices or cloud data centers, the extra costs are unnecessary. And finally, with the failure of Moore’s Law, it is becoming more and more difficult to elevate hardware performance. Neural networks’ lightweight design is a promising solution for these issues. The lightweight network model significantly reduces the number of parameters and computational complexity. Certainly, to achieve fair performance in practical applications requires corresponding support from the underlying libraries and hardware. Currently, the handwritten digit recognition network proposed by LeCun et al. [22] is a paradigm used for exploring technology.

2.2 Category of Lightweight DCNNs Methods

As shown in Fig. 2, the CNNs extract characteristics of the inputs via convolution and pooling operations and put them forward to fully connected (FC) layers to yield outputs. The loss function is then used as an optimization criterion to update the weights of each layer, aiming to minimize the loss and match the expected output. With the learned weights, CNNs are capable of inferring those unencountered tasks.

Fig. 2
figure 2

CNNs propagation pipeline CNNs training (also known as learning) is a process of both forward and backward propagation iteratively. The inference is a forward propagation to calculate the output for unseen data with the learned parameters during training. Here, \(y_{i}\) denotes activation output of every layer and serves as the input to the next layer, which is equivalent to \(x_{i}\). \(\omega\) is weight and b is bias. L is loss function, \(\bar{y}\) is output

Large-scale DCNNs may benefit particular visual tasks but certainly require vast hardware resources, and time consuming. In addition, relevant research has shown that there is a lot of redundancy when modeling a large amount of data using deep neural networks [23]. The computational complexity mainly originates from the convolution operation, the number of parameters mainly determined by the full connection layer [24]. To slim the network and reduce its computational complexity, the model itself and the underlying framework need to be optimized to fit particular hardware. Therefore, we classify the lightweight network technology into two categories refer to model correlation and hardware correlation as shown in Fig. 3, where the model correlation involves design and compression. The former mainly includes manual design and automatic model search to obtain an initial lightweight neural network. The latter is tailoring a bulky neural network to a lightweight one, which mainly includes four ways: model pruning, low-rank decomposition, weight quantization, and knowledge distillation. Hardware correlation mainly refers to the accelerated optimization of convolution operation and the underlying framework-level support when lightweight CNN models are deployed on mobile or embedded devices, such as TensorFlow Lite and TensorRT, as well as related technologies that guide hardware design. The item amount statistics of references for each subcategory are shown in Fig. 4.

Fig. 3
figure 3

Taxonomy of lightweight CNNs methods

Fig. 4
figure 4

The item amount statistic within each lightweight strategy and relative supports

3 Methods for Lightweight Convolutional Neural Networks

3.1 Architecture Design

Exploring the sparse hierarchical structure of CNNs without significantly reducing the accuracy of the network is the purpose of designing lightweight CNNs. Two ways towards this destination.

3.1.1 Manual Design

Manual design reduces the parameters amount and computational complexity by introducing some specific convolution paradigms, such as group convolution, separable convolution, dilated convolution, etc. The Inception series started from GoogLeNet has made continuous progress in improving accuracy and decreasing the computational complexity of the network [25,26,27,28]. Especially, in [28], the decoupling of 3D convolution kernels into a separable 2D paradigm along the direction of the channel, i.e., depthwise convolution (DWConv) or single intra-channel convolution [29, 30], has had a far-reaching impact on the subsequent lightweight schemes.

Via sparse structure designing, a large number of impressive works have emerged. Forrest Iandola et al. [31] proposed the SqueezeNet, in which a sparse convolution module called Fire (Fig. 5a) is represented. The fire model consists of two stages named squeeze and expand, respectively, the former used \(1\times 1\) convolution filters [32] to reduce the dimension of characteristic channels, while the latter combined \(1\times 1\) and \(3\times 3\) convolution filters to support multiple resolutions. SqueezeNet has 50 times fewer parameters than the AlexNet but achieves a competitive level of accuracy on ImageNet as AlexNet. In the improved SqueezeNet [33], a two-stage bottleneck structure is proposed to reduce the number of channels (Fig. 5b), and the separable convolution is utilized to further reduce the parameters. The author also used hardware simulation to determine the best design of the baseline model. The MobileNets series [34, 35] are designed for mobile or embedded devices. MobileNets [34] makes full use of the depthwise separable convolution (DSConv) which involves depthwise con-volution to filter each input channel, and a \(1 \times 1\) pointwise convolution (PWConv) to combine the outputs of the depthwise convolution (Fig. 6), and controlling the network size by super parameters. MobibleNetV2 [35] co-opted ResNets [3]’s bottleneck module, which combines depth separable convolution with residual connection. The author adopted linear transformation in bottleneck to reserve complete information and shortcuts directly between the bottlenecks (Fig. 7) which enables the module to perform inference with higher memory efficiency than standard ones in various neural architectures. ShuffleNet [36], as shown in Fig. 8, is an improvement of MobibleNet, and also inherits the merit of group convolutions of AlexNet to compromise representation ability and computational cost. In addition, the shuffle operation is introduced to facilitate information flow exchange for multiple group convolution layers. In ShuffleNetV2 [37], the author claimed that the numerous \(1\times 1\) group convolutions and the shuffle operations actually increase the frequency of memory accesses. Therefore, to solve the above problem, channel splitting is employed instead of group-wise convolutions. In the work of [38], the author pointed out that the inverted residual block would induce information loss and gradient fusion. Thus, they add depthwise convolutions at the ends of the residual path (Fig. 9), which can extract richer features. [39] utilized the correlation along the depth direction in DS convolution and proposed the blueprint separable convolution (BSConv) (Fig. 10). The BSConv consistently verified improvement based on DSConv models without introducing any further complexity. ChannelNets [40] believe that the fully-connected pattern is the main cause of excessive computational consumption. So three channel-wise convolution operations are proposed, which significantly reduce the number of parameters and computational complexity without accuracy loss. The aforementioned strategy has also been incorporated into 3DCNNs [41] for video applications that typically require higher computational resources.

Fig. 5
figure 5

The fire module and SqueezeNext block. Here, M and N represent the number of channels, while W and H represent the size of the feature map, respectively

Fig. 6
figure 6

Standard convolution and depthwise separable convolution

Fig. 7
figure 7

The inverted residual block. This block expands a compressed input, filters it with a DWConv, and then projects the features back to a lower-dimensional representation using a linear convolution

Fig. 8
figure 8

Channel shuffle with two stacked group convolutions. GConv indicates group convolution. The information between the layers GConv1 and GConv2 is fully communicated through channel shuffle

Fig. 9
figure 9

The sandglass block. This block reverses the inverted residual block between bottlenecks and adds DWConvs (i.e., separated blocks) at both ends of the residual path, both of which are crucial for performance improvement

Fig. 10
figure 10

The blueprint separable convolution. BSConv exploits correlations between CNN filter kernels along the depth dimension. It represents each filter using a single 2D blueprint kernel distributed across depth via a weight vector

Other lightweight CNNs concentrate on specific visual tasks [42,43,44,45,46] or dedicated hardware applications [47, 48], such as object detection [42], segmentation [43], and recognition [45, 49]. In [50, 51] (Fig. 11), the dilated convolution is used to enlarge the receptive field without increasing the computational load, memory, and power, which dramatically benefits the semantic segmentation of high-resolution images. What is meaningful are the techniques that leverage visualization to provide design insights. In [52], visualization revealed redundant feature maps as important for effective CNNs. The insight inspired the GhostNet module (Fig. 12) to generate more redundant feature maps through linear transformations for revealing intrinsic information while maintaining compatibility with existing CNNs. The authors also put forward C-GhostNet and G-GhostNet respectively for GPU-like and CPU-like devices in their subsequent works [47]. In VGNetG [53], the visual analysis enabled utilizing edge operators to substitute for learnable operations in the lower layers, resulting in a parameter-efficient CNN architecture. More recently, works like MobileOne [54] and FalconNet [55] have developed the reparameterization technique [56] into module design. It allows linear branches present during training to be re-parameterized as simpler blocks for inference. Concretely, the MobileOne block (Fig. 13) introduces over-parameterized branches to enhance representation capacity during training, which is then reparameterized into a slimmed-down form for inference, yielding improvements in both accuracy and latency.

Fig. 11
figure 11

The block diagram of the efficient spatial pyramid (ESP) module. It consists of a pointwise convolution followed by a spatial pyramid convolution. The former of the module reduces the computation while the latter enlarges the receptive field and removes gridding artifacts through hierarchical feature fusion (HFF)

Fig. 12
figure 12

The Ghost module. It consists of a lightweight “ghost” branch and a heavier “feature” branch in parallel. The former generates feature maps via a series of cheap operations while the latter generates more complex ones. The output is obtained via a concatenation operator, balancing representational power and computational efficiency

Fig. 13
figure 13

The MobileOne Block. It has two different structures at train time and test time. Left: Train time MobileOne block with reparameterizable branches. Right: MobileOne block at inference where the branches are reparameterized. Either ReLU or SEReLU is used as activation. The trivial over-parameterization factor k is a hyperparameter which is tuned for every variant

In [55], the authors abstracted the Meta Light Block based (Fig. 14) on different lightweight modules. They introduced Reparameterized Spatial Operator (RepSO) and Reparameterized factorized Channel Operator (RefCO) methods to increase the sparsity of the spatial and channel dimensions, respectively. Both strategies leverage structural reparameterization to convert the diverse connections employed during training into equivalent inference units.

Fig. 14
figure 14

Meta Light Block with RepSO. The block consists of two \(1 \times 1\) Conv layers (with an expansion ratio \(\lambda\)) and a single spatial operator layer in between. Left: Train time block with reparameterizable branches. Right: Inference time block where the branches are reparameterized through RepSO

We summarize those impressive lightweight works as shown in Table 2. It is clear that besides depthwise convolution and pointwise convolution, the residual connections, rectified linear unit (ReLU) [57] or its variants (i.e. ReLU6 and PReLU) [35, 58], and linear operations are the most commonly involved operations. Furthermore, Squeeze and Excitation (SE) modules [59] (Fig. 15) are often inserted into the blocks as an attention mechanism to elevate the perception abilities of depthwise convolution. In practice, depthwise convolution has lower arithmetic intensity (Fig. 16), making it less efficient than expected [28, 36]. Other works circumvent this problem with different designs to achieve lightweight goals. The ShiftNet [60] proposes a shift operation (Fig. 17) requiring no extra floating point operations (FLOPs) and parameters, readily implemented under the current environments. The DiCE unit [61] utilizes dimension-wise convolutions and fusion, applying light-weight convolutions across input dimensions and fusing dimension-wise representations compactly (Fig. 18). PeleeNet [44] utilizes conventional convolutions to achieve efficient and real-time object detection.

Fig. 15
figure 15

Illustration of the SE module. It involves a two-step process: squeezing global information by reducing spatial dimensions, and then excitation by learning channel-wise weights to amplify important features

Fig. 16
figure 16

Throughput variation with number of MACs per output for different types of NN operations [62]

Fig. 17
figure 17

Shift operation. The operation shifts the pixel values of the input feature map by a specified distance, altering their spatial positions. In the given example with a 3\(\times\)3 shift matrix, the illuminated cell denotes 1 at that position and the white cells represent 0

Fig. 18
figure 18

DiCE Unit. The unit efficiently encodes the spatial and channel-wise information in the input tensor X using dimension-wise convolutions (DimConv) and dimension-wise fusion (DimFuse) to produce an output tensor Y. In practice, these three-dimensional kernels are executed simultaneously

Table 2 The comparison of lightweight modular operations

Overall, the crafted design of blocks aimed to improve the overall performance of the network, and also benefits the definition of search space for automated machine learning (AutoML) design.

3.1.2 AutoML Design

The manual design has what appears to be a drawback; it heavily relies on the knowledge and experience of specialists to make a compromise between various factors like accuracy, efficiency, computing consumption, and more. As a result, a sub-optimal scheme tends to be obtained instead of an optimal one. Automatic methods typically utilize search algorithms to create an optimal network model that minimizes the need for human labor.

The automatic search approach primarily involves three aspects: search space, search strategy, and evaluation strategy. To achieve a lightweight design for deep networks, the focus on lightweight with limited search space is crucial, and it encompasses three stages: cell-level, stage-level, and layer-level.

  1. 1.

    Cell-level, which explores a one-shot way of the network design. Specifically, it only searches cell-level operations and extends them to the entire network layers. Some early works [65,66,67] provide certifications of competitive accuracy and low FLOPs of neural network design with this strategy. However, the homogeneity of different layers significantly reduces the accuracy and increases the delay of the neural network. As shuffleNetV2 [37] pointed out, the cell-level operation is fragmented, resulting in frequent memory access, and may not be hardware-friendly. In addition, this strategy merely uses FLOPs as an approximate metric for measuring the latency, which does not meet the requirements of real-time response.

  2. 2.

    Stage-level aims to obtain a hierarchical search space with different blocks distributed in different stages of a network. MnasNet [68] searched through neural structures composed of multiple stages and connected sequentially, each structure with a variable number of repeated identical layers, and is optimized on the target mobile devices with the external latency considered. Subsequently, MobileNetV3 [69] adopted a similar search algorithm with MnasNet except the search algorithm is a hardware-aware neural architecture search (NAS) complemented by NetAdapt [70]. [71] is also a stage-level search work, and the GPU attribute is considered in its search algorithm. Yet blocks within a stage are identical, there is still potential room for performance elevating.

  3. 3.

    Compared with Stage-Level, Layer-Level takes a further hierarchical step to the search space. [72] applied modular search space to retrieve the predefined supernet middle layer, and the key is to fit the search granularity to the supernet layer level. Nine candidate modules are given in each search space to generate FBNets series networks, and the latency and accuracy are deemed as the contributions of each candidate to the search architecture. In [73], the authors thought that the search space of the aforementioned modular search algorithm is limited, thus, they proposed a memory-efficient algorithm, which can greatly expand the search space in spatial and channel dimensions within the layer.

Furthermore, there are ways to automatically search for lightweight CNN, which take into account finer-grained attributes in search space [71, 74,75,76,77,78,79]. [74] built ChamNet, considered traits of the hardware platform, and adjusted computing resources in the search algorithm to fit latency and energy constraints. The first mobile GPU-awareness (MOGA) NAS was proposed by Chu et al. [71] for mobile applications. [75, 77] also take into account the traits of the platform in search space. The convolution kernel size dramatically affects the neural networks performance, thus, MixNets [76] proposed a mixed-depth convolution (MixConv), which mixes multiple kernel sizes into one convolution and integrates them within the AutoML search space to obtain better performance than the previous mobile lightweight models. In addition to the structured search, [78] expands the search space by introducing the previously neglected hyperparametric into it, which makes it more flexible to obtain lightweight networks. [79] believes that the task types should also be considered in the search space.

Table 3 The comparison of lightweight models on ImageNet

3.1.3 Analysis and Summary

We summarize the manual design and automatic design of lightweight CNNs, as shown in Table 3. To compare benchmarks, all these works are evaluated on the ImageNet and are measured by the number of parameters and FLOPs as well as mainly concerned with the accuracy of the top one. However, these models are focused on image classification on the ImageNet, and taken for granted that appropriate for other tasks. [50, 79] adopted manual and automatic search, respectively, and take the specific task categories into account simultaneously, high-level tasks such as object detection, semantic segmentation, and fine-grained face recognition, etc., those application properties are crucial for model design [42,43,44,45,46, 79]. Moreover, platform attribution [47, 48, 71, 72, 77] is another critical factor. Many works have reported their inference latency on hardware platforms, as shown in Table 3, column 6, but objective assessment of the performance of these models is difficult. The most popular measurement indicators of FLOPs and parameters do not entirely reflect the actual model efficiency [62]. In addition, most of the automatic methods rely on structural search with fixed hyperparameters [78], and the whole network still requires manual optimization [69].

3.2 Model Compression

Model compression is to explore the inherent over parameterized and structural redundancy of the network and remove them, thus obtaining a lightweight form. According to different processing views, the model compression is divided into network pruning, low-rank decomposition, low-bit quantization, and knowledge distillation.

3.2.1 Network Pruning

Network pruning is to eliminate the non-critical redundancy in a pre-trained model without significant performance impairments.

  1. 1.

    Network pruning granularity

According to the redundant paradigm of the pre-trained models, pruning techniques can be classified as coarse-grained pruning (i.e., structured pruning), strip-wise/group-wise pruning, pattern pruning, and fine-grained pruning (i.e., unstructured pruning), as shown in Fig. 19.

  • Fine-grained pruning, the pruning granularity of which is a sole neuron or connection. As the sparsity of a model usually needs to be determined layer by layer, the practical compression is inconsistent with the theoretical one [82]. In addition, the pruning requires customized hardware to support [83].

  • Strip-wise/Group-wise pruning is to prune along some vector of the input/output dimension in the parameter tensor. And the pruned results are irregular, which also requires a customized hardware support system [84, 85].

  • Pattern-based pruning is executed on one layer or the whole of the model according to a set of fixed patterns [86, 87].

  • Connectivity pruning sets the weight values of the pruned filters to 0 thus cutting off the connection between the input and output of certain channels, which enables the model to be accelerated with the existing hardware rather than the customized [88].

  • Structured pruning removes unimportant convolution kernels in a set of channels and directly changes the width of the model. This approach can be accelerated directly using Off-the-Shelf machine learning libraries [89].

Fig. 19
figure 19

Pruning granularity

  1. 2.

    Pruning methods

Pruning can start from both model structure and the processing of training. In terms of the structure pruning, the weight, the activation function, the gradient in backpropagation, and the batch-normalization all have the possibility to be tailored. From the perspective of model training, pruning approaches include reconstruction error training and regularity training.

Li et al. [89] adopted the L1 norm to prune the weight kernel of the current layer and then removed the features map connected to this weight filter and the corresponding weight channel of the next layer. It is usually conditional to preserve large-norm coefficient features based on the Smaller-Norm-Less Informative hypothesis [90]. He et al. [91] consider the mutual relations between filters and based on the geometric median prune the most redundant filters rather than those relatively less important ones, which still works efficiently even if the norm-based criterion fails. Luo et al. [92] used reconstruction error to minimize the output deviations of the next layer and pruned the current layer according to the statistics information of its next layer. DepGraph [93] explicitly models the dependency between layers by automatically grouping tightly coupled parameters. It enables efficient generalization to diverse neural architectures without tedious individual analysis.

Most of the above works adopt structured pruning techniques and require fine-tuning methods to remedy degeneration in accuracy, the processing of which is tedious, and the generated sub-models are usually sub-optimal. Therefore, [94] adopted a continuous compression proportional control strategy for learning and designed a reinforcement learning reward mechanism according to different scenarios. [95] proposed adaptive batch normalization (BN), whose parameters are updated via several batches rather than fine-tuning. The absence of standardized benchmarks and metrics has confused researchers for a long time [96, 97]. [98] resorted to random search to optimize channel configurations when pruning. It could serve as a baseline to properly evaluate different pruning methods.

3.2.2 Low-Rank Decomposition

The tensor is the fundamental component of CNNs. Low-rank decomposition attempts to reduce the hidden redundancy in tensors, thus decreasing the complexity of convolutional/fully connected layers in CNNs, and speeding up the model inference. One of the most commonly used low-rank decompositions is singular value decomposition (SVD). Let tensor \(A\in R^{m\times n}\), and re-write

$$\begin{aligned} A=USV^{T}, \end{aligned}$$
(1)

where \(U\in R^{m\times r}\) and \(V^T\in R^{r\times n}\) are orthogonal to each other, \(S\in R^{r\times r}\) is a diagonal matrix containing singular values of the original matrix A. If there exists \(k<<r\) can replacer, and the decomposition complexity is reduced from O(mn) to \(O(k(m+n+1))\), then the original tensor can be compressed.

In [99, 100], the author decomposed the convolution kernel of \(w\times h\) into the convolution kernel form of \(w\times 1\) and \(1\times h\). In [101], Tucker is used to decompose the model weight, then the pruned model is deployed on the mobile phone for the experiment. Chen et al. [102] proposed a decomposition method by using the combination of Tucker and classical prolongation (CP) and improved the efficiency of parameter utilization in the network.

The higher the tensor dimension, the better the decomposition results may be, but the space complexity increases. Therefore, [103] removed the high-order kernel tensor to avoid the aforementioned deficiency, but this method ignores the key part of convolution layer modeling and merely works effectively for the fully connected layer. In [104], the tensor-train method is extended to the convolution layers. Since the convolution structure is already a compact fourth-order tensor compared with the fully connected layer, the final compression level is very limited, and the subsequent extracted features may not be ideal. The prevalent DSConvs in lightweight CNNs make the situation more complex as their more compact tensor maps. [105] proposes an approach that integrates low-rank tensor decomposition with sparse pruning, fully leveraging both coarse and fine structures. This allows for efficient model compression of architectures that utilize DSConvs.

In fact, the low-rank decomposition is similar to designing a lightweight compact structure, the difference is that the latter uses a compact lightweight topology structure to find the basic model, and the former aims to compress a given basic neural network model. At present, the low-rank decomposition is more mature than other methods. And most works require layer-by-layer decomposition and compression. Despite [106] considering the global optimization of all layers during the model compression to avoid trivial layer-wise decomposition issues, it still requires additional retraining processing. In view of the fact that its high cost and limited compression capacity for the convolution layers, this approach does not earn much attention as imagination.

3.2.3 Low-Bit Quantization

Low-bit quantization attempts to analyze the numerical representation of the models, and then map the weights, activations, and even gradients of the networks to a set of fixed values to compress the numerical representation of the model and improve the inference efficiency. Without significantly impair the accuracy of the network, the original 32-bit (or 16-bit) single-precision floating-point number is projected to a lower-bit representation, such as 8-bit, 4-bit, 2-bit, or even 1-bit. Low-bit representation is one of the most common model compression technologies used in industry, which can not only alleviate the amount of data transmission in hardware, but also reduce multiply-and-accumulate (MAC) operations and energy consumption.

Let x be the input, \(x^{q}\) is the quantization,

$$\begin{aligned} x^{q}=clip(R(\frac{x}{s}+Z),\alpha ,\beta ), \end{aligned}$$
(2)

Equation 2 denotes that the real number x is projected into the integer range \([\alpha ,\beta ]\), where s is the quantization factor, and q is the quantized bit width.

$$\begin{aligned} s=\frac{\beta -\alpha }{2^b-1}, \end{aligned}$$
(3)

Equation 2 depicts an affine quantization. When Z=0, this quantization is the so-called scale quantization. Through the inverse operation of Eq. 2, the quantized parameter value is reconstructed as “real value”, but usually the recovered real value is approximate to the original value rather than equal.

  1. 1.

    Quantization regimes

The quantization mechanism represented by R in Eq. 2 mainly includes uniform quantization and nonuniform one, as shown in Fig. 20. Among them, nonuniform quantization distributes input to different step sizes by using variable quantization factor s (Eq. 3) [107, 108]. The quantization result matches the reality better but tends to bring greater overhead costs which is unfriendly to hardware implementation. In practice, binarization quantization is also very commonly utilized. In [109], quantization is taken as an optimization problem to minimize the difference between the input tensor and the binary expansion part. Uniform quantization is represented by linear quantization which allows the effective implementation of fixed-point operations on hardware. For uniform quantization, it is vital to choose the projecting range \([\alpha , \beta ]\), which determines the setting of quantization factor s. There are symmetric quantization and asymmetric quantization, the former is easier to implement, but the imbalance of the actual range may lead to a sub-optimal solution.

Fig. 20
figure 20

Two forms of discrete distribution for quantization. By adjusting s, the distribution of quantization levels is changed to two modes: (1 )uniform quantization with the same step sizes, and (2) nonuniform quantization with different step sizes

  1. 2.

    Quantization granularity

CNN has a hierarchical structure, and different components contribute inconsistently to the model. Different levels of quantization correspond to different granularity. For layer-wise quantization, all elements in this layer share the same quantization parameters [110]. However, different consecutive layers interact with each other, so module-based quantization is proposed to alleviate this problem [111]. For channel-wise quantization, per-channel quantization [112] works on intermediate dimensions of the tensor. Further, one can explore per-row/per-column quantization, but this may lead to the huge burden of inference time, so it is more commonly adopted at the channel level [113].

  1. 3.

    Performance recovery methods

Compressing a huge network to a compact one usually endows degradation of accuracy, thus performance persevering strategy must be involved. The two most utilized methods include post-training quantization (PTQ) and quantization-aware-training (QAT) [114].

The PTQ method is popular among the community and industry since it can restore the performance of the model without requiring the original training pipeline. The method converts the pre-trained network into a fixed-point network and then retrains it offline to restore accuracy [111, 112, 115,116,117,118,119]. However, its commonly used rounding-to-nearest mechanism ignores the contribution of non-diagonal elements in the Hessian matrix, resulting in a sub-optimal solution. Therefore, an improved approximation mechanism for per-layer was proposed [115], and remarkable works have been made by limiting the weights precision of networks such as Resnet18 and Resnet50 to 4-bit, while keeping the accuracy reduction within 1\(\%\). [111] quantified the inter-layer correlation which is ignored by AdaRound [115] to restrict the weight bit width of the quantized networks to 2-bit. In [119], the author pointed out that the order of weights and activations is crucial to the network performance, they randomly drop the quantization of activations during PTQ, thus the 2-bit activation for PTQ is realized for the first time.

PTQ has many merits, such as high efficiency and speed in applications. However, it is not as effective in restoring full-precision representation as expected. QAT, on the other hand, performs better in lower bits, such as 4-bit representations, although it takes more training costs and data support. Many works published about this method [120,121,122,123,124,125]. Since QAT introduced quantization during training, the gradients update may cause accumulating quantization errors, especially in low-precision representation. Therefore, the straight-through estimator (SET) is required [123, 126]. In [121, 122], only the weights are binarized, and the gradients are still presented in full precision without using low bits, which endows much computing cost to backpropagation. In [123], both weights and activations are of low bit width and are quantified definitely, while the gradients are randomly quantized, thus, the CNNs training is accelerated as well as the inference. PACT [127] learns the clipping ranges of activations during training rather than the fixed ranges as in [123]. [128] proposed a so-called quantization interval learning (QIT) which quantizes the network weights and activations to obtain the optimal quantizer. In [129], the quantization step is deemed as a training parameter to better adapt to the quantization distribution, while its improved version [130] can learn to accommodate the negative activations with asymmetric quantization. [131] proposes the CSQ (centered symmetric quantization) quantizer for extreme low-bit quantization (\(\le\) 3-bit), which is trainable using QAT methods and shows that a simple change of quantization levels can result in significant performance improvement.

Despite the current quantization methods have made great progress in theory, there are still many limitations. On the one hand, most of the unified quantization methods tend to induce accuracy degradation due to the non-uniformed contributions of layers in a hierarchical network. There are mixed-precision quantization methods designed to cope with this problem, but their cost is prohibitive since it requires traversing the mixed quantization space to solve the optimization problem [132, 133]. On the other hand, most of the quantization methods are based on redundant baseline model testing and lack related tests for lightweight design models, let alone on real mobile devices. The quantization is more closely associated with the underlying hardware support, and the customized hardware may be more appropriate for it, but at a higher cost. So does the hardware-aware quantization [134]. Therefore, more corresponding baseline testing and hardware testing may benefit the design of quantization.

3.2.4 Knowledge Distillation

Knowledge distillation (KD) [135, 136] is to acquire and transfer knowledge from the original large teacher model to the lightweight student model, as shown in Fig. 21. It is considered that the “softmax” output of the teacher model contains more information, which is ignored in the form of one-hot [136], i.e., the highest probability is taken as the correct output while other probabilities as wrong. However, these incorrect outputs actually contain more information which is vital for the learning process. Therefore, a “temperature” strategy is proposed to control the soft probability distribution of the outputs, as depicted in Eq. 4.

$$\begin{aligned} p_{i} = \frac{exp(z_{i}/T)}{{\textstyle \sum _{j}^{}exp(z_{i}/T)}}, \end{aligned}$$
(4)

where T is the temperature, the \(z_{i}\) calculated for each category is converted as probability \(p_{i}\) by comparing the \(z_{i}\) with other logits. In general, \(T = 1\), and when T increases, it produces a soft probability distribution in the class.

Fig. 21
figure 21

Knowledge distillation framework

The knowledge distillation can be roughly classified as feature-based knowledge and logits-based one according to the distillation location.

  • The knowledge transferred by the logits-based method was initially considered as the conditional distribution of the output for a given input sample. From this point of view, the predictions or soft targets from the pre-training teacher model play an important role in the guidance of the student model [137,138,139,140,141,142,143,144]. [139] can be viewed as a supplement to the principle of KD [136]. It decomposes the gradient caused by KD into two items: dark knowledge items and a ground-truth component, and quantifies the contribution of dark knowledge to KD. [138] transfers knowledge through collaborative learning between a group of student models rather than directly transferring with the one-way knowledge between the predefined teacher model and the student model. [141] claimed that the large model is not necessarily a good teacher model for the guidance of the student model training. They proposed terminating the teacher model training early to alleviate the mismatch between the teacher and student models. [144] pointed out that a good teacher model can train a student model well as long as the teacher and the student always with the same input, radical data enhancement, and sufficient training times. It is found in [142] that when there is a significant capacity difference between the teacher model and the student model, the performance of the student model will be degraded greatly. To alleviate the above problems, multi-level knowledge extraction is introduced as a medium-scale network to bridge the gap between them.

  • Feature-based methods may outperform logits-based at the expense of extra computation and memory consumption for distilling deep features during training [145,146,147,148,149,150,151,152,153,154]. [145] defined the transferred knowledge according to the information flow between layers, and obtained the extracted knowledge by computing the inner product of features between two layers. [146] added a feature attention mechanism into the network, which helps the student model learn better from the teacher model. [147] proposed to align the distribution i.e., the matching distribution of neuron selectivity patterns between teacher and student models, to improve the performance of the student network. Heo et al. [148] found that the transferring of activation boundaries can greatly improve the transferred efficiency and proposed a knowledge transfer method by transferring the activation boundaries of hidden neurons. [149] put forward a factors-based knowledge transfer method, which can inherit paraphrased information from the teachers’ network. [154] pointed out that students should maintain similarity in pairs in their representation space, rather than imitate the teacher’s representation space. [153] has developed contrastive learning which enables the teacher and student to project the same input to adjacent representations, and different input to apart representations.

It is noted that feature-based distillation requires extra super-parameters adjustment to lever the effects of losses in different layers. Although there is a supervised approach to explore the feature representation of the middle layer in a teacher model, it is unclear which layers and how the layers impact the student model. Compared with feature-based distillation, the logits-based method has a lower cost of training, but its performance compares unfavorably with the former. It is generally agreed that larger models may not necessarily make better teacher models, feature-based distillation is better than soft-label distillation, and the deep-level students model outperforms shallow-level students. Interestingly, soft label distillation in [155] with higher-level semantic features is better than feature-based distillation. The knowledge distillation field is still in its infancy, requiring further establishment of theories and experiments.

Table 4 Comparison of some learning libraries

4 Underlying Frameworks Support

The approaches reviewed previously are top-down solutions to develop lightweight network models. However, practical applications require support from the underlying framework to truly obtain improvements in accuracy, speed, and energy efficiency. This underlying framework can be categorized into general learning libraries and hardware-based support.

  • General learning libraries [167] can be used for CNNs training and inference, such as in Caffe [163], MXNet [164], Tensorflow [156], TensorRT [166], PyTorch [157], etc., as shown in Table 4. Most of these learning libraries are open source and improved continuously. Among them, PyTorch and Tensorflow are two representative frameworks with strong community and excellent documentation support, and they are also the two most popular frameworks in academic and industrial circles, both have visual tools (Visdom, Tensorboard) support to facilitate the development. They are popular in model training via GPU parallel computing speedup but are not widely available for the inference of edge devices. The libraries’ optimization mainly focuses on pipelining, resource management, as well as efficient compiler design [168]. TensorFlow Lite (TF-Lite) and TensorFlow Micro are two extensions of Tensorflow for edge inference.

  • To accelerate the processing of deep network models in specific applications, numerous specialized hardware platforms have been developed. As discussed earlier, the reduction in the number of network parameters and multiply-and-accumulate (MAC) may not induce performance improvement or energy consumption reduction as expected. The core of hardware development is in terms of appropriate throughput and energy efficiency. The MAC operations are easily implemented in parallel and optimized in both spatial and temporal architectures [169]. Temporal architecture is usually implemented in CPUs or GPUs and enhances parallelism via SIMD or SIMT to increase throughput [169, 170].   Representative embedded GPUs involve NVIDIA Jetson TX2 [171], and Intel Edison Kit [172]. In addition, data transmission contributes much to the energy consumption. Spatial architectures are usually designed and implemented based on specific integrated circuits (ASIC) and field programmable gate arrays (FPGA). These designs tend to enhance data reuse and reduce data transmission as much as possible to ensure energy efficiency. Typical works on ASIC include EIE [83] and Eyeriss [173, 174]. EIE is specially designed for pruned lightweight networks. In Eyeriss [173], the authors adopted two methods to improve energy efficiency. One is to reduce transmission through data reuse, and the other is to avoid unnecessary reading and calculation via data statistics [174]. The main difference between them is in structure, while the latter performs better in lightweight applications. More related works refer to FPGAs and ASICs are detailed in [16, 168, 169, 175].

5 Benchmark Datasets

The dataset is the driver for a neural network as well as the requisite of performance verification for it.

CIFAR [176] is a commonly used image classification dataset, with an image pixel of 32 \(\times\) 32. The training and verification set consists of 50 k and 10 k images. CIFAR-10 is composed of 10 mutually exclusive classes, and CIFAR-100 is composed of 100 mutually exclusive classes.

ImageNet [20] contains large-scale colored images with pixels of 256 \(\times\) 256, which is composed of 1000 no-overlap categories. There are 1.3 M training images, 100 thousand testing images (100 per class), and 50,000 validation images (50 per class) in ImageNet.

Stanford Dogs dataset [177] contains 20,580 images, which is built for the task of fine-grained image categorization, and composed of 120 breeds of dogs from around the world. The training set contains 14,580 images and the validation contains 6000 images.

There are other public datasets for different types of tasks (e.g., object detection [178] and semantic segmentation [179]). However, current public data sets can not satisfy specific application requirements, more private datasets are available as semi-opened or unopened in academics, companies, or individuals.

6 Future Research Directions

Based on the above analysis, current lightweight technologies have limitations for widespread applications to some extent. Future research should focus on the following promising directions:

  • Collaborative design and optimization. As described earlier, lightweight techniques require top-to-bottom cooperation, which requires the involvement of the specific task to explore lightweight ways from all possible perspectives [180]. For example, when designing a lightweight network model, the task complexity and the platform attributes should be considered for tuning. Various compression methods are orthogonal to model design methods [181, 182], i.e., they can cooperate during the design. The progress in lightweight network algorithms also benefits the development of the underlying learning library and hardware. It is worth further focusing on developing the combination [54, 55] of emerging technologies like structural reparameterization with existing lightweight technologies.

  • Establishment of evaluation standards. On the one hand, to objectively depict the properties of a light-weight CNN model, the following metrics should be emphasized: the accuracy of the model based on a public dataset, the number of parameters and FLOPs of the model as well as the architecture-related parameters. Moreover, the deployment or development of a lightweight network model should pay attention to the hardware metrics [62], such as latency, power consumption, etc. On the other hand, the scarcity of high-quality datasets is another important hindrance to lightweight network development. In current circumstances, one feasible method is to gradually establish the standard datasets for specific applications, the other is to adopt alternative strategies to remedy the lack of datasets, such as transfer learning [183], active learning [184], and incremental learning, etc.

  • Explainability and visualization. Model visualization technology has emerged recently, which has significant potential to help the researcher understand and improve the models, such as in the design of GhostNet [47, 52] and VGNet [80]. In the future, more fine-grained downstream vision tasks still require fully mining and exploiting the redundancy of the feature maps via visualization technology.

  • Activation functions and attention modules. Currently, ReLU is favored by many researchers, and future works on activation function may have space to improve. In addition, lightweight CNNs with the plug-in attention mechanism are full of promising [185,186,187,188,189]. Particularly, lightweight CNNs leveraging multi-head attention mechanisms of Transformers to overcome CNNs’ limitations on long-range modeling are gaining momentum [190,191,192,193,194].

7 Conclusions

DCNNs have gained remarkable attention in the field of computer vision, whose performance even have outperformed humans in many applications. It will continue to help drive the artificial intelligence algorithm trend. Although the current technological development has made much progress to some extent, the high computational complexity hinders its portability and performance in various mobile scenarios. Therefore, lightweight network technology naturally appears to match the growing mobile intelligence.

In this paper, the vital technologies of lightweight CNNs in recent years are reviewed, which include but are not limited to manual structured design, automatic architecture search, compression from structure to representation, and abstract knowledge distillation. In addition, we emphasize the significance of dataset and hardware support for lightweight network deployment and prospect several promising future trends corresponding to this field. We try to comb these marvelous works as clearly as possible, thus providing valuable guidance for researchers engaged in this field.