1 Introduction

Deep convolutional neural networks (CNNs) have been successfully demonstrated on many computer vision tasks. However, CNNs still face many challenges when applied to practical applications. CNNs typically involve millions of parameters and billions of FLOPs of computation. This is critical because vision application models consume large amounts of memory and computation, making them impractical for object recognition outside of high-performance computing environments. Existing work focuses on network pruning (He et al. 2017; Li et al. 2019), low-bit quantization (Zhang et al. 2018; Zhou et al. 2016, 2017), decomposition/factorization (Jaderberg et al. 2014; Denton et al. 2014), distillation (Xu et al. 2018; Changyong et al. 2019), or efficient architecture design (Sandler et al. 2018; Zhang et al. 2018) to address the problem of inefficient models. In particular, quantization approaches represent network weights and activations with low bit-width fixed-point integers, enabling computation by efficient bit-wise operations. Binarization (Rastegari et al. 2016; Zhang et al. 2019; Liu et al. 2018) is an extreme quantization approach where both weights and activations are \(+1\) or \(-1\), represented by a single bit. This paper designs highly compact binary neural networks (BNNs) from both quantization and network pruning perspectives.

Despite the progress made in 1-bit quantization and network pruning, few works have combined the two in a unified framework to reinforce each other. It is clearly necessary to introduce pruning techniques into 1-bit CNNs since not all filters and kernels are equally important or worth quantizing in the same way. One potential solution is to prune the network and then conduct 1-bit quantization over the remaining weights to produce a more compressed network. However, this solution fails to consider the difference between the binarized and full precision parameters during pruning. Therefore, a promising alternative is to prune the quantized network. However, designing a unified framework to combine quantization and pruning is still an open question.

Fig. 1
figure 1

This figure shows the framework integrating the rectified binary convolutional network (RBCN) with generative adversarial network (GAN) learning. The full-precision model provides the “real” feature maps, while the 1-bit model (as the generator) provides the “fake” feature maps, to the discriminators that try to distinguish the “real” from the “fake”. Meanwhile, the generator tries to make the discriminators unable to work well. By repeating this process, both the full-precision feature maps and kernels (across all the convolutional layers) are sufficiently employed to enhance the capacity of the 1-bit model. Note that (1) the full-precision model is used only in learning but not in inference; (2) after training, the full-precision learned filters W are discarded, and only the binarized filters \({\hat{W}}\) and the shared learnable matrices C are kept in RBCN for the calculation of the feature maps in inference

To address these issues, we introduce a rectified binary convolutional network (RBCN) to train a BNN, in which a novel learning architecture is presented in a generative adversarial network (GAN) framework. Our motivation is based on the fact that GANs have the powerful ability to match two data distributions (the full precision and 1-bit networks). This can also be viewed as distilling/exploiting the full precision model for the benefit of its 1-bit counterpart. For training RBCN, the main process for binarization is illustrated in Fig. 1, where the full-precision model and the 1-bit model (generator) respectively provide “real” and “fake” feature maps to the discriminators. The discriminators try to distinguish the “real” from the “fake”, and the generator tries to make the discriminators unable to work well. The result is a rectified process and a unique architecture with a more precise estimation of the full precision model. Pruning is also explored to improve the 1-bit models applicability in practical applications in the GAN framework. The main contributions of this paper are summarized as follows:

  • We introduce a novel BNN learning architecture, the Rectified Binary Convolutional Network (RBCN), to rectify the binarization process based on the full precision kernels and feature maps in a unified framework.

  • After binarization, we perform network pruning to further compress the deep models in a GAN framework.

  • We perform a comprehensive experiment on image classification, object tracking, face recognition, and person re-identification. Our models achieve state-of-the-art performance compared with other 1-bit CNNs on ImageNet, CIFAR, and LFW datasets.

This paper is built upon our previous conference paper (Liu et al. 2019) by significantly extending our work, including: (1) improving our method by extending pruned 1-bit CNNs using GANs, which remains largely unexplored in the existing literature, (2) conducting extensive experiments on the tasks of classification, tracking, face recognition, person re-identification, and pruning. All of these contributions strongly support the effectiveness of our models.

2 Related Work

In this paper, we exploit Generative Adversarial Networks (GANs) to facilitate binarization and pruning training process, using GANs’ ability to match feature map distributions. In the next two subsections, we briefly review GANs and network compression methods.

2.1 Generative Adversarial Networks (GANs)

GANs (Goodfellow et al. 2014) are one of the most successful and widely used methods for training CNNs. The main idea of a GAN is to train two competitive networks, referred to as the generator G(xz) and discriminator D(x), based on the game theory principles. The goal of GANs is to train the generator G to match a data distribution x by transforming a noise vector z. The discriminator D is trained to distinguish the ‘fake’ samples (z) generated by G from the ‘real’ samples (x). The GAN training process is formulated as follows

$$\begin{aligned} \arg \min \limits _{G}\max \limits _{D}V(D,G)= & {} E[log(D(x))]\nonumber \\&+E[1-log(D(G(x,z)))], \end{aligned}$$
(1)

where E is the expectations and V is the loss function.

Recently, GANs have been harnessed in various domains, including image generation, compression, and domain transfer. Since GANs were first introduced, researchers have studied them vigorously. As a result, researchers propose a lot of loss functions, regularization and normalization methods, and neural architectures choices (Goodfellow et al. 2014; Salimans et al. 2016; Tschannen et al. 2018; Gulrajani et al. 2017; Odena et al. 2017; Mao et al. 2017; Arjovsky et al. 2017). These all show the power of GANs and their ability to match two data distributions. This has inspired us to explore GANs to match the feature maps distribution between full precision and 1-bit models.

2.2 Network Compression

This section briefly reviews the primary approaches for deep network compression, dividing them into five categories: quantization, network pruning, decomposition/factorization, distillation, and compact network design.

2.2.1 Quantization

Recently, the superior memory and computational efficiency of quantization networks stimulates the corresponding research. The quantization-based methods represent network weights and/or activations with low precision, typically from 32-bit to 1, 2, or 4 bits. This reduces the model storage and computation cost, thereby generating highly compact CNN models. Our paper aims for extreme binary quantization. Among the state-of-the-art binarization methods, XNOR-Net (Rastegari et al. 2016) provides an efficient binarization method by reconstructing the full precision filters with a single scaling factor. Bi-Real Net (Liu et al. 2018) introduces a new structure modification method to preserve the real activations before the sign function. Besides, by exploiting multiple projections with discrete backward propagation, projection convolutional neural networks (PCNNs) (Gu et al. 2019) learn a set of diverse quantized kernels to reconstruct a full-precision kernel. HORQ (Li et al. 2017) proposes a high-order binarization scheme by recursively performing residual quantization. Recently, ReActNet (Liu et al. 2020) proposes a baseline network by modifying and binarizing a compact real-valued network MobileNet with parameter-free shortcuts, bypassing all the intermediate convolutional layers to largely improve the representational capability of BNNs. Our analysis of the prior arts suggests that how full precision information is used is the key issue for optimizing BNNs.

2.2.2 Network Pruning

Neural network pruning focuses on removing the less-important network connections in an unstructured or structured manner. The paper in Hassibi and Stork (1993) proposes a saliency-like measurement to remove redundant weights, which is determined by the loss function’s second-order derivative matrix. Moreover, in order to remove less-important weights with small avsolute values, Han et al. (2015) propose an iterative thresholding method. In addition, Guo et al. (2016) provide a connection splicing method in order to avoid incorrect weights pruning. Besides, through minimizing the reconstruction error, the paper in Yang et al. (2017) introduces an energy-aware pruning method to prune the less-important weights layer-by-layer.

By contrast, in the aspect of structured pruning, Li et al. (2016) propose to calculate the \(\ell _{1}\)-norm of filters in a layer-wise manner to remove filters and their corresponding feature maps. Besides, the paper in Molchanov et al. (2016) proposes a Taylor expansion-based criterion to iteratively prune filters. Unlike the above multi-stage and layer-wise pruning methods, Huang and Wang (2018) propose to prune the network in an end-to-end manner by using an uncontrolled sparse soft mask. Furthremore, the method in Lin et al. (2019) effectively solves the pruning optimization problem using generative adversarial learning to learn a sparse soft mask in a label-free, end-to-end manner. Inspired by this, we perform pruning for 1-bit CNNs with a GAN framework.

2.2.3 Decomposition/Factorization

The main idea of decomposition/factorization is to construct a low-rank basis of filters to reduce the weights effectively. By exploiting the redundancy that exists between different feature channels and filters, Jaderberg et al. (2014) propose an agnostic approach for rank-1 filters in the spatial domain. Moreover, Rigamonti et al. (2013) preserve the performance of CNNs while drastically reducing their cost by learning and using separable filters that approximate the non-separable filters. Also, Lebedev et al. (2014) propose a simple two-step approach for speeding up convolutional layers based on tensor decomposition and discriminative fine-tuning. In addition, Yang et al. (2015) introduce a deep fried convnet to re-parameterize the matrix-vector multiplication of fully connected layers. Denton et al. (2014) present several techniques designed for object recognition tasks to speed up the evaluation of large convolutional networks.

2.2.4 Distillation

Distillation is a knowledge transfer method that provides guidance information through large teacher models to guide the training of small student models. Belagiannis and Farshad (2018) propose a knowledge transfer method based on adversarial learning, which teaches a smaller student network with limited capacity to mimic the larger teacher network to obtain better performance. Furthermore, inspired by the assumption that the small network can not perfectly mimic a large one due to the large network scale gap, Changyong et al. (2019) consider effective intermediate supervision under the adversarial training framework to learn the student network. Xu et al. (2018) propose to use conditional adversarial networks to learn the loss function to transfer knowledge from the teacher to the student. Gao et al. (2020) propose residual knowledge distillation (RKD) to remedy the performance degradation caused by the substantial gap between the learning capacities of the student and the teacher network. Specifically, RKD introduces an assistant network to aid the training process by learning the residual error between the student and teacher models. Ahn et al. (2019) propose an information-theoretic framework for knowledge transfer, where the student network effectively learns the target task by minimizing the cross-entropy loss while maintaining high mutual information with the teacher network. By learning the distribution of activation in the teacher network, the mutual information can be maximized.

Like the methods described above, we adopt the idea of distillation to transfer information from a teacher model to a student model. But different with those methods, we are the first to explore 1-bit network distillation using a GAN framework. In our approach, a full precision network is used as the teacher model, and the 1-bit network is the student model. Through the information transfer, the 1-bit student network mimics the output of the full precision teacher network to obtain better performance. Another difference lies in our method adopts multiple discriminators, which fit the distributions between a student network and a full precision network at different stages. This is similar to the method in Ahn et al. (2019), but our learnable discriminators avoid the manual tuning required by the handcrafted module in Ahn et al. (2019).

2.2.5 Efficient Network Design

There has been a rising interest in designing efficient architectures to meet the requirement of efficiency and accuracy. Recently, some approaches [e.g., GoogLeNet (Szegedy et al. 2015) and SqueezeNet (Iandola et al. 2016)] propose to replace \(3\times 3\) convolutional kernels with \(1\times 1\) kernels to reduce the computational complexity. Besides, Xception (Chollet 2017), MobileNet (Howard et al. 2017), and ShuffleNet (Zhang et al. 2018) employ depth-wise separable convolutions to build light weight deep neural networks. Meanwhile, Neural Architecture Search (NAS) has attracted a lot of attention due to its considerable performance gains while avoiding handcrafted design. Instead of adopting the conventional proxy-based framework, ProxylessNAS (Cai et al. 2018) introduces latency loss to search architectures for the target task. Also, EfficientNet (Tan and Le 2019) uniformly scales the dimensions of depth, width, and resolution to obtain more compact networks by employing a simple but highly effective compound coefficient. Furthermore, Binarized Neural Architecture Search (BNAS) (Chen et al. 2020) provides a more promising method for efficient network architectures according to the performance feedback.

3 Rectified Binary Convolutional Networks (RBCNs)

We design RBCNs with GANs to rectify BNNs in a unified framework. In our RBCNs, multi-cue information from full precision feature maps and kernelsFootnote 1 are exploited to improve the performance of BNNs. The rectified convolutional layers are generic, flexible, and can be easily incorporated into existing CNNs such as WideResNets and ResNets. We furthermore prune RBCN with minimal performance degradation using GANs. To begin, Table 1 summarizes some of the notation used in this paper.

Table 1 This figure contains a brief description of the variables and operators used in the paper

3.1 Optimized RBCN

This section describes the optimization process, including changing the loss function and updating the learnable parameters.

3.1.1 Loss Function

The rectification process combines the full precision kernels and feature maps to rectify the binarization process. It includes kernel approximation and adversarial learning. This learnable kernel approximation leads to a unique architecture with a precise estimation of the convolutional filters through minimizing a kernel loss. The discriminators \(D(\cdot )\) with filters Y are introduced to distinguish the feature maps R of the full precision model from those T of RBCN. The RBCN generator with filters W and learnable matrices C is trained together with Y using the knowledge from the supervised feature maps R. In summary, W, C and Y are learned by solving the following optimization problem:

$$\begin{aligned} \! \arg \min \limits _{W,{\hat{W}},C}\max \limits _{Y}{\mathscr {L}}= & {} {\mathscr {L}}_{Adv}(W,{\hat{W}},C,Y)+{\mathscr {L}}_{S}(W,{\hat{W}},C)\nonumber \\&+{\mathscr {L}}_{Kernel}(W,{\hat{W}},C), \end{aligned}$$
(2)

where \({\mathscr {L}}_{Adv}(W,{\hat{W}},C,Y)\) is the adversarial loss as

$$\begin{aligned} \begin{aligned} {{\mathscr {L}}_{Adv}(W,{\hat{W}},C,Y)=log(D(R;Y))+log({1-D(T;Y)})}, \end{aligned}\nonumber \\ \end{aligned}$$
(3)

where \(D(\cdot )\) consists of a series of basic blocks, each containing a linear layer and a LeakyRelu layer. We also have multiple discriminators to rectify the training process of binarization.

In addition, \({\mathscr {L}}_{Kernel}(W,{\hat{W}},C)\) denotes the kernel loss between the learned full precision filters W and the binarized filters \({\hat{W}}\) and is defined as:

$$\begin{aligned} \begin{aligned} {\mathscr {L}}_{Kernel}(W,{\hat{W}},C)=\lambda _1/2||W-C{\hat{W}}|{{|}^{2}}, \end{aligned} \end{aligned}$$
(4)

where \(\lambda _1\) is a balance parameter. Finally, \({\mathscr {L}}_{S}\) is a traditional problem-dependent loss, such as softmax loss. The adversarial loss, kernel loss, and softmax loss are considered to be regularizations on \({\mathscr {L}}\).

For simplicity, the update of the discriminators is omitted in the following description until Algorithm 1. We also have omitted \(log(\cdot )\) and rewrite the optimization in Eq. 2 as Eq. 5 for simplicity.

$$\begin{aligned}&\min \limits _{W,{\hat{W}},C} {\mathscr {L}}_S(W,{\hat{W}},C)+\lambda _1/2 \sum _{l} \sum _{i} ||W_i^{l}-C^{l}{\hat{W}}_i^{l}|{{|}^{2}}\nonumber \\&\quad +\sum _{l}\sum _{i}||{1-D(T_i^{l};Y)}||^2. \end{aligned}$$
(5)

where i represents the ith channel and l represents the lth layer. In Eq. 5, the target is to obtain W, \({\hat{W}}\) and C with Y fixed, which is why the term D(RY) in Eq. 2 can be ignored. The update process for Y is found in Algorithm 1. The advantage of our formulation in Eq. 5 lies in that the loss function is trainable, meaning that it can be easily incorporated into existing learning frameworks.

figure a

3.1.2 Learning RBCNs

In RBCNs, the convolution is implemented using \(W^l\), \(C^l\) and \(F_{in}^l\) to calculate the output feature maps \(F_{out}^{l}\) as

$$\begin{aligned} \begin{aligned} F_{out}^{l} = RBConv(F_{in}^l; {\hat{W}}^l, C^l)=Conv(F_{in}^l,{\hat{W}}^l\odot {C^l}), \end{aligned}\nonumber \\ \end{aligned}$$
(6)

where RBConv denotes the convolution operation implemented as a new module, \(F_{in}^l\) and \(F_{out}^{l}\) are the feature maps before and after the convolution, respectively. \(W^l\) are full precision filters, the values of \({\hat{W}}^l\) are 1 or \(-1\), and \(\odot \) is the element-by-element product operation.

During the backward propagation process of RBCNs, what needs to be learned and updated are the full precision filters W and the learnable matrices C. These two sets of parameters are jointly learned. In each convolutional layer, we update W first and then C.

Update W Let \({{\delta }_{W_i^l}}\) be the gradient of the full precision filter \(W_i^l\). During backpropagation, the gradients are passed to \({\hat{W}}_i^l\) first and then to \(W_i^l\). Thus,

$$\begin{aligned} {{\delta }_{W_i^l}}=\frac{\partial {\mathscr {L}}}{\partial W_i^l}=\frac{\partial {\mathscr {L}}}{\partial \hat{W_i^l}}\frac{\partial \hat{W_i^l}}{\partial W_i^l}, \end{aligned}$$
(7)

where

$$\begin{aligned} \frac{\partial {\hat{W}}_i^l}{\partial W_i^l}=\left\{ \begin{array}{ll} 2+2W_i^l,&{}\quad -1\le W_i^l<0, \\ 2-2W_i^l,&{}\quad 0\le W_i^l<1, \\ 0,&{}\quad \mathrm{{otherwise}}, \\ \end{array} \right. \end{aligned}$$
(8)

which is an approximation of the \(2\times \) the Dirac delta function (Liu et al. 2018). Furthermore,

$$\begin{aligned} \begin{aligned} \frac{\partial {\mathscr {L}}}{\partial {\hat{W}}_{i}^{l}}=\frac{\partial {{{\mathscr {L}}}_{S}}}{\partial {\hat{W}}_{i}^{l}}+\frac{\partial {{{\mathscr {L}}}_{Kernel}}}{\partial {\hat{W}}_{i}^{l}}+\frac{\partial {{{\mathscr {L}}}_{Adv}}}{\partial {\hat{W}}_{i}^{l}}, \end{aligned} \end{aligned}$$
(9)

and

$$\begin{aligned} {{W_i^l}}\leftarrow {{W_i^l}}-\eta _1 {{\delta }_{W_i^l}}, \end{aligned}$$
(10)

where \(\eta _1\) is the learning rate. Then,

$$\begin{aligned}&\frac{\partial {{\mathscr {L}}_{Kernel}}}{\partial {{\hat{W}}_i^{l}}}=-\lambda _1 ({W^{l}_i-C^{l}{\hat{W}}^{l}_i})C^{l}, \end{aligned}$$
(11)
$$\begin{aligned}&\frac{\partial {{\mathscr {L}}_{Adv}}}{\partial {{\hat{W}}_i^{l}}}=-2({1-D(T_i^{l};Y)})\frac{\partial {D}}{\partial {{{\hat{W}}^{l}_i}}}. \end{aligned}$$
(12)

Update C We further update the learnable matrix \(C^l\) with \(W^l\) fixed. Let \({{\delta }_{C^l}}\) be the gradient of \(C^l\). Then we have

$$\begin{aligned} {{\delta }_{C^l}}=\frac{\partial {\mathscr {L}}_S}{\partial C^l}+\frac{\partial {\mathscr {L}}_{Kernel}}{\partial {C^l}}+\frac{\partial {\mathscr {L}}_{Adv}}{\partial {C^l}}, \end{aligned}$$
(13)

and

$$\begin{aligned} {{C^{l}}}\leftarrow {{C^{l}}}-\eta _2 {{\delta }_{C^{l}}}, \end{aligned}$$
(14)

where \(\eta _2\) is another learning rate. Furthermore,

$$\begin{aligned}&\frac{\partial {{\mathscr {L}}_{Kernel}}}{\partial {{C^{l}}}}=-\lambda _1\sum _{i}({W^{l}_i-C^l{\hat{W}}^{l}_i}) {\hat{W}}^{l}_i, \end{aligned}$$
(15)
$$\begin{aligned}&\frac{\partial {{\mathscr {L}}_{Adv}}}{\partial {{C^{l}}}}=-\sum _{i} 2({1-D(T^l_i;Y)})\frac{\partial {D}}{\partial {{C^{l}}}}. \end{aligned}$$
(16)

These derivations show that the rectified process is trainable in an end-to-end manner. The complete training process is summarized in Algorithm 1, including how to update of the discriminators. As described in line 17 of Algorithm 1, we independently update other parameters while fixing convolutional layer’s parameters to enhance the feature maps’ variety in every layer. In this way, we speed up the training convergence and fully explore the potential of 1-bit networks. In our implementation, all the values of \(C^l\) are replaced by their average during the forward process. A scalar, not a matrix, is involved in the inference, thus speeding up the computation.

3.2 Network Pruning

We further prune the 1-bit CNNs to increase the model efficiency and further improve the flexibility of RBCNs in practical scenarios. This section considers the pruning process for optimization, including changing loss function and updating learnable parameters.

3.2.1 Loss Function

After binarizing the CNNs, we further prune the resulting 1-bit CNNs under the generative adversarial learning framework using the method described in Lin et al. (2019). We use a soft mask to remove the corresponding structures such as filters while obtaining close to the baseline accuracy. The discriminator \(D_p(\cdot )\) with weights \(Y_p\) is introduced to distinguish baseline networks output \(R_p\) from those \(T_p\) of the pruned 1-bit network. The pruned network with weights \(W_p\), \({\hat{W}}_p\), \(C_p\) and soft mask \(M_p\), is learned together with \(Y_p\) using knowledge from the supervised features of baseline. \(W_p\), \({\hat{W}}_p\), \(C_p\), \(M_p\) and \(Y_p\) are therefore learned by solving the optimization problem as follows:

$$\begin{aligned}&\arg \min \limits _{W_p,{\hat{W}}_p,C_p,M_p}\max \limits _{Y_p}{\mathscr {L}}_p={\mathscr {L}}_{Adv\_p}(W_p,{\hat{W}}_p,C_p,M_p,Y_p) \nonumber \\&\quad +{\mathscr {L}}_{Kernel\_p}(W_p,{\hat{W}}_p,C_p) \nonumber \\&\quad + {\mathscr {L}}_{S\_p}(W_p,{\hat{W}}_p,C_p)\nonumber \\&\quad +{\mathscr {L}}_{Data\_p}(W_p,{\hat{W}}_p,C_p,M_p) +{\mathscr {L}}_{Reg\_p}(M_p,Y_p), \end{aligned}$$
(17)

where \({\mathscr {L}}_p\) is the loss function of pruning, the form of \({\mathscr {L}}_{Adv\_p}(W_p,{\hat{W}}_p,C_p,M_p,Y_p)\) and \({\mathscr {L}}_{Kernel\_p}(W_p,{\hat{W}}_p,C_p)\) are

$$\begin{aligned}&{\mathscr {L}}_{Adv\_p}(W_p,{\hat{W}}_p,C_p,M_p,Y_p) \nonumber \\&\quad =log(D_p(R_p;Y_p))+log({1-D_p(T_p;Y_p)}), \end{aligned}$$
(18)
$$\begin{aligned}&{\mathscr {L}}_{Kernel\_p}(W_p,{\hat{W}}_p,C_p)\nonumber \\&\quad ={\lambda _1}/{2}||W_p-C_p{\hat{W}}_p|{{|}^{2}}. \end{aligned}$$
(19)

\({\mathscr {L}}_{S\_p}\) is a traditional problem-dependent loss such as the softmax loss. \({\mathscr {L}}_{Data\_p}\) is the data loss between output features from both the baseline and the pruned network and used to align of these two networks’ outputs. The data loss can then be expressed by MSE loss.

$$\begin{aligned} \begin{aligned} {{{\mathscr {L}}}_{Data\_p}}(W_p,{\hat{W}}_p,C_p,M_p)= \frac{1}{2n}{{\left\| {{R_{p}}}-{{T_{p}}} \right\| }^{2}}, \end{aligned} \end{aligned}$$
(20)

where n is the size of the mini-batch.

\({\mathscr {L}}_{Reg\_p}(M_p,Y_p)\) is a regularizer on \(W_p\),\({\hat{W}}_p\),\(M_p\) and \(Y_p\), which can be split into two parts as follows:

$$\begin{aligned} {\mathscr {L}}_{Reg\_p}(M_p,Y_p) = {\mathscr {R}}_{\lambda }(M_p) + {\mathscr {R}}(Y_p), \end{aligned}$$
(21)

where \({\mathscr {R}}(Y_p)=log(D_p(T_{p};Y_p))\), \({\mathscr {R}}_{\lambda }(M_p)\) is a sparsity regularizer form with parameter \(\lambda \) and \({\mathscr {R}}_{\lambda }(M_p)=\lambda ||M_p||_{l_1}\).

As with the process in binarization, the update of the discriminators is omitted in the following description until Algorithm 2. We also have omitted \(log(\cdot )\) for simplicity and rewrite the optimization of Eq. 17 as

$$\begin{aligned}&\min \limits _{W_p,{\hat{W}}_p,C_p,M_p}\lambda _1/2 \sum _{l}\sum _{i} ||W_{p,i}^{l}-C^{l}_{p}{\hat{W}}_{p,i}^{l}|{{|}^{2}}\nonumber \\&\quad +\sum _{l}\sum _{i}||{1-D(T_{p,i}^{l};Y_p)}||^2 \nonumber \\&\quad +{\mathscr {L}}_{S\_p}(W_p,{\hat{W}}_p, C_p)+\frac{1}{2n}{{\left\| {{R_{p}}}-{{T_{p}}} \right\| }^{2}} + \lambda ||M_p||_{l_1}.\nonumber \\ \end{aligned}$$
(22)

3.2.2 Learning Pruned RBCNs

In pruned RBCNs, the convolution is implemented as

$$\begin{aligned} \begin{aligned} F_{out,p}^{l}&= RBConv(F_{in,p}^l; {\hat{W}}^l_p\circ M^l_p, C_p^l)\\&=Conv(F_{in,p}^l,({\hat{W}}_p\circ M^l_p)\odot {C^l_p}), \end{aligned} \end{aligned}$$
(23)

where \(\circ \) is an operator that obtains the pruned weight with mask \(M_p\). The other part of the forward propagation in the pruned RBCNs is the same with RBCNs.

In pruned RBCNs, what needs to be learned and updated are the full precision filters \(W_p\), learnable matrices \(C_p\), and soft mask \(M_p\). In each convolutional layer, these three sets of parameters are jointly learned.

Update \(M_p\) \(M_p\) is updated by FISTA (Lin et al. 2018) with the initialization of \(\alpha _{(1)} = 1\). Then we obtain

$$\begin{aligned}&\alpha _{(k+1)} = \frac{1}{2}\left( 1+\sqrt{1+4\alpha ^2_{(k)}}\right) , \end{aligned}$$
(24)
$$\begin{aligned}&y_{(k+1)} = M_{p,(k)} +\frac{a_{(k)}-1}{a_{(k+1)}}(M_{p,(k)}-M_{p,(k-1)}), \end{aligned}$$
(25)
$$\begin{aligned}&M_{p,(k+1)} =prox_{{\eta _{(k+1)}}^{\lambda ||\cdot ||_1}}\nonumber \\&\quad \left( y_{(k+1)}-\eta _{(k+1)}\frac{\partial ({\mathscr {L}}_{Adv\_p}+{\mathscr {L}}_{Data\_p})}{\partial (y_{(k+1)})}\right) , \end{aligned}$$
(26)

where \(\eta _{(k+1)}\) is the learning rate at the iteration \(k+1\) and \(prox_{{\eta _{(k+1)}}^{\lambda ||\cdot ||_1}}(z_i)=sign(z_i)\cdot (|z_i|-\eta _{(k+1)}\lambda )_{+}\), more detail can be found in Lin et al. (2019).

Update \(W_p\) Let \({{\delta }_{W_{p,i}^l}}\) be the gradient of the full precision filter \(W_{p,i}^l\). During backpropagation, the gradients pass to \({\hat{W}}_{p,i}^l\) first and then to \(W_{p,i}^l\). Furthermore,

$$\begin{aligned} {\delta }_{W_{p,i}^l}= & {} \frac{\partial {{{\mathscr {L}}_p}}}{\partial {\hat{W}}_{p,i}^{l}}=\frac{\partial {{{\mathscr {L}}}_{S\_p}}}{\partial {\hat{W}}_{p,i}^{l}}+\frac{\partial {{{\mathscr {L}}}_{Adv\_p}}}{\partial {\hat{W}}_{p,i}^{l}}\nonumber \\&+\frac{\partial {{{\mathscr {L}}}_{Kernel\_p}}}{\partial {\hat{W}}_{p,i}^{l}}+\frac{\partial {{{\mathscr {L}}}_{Data\_p}}}{\partial {\hat{W}}_{p,i}^{l}}, \end{aligned}$$
(27)

and

$$\begin{aligned} {{W_{p,i}^l}}\leftarrow {{W_{p,i}^l}}-\eta _{p,1} {{\delta }_{W_{p,i}^l}}, \end{aligned}$$
(28)

where \(\eta _{p,1}\) is the learning rate, \(\frac{\partial {{\mathscr {L}}_{Kernel\_p}}}{\partial {{\hat{W}}_{p,i}^{l}}}\) and \(\frac{\partial {{\mathscr {L}}_{Adv\_p}}}{\partial {{\hat{W}}_{p,i}^{l}}}\) are

$$\begin{aligned}&\frac{\partial {{\mathscr {L}}_{Kernel\_p}}}{\partial {{\hat{W}}_{p,i}^{l}}}=-\lambda _1 ({W^{l}_{p,i}-C^{l}_p{\hat{W}}^{l}_{p,i}})C^{l}_p, \end{aligned}$$
(29)
$$\begin{aligned}&\frac{\partial {{\mathscr {L}}_{Adv\_p}}}{\partial {{\hat{W}}_{p,i}^{l}}}=- 2({1-D_p(T_{p,i}^{l};Y_p)})\frac{\partial {D_p}}{\partial {{{\hat{W}}^{l}_{p,i}}}}. \end{aligned}$$
(30)

And

$$\begin{aligned} \begin{aligned} \frac{\partial {{{\mathscr {L}}}_{Data\_p}}}{\partial {\hat{W}}_{p,i}^{l}}=-\frac{1}{n} (R_{p}-T_{p})\frac{\partial {T_{p}}}{\partial {{{\hat{W}}^{l}_{p,i}}}}, \end{aligned} \end{aligned}$$
(31)

Update \(C_p\) We further update the learnable matrix \(C^l_p\) with \(W^l_p\) and \(M^l_p\) fixed. Let \({{\delta }_{C^l_p}}\) be the gradient of \(C^l_p\). Then we have

$$\begin{aligned} {{\delta }_{C^l_p}}= & {} \frac{\partial {{{\mathscr {L}}_p}}}{\partial {\hat{C}}^{l}_p}=\frac{\partial {{{\mathscr {L}}}_{S\_p}}}{\partial {\hat{C}}^{l}_p}+\frac{\partial {{{\mathscr {L}}}_{Adv\_p}}}{\partial {\hat{C}}^{l}_p}\nonumber \\&+\frac{\partial {{{\mathscr {L}}}_{Kernel\_p}}}{\partial {\hat{C}}^{l}_p}+\frac{\partial {{{\mathscr {L}}}_{Data\_p}}}{\partial {\hat{C}}^{l}_p}, \end{aligned}$$
(32)

and

$$\begin{aligned} {{C^{l}_p}}\leftarrow {{C^{l}_p}}-\eta _{p,2} {{\delta }_{C^{l}_p}}. \end{aligned}$$
(33)

and\(\frac{\partial {{\mathscr {L}}_{Kernel\_p}}}{\partial {{C^{l}_p}}}\) and \(\frac{\partial {{\mathscr {L}}_{Adv\_p}}}{\partial {{C^{l}_p}}}\) are

$$\begin{aligned}&\frac{\partial {{\mathscr {L}}_{Kernel\_p}}}{\partial {{C^{l}_p}}}=-\lambda _1\sum _{i}({W^{l}_{p,i}-C^l_p{\hat{W}}^{l}_{p,i}}) {\hat{W}}^{l}_{p,i}, \end{aligned}$$
(34)
$$\begin{aligned}&\frac{\partial {{\mathscr {L}}_{Adv\_p}}}{\partial {{C^{l}_p}}}=-\sum _{i} 2({1-D_p(T^l_{p,i};Y_p)})\frac{\partial {D}_p}{\partial {{C^{l}_p}}}. \end{aligned}$$
(35)

Further,

$$\begin{aligned} \begin{aligned} \frac{\partial {{{\mathcal {L}}}_{Data\_p}}}{\partial C^{l}_p}=\frac{1}{n} \sum _{i}(R_{p,i}-T_{p,i})\frac{\partial {T_{p,i}}}{\partial {{{C}^{l}_p}}}. \end{aligned} \end{aligned}$$
(36)

The complete training process is summarized in Algorithm 2, including the update of the discriminators.

figure b

4 Experiments

Our RBCNs are evaluated first on object classification using the CIFAR10/100 (Krizhevsky 2009) and ILSVRC12 ImageNet (Russakovsky et al. 2015) datasets. They are then evaluated on various tasks such as object tracking, face recognition (FR), and person re-identification (Re-id). For object classification, WideResNet (WRN) (Zagoruyko and Komodakis 2016) and ResNet (He et al. 2016) are employed as the backbone networks.

4.1 Datasets and Implementation Details

Datasets CIFAR10 (Krizhevsky 2009) is a natural image classification dataset containing a training set of 50,000 and a testing set of 10,000 \(32\times 32\) color images. The dataset contains ten classes, including airplanes, dogs, birds, automobiles, cats, deers, frogs, ships, horses, and trucks, while CIFAR100 has 100 classes.

The ImageNet object classification dataset (Russakovsky et al. 2015) is more challenging due to its larger scale and greater diversity. This dataset contains 1000 classes, 1.2 million training images, and 50k validation images.

WRN Backbone WRN is a network structure similar to ResNet with a depth factor of k to control the feature map depth dimension expansion through three stages, within which the dimensions remain unchanged. For simplicity, we fix the depth factor to 1. Each WRN has a parameter i that indicates the channel dimension of the first stage that we set it to 16. \(\eta _1\) and \(\eta _2\) are both set to 0.01 with a degradation of 10% for every 60 epochs before reaching the maximum epoch of 200 for CIFAR10/100. WRN22 is a network with 22 convolutional layers, and WRN40 is a network with 40 convolutional layers.

ResNet18 Backbone SGD is used as the optimization algorithm with a momentum of 0.9. On ImageNet, \(\eta _1\) and \(\eta _2\) are both set to 0.1 with a degradation of 10% for every 20 epochs before reaching the maximum epoch of 70. On CIFAR10/100, \(\eta _1\) and \(\eta _2\) are both set to 0.01 with a degradation of 10% for every 60 epochs before reaching the maximum epoch of 200. For the hyper-parameters specific for pruning, more details can be found in Lin et al. (2019).

4.2 Ablation Study

This section studies the performance contributions of the kernel approximation, the GAN, and the update strategy (we fix the parameters of convolutional layers, and update other layers). CIFAR100 and ResNet18 with different kernel stages are used in these experiments.

  1. (1)

    We replace the convolution in Bi-Real Net with our kernel approximation (RBConv) and compare the results. As shown in the column of “Bi” and “R” in Table 2, RBCN achieves \(1.62\%\) accuracy improvement over Bi-Real Net (\(56.54\%\) vs. \(54.92\%\)) using the same network structure as in ResNet18. This significant improvement verifies the effectiveness of the learnable matrices.

  2. (2)

    Using GAN makes RBCN improve \(2.59\%\) (\(59.13\%\) vs. \(56.54\%\)) with the kernel stage of 32-32-64-128, which shows that the GAN can help mitigate the problem of being trapped in poor local minima.

  3. (3)

    We further improve RBCNs by updating the BN layers with W and C fixed after each epoch (line 17 in Algorithm 1). This further boosts our accuracy by \(2.51\%\) (\(61.64\%\) vs. \(59.13\%\)) in CIFAR100 with 32-32-64-128.

Table 2 Performance (accuracy, %) contributions of the components in RBCNs on CIFAR100, where Bi = Bi-Real Net, R = RBConv, G = GAN, and B = update strategy

4.3 Accuracy Comparison with the State-of-the-Art

CIFAR10/100 The same parameter settings are used in RBCNs on both CIFAR10 and CIFAR100. We first compare our RBCNs with the original ResNet18 using different stage kernels, followed by a comparison with the original WRNs with the initial channel dimension 64 (Table 3). The rectified process leads to results on both the datasets that are close to the full precision networks ResNe18 and WRN40. We compare our results with other state-of-the-art approaches such as Bi-Real Net (Liu et al. 2018), PCNN (Gu et al. 2019), and XNOR-Net (Rastegari et al. 2016). All of these BNNs have both binary filters and binary activations. We observed that there is an accuracy improvement of up to \(6.72\%\) (\(=\) 61.64–54.92\(\%\)). We also plot the training and testing loss curves of XNOR-Net and RBCN in Fig. 2 and show that RBCN converges faster than XNOR-Net.

ImageNet Five methods are chosen for comparison on ImageNet: Bi-Real Net (Liu et al. 2018), BinaryNet (Courbariaux et al. 2016), XNOR-Net (Rastegari et al. 2016), PCNN (Gu et al. 2019), and ABC-Net (Lin et al. 2017). These networks binarize both network weights and activations and achieve state-of-the-art results. All the methods in Table 4 perform the binarization of ResNet18. We quoted the results directly from the corresponding papers, except BinaryNet from Lin et al. (2017). The comparison clearly indicates that the proposed RBCN outperforms the five binary networks by a considerable margin in both the Top-1 and Top-5 accuracies. For Top-1 accuracy, RBCN outperforms BinaryNet and ABC-Net by over \(16\%\), achieves \(8.3\%\) improvement over XNOR-Net, \(3.1\%\) over Bi-Real Net, and \(2.2\%\) over PCNN. In Fig. 3, we plot the training and testing loss curves of XNOR-Net and RBCN. It clearly shows that using our rectified process, RBCN converges faster than XNOR-Net.

Table 3 Classification accuracy (%) based on ResNet18 and WRN40 on CIFAR10/100
Fig. 2
figure 2

Training and testing error curves of RBCN and XNOR-Net based on WRN40 for the CIFAR10/100 experiments

Table 4 Classification accuracy (%) on ImageNet

4.4 Experiments on Different Tasks

In this section, we explore the effect of RBCNs for different tasks, including object tracking, person re-identification (reID), and face recognition (FR). These tasks can help to establish binary networks in other fields.

4.4.1 Object Tracking

The key message conveyed in the proposed method is that although the conventional binary training method has a limited model capability, the proposed rectified process can lead to a powerful model. We show that this framework can also be used in object tracking. In particular, we consider the problem of tracking an arbitrary object in videos, where the object is identified only by a rectangle in the first frame. For object tracking, it is necessary to update the weights of the network online, severely compromising the systems speed. To directly apply the framework to tracking, we construct a binary convolution with the same structure to reduce the convolution time. In this way, RBCN can be used to binarize the network further and guarantee the tracking performance.

For these experiments, we use the SiamFC Network as the backbone as a Rectified Binary Convolutional SiamFC Network (RB-SF). We evaluate RB-SF on four datasets, UAV123 (Mueller et al. 2016), GOT-10K (Huang et al. 2018), OTB50 (Wu et al. 2013), and OTB100 (Wu et al. 2015). We use accuracy occupy (AO) and success rate (SR) as metrics. In the following, we will briefly introduce the datasets used in our experiments.

Fig. 3
figure 3

Training and testing error curves of RBCN and XNOR-Net based on the ResNet18 backbone on ImageNet

Table 5 Tracking performance comparison between XNOR-Net and RB-SF on different datasets

The long-term aerial tracking dataset UAV123 consists of 123 videos with more than 110K frames. All of the sequences are from an aerial viewpoint. The dataset contains a wide variety of scenes including buildings, urban landscape, fields, roads, beaches and a harbor/marina, targets including cars, trucks, boats, persons, groups, and aerial vehicles, and activities including walking, cycling, wakeboarding, driving, swimming, and flying.

GOT-10k includes 563 object classes and 87 motion patterns in the real-world. It has over 10,000 segments and 1.5 million bounding boxes. The dataset is also split into unified training, validation, and testing subsets, which expands five subtrees: animal, artifact, person, natural object, and part. We use the testing subset containing 180 videos, 84 classes of moving objects, and 32 forms of motion.

The dataset OTB100 for tracker evaluation has 100 moving objects, containing basketball, birds, biker, etc. The full dataset consists of 100 sequences with more than 59K frames. OTB50 is a subset of OTB100 and consists of 51 sequences with more than 29k frames.

The results are shown in Table 5. Intriguingly, our framework achieves about \(7\%\) AO improvement over XNOR-Net, using the same network architecture as in SiamFC Network on GOT-10k.

4.4.2 Person Re-Identification

The task of person re-identification is to determine whether two images belong to the same subject. Specifically, the two images are usually captured by two disjoint cameras in practical applications. The performance of person re-identification is closely related to many other applications, such as behavior analysis, object retrieval, and cross-camera tracking. Although it has good performance on many benchmarks, the re-identification task is still difficult to be applied in the real-world due to the large amount of memory and computation costs. Therefore, we use the rectified method to compress the reID models in this paper to further improve the efficiency of CNNs. In the following, we will briefly introduce the person re-identification datasets used in our paper.

Market-1501 (Zheng et al. 2015) contains of 1501 identities captured from 6 different viewpoints, including 32668 labeled bounding boxes, which is currently the largest image-based reID benchmark dataset. Specifically, the dataset contains 12936 images with 751 identities for training, and 19732 images with 750 identities for testing.

DukeMTMC-reID (Zheng et al. 2017) contains 8 85-min high-resolution videos from 8 different cameras with available hand-drawn pedestrian bounding boxes.

The iLIDS dataset (Wang et al. 2014) is constructed from video images in a airport arrival hall. It has 119 pedestrians with 479 normalized images from non-overlapping cameras. In particular, the subjects in the iLIDS dataset are exposed to the chanllenge of large illumination changes and occlusions.

As shown in Table 6, our framework achieves \(8.8\%\) (= 67.4–58.6\(\%\)) precision improvement over XNOR-Net, both using the same network architecture as in DesNet121 Network on Market-1501. Besides, RBCN also outperforms XNOR-Net by \(5.0\%\) (= 50.7–45.7\(\%\)) in DesNet121 Network on video dataset iLIDS, which confirms that our rectified network RBCNs can also obtain better performance on re-identification video dataset.

Table 6 Person re-identification (reID) accuracy (%) based on ResNet50, ResNet50_MID and DesNet121 on different datasets

4.4.3 Face Recognition

Face recognition (FR) is important in recent days because it can be used to identity authentication. Though the face recognition technology has already obtained very high performance, it still meets the challenge of being used in light-weight devices because of the large memory and computation cost. Therefore, we employ our RBCN on the models in face recognition to improve their efficiency. In the following, We will briefly introduce the datasets used in our paper.

Training Dataset We use CASIA-WebFace (Yi et al. 2014) to train our RBCN models. Specifically, CASIA-WebFace has 494,414 face images belonging to 10,575 different individuals, which are horizontally flipped for data augmentation.

Testing Dataset LFW dataset (Huang et al. 2007) contains 13,323 web photos of 5749 celebrities, which are divided into 6000 face pairs in 10 splits. CFP dataset (Sengupta et al. 2016) contains 7000 images of 500 subjects, which is mainly used for evaluating pose variation. AgeDB (Moschoglou et al. 2017) contains 16,488 images of various 568 distinct subjects (e.g., actors, writers, scientists).

As shown in Table 7, RBCN achieves up to an \(11.51\%\) (= 83.63–72.12\(\%\)) precision improvement over XNOR-Net using ResNet50 on AgeDB. RBCN with only weights binarized outperforms XNOR-Net with a gap over \(18\%\) (i.e., \(90.33\%\) vs. \(72.12\%\)) in ResNet50 on AgeDB. Furthermore, our framework with only binarized weights performs almost as well as the full precision ResNet50 Network.

Table 7 Face recognition accuracy (%) based on ResNet50 on different datasets
Table 8 Pruning results of ResNet18 on CIFAR10. RBCN (0.6) means that we prune 40% parameters on 1-bit networks

4.5 Results on Network Pruning

In this section, we verify the performance of network pruning. Since BNNs have much information redundancy, we hope to use network pruning to obtain a more compact model without sacrificing much accuracy. In this implementation, we use ResNet18 as our backbone and change the pruning rates to validate the effectiveness of pruning on CIFAR10 dataset. The results are shown in Table 8. For ResNet18 with kernel stage 32-64-128-256, when we prune \(20\%\) channels of RBCN, we only sacrifice \(0.14\%\) accuracy compared with the baseline (i.e., 89.91–89.77\(\%\)). For ResNet18 with kernel stage 32-32-64-128, when we prune \(20\%\) of the channels of RBCN, we obtain the result of \(87.47\%\) with significant memory saving and computational efficiency. From the results, we see that when we prune the redundancy in RBCNs, our models tend to be more compact while maintaining performance.

4.6 Efficiency Analysis

The memory usage is computed as the summation of \(32\times \) the number of real-valued parameters and \(1\times \) the number of binary parameters in BNNs. Furthermore, followed Liu et al. (2018), we use FLOPs to measure the speed. The FLOPs are calculated as the number of real-valued floating-point multiplications plus 1/64 of the number of 1-bit multiplications. From the results in Table 9, we can see that RBCN and XNOR-Net reduce the memory usage of the full precision ResNet18 by 11.10 times. For efficiency, both RBCN and XNOR-Net gain \(10.86\times \) speedup over ResNet18. Note that the computational and storage costs brought by the learnable matrices C are negligible. Considering that we only introduce a new training strategy (i.e., GAN) to better calculate 1-bit CNNs, the network architecture remains the same as other works Liu et al. (2018), Gu et al. (2019). Therefore, we have the same real acceleration ratio as other methods.

Furthermore, we compare the our RBCN with other model compression methods to evaluate our advantages in Table 10. For example, the distillation based method RKD (Gao et al. 2020) obtains higher accuracy with the same parameters and flops than the baseline. The pruning methods SFP and FPGM can reduce the number of parameters and flops with a little performance drop. Compared with these compression methods, the advantage of binarization is that the parameters and flops are extremely small, which has more potential to be used in resource limited devices.

Table 9 Comparison of memory usage & FLOPs calculation
Table 10 Memory usage, FLOPs and accuracy comparison for different model compression methods

5 Conclusion

This paper introduces Rectified Binary Convolutional Networks (RBCNs), which optimize BNNs by exploiting full precision kernels and feature maps in an end-to-end manner. We introduce GANs to train the binarized network with the guidance of its corresponding full precision model, significantly improving the performance of BNNs. RBCNs can not only be used for object classification but also other tasks as well, including object tracking, face recognition, and person re-identification. Experiments demonstrate the superior performance of RBCNs over state-of-the-art binarized models. We also achieve promising performance on model pruning, which validate the generality of our method. In the future, we will combine our method with neural architecture search (NAS) to build data-adaptive 1-bit CNNs.