Keywords

1 Introduction

Deep Neural Networks (DNNs) have inexorably pushed the amazing performances in lots of application domains including but not limited to the speech recognition [1, 2] and computer vision, mainly including object recognition [3, 4, 6, 23] and object detection [7, 8, 10]. A particular type of networks, named Convolution Neural Networks (CNNs), are being deployed to real world applications on smart phones and other embedded devices. However, it is difficult to deploy these computationally intensive and memory-intensive CNNs on embedded devices which are both computational resources limited and storage limited.

1.1 Binary Weight Networks and Model Compression

To address the storage and computational issues [5, 21], methods that seek to binarize weights or activations in DNNs models have been proposed. BinaryConnect [11] binarizes the weights to {+1, −1} with a single sign function. Binary Weight Networks [12] improve the models’ capacity by adding an extra scaling factor on the basis of the previous method. BinaryNet [12] and XNOR-Net [13] binarize not only weights but also activations as extensions of the previous methods. These models eliminate most of the multiplication operations in the forward and backward propagations [16] and model compression rate achieves up to 32\(\times \), but there are also considerable accuracy loss.

1.2 Ternary Weight Networks and Model Compression

Nowadays, more and more researchers are engaged in the quantization of 2-bit neural networks especially the ternary weights quantization. Ternary weights networks (TWNs) [14] were introduced with the weights constrained to {\(-1, 0, +1\)} to maximize scale model compression and minimize the precision loss of the model as far as possible. Compared with the previous binary quantization network, the accuracy loss has been reduced obviously because of the increased weights precision. However, there are also some tricks to improve the capacity of ternary weights networks with the different scaling factors for positive and negative weights.

We optimize the previous methods [14, 20] by proposing Segmented Asymmetric Ternary and Binary Weights Networks (SATB-Nets) to explore higher model capacity and model compression rate. For each layer, we segment the weights vector space into many disjoint subspaces. In each subspace, we confine weights to three values {\(+W^{pt}_{ls}, 0, -W^{nt}_{ls}\)} for convolutional (CONV) layers and two values {\(+W^{pb}_{ls}, -W^{nb}_{ls}\)} for fully-connected (FC) layers, which can also be encoded with two bits and a single bit. Compared with TWNs [14] and BWNs [11] quantization method, our SATB-Nets are able to explore the local redundancy structure better and gain more stronger expressive abilities leading to better performance. In addition, the fixed scaling factors {\(+W^{p*}_{ls}, 0, -W^{n*}_{ls}\)} provide more possibilities for computing acceleration.

2 Segmented Asymmetric Ternary and Binary Weights Networks

We will detailedly introduce how to obtain Segmented Asymmetric Ternary and Binary Weights Networks (SATB-Nets) and train them efficiently in this section.

2.1 Segmentation

Product quantization (PQ) [18] partitions the vector space into many disjoint subspaces to explore the redundancy of structures in vector space. Authors of [9] proposes the segmentation of the weight matrix and then the performance of quantization in each subspace. Similarly, we partition weight matrix into several submatrices to improve the expression ability of the quantized networks:

$$\begin{aligned} W= [W^{1}, W^{2}, ... , W^{k}], \end{aligned}$$
(1)

Where \(W\in R^{m*n}\) and \(W^{i}\in R^{m*(n/k)}\) assuming n is divisible by k. We can quantify each submatrix \(W^{i}\) with ternary and binary value. More segments lead to higher model capacity but will aggressively increase the codebook size. So, by using the same trick as described in [9], we fixed the number of segments k to 8 to keep a satisfying balance between compression rate and output precision loss of the networks.

2.2 Asymmetric Binary Weights for Fully-Connected(FC) Layers

We constrain the full precision weights \(W_{lsi}\) (lth layers, sth segments and ith parameters) to binary weights with values belong to {\(+W^{pb}_{ls}, -W^{nb}_{ls}\)}. The quantization function is shown in (2).

$$\begin{aligned} w^{b}_{lsi} = f_{b}(w^{b}_{lsi}) = \left\{ \begin{array}{ll} +W^{pb}_{ls} &{} \qquad w_{lsi} \ge 0\\ {}\\ -W^{nb}_{ls} &{} \qquad w_{lsi}<0 \end{array} \right. \end{aligned}$$
(2)

Here 0 is threshold and {\(W^{pb}_{ls}, W^{nb}_{ls}\)} are the scaling factors. In order to get as well performance as possible, the minimization of Euclidian distance between the floating-point weights \(W_{ls}\) and binary weights \(W^{b}_{ls}\) is adopted and the optimization problem is transformed to (3):

$$\begin{aligned} \left\{ \begin{array}{ll} W^{*}_{ls} = arg\min J( W^{*}_{ls}) = arg\min { \left\| { W }_{ ls }-{ W }_{ ls }^{ b } \right\| }_{ 2 }^{ 2 }\\ {}\\ s.t. W^{*}_{ls}>0; w^{b}_{lsi}\in \{+W^{pb}_{ls},0,-W^{nb}_{ls}\}; i=1,2,...,n; s=1,2,...,k \end{array} \right. \end{aligned}$$
(3)

Substitute the binary function (2) into the formula (3), we can get the expression as (4):

$$\begin{aligned} J( { W }_{ ls }^{ * }) ={ \left\| { W }_{ ls }-{ W }_{ ls }^{ b } \right\| }_{ 2 }^{ 2 }=\sum _{ i\in { I }_{ * } }^{ { n }_{ s } }{ { \left| \left| { w }_{ lsi } \right| -{ W }_{ ls }^{ * } \right| }^{ 2 } } \end{aligned}$$
(4)

where \({ I }_{ * } = \){\({ I }_{ p },{ I }_{ n }\)}, \({ I }_{ p } = \{i|{ w }_{ lsi } \ge 0\}\), \({ I }_{ n } = \{i|{ w }_{ lsi }<0\}\). According to (4), It is not complicated to obtain binary weights from the floating-point weights as (5):

$$\begin{aligned} \begin{aligned} { W }_{ ls }^{ * }=\frac{ 1 }{ \left| { I }_{ * } \right| } \sum _{ i\in { I }_{ * } }^{ { n }_{ s } }{ \left| { w }_{ lsi } \right| } \end{aligned} \end{aligned}$$
(5)

where \(|{ I }_{ * }|\) denotes the number of elements in \({ I }_{ * }\) in each segment.

2.3 Asymmetric Ternary Weights for Convolutional (CONV) Layers

Similarly, we also constrain the floating-point weights \(W_{lsi}\) (lth layers, sth segments and ith parameters) to ternary weights with values belong to {\(+W^{pt}_{ls}, 0, -W^{nt}_{ls}\)}. The quantization function is shown in (6).

$$\begin{aligned} w^{t}_{lsi} = f_{t}(w^{t}_{lsi}|\varDelta ^{p}_{ls}, \varDelta ^{n}_{ls}) = \left\{ \begin{array}{ll} +W^{pt}_{ls} &{} \qquad {w_{lsi}>\varDelta ^{p}_{ls}}\\ 0 &{} \qquad {-\varDelta ^{n}_{ls} \le w_{lsi} \le \varDelta ^{p}_{ls}}\\ -W^{nt}_{ls} &{} \qquad {w_{lsi}<-\varDelta ^{n}_{ls}} \end{array} \right. \end{aligned}$$
(6)

Here {\(\varDelta ^{P}_{ls}, \varDelta ^{n}_{ls}\)} are the threshold and {\(W^{pt}_{ls}, W^{nt}_{ls}\)} are the scaling factors. The optimization problem is formulated as (7):

$$\begin{aligned} \left\{ \begin{array}{ll} W^{*}_{ls}= arg\min J( W^{*}_{ls}) = arg\min { \left\| { W }_{ ls }-{ W }_{ ls }^{ t } \right\| }_{ 2 }^{ 2 }\\ {}\\ s.t. W^{*}_{ls}>0; W^{t}_{lsi}\in \{+W^{pt}_{ls},0,-W^{nt}_{ls}\}; i=1,2,...,n; s=1,2,...,k \end{array} \right. \end{aligned}$$
(7)

Substitute the ternary function (6) into the formula (7), we can get the expression as (8):

$$\begin{aligned} \begin{aligned} J( { W }_{ ls }^{ * })&={ \left\| { W }_{ ls }-{ W }_{ ls }^{ t } \right\| }_{ 2 }^{ 2 } = \left| { I }_{ { \varDelta }_{ ls }^{ * } } \right| * { ({ W }_{ ls }^{ * }) }^{ 2 } - 2*(\sum _{ i|i\in { I }_{ { \varDelta }_{ ls }^{ * } } }^{ { n }_{ s } }{ \left| { w }_{ lsi } \right| } )*({ W }_{ ls }^{ * }) + C\\ \end{aligned} \end{aligned}$$
(8)

Where \( { I }_{ { \varDelta }_{ ls }^{ * } } = \){\({ I }_{ { \varDelta }_{ ls }^{ p } }, { I }_{ { \varDelta }_{ ls }^{ n } }\)}, \({ I }_{ { \varDelta }_{ ls }^{ p } } = \{i|{ w }_{ lsi }>{ \varDelta }_{ ls }^{ p }\}\), \({ I }_{ { \varDelta }_{ ls }^{ n } } = \{i|{ w }_{ lsi }<-{ \varDelta }_{ ls }^{ n }\}\) and \(|{ I }_{ { \varDelta }_{ ls }^{ * } }|\) denotes the number of elements in \({ I }_{ { \varDelta }_{ ls }^{ * } }\) in each segment. \({ \varDelta }_{ ls }^{ p }\) and \({ \varDelta }_{ ls }^{ n }\) are independent together. \(C = \sum _{ i }^{ { n }_{ s } }{ { \left| { w }_{ lsi } \right| }^{ 2 } } \) is a {\(W^{pt}_{ls}, W^{nt}_{ls}\)}-independent constant. Therefore, our scaling factors {\(W^{pt}_{ls}, W^{nt}_{ls}\)} can be simplified to:

$$\begin{aligned} \begin{aligned} { W }_{ ls }^{ * } = arg \min J( { W }_{ ls }^{ * }) = arg \min (\left| { I }_{ { \varDelta }_{ ls }^{ * } } \right| *{ ({ W }_{ ls }^{ * }) }^{ 2 }-2*(\sum _{ i|i\in { I }_{ { \varDelta }_{ ls }^{ * } } }^{ { n }_{ s } }{ \left| { w }_{ lsi } \right| } )*({ W }_{ ls }^{ * })) \end{aligned} \end{aligned}$$
(9)

According to (9), It is not complicated to obtain tenary weights from the floating-point weights as (10):

$$\begin{aligned} \begin{aligned}&{ W }_{ ls }^{ * } =\frac{ 1 }{ \left| { I }_{ { \varDelta }_{ ls }^{ * } } \right| } \sum _{ i|i \in { I }_{ { \varDelta }_{ ls }^{ * } } }^{ { n }_{ s } }{ \left| { w }_{ lsi } \right| } \\ \end{aligned} \end{aligned}$$
(10)

Here {\({ \varDelta }_{ ls }^{ p }, { \varDelta }_{ ls }^{ n }\)} are both positive values. There is no straightforward solutions to figure out \({ \varDelta }_{ ls }^{ p } \) and \( { \varDelta }_{ ls }^{ n }\) as [17]. But values are generated from uniform or normal distribution empirically, adopting the method mentioned in [14], the thresholds are as following:

$$\begin{aligned} \begin{aligned}&{ \varDelta }_{ ls }^{ * } \approx 0.7 * \frac{ 1 }{ \left| { I }^{ * } \right| } \sum _{ i|i \in { I }^{ * } }^{ { n }_{ s } }{ \left| { w }_{ lsi } \right| } \end{aligned} \end{aligned}$$
(11)

Where \({ I }^{ * } = \) {\({ I }^{ p },{ I }^{ n }\)}, \({ I }^{ p } = \{i|{w}_{lsi} \ge 0|i = 1,2,...,n_s\}\), \({ I }^{ n } = \{i|{w}_{lsi} < 0|i = 1,2,...,n_s\}\). Finally, by substituting (10) and (11) to (6), Ternary weights can be easily obtained from the floating-point weights.

2.4 Heterogeneous Quantized Weights Structure

In order to achieve a good balance between compression rate and accuracy, we train CNNs with ternary weights for convolutional layers and binary weights for the fully-connected layers. On the one hand, these densely and highly redundancy fully-connected layers take up most of the parameters, binarization is more helpful in removing redundancy and higher proportion of compression. On the other hand, [21] shows that convolutional layers require more bits of precision than fully-connected layers, so ternary weights for convolutional layers improve the expression capacity. In addition, the quantization values of zero for convolutional layers reduce the calculation of the multiplication to accelerate the networks.

2.5 Train the SATB-Nets with Stochastic Gradient Descent (SGD) Method

Stochastic Gradient Descent (SGD) algorithm is used as the training algorithm for SATB-Nets, about which more detail is shown in Algorithm 1.

figure a

The whole training process is almost the same as normal training method, except that segmented asymmetric ternary weights for convolutional (CONV) layers and binary weights for the fully-connected (FC) layers are used in forward propagation (step 1) and backward derivation (step 2), which is similar to training method as BinaryConnect [11]. In order to overcome the difficulty of convergence of models using quantized weights, we reserved the full precision floating-point weights to update weights to obtain the tiny changes in each iteration (step 3).

In addition, Batch Normalization (BN) [24] and learning rate scaling, as two useful tricks, are adopted. We also use momentum for acceleration.

3 Experiments

In this section, we benchmark SATB-Nets with full precision weights networks (FPWNs), binary weights networks (BWNs) and Ternary Weights Networks (TWNs) on the small scale datasets (CIFAR-10) and the large scale dataset (ImageNet datasets). We adopt the VGG [6] networks on Cifar-10 and the AlexNet [3] on ImageNet. To be fair, the following terms are identical: network architecture, learning rate scaling procedure (multi-step), optimization method (SGD with momentum) and regularization method (L2 weight decay). We conjecture that SATB-Nets have sufficient expressiveness in the depth networks and adopt the data augmentation and the sparse weights like dropout [15] to prevent over-fitting. In addition, all the neural networks are deployed on framework of Caffe [25]. For more detailed configurations, we can see Table 1.

Table 1. Networks and some hyper-parameters of them on datasets

3.1 VGGNets on CIFAR-10

CIFAR-10 is a benchmark image classification dataset which consists of 60K \(32 \times 32\) color images and Five sixths of them belong to a training set and the rest belong to the test set. To prevent over-fitting while training VGG [6] networks, data-augmentation is used following [4]. A random \(32 \times 32\) crop is from the padded images on which 4 pixels are padded each side. The cropped images are used for training while original images are for testing. We adopt VGG16 [6] architecture for the experiment firstly. Beside, in order to solve the difficulty of training so deep neural network, we initialize these networks with full-trained full precision model.

Fig. 1.
figure 1

Validation accuracy curves of VGG16 on Cifar-10

We compare SATB-Nets with the FPWNs, BWNs and TWNs. The result (Fig. 1 and Table 2) shows that SATB-Nets from VGG16 outperforms BWNs, TWNs and FPWNs by 2.52%, 1.09% and 0.65% respectively. In the meanwhile, SATB-Nets from VGG10, VGG13 and VGG16 are always outperforming BWNs and TWNs.

To our surprise, the SATB-Nets constrained from VGG13 and VGG16 outperform the full precision weights networks. According to our analysis, we conjecture that our SATB-Nets have adequate capacity for expression and the sparse weights networks prevent over-fitting like dropout [15].

Table 2. Validation accuracy of VGGNet on CIFAR-10 (%)

For the more sufficient experimental verification, we expand the experiment to VGG13 removing last 3 convolutional layers of VGG16 [6] and VGG10 removing last 6 convolutional layers. The results met our expectations which are listed in Table 2. In the meanwhile, Table 3 shows the compression ratio of VGG-16.

Table 3. Compression ratio for VGG-16 (Byte)

3.2 AlexNet on ImageNet

We further examine the performance of SATB-Nets on the ImageNet ILSVRC-2012 dataset, which has over 1.2M training examples and 50K validation examples. We use the AlexNet Caffe model [26] as the reference model. Beside, in order to solve the difficulty of training so deep neural network, we initialize these networks with full-trained full precision model.

Fig. 2.
figure 2

Validation accuracy curves of AlexNet on ImageNet

Our training curves are shown in Fig. 2, the complete result (Fig. 2 and Table 4) shows that SATB-Nets reaches the top-1 validation accuracy of 56.57% which has only 0.15% accuracy degradation over full precision counterpart.

Table 4. Validation accuracy of AlexNet on ImageNet(%)

Tables 3 and 5 show the compression ratio of VGG-16 and AlexNet. SATB-Nets achieve up to \(29\times \) and \(31\times \) model compression rate respectively which are closed to the binary weights compression without impacting accuracy.

Table 5. Compression statistics for AlexNet (Byte)

4 Conclusion

In this paper, we propose ternary and binary weights networks optimization problems. Next, We propose SATB-Nets which nearly achieve up to binary compression ratio. Meanwhile, experiments show that benchmarks demonstrate the superior performance of the method which we proposed. Next step, we will apply the method to more datasets and models to more deeply explore the relationships between the capacity of networks and the quantized values.