Introduction

Facial landmark localization is a fundamental problem of many face-related vision tasks, such as head pose estimation [1, 2], facial emotion recognition [3], and face verification [4, 5]. Recent deep neural networks (DNNs) have significantly improved the performance of facial landmark localization under extreme conditions [6,7,8]. However, an excellent localization method should not only be robust to facial deformation, expression, and illumination but also be efficient for storage and computation. DNNs generally have a huge number of parameters which lead to extensive storage and time cost. These are especially undesirable when detecting landmarks on mobile devices. To address this issue, model miniaturization and acceleration techniques are greatly needed.

In this paper, an effective framework is proposed to solve the above problems. It integrates importance-based pruning and product quantization [9] to compress the model. All pooling layers are merged with their neighbor convolutional layers to accelerate the model simultaneously. More specifically, a VGG-like [11] baseline model is firstly trained to locate 68 facial landmarks. This dense network has excellent performance but a large model size. Then this network is retrained iteratively after compressing one layer at a time. When compressing each layer, unimportant neurons or connections are pruned based on weight correlations. Product quantization is then applied on the absolute values of remaining weights. To accelerate the model further, all pooling layers are abandoned and the strides of their neighbor convolutional layers are increased before retraining. Finally, we successfully achieve 26 × compression and 4 × acceleration while the root mean squared error (RMSE) increases by just 3.6%.

Pruning aims at removing redundancy and constructing a sparser network. It shares some similarity with Dropout [13], which is used to avoid overfitting through randomly setting the neural outputs to zeros. Pruning is a common method in dense network compression [14,15,16]. It is usually used to prune connections between neurons. Although this kind of pruning can reduce the model size, it needs to reconstruct the sparse weight matrix during testing. That means the computation complexity and memory cost will stay the same. Instead of pruning connections, we introduce a neuron-pruning method, which is equal to build a smaller net with less neurons as shown in Fig. 1. Product quantization decomposes the original high-dimensional weights into several low-dimensional Cartesian product subspaces which are then quantized separately. Compared with scalar quantization in DeepCompression [16], product quantization needs fewer cluster indexes and codes with same quantization error. It means a higher compressing ratio can be obtained without noticeable accuracy loss. After pruning and product quantization, we remove all pooling layers and increase the strides of their neighbor convolutional layers. It can save much time of im2col-based [10] convolutional computation.

Fig. 1
figure 1

An illustration of differences between pruning connections and neurons. Pruning a neuron will remove all related connections

On the other hand, there is not enough public datasets for facial landmark detection as training data. There are only 4000 images of 68 labeled landmarks collected from several public datasets [27,28,29,30], which means it must be careful to avoid overfitting. Fortunately, pruning and quantization both can be regarded as some regularizers. It is no longer needed to over-worry about the overfitting problem, and a larger learning rate can be used without a dropout layer.

The rest of the paper is organized as follows. The “Related Work” section briefly reviews the related work. The “Our Method” section describes the three parts of our method: neuron or connection pruning, production quantization, and network retaining. In the “Experiments” section, the baseline model is introduced and compressed to evaluate the effectiveness of our method. Finally, the conclusion is given in the “Conclusion” section.

Related Work

To reduce the storage and time cost of DNNs, an increasing number of works begin to explore network miniaturization and acceleration techniques.

Some works achieve this goal by carefully designing small network architectures. GoogLeNet-V3 [17] not only used 1 × 1 convolution kernels to reduce the feature dimension but also replaced the n × n convolution with a 1 × n convolution followed by a n × 1 convolution. Unlike this, Courbariaux et al. [18] trained a binarized neural network with binary weights and activations, which drastically reduced memory consumption. Denil et al. [19] represented the weight matrix as a product of two low-rank factors. During training, they fixed one factor and only updated the other factor. Similarly, Sainath et al. [21] utilized the low-rank matrix factorization to reduce the parameters of fully connected layers. However, training a network with factorized representation directly usually performs poorly. Recently, Scardapane et al. [20] designed a new loss function to perform feature selection, network training, and weight compression simultaneously, but their work is not suitable for convolutional network.

In addition to training small models directly, compressing a large model into a small one is another popular choice. Denton et al. [22] first considered singular value decomposition (SVD) to compress parameters. Gong et al. [23] systematically explored quantization methods for compressing the dense connected layers, including binarization, scalar quantization, product quantization, and residual quantization. Their experiments showed that product quantization is obviously superior to other methods. However, these methods have no network retraining schemes and inevitably cause the performance degradation. In our method, the production quantization is applied with the retraining procedure. So a high compression ratio can be achieved with negligible performance loss.

Han et al. [16] compressed the network by combining pruning connections, scalar quantization, and Huffman coding, which is a popular work recently. In contrast, we prune not only neural connections but also neurons. So it will cost less to store the sparse structure and the inference speed will be significantly improved simultaneously. We also replace its scalar quantization with product quantization on absolute values and introduce an iterative way to retrain the network layer by layer. In addition, Sun et al. [15] found that weight magnitude is not a good indicator to the importance of neural connections. So they pruned the network connections based on weight correlations. We improve this method and bring it into our work.

Moreover, some researchers have turned their attentions to the hardware for model compression and acceleration. Han et al. [24] designed a specific hardware accelerator called EIE. It could run a compressed network with sparse and weight sharing neurons directly. IBM team [25] applied structured kernels into convolutional layers on their TrueNorth hardware architecture. It achieved a good trade-off between energy efficiency and classification accuracy. But yet, these techniques have almost not reached a practical level.

Our Method

We first train a dense network as our baseline and then compress it with the following steps.

Pruning Neurons and Connections

With the pre-trained dense network, a pruning ratio R (R > 1) is used to control the number of neurons or connections that will be pruned. Concretely, only 1/R of neurons or connections will be preserved. The key problem here is to decide which neurons or connections will be kept. DeepCompression [16] removed all connections with weights below a specified threshold. However, weight magnitude cannot indicate the importance of neural connections well. A fairly straightforward approach is to iteratively drop a neuron with minimum prediction error:

$$ {\Delta} y = ||\hat {W}x-Wx||^{2} \; . $$
(1)

where x is the input, Δy is the error of output, and W and \(\hat W\) are the original weight matrix and the pruned weight matrix, respectively. To build the \(\hat W\), all matrix columns corresponding to the pruned neurons will be set to zeros. However, this greedy algorithm is inefficient, especially when there are too many neurons. Inspired by Sun et al. [15], we measure the importance of neuron based on the sum of connection correlations. It contains two parts:

  1. (1)

    For fully connected layers that have no weight sharing, the correlation coefficient between neuron x i and y j is computed as follows:

    $$ r_{ij} = \frac{E[(x_{i}-u_{x_{i}})(y_{j}-u_{y_{j}})]}{\sigma_{x_{i}} \sigma_{y_{j}}} \; . $$
    (2)

    where \(\mu _{x_{i}}\) and \(\sigma _{x_{i}}\) separately denote the mean and standard deviation of all weights related to input neuron x i , and \(\mu _{y_{j}}\) and \(\sigma _{y_{j}}\) separately denote the mean and standard deviation of all weights related to output neuron y j . Then the importance of output neuron y j is

    $$ I_{j} = \sum\limits_{i=1} | r_{ij} | \; . $$
    (3)

    We keep the most important 1/R of output neurons and remove the rest.

  2. (2)

    For convolutional layers with weight sharing, it is impracticable to prune neurons. So we decide to prune connections instead. The correlation coefficients are firstly computed with the method described by Sun et al. [15] to indicate the importance of the weights. Then the same number of weights with minimum importance are removed from each convolution kernel. This specific pruning operation relates to our strategy for convolutional computation. We use im2col [10] to perform fast convolution which reduces this problem to matrix-matrix multiplication. So an aligned weight matrix after pruning will help the reconstruction of the sparse matrix and make it easy to apply product quantization further, as shown in Fig. 2.

Fig. 2
figure 2

a Convolutional kernels. b Weight matrix from reshaping and merging all kernels. c Each row has the same number of weights left after pruning, where white squares mean the pruned connections (or weights). d A new weight matrix is established from remaining weights, then product quantization can be applied to it directly

Product Quantization

The network performance is sensitive to the pruning ratio. Although a larger pruning ratio can generate a higher compression ratio, the network performance will be severely curtailed as shown in Fig. 7. Therefore, product quantization [9] is applied to compress the network further. Product quantization is a popular vector quantization method. With decomposing original high-dimensional space into several low-dimensional subspaces and taking quantization separately, the data distribution can be described well with less centroid codes.

Given a pruned weight matrix \(\hat W \in R^{m\times n}\), the positive or negative sign of each weight is first recorded and the \(\hat W\) is substituted for its absolute value. Then the \(\hat W\) is split column-wisely into S submatrices:

$$ \hat W = [\hat W^{1},\hat W^{2},\ldots, \hat W^{S}] \; . $$
(4)

For each submatrix \(\hat W^{i} \in R^{m\times D},(D=n/S,i=1,{\dots } S)\), the K-means clustering algorithm is performed on it to generate a sub-codebook C i with k codes. The whole code space is therefore defined as their Cartesian product:

$$\begin{array}{@{}rcl@{}} C &=& C_{1} \times C_{2} \times {\ldots} \times C_{S} \; .\\ C_{i} &=& [{C_{i}^{1}}, {C_{i}^{2}}, \ldots, {C_{i}^{k}}] . \end{array} $$
(5)

For each row \(\hat W_{r}\) in \(\hat W\), it can be reconstructed with the closest code vector:

$$\begin{array}{@{}rcl@{}} \hat{W}_{r} &=& [\hat {W_{r}^{1}},\hat {W_{r}^{2}},\ldots, \hat {W_{r}^{S}}] .\\ \hat{W}_{r}^{i} &\leftarrow& {C_{i}^{j}}, j = \underset{j}{\arg\min}\| {W_{r}^{i}} - {C_{i}^{j}}\| . \end{array} $$
(6)

Supposing that all submatrices have the same cluster number k, we can use k S subcodes to generate a large codebook with k S codes. It is the reason why product quantization consumes less memory than scalar quantization with the same quantization error.

We further find that the distribution of the weight values is symmetrical about the zero. So, we apply product quantization on the absolute values. It reduces the quantization error with the same amount of quantization centroids. And the only extra expense is one bit for each weight to record its positive or negative sign.

Network Retraining

The network is compressed layer by layer from backward to forward. In each iteration, the correlated coefficients are firstly calculated from the previously trained model. Then one additional layer is pruned and quantified. To retrain the network, we use deep learning framework Caffe [26] and simply modify its convolutional and fully connected layers by adding another two blobs to store indexes and codes. Each time before forward-propagation, the weight matrix is reconstructed with the indexes and codes. Particularly, if the index is zero, it means the corresponding connection has been pruned. Otherwise, this connection will be recovered by looking up tables and adding a positive or negative sign, as shown in Fig. 3. During back-propagation, the centroid codes are updated using the method described in DeepCompression [16]. Layers after pruning and quantization become very sparse. It is no longer needed to use the dropout layers to restrict overfitting.

Fig. 3
figure 3

Reconstructing weights with indexes and codes. Only if the index is nonzero, the corresponding weight will be recovered by looking up code tables and adding a corresponding positive or negative sign

On the other hand, the parameter size and computation time of the whole convolutional layers are measured in our baseline model. We find that convolutional layers take up more than 44% of operation time while have less than 7% of parameters. It is because im2col [10] spends much time rearranging all convolution patches into a large dense matrix. To optimize this process, we abandon all max-pooling layers and increase the strides of their neighbor convolutional layers. For example, after the first max-pooling layer is removed, the stride of the first convolutional layer will increase from 2 to 6. Increasing the stride reduces both the frequencies of catching pathes and the size of dense matrix. Therefore, the convolution cost can be reduced by about two-thirds.

Experiments

In this section, the baseline model is introduced and compared with SDM [12] and TCDCN [31]. Then the model is compressed and the performance before and after compression are compared. Finally, the impact of some parameters on the trade-off between accuracy and compression ratio is discussed.

Baseline Model

There are 4025 images with 68 landmarks collected from LFPW [27], AFW [28], HELEN [29], and 300W [30]. Four hundred twenty-five images are randomly selected for testing and the rest are for training. Then the training images are augmented through mirror transformation and geometric transformation such as shifting and rotating. The baseline model is simplified from VGG11-Net [11] by removing three convolutional layers. The number of outputs in the last fully connected layer is changed to 136 for facial landmark regression. And the original softmax loss layer is also replaced with a modified Euclidean loss layer as the following formula:

$$ Loss = \sum\limits_{i=1}^{136}h_{i}(y_{i}-\hat y_{i})^{2} . $$
(7)
$$ h_{i} = \left\{ \begin{array}{rcl} 0 &\quad& {|y_{i}-\hat y_{i}|<\alpha} \\ 1 &\quad& {elsewise} \end{array} \right. ~. $$
(8)

where y i is the ground truth and \(\hat y_{i}\) is the network output. There are 136 outputs in total for 68 facial point coordinates. It is noted that all coordinate values have been normalized to [−1, +1]. The α is an adjustable parameter. Prediction error that is less than this tolerance will be ignored. It is mainly because labeling facial points is affected by subjective factors and the ground truth may be not entirely accurate. Besides, this loss helps to mine hard samples online and strength the network power.

The architecture of our baseline model is shown in Table 1. There are five convolutional layers and three fully connected layers. The entire network model exceeds 379 MB and has about 99.4 M float parameters.

Table 1 The structure of our baseline model

In order to measure and compare the performance, two metrics are introduced:

  1. (1)

    Root mean squared error (RMSE):

    $$ RMSE = \frac{{\sum}_{i=1}^{68}\sqrt{(p_{i}-\hat p_{i})^{2}}}{68d}\; . $$
    (9)

    where d is the distance between the eyes, p i is the ground truth position of the specific facial landmark, and \(\hat p_{i}\) is the predicted position.

  2. (2)

    Cumulative error distribution (CED) curve: The CED curve plots the error tolerance on the X-axis versus the percentage of samples within the tolerance on the Y-axis.

The mean RMSE of our baseline model is 0.0526. The mean RMSE of SDM [12] and TCDCN [31] are 0.0837 and 0.0680 separately, which are larger than ours. The CED curve is shown in Fig. 4, where a bigger area under curve (AUC) corresponds with a higher regression accuracy. It is clear that the performance of our baseline model is significantly better than SDM and TCDCN.

Fig. 4
figure 4

Performance of the compressed network. The “Compress all” represents the compressed model with the configuration in Table 2 and the “Compress fc8” represents the same model with fc8 layer compressed alone

Compressed Network

The baseline network is compressed iteratively with the method described in the “Related Work” section. The dropout layers are removed before retraining the network, since they are unnecessary and will reduce retraining speed. The compressing configurations are shown in Table 2.

Table 2 Configurations for compressed network

Some connections and neurons are pruned in convolutional layers and fully connected layers separately. For example, the size of dense weight matrix in fc6 layer becomes 18432 × 2014 after pruning neurons in fc7 layer. This matrix further becomes four times sparser after pruning connections in fc6 layer. Since the parameters in the fully connected layer are mostly redundant, a larger compression ratio is used. Parameters in convolutional layers contribute a smaller part of the whole network and are harder to be reduced, so a smaller (or zero) compression ratio is taken on them. By making necessary trade-offs between performance and compression ratio, we achieve 26 × compression of the baseline model while the mean RMSE increases by just 3.6%. More specifically, the compressed model has about 1.6 M float parameters and 3 M 8-bit quantization indices. With additional memory for the storage of sparse indexes, the model finally is compressed from 379 to 14.6 MB. The comparisons between the baseline model and the compressed model are shown in Table 3. It is worth noting that the black line in Fig. 4 represents the same model with fc8 layer compressed alone, whose performance is surprisingly better than baseline. It is mainly because some appropriate sparseness can improve the generalization ability of DNN models.

Table 3 Comparisons between the baseline model and the compressed model

In retraining, all three max-pooling layers are removed and the strides of their neighbor convolutional layers are increased accordingly. Therefore, the speed of the compressed network is further improved.

Some detection examples in Fig. 5 verify the robustness of the compressed model to deformation, expression, and illumination. Moreover, the RMSE of the compressed model is just 3.6% higher than the baseline model. For most testing images, it is hard to see the difference between the baseline model and the compressed model with the naked eye. We select several samples with obvious differences to show the slight degradation in performance in Fig. 6.

Fig. 5
figure 5

Some example detection results from our compressed network

Fig. 6
figure 6

Some samples with obvious degradation in performance

Discussion About Different Configurations

Quantization can only reduce parameter size while pruning can not only reduce parameter size but also improve inference speed. It seems that pruning should be heavily used to get both high compression ratio and speed improvement. However, for facial landmark detection, we find that a higher pruning ratio leads to significant performance degradation. Figure 7 shows the comparison of different pruning ratios on fully connected layers and the pruning method is described in [15]. The results indicate that the method [15] is not suitable for our facial landmark detection and pruning too many connections will severely degrade the performance.

Fig. 7
figure 7

Comparison of different pruning ratios on fully connected layers. The pruning method is described in [16]. “Prune X” represents all fully connected layers are pruned with the same pruning ratio X

In this case, it must be careful to prune neurons or connections and use product quantization to compress more parameters. The performance of product quantization is positively correlated with the cluster number K in each subspace and negatively correlated with the dimensionality D of subspace. To get a higher compression ratio, it obviously needs a small K and a large D, but the performance will suffer from this. Fortunately, applying product quantization on the absolute values can improve this situation. The only extra expense is that each value needs one bit to record its positive or negative sign. Figure 8 shows the comparison between taking product quantization on the absolute values and original values in fc7 layer. It demonstrates that our method can use fewer quantization centroids with less error and compress the network further.

Fig. 8
figure 8

Comparison of product quantization on absolute values or original values with different dimensionality D of subspace. Product quantization can use fewer quantization centroids with less error on absolute values

Conclusion

In this paper, an efficient method is proposed to compress a large DNN model and reduce the runtime cost. First of all, unimportant neurons or connections are pruned based on weight correlations. It can reduce both index storage and inference computation cost. Product quantization is then applied to generate higher compression. Especially, this quantization is taken on the absolute values of the remaining weights. So the same amount of quantization centroids can bring smaller quantization error. Additionally, all pooling layers are removed and the strides of their neighbor convolutional layers are increased during retraining. It accelerates the network further. The experiments of compressing a large dense model for facial landmark detection demonstrate the effectiveness of our proposed method, which achieves 26 × compression and 4 × acceleration while the mean RMSE increases by just 3.6%. In the future, we will apply our compression and acceleration technique to more existing algorithms for different tasks to further verify the effectiveness and robustness.