Abstract
We present a new PyTorch-based framework for neural network compression with fine-tuning named Neural Network Compression Framework (NNCF) (https://github.com/openvinotoolkit/nncf) . It leverages recent advances of various network compression methods and implements some of them, namely quantization, sparsity, filter pruning and binarization. These methods allow producing more hardware-friendly models that can be efficiently run on general-purpose hardware computation units (CPU, GPU) or specialized deep learning accelerators. We show that the implemented methods and their combinations can be successfully applied to a wide range of architectures and tasks to accelerate inference while preserving the original model’s accuracy. The framework can be used in conjunction with the supplied training samples or as a standalone package that can be seamlessly integrated into the existing training code with minimal adaptations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
OpenVINO Toolkit. https://software.intel.com/en-us/openvino-toolkit
Avron, H., Toledo, S.: Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM (JACM) 58(2), 1–34 (2011)
Chen, K., et al.: MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.-J., Srinivasan, V., Gopalakrishnan, K.: Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018)
Dong, Z., Yao, Z., Cai, Y., Arfeen, D., Gholami, A., Mahoney, M.W., Keutzer, K.: Hawq-v2: Hessian aware trace-weighted quantization of neural networks. arXiv preprint arXiv:1911.03852 (2019)
Gale, T., Elsen, E., Hooker, S.: The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574 (2019)
Gomez, A.N., Zhang, I., Swersky, K., Gal, Y., Hinton, G.E.: Learning sparse networks using targeted dropout. arXiv preprint arXiv:1905.13678 (2019)
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, pp. 1135–1143 (2015)
He, Y., Liu, P., Wang, Z., Hu, Z., Yang, Y.: Filter pruning via geometric median for deep convolutional neural networks acceleration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349 (2019)
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In: Advances in Neural Information Processing Systems, pp. 4107–4115 (2016)
Krishnamoorthi, R.: Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342 (2018)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc. (2012)
Liu, B., Wang, M., Foroosh, H., Tappen, M., Pensky, M.: Sparse convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814 (2015)
Liu, C., et al.: Progressive neural architecture search. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34 (2018)
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744 (2017)
Louizos, C., Welling, M., Kingma, D.P.: Learning sparse neural networks through \( l\_0 \) regularization. arXiv preprint arXiv:1712.01312 (2017)
Molchanov, D., Ashukha, A., Vetrov, D.: Variational dropout sparsifies deep neural networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 2498–2507. JMLR.org (2017)
van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Park, J., et al.: Faster CNNs with direct sparse convolutions and guided pruning. arXiv preprint arXiv:1608.01409 (2016)
Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-net: Imagenet classification using binary convolutional neural networks. In: European Conference on Computer Vision, pp. 525–542. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-319-46493-0_32
Rodríguez, P., Gonzalez, J., Cucurull, G., Gonfaus, J.M., Roca, X.: Regularizing CNNs with locally constrained decorrelations. arXiv preprint arXiv:1611.01967 (2016)
Wu, M., Jain, S.R., Gural, A., Dick, C.H.: Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks (2019)
Shang, W., Sohn, K., Almeida, D., Lee, H.: Understanding and improving convolutional neural networks via concatenated rectified linear units. In: International Conference on Machine Learning, pp. 2217–2225 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014)
Tan, M., et al.: MnasNet: platform-aware neural architecture search for mobile. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828 (2019)
Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2074–2082 (2016)
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. ArXiv, arXiv-1910, (2019)
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, pp. 818–833. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016)
Zhu, M., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878 (2017)
Zmora, N., Jacob, G., Zlotnik, L., Elharar, B., Novik, G.: Neural network distiller, June 2018
Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Described below are the steps required to modify an existing PyTorch training pipeline in order for it to be integrated with NNCF. The described use case implies there exists a PyTorch pipeline that reproduces model training in floating point precision and a pre-trained model snapshot. The objective of NNCF is to simulate model compression at inference time in order to allow the trainable parameters to adjust to the compressed inference conditions, and then export the compressed version of the model to a format suitable for compressed inference. Once the NNCF package is installed, the user needs to introduce minor changes to the training code to enable model compression. Below are the steps needed to modify the training pipeline code in PyTorch:
-
Add the following imports in the beginning of the training sample right after importing PyTorch:
-
Once a model instance is created and the pre-trained weights are loaded, the model can be compressed using the helper methods. Some compression algorithms (e.g. quantization) require arguments (e.g. the train_loader for your training dataset) to be supplied to the initialize() method at this stage as well, in order to properly initialize compression modulse parameters related to its compression (e.g. scale values for FakeQuantize layers):
where resnet50_int8.json in this case is a JSON-formatted file containing all the options and hyperparameters of compression methods (the format of the options is imposed by NNCF).
-
At this stage the model can optionally be wrapped with DataParallel orDistributedDataParallel classes for multi-GPU training. In case distributed training is used, call the compression_algo.distributed() method after wrapping the model with DistributedDataParallel to signal the compression algorithms that special distributed-specific internal handling of compression parameters is required.
-
The model can now be trained as a usual torch.nn.Module to fine-tune compression parameters along with the model weights. To completely utilize NNCF functionality, you may introduce the following changes to the training loop code: 1) after model inference is done on the current training iteration, the compression loss should be added to the main task loss such as cross-entropy loss:
2) the compression algorithm schedulers should be made aware of the batch/epoch steps, so add comp_ctrl.scheduler.step() calls after each training batch iteration and comp_ctrl.scheduler.epoch_step() calls after each training epoch iteration.
-
When done finetuning, export the model to ONNX by calling a compression controller’s dedicated method, or to PyTorch’s .pth format by using the regular torch.save functionality:
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kozlov, A., Lazarevich, I., Shamporov, V., Lyalyushkin, N., Gorbachev, Y. (2021). Neural Network Compression Framework for Fast Model Inference. In: Arai, K. (eds) Intelligent Computing. Lecture Notes in Networks and Systems, vol 285. Springer, Cham. https://doi.org/10.1007/978-3-030-80129-8_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-80129-8_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80128-1
Online ISBN: 978-3-030-80129-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)