Neural Network Compression Framework for Fast Model Inference

Kozlov, Alexander; Lazarevich, Ivan; Shamporov, Vasily; Lyalyushkin, Nikolay; Gorbachev, Yury

doi:10.1007/978-3-030-80129-8_17

Alexander Kozlov¹⁰,
Ivan Lazarevich¹⁰,
Vasily Shamporov¹⁰,
Nikolay Lyalyushkin¹⁰ &
…
Yury Gorbachev¹⁰

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 285))

1910 Accesses
10 Citations

Abstract

We present a new PyTorch-based framework for neural network compression with fine-tuning named Neural Network Compression Framework (NNCF) (https://github.com/openvinotoolkit/nncf) . It leverages recent advances of various network compression methods and implements some of them, namely quantization, sparsity, filter pruning and binarization. These methods allow producing more hardware-friendly models that can be efficiently run on general-purpose hardware computation units (CPU, GPU) or specialized deep learning accelerators. We show that the implemented methods and their combinations can be successfully applied to a wide range of architectures and tasks to accelerate inference while preserving the original model’s accuracy. The framework can be used in conjunction with the supplied training samples or as a standalone package that can be seamlessly integrated into the existing training code with minimal adaptations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Survey of Deep Neural Network Compression

Model Compression Techniques in Deep Neural Networks

Coreset-Based Neural Network Compression

References

OpenVINO Toolkit. https://software.intel.com/en-us/openvino-toolkit
Avron, H., Toledo, S.: Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM (JACM) 58(2), 1–34 (2011)
Google Scholar
Chen, K., et al.: MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.-J., Srinivasan, V., Gopalakrishnan, K.: Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018)
Dong, Z., Yao, Z., Cai, Y., Arfeen, D., Gholami, A., Mahoney, M.W., Keutzer, K.: Hawq-v2: Hessian aware trace-weighted quantization of neural networks. arXiv preprint arXiv:1911.03852 (2019)
Gale, T., Elsen, E., Hooker, S.: The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574 (2019)
Gomez, A.N., Zhang, I., Swersky, K., Gal, Y., Hinton, G.E.: Learning sparse networks using targeted dropout. arXiv preprint arXiv:1905.13678 (2019)
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, pp. 1135–1143 (2015)
Google Scholar
He, Y., Liu, P., Wang, Z., Hu, Z., Yang, Y.: Filter pruning via geometric median for deep convolutional neural networks acceleration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349 (2019)
Google Scholar
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In: Advances in Neural Information Processing Systems, pp. 4107–4115 (2016)
Google Scholar
Krishnamoorthi, R.: Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342 (2018)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc. (2012)
Google Scholar
Liu, B., Wang, M., Foroosh, H., Tappen, M., Pensky, M.: Sparse convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814 (2015)
Google Scholar
Liu, C., et al.: Progressive neural architecture search. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34 (2018)
Google Scholar
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744 (2017)
Google Scholar
Louizos, C., Welling, M., Kingma, D.P.: Learning sparse neural networks through $ l\_0 $ regularization. arXiv preprint arXiv:1712.01312 (2017)
Molchanov, D., Ashukha, A., Vetrov, D.: Variational dropout sparsifies deep neural networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 2498–2507. JMLR.org (2017)
Google Scholar
van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Park, J., et al.: Faster CNNs with direct sparse convolutions and guided pruning. arXiv preprint arXiv:1608.01409 (2016)
Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-net: Imagenet classification using binary convolutional neural networks. In: European Conference on Computer Vision, pp. 525–542. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-319-46493-0_32
Rodríguez, P., Gonzalez, J., Cucurull, G., Gonfaus, J.M., Roca, X.: Regularizing CNNs with locally constrained decorrelations. arXiv preprint arXiv:1611.01967 (2016)
Wu, M., Jain, S.R., Gural, A., Dick, C.H.: Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks (2019)
Google Scholar
Shang, W., Sohn, K., Almeida, D., Lee, H.: Understanding and improving convolutional neural networks via concatenated rectified linear units. In: International Conference on Machine Learning, pp. 2217–2225 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014)
Google Scholar
Tan, M., et al.: MnasNet: platform-aware neural architecture search for mobile. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828 (2019)
Google Scholar
Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2074–2082 (2016)
Google Scholar
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. ArXiv, arXiv-1910, (2019)
Google Scholar
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, pp. 818–833. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016)
Zhu, M., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878 (2017)
Zmora, N., Jacob, G., Zlotnik, L., Elharar, B., Novik, G.: Neural network distiller, June 2018
Google Scholar
Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Intel Corporation, Nizhny Novgorod, Turgeneva Street 30, 603024, Nizhny Novgorod, Russia
Alexander Kozlov, Ivan Lazarevich, Vasily Shamporov, Nikolay Lyalyushkin & Yury Gorbachev

Authors

Alexander Kozlov
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Lazarevich
View author publications
You can also search for this author in PubMed Google Scholar
Vasily Shamporov
View author publications
You can also search for this author in PubMed Google Scholar
Nikolay Lyalyushkin
View author publications
You can also search for this author in PubMed Google Scholar
Yury Gorbachev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Kozlov .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai

Appendix

Described below are the steps required to modify an existing PyTorch training pipeline in order for it to be integrated with NNCF. The described use case implies there exists a PyTorch pipeline that reproduces model training in floating point precision and a pre-trained model snapshot. The objective of NNCF is to simulate model compression at inference time in order to allow the trainable parameters to adjust to the compressed inference conditions, and then export the compressed version of the model to a format suitable for compressed inference. Once the NNCF package is installed, the user needs to introduce minor changes to the training code to enable model compression. Below are the steps needed to modify the training pipeline code in PyTorch:

Add the following imports in the beginning of the training sample right after importing PyTorch:
Once a model instance is created and the pre-trained weights are loaded, the model can be compressed using the helper methods. Some compression algorithms (e.g. quantization) require arguments (e.g. the train_loader for your training dataset) to be supplied to the initialize() method at this stage as well, in order to properly initialize compression modulse parameters related to its compression (e.g. scale values for FakeQuantize layers):
where resnet50_int8.json in this case is a JSON-formatted file containing all the options and hyperparameters of compression methods (the format of the options is imposed by NNCF).
At this stage the model can optionally be wrapped with DataParallel orDistributedDataParallel classes for multi-GPU training. In case distributed training is used, call the compression_algo.distributed() method after wrapping the model with DistributedDataParallel to signal the compression algorithms that special distributed-specific internal handling of compression parameters is required.
The model can now be trained as a usual torch.nn.Module to fine-tune compression parameters along with the model weights. To completely utilize NNCF functionality, you may introduce the following changes to the training loop code: 1) after model inference is done on the current training iteration, the compression loss should be added to the main task loss such as cross-entropy loss:
2) the compression algorithm schedulers should be made aware of the batch/epoch steps, so add comp_ctrl.scheduler.step() calls after each training batch iteration and comp_ctrl.scheduler.epoch_step() calls after each training epoch iteration.
When done finetuning, export the model to ONNX by calling a compression controller’s dedicated method, or to PyTorch’s .pth format by using the regular torch.save functionality:

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kozlov, A., Lazarevich, I., Shamporov, V., Lyalyushkin, N., Gorbachev, Y. (2021). Neural Network Compression Framework for Fast Model Inference. In: Arai, K. (eds) Intelligent Computing. Lecture Notes in Networks and Systems, vol 285. Springer, Cham. https://doi.org/10.1007/978-3-030-80129-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-80129-8_17
Published: 06 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80128-1
Online ISBN: 978-3-030-80129-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us