Skip to main content

Neural Network Compression Framework for Fast Model Inference

  • Conference paper
  • First Online:
Intelligent Computing

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 285))

Abstract

We present a new PyTorch-based framework for neural network compression with fine-tuning named Neural Network Compression Framework (NNCF) (https://github.com/openvinotoolkit/nncf) . It leverages recent advances of various network compression methods and implements some of them, namely quantization, sparsity, filter pruning and binarization. These methods allow producing more hardware-friendly models that can be efficiently run on general-purpose hardware computation units (CPU, GPU) or specialized deep learning accelerators. We show that the implemented methods and their combinations can be successfully applied to a wide range of architectures and tasks to accelerate inference while preserving the original model’s accuracy. The framework can be used in conjunction with the supplied training samples or as a standalone package that can be seamlessly integrated into the existing training code with minimal adaptations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. OpenVINO Toolkit. https://software.intel.com/en-us/openvino-toolkit

  2. Avron, H., Toledo, S.: Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM (JACM) 58(2), 1–34 (2011)

    Google Scholar 

  3. Chen, K., et al.: MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

  4. Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.-J., Srinivasan, V., Gopalakrishnan, K.: Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018)

  5. Dong, Z., Yao, Z., Cai, Y., Arfeen, D., Gholami, A., Mahoney, M.W., Keutzer, K.: Hawq-v2: Hessian aware trace-weighted quantization of neural networks. arXiv preprint arXiv:1911.03852 (2019)

  6. Gale, T., Elsen, E., Hooker, S.: The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574 (2019)

  7. Gomez, A.N., Zhang, I., Swersky, K., Gal, Y., Hinton, G.E.: Learning sparse networks using targeted dropout. arXiv preprint arXiv:1905.13678 (2019)

  8. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, pp. 1135–1143 (2015)

    Google Scholar 

  9. He, Y., Liu, P., Wang, Z., Hu, Z., Yang, Y.: Filter pruning via geometric median for deep convolutional neural networks acceleration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349 (2019)

    Google Scholar 

  10. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In: Advances in Neural Information Processing Systems, pp. 4107–4115 (2016)

    Google Scholar 

  11. Krishnamoorthi, R.: Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342 (2018)

  12. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc. (2012)

    Google Scholar 

  13. Liu, B., Wang, M., Foroosh, H., Tappen, M., Pensky, M.: Sparse convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814 (2015)

    Google Scholar 

  14. Liu, C., et al.: Progressive neural architecture search. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34 (2018)

    Google Scholar 

  15. Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744 (2017)

    Google Scholar 

  16. Louizos, C., Welling, M., Kingma, D.P.: Learning sparse neural networks through \( l\_0 \) regularization. arXiv preprint arXiv:1712.01312 (2017)

  17. Molchanov, D., Ashukha, A., Vetrov, D.: Variational dropout sparsifies deep neural networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 2498–2507. JMLR.org (2017)

    Google Scholar 

  18. van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)

  19. Park, J., et al.: Faster CNNs with direct sparse convolutions and guided pruning. arXiv preprint arXiv:1608.01409 (2016)

  20. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-net: Imagenet classification using binary convolutional neural networks. In: European Conference on Computer Vision, pp. 525–542. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-319-46493-0_32

  21. Rodríguez, P., Gonzalez, J., Cucurull, G., Gonfaus, J.M., Roca, X.: Regularizing CNNs with locally constrained decorrelations. arXiv preprint arXiv:1611.01967 (2016)

  22. Wu, M., Jain, S.R., Gural, A., Dick, C.H.: Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks (2019)

    Google Scholar 

  23. Shang, W., Sohn, K., Almeida, D., Lee, H.: Understanding and improving convolutional neural networks via concatenated rectified linear units. In: International Conference on Machine Learning, pp. 2217–2225 (2016)

    Google Scholar 

  24. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014)

    Google Scholar 

  25. Tan, M., et al.: MnasNet: platform-aware neural architecture search for mobile. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828 (2019)

    Google Scholar 

  26. Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2074–2082 (2016)

    Google Scholar 

  27. Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. ArXiv, arXiv-1910, (2019)

    Google Scholar 

  28. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)

  29. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, pp. 818–833. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-319-10590-1_53

  30. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016)

  31. Zhu, M., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878 (2017)

  32. Zmora, N., Jacob, G., Zlotnik, L., Elharar, B., Novik, G.: Neural network distiller, June 2018

    Google Scholar 

  33. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Kozlov .

Editor information

Editors and Affiliations

Appendix

Appendix

Described below are the steps required to modify an existing PyTorch training pipeline in order for it to be integrated with NNCF. The described use case implies there exists a PyTorch pipeline that reproduces model training in floating point precision and a pre-trained model snapshot. The objective of NNCF is to simulate model compression at inference time in order to allow the trainable parameters to adjust to the compressed inference conditions, and then export the compressed version of the model to a format suitable for compressed inference. Once the NNCF package is installed, the user needs to introduce minor changes to the training code to enable model compression. Below are the steps needed to modify the training pipeline code in PyTorch:

  • Add the following imports in the beginning of the training sample right after importing PyTorch:

    figure b
  • Once a model instance is created and the pre-trained weights are loaded, the model can be compressed using the helper methods. Some compression algorithms (e.g. quantization) require arguments (e.g. the train_loader for your training dataset) to be supplied to the initialize() method at this stage as well, in order to properly initialize compression modulse parameters related to its compression (e.g. scale values for FakeQuantize layers):

    figure c

    where resnet50_int8.json in this case is a JSON-formatted file containing all the options and hyperparameters of compression methods (the format of the options is imposed by NNCF).

  • At this stage the model can optionally be wrapped with DataParallel orDistributedDataParallel classes for multi-GPU training. In case distributed training is used, call the compression_algo.distributed() method after wrapping the model with DistributedDataParallel to signal the compression algorithms that special distributed-specific internal handling of compression parameters is required.

  • The model can now be trained as a usual torch.nn.Module to fine-tune compression parameters along with the model weights. To completely utilize NNCF functionality, you may introduce the following changes to the training loop code: 1) after model inference is done on the current training iteration, the compression loss should be added to the main task loss such as cross-entropy loss:

    figure d

    2) the compression algorithm schedulers should be made aware of the batch/epoch steps, so add comp_ctrl.scheduler.step() calls after each training batch iteration and comp_ctrl.scheduler.epoch_step() calls after each training epoch iteration.

  • When done finetuning, export the model to ONNX by calling a compression controller’s dedicated method, or to PyTorch’s .pth format by using the regular torch.save functionality:

    figure e

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kozlov, A., Lazarevich, I., Shamporov, V., Lyalyushkin, N., Gorbachev, Y. (2021). Neural Network Compression Framework for Fast Model Inference. In: Arai, K. (eds) Intelligent Computing. Lecture Notes in Networks and Systems, vol 285. Springer, Cham. https://doi.org/10.1007/978-3-030-80129-8_17

Download citation

Publish with us

Policies and ethics