Skip to main content

Empirical Study of Image Captioning Models Using Various Deep Learning Encoders

  • Conference paper
  • First Online:
Machine Learning and Computational Intelligence Techniques for Data Engineering (MISP 2022)

Abstract

Image captioning is the generation of caption for the given image. It is a growing and challenging research topic in the field of computer vision. There have been many approaches for image captioning. Initially, Template based methods were implemented in which a fixed size template was used for fill up with image objects and their properties. Retrieval based approach was used in which some images used to match with the query image and a caption was generated with the help of captions of these images. There were some limitations in these approaches of missing out important objects. Recent approach for the image captioning uses the technique of encoder and decoder. The encoder and decoder based methods have provided remarkable results. However, it is not easy to determine the impact of only the encoder for the image captioning task. In this paper, we have compared the performance of image captioning models with various image encoders such as Visual Geometry Group (VGG), Residual Networks (ResNet), InceptionV3 uses the Gated Recurrent Unit (GRU) as the decoder for the purpose of text generation. The results are compared on the basis of Bilingual Evaluation Understudy (BLEU) score using Flickr8K Dataset. It can be seen that the ResNet provides a better BLEU score as compared to the VGG16, VGG19 and InceptionV3 when it was implemented using the Flickr8K Dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 329.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Pal A, Kar S, Taneja A, Jadoun VK (2020) Image captioning and comparison of different encoders. J Phys Conf Ser 1478. https://doi.org/10.1088/1742-6596/1478/1/012004

  2. Yang Z, Liu Q (2020) ATT-BM-SOM: a framework of effectively choosing image information and optimizing syntax for image captioning. IEEE Access 8:50565–50573. https://doi.org/10.1109/ACCESS.2020.2980578

    Article  Google Scholar 

  3. Wang H, Wang H, Xu K (2020) Evolutionary recurrent neural network for image captioning. Neurocomputing 401:249–256. https://doi.org/10.1016/j.neucom.2020.03.087

    Article  Google Scholar 

  4. Lu X, Wang B, Zheng X, Li X (2017) Sensing image caption generation. IEEE Trans Geosci Remote Sens 56:1–13

    Google Scholar 

  5. Wang B, Zheng X, Qu B, Lu X, Member S (2020) Remote sensing image captioning. IEEE J Sel Top Appl Earth Obs Remote Sens 13:256–270

    Google Scholar 

  6. Liu M, Li L, Hu H, Guan W, Tian J (2020) Image caption generation with dual attention mechanism. Inf Process Manag 57:102178. https://doi.org/10.1016/j.ipm.2019.102178

    Article  Google Scholar 

  7. Gaurav, Mathur P (2011) A survey on various deep learning models for automatic image captioning. J Phys Conf Ser 1950. https://doi.org/10.1088/1742-6596/1950/1/012045

  8. Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans Multimed Comput Commun Appl 14 (2018). https://doi.org/10.1145/3115432

  9. Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder-decoder network for image captioning. IEEE Trans Multimed 21:2942–2956. https://doi.org/10.1109/TMM.2019.2915033

    Article  Google Scholar 

  10. Deng Z, Jiang Z, Lan R, Huang W, Luo X (2020) Image captioning using DenseNet network and adaptive attention. Signal Process Image Commun. 85:115836. https://doi.org/10.1016/j.image.2020.115836

    Article  Google Scholar 

  11. Han SH, Choi HJ (2020) Domain-specific image caption generator with semantic ontology. In: Proceedings—2020 IEEE international conference on big data and smart computing (BigComp), pp 526–530. https://doi.org/10.1109/BigComp48618.2020.00-12

  12. Wei H, Li Z, Zhang C, Ma H (2020) The synergy of double attention: combine sentence-level and word-level attention for image captioning. Comput Vis Image Underst 201:103068. https://doi.org/10.1016/j.cviu.2020.103068

    Article  Google Scholar 

  13. Kalra S, Leekha A (2020) Survey of convolutional neural networks for image captioning. J Inf Optim Sci 41:239–260. https://doi.org/10.1080/02522667.2020.1715602

    Article  Google Scholar 

  14. Yu N, Hu X, Song B, Yang J, Zhang J (2019) Topic-Oriented image captioning based on order-embedding. IEEE Trans Image Process 28:2743–2754. https://doi.org/10.1109/TIP.2018.2889922

    Article  MathSciNet  MATH  Google Scholar 

  15. Part 14 : Dot and Hadamard Product | by Avnish | Linear Algebra | Medium, https://medium.com/linear-algebra/part-14-dot-and-hadamard-product-b7e0723b9133. Last Accessed 07 Feb 2022

  16. Understanding learning rates and how it improves performance in deep learning | by Hafidz Zulkifli | Towards Data Science. https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10. Last Accessed 08 Feb 2022

  17. Zakir Hossain MD, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Comput Surv 51. https://doi.org/10.1145/3295748.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gaurav .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gaurav, Mathur, P. (2023). Empirical Study of Image Captioning Models Using Various Deep Learning Encoders. In: Singh, P., Singh, D., Tiwari, V., Misra, S. (eds) Machine Learning and Computational Intelligence Techniques for Data Engineering. MISP 2022. Lecture Notes in Electrical Engineering, vol 998. Springer, Singapore. https://doi.org/10.1007/978-981-99-0047-3_27

Download citation

Publish with us

Policies and ethics