Abstract
Image captioning is the generation of caption for the given image. It is a growing and challenging research topic in the field of computer vision. There have been many approaches for image captioning. Initially, Template based methods were implemented in which a fixed size template was used for fill up with image objects and their properties. Retrieval based approach was used in which some images used to match with the query image and a caption was generated with the help of captions of these images. There were some limitations in these approaches of missing out important objects. Recent approach for the image captioning uses the technique of encoder and decoder. The encoder and decoder based methods have provided remarkable results. However, it is not easy to determine the impact of only the encoder for the image captioning task. In this paper, we have compared the performance of image captioning models with various image encoders such as Visual Geometry Group (VGG), Residual Networks (ResNet), InceptionV3 uses the Gated Recurrent Unit (GRU) as the decoder for the purpose of text generation. The results are compared on the basis of Bilingual Evaluation Understudy (BLEU) score using Flickr8K Dataset. It can be seen that the ResNet provides a better BLEU score as compared to the VGG16, VGG19 and InceptionV3 when it was implemented using the Flickr8K Dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Pal A, Kar S, Taneja A, Jadoun VK (2020) Image captioning and comparison of different encoders. J Phys Conf Ser 1478. https://doi.org/10.1088/1742-6596/1478/1/012004
Yang Z, Liu Q (2020) ATT-BM-SOM: a framework of effectively choosing image information and optimizing syntax for image captioning. IEEE Access 8:50565–50573. https://doi.org/10.1109/ACCESS.2020.2980578
Wang H, Wang H, Xu K (2020) Evolutionary recurrent neural network for image captioning. Neurocomputing 401:249–256. https://doi.org/10.1016/j.neucom.2020.03.087
Lu X, Wang B, Zheng X, Li X (2017) Sensing image caption generation. IEEE Trans Geosci Remote Sens 56:1–13
Wang B, Zheng X, Qu B, Lu X, Member S (2020) Remote sensing image captioning. IEEE J Sel Top Appl Earth Obs Remote Sens 13:256–270
Liu M, Li L, Hu H, Guan W, Tian J (2020) Image caption generation with dual attention mechanism. Inf Process Manag 57:102178. https://doi.org/10.1016/j.ipm.2019.102178
Gaurav, Mathur P (2011) A survey on various deep learning models for automatic image captioning. J Phys Conf Ser 1950. https://doi.org/10.1088/1742-6596/1950/1/012045
Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans Multimed Comput Commun Appl 14 (2018). https://doi.org/10.1145/3115432
Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder-decoder network for image captioning. IEEE Trans Multimed 21:2942–2956. https://doi.org/10.1109/TMM.2019.2915033
Deng Z, Jiang Z, Lan R, Huang W, Luo X (2020) Image captioning using DenseNet network and adaptive attention. Signal Process Image Commun. 85:115836. https://doi.org/10.1016/j.image.2020.115836
Han SH, Choi HJ (2020) Domain-specific image caption generator with semantic ontology. In: Proceedings—2020 IEEE international conference on big data and smart computing (BigComp), pp 526–530. https://doi.org/10.1109/BigComp48618.2020.00-12
Wei H, Li Z, Zhang C, Ma H (2020) The synergy of double attention: combine sentence-level and word-level attention for image captioning. Comput Vis Image Underst 201:103068. https://doi.org/10.1016/j.cviu.2020.103068
Kalra S, Leekha A (2020) Survey of convolutional neural networks for image captioning. J Inf Optim Sci 41:239–260. https://doi.org/10.1080/02522667.2020.1715602
Yu N, Hu X, Song B, Yang J, Zhang J (2019) Topic-Oriented image captioning based on order-embedding. IEEE Trans Image Process 28:2743–2754. https://doi.org/10.1109/TIP.2018.2889922
Part 14 : Dot and Hadamard Product | by Avnish | Linear Algebra | Medium, https://medium.com/linear-algebra/part-14-dot-and-hadamard-product-b7e0723b9133. Last Accessed 07 Feb 2022
Understanding learning rates and how it improves performance in deep learning | by Hafidz Zulkifli | Towards Data Science. https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10. Last Accessed 08 Feb 2022
Zakir Hossain MD, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Comput Surv 51. https://doi.org/10.1145/3295748.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Gaurav, Mathur, P. (2023). Empirical Study of Image Captioning Models Using Various Deep Learning Encoders. In: Singh, P., Singh, D., Tiwari, V., Misra, S. (eds) Machine Learning and Computational Intelligence Techniques for Data Engineering. MISP 2022. Lecture Notes in Electrical Engineering, vol 998. Springer, Singapore. https://doi.org/10.1007/978-981-99-0047-3_27
Download citation
DOI: https://doi.org/10.1007/978-981-99-0047-3_27
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-0046-6
Online ISBN: 978-981-99-0047-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)