Abstract
Automated Image Captioning involves understanding the semantic information of an image and expressing it in natural language. Among the many approaches proposed, deep learning-based techniques have achieved state-of-the-art results in solving this problem. In this paper, three primary, distinct deep learning-based approaches to solve this problem are introduced and compared: encoder-decoder frameworks, neuroevolution, and attention-based approaches. This paper covers their mechanisms and their performance, and highlights where they differ from each other. To conclude, the results of these approaches on benchmark dataset and metrics are presented.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Fei-Fei L, Iyer A, Koch C, Perona P (2007) What do we perceive in a glance of a real-world scene? J Vis 7(1):10
Sharma H, Agrahari M, Singh SK, Firoj M, Mishra RK (2020) Image captioning: a comprehensive survey. In: 2020 international conference on power electronics & IoT applications in renewable energy and its control (PARC). IEEE, pp 325–328
Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304
Wang X, Zhao Y, Pourpanah F (2020) Recent advances in deep learning
Kiros R, Salakhutdinov R, Zemel, RS (2014) Unifying visual-semantic embeddings with multimodal neural language models
Vinyals O, Toshev A, Bengio S, Erhan D, Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A, What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212
Wei Y, Xia W, Huang J, Ni B, Dong J, Zhao Y, Yan S, CNN: single-label to multi-label
Stanley KO, Miikkulainen R, Evolving neural networks through augmenting topologies, 10(2):99–127
Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, Raju B, Shahrzad H, Navruzyan A, Duffy N, Hodjat B, Evolving deep neural networks
Spratling MW, Johnson MH (2004) A feedback model of visual attention. J Cogn Neurosci 16(2):219–237
Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention. arXiv preprint arXiv:1406.6247
Ba J, Mnih V, Kavukcuoglu K (2014) Multiple object recognition with visual attention, arXiv preprint arXiv:1412.7755
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2016) Show, attend and tell: neural image caption generation with visual attention
Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circ Syst Video Technol 30(12):4467–4480
Wolf T, Chaumond J, Debut L, Sanh V, Delangue C, Moi A, Cistac P, Funtowicz M, Davison J, Shleifer S et al (2020) Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp 38–45
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Vedantam R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Chen X, Fang H, Lin T-Y, Vedantam R, Gupta S, Dollar P, Zitnick CL, Microsoft COCO captions: data collection and evaluation server, Version: 2
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T, Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW (2016) Review networks for caption generation. arXiv preprint arXiv:1605.07912
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Srivastava, S., Chaudhari, Y., Damania, Y., Jadhav, P. (2022). Deep Learning Techniques for Automated Image Captioning. In: Zhang, YD., Senjyu, T., So-In, C., Joshi, A. (eds) Smart Trends in Computing and Communications. Lecture Notes in Networks and Systems, vol 286. Springer, Singapore. https://doi.org/10.1007/978-981-16-4016-2_55
Download citation
DOI: https://doi.org/10.1007/978-981-16-4016-2_55
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-4015-5
Online ISBN: 978-981-16-4016-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)