Attention Is All You Need to Tell: Transformer-Based Image Captioning

Chordia, Shreyansh; Pawar, Yogini; Kulkarni, Saurabh; Toradmal, Utkarsha; Suratkar, Shraddha

doi:10.1007/978-981-19-1018-0_52

Shreyansh Chordia¹⁵,
Yogini Pawar¹⁵,
Saurabh Kulkarni¹⁵,
Utkarsha Toradmal¹⁵ &
…
Shraddha Suratkar¹⁵

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 427))

1382 Accesses
2 Citations

Abstract

Automatic Image Captioning is a task that involves two prominent areas of Deep Learning research, i.e., image processing and language generation. Over the years, we have achieved a lot of success in being able to generate syntactically and semantically meaningful descriptions using deep learning architectures. Recent studies have implemented an attention mechanism that lets the model attend to different regions of the image at different timestamps while generating the caption. In this paper, we present a Transformer architecture that generates captions by just enforcing the attention mechanism. To understand the effect of attention mechanism on model performance, we separately train two LSTM-based image captioning models for a comparative study with our architecture. The models are trained on the Flickr-8K dataset using the Cross-Entropy Loss Function. For evaluating the models we calculate CIDEr-R, BLEU, METEOR, and ROUGE-L metric scores for the captions generated by these models on test split. Results from our comparative study suggest that the Transformer Architecture is a better approach toward image captioning and meaningful descriptions can be generated even without the use of traditional recurrent neural networks as decoders.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Comparison Between LSTM and Transformers for Image Captioning

Image Captioning Methodologies Using Deep Learning: A Review

Assamese news image caption generation using attention mechanism

Article 14 February 2022

References

Hu, X, Yin X, Lin K, Wang L, Zhang L, Gao J, Liu Z (2020) VIVO: surpassing human performance in novel object captioning with visual vocabulary pre-training
Google Scholar
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: 2020 IEEE conference on computer vision and pattern recognition (CVPR 2020)
Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, pp 6000-6010
Google Scholar
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Article MathSciNet Google Scholar
Yang Y, Teo CL, Daume H, Aloimono Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the conference on empirical methods in natural language processing, pp 444–454
Google Scholar
Kulkarni G (2011) Baby talk: understanding and generating simple image descriptions. In: CVPR et al (2011) Colorado Springs. CO, USA, pp 1601–1608. https://doi.org/10.1109/CVPR.2011.5995466
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47(2013):853–899
Article MathSciNet Google Scholar
Ordonez V, Kulkarni G, Berg TL (2011) Im2text: describing images using 1 million captioned photographs. In: Advances in neural information processing systems, pp 1143–1151
Google Scholar
Kuznetsova P, Ordonez V, Berg T, Choi Y (2014) Treetalk: Composition and compression of trees for image descriptions. Trans Assoc Comput Linguist 2(10):351–362
Article Google Scholar
Kiros R, Salakhutdinov R, Zemel R. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA, USA, pp 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935
Donahue J, Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE conference on computer vision and pattern recognition, pp 2625–2634
Google Scholar
Pu Y, Gan Z, Henao R, Yuan X, Li C, Stevens A, Carin L (2016) Variational autoencoder for deep learning of images, labels and captions
Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on machine learning
Google Scholar
Huang L, Wang W, Chen J, Wei X (2019) Attention on attention for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), pp 4633–4642. https://doi.org/10.1109/ICCV.2019.00473
Anderson P et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, Salt Lake City, UT, pp 6077–6086. https://doi.org/10.1109/CVPR.2018.00636
Zhang W, Nie W, Li X, Yu Y (2019) Image caption generation with adaptive transformer (2019). In: 34rd youth academic annual conference of Chinese Association of Automation (YAC). Jinzhou, China, pp 521–526. https://doi.org/10.1109/YAC.2019.8787715
Szegedy C, Vanhoucke, V, Ioffe S, Shlens J, Wojna ZB (2016) Rethinking the inception architecture for computer vision. 10.1109/CVPR.2016.308
Google Scholar
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. CVPR 4566–4575
Google Scholar
Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. SMT-W, pp 376–380
Google Scholar
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out
Google Scholar
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. Association for Computational Linguistics (ACL), Philadelphia
Google Scholar
Kingma D, Ba J (2014) Adam: a method for stochastic optimization. Int Conf Learn Representations
Google Scholar
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv, [1409.0473]
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering and Information Technology, Veermata Jijabai Technological Institute (VJTI), Mumbai, Maharasthra, India
Shreyansh Chordia, Yogini Pawar, Saurabh Kulkarni, Utkarsha Toradmal & Shraddha Suratkar

Authors

Shreyansh Chordia
View author publications
You can also search for this author in PubMed Google Scholar
Yogini Pawar
View author publications
You can also search for this author in PubMed Google Scholar
Saurabh Kulkarni
View author publications
You can also search for this author in PubMed Google Scholar
Utkarsha Toradmal
View author publications
You can also search for this author in PubMed Google Scholar
Shraddha Suratkar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shreyansh Chordia .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology, Warangal, India
Rashmi Ranjan Rout
Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India
Soumya Kanti Ghosh
Department of Computer Science and Engineering, Indian Institute of Technology (ISM) Dhanbad, Dhanbad, India
Prasanta K. Jana
School of IT and Engineering (SITE), Vellore Institute of Technology, Vellore, India
Asis Kumar Tripathy
Department of Computer Science and Information Technology, Institute of Technical Education and Research (ITER), Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, India
Jyoti Prakash Sahoo
Department of Computer Science and Information Engineering (CSIE), Providence University, Taichung, Taiwan
Kuan-Ching Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chordia, S., Pawar, Y., Kulkarni, S., Toradmal, U., Suratkar, S. (2022). Attention Is All You Need to Tell: Transformer-Based Image Captioning. In: Rout, R.R., Ghosh, S.K., Jana, P.K., Tripathy, A.K., Sahoo, J.P., Li, KC. (eds) Advances in Distributed Computing and Machine Learning. Lecture Notes in Networks and Systems, vol 427. Springer, Singapore. https://doi.org/10.1007/978-981-19-1018-0_52

Download citation

DOI: https://doi.org/10.1007/978-981-19-1018-0_52
Published: 28 July 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-1017-3
Online ISBN: 978-981-19-1018-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Attention Is All You Need to Tell: Transformer-Based Image Captioning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Comparison Between LSTM and Transformers for Image Captioning

Image Captioning Methodologies Using Deep Learning: A Review

Assamese news image caption generation using attention mechanism

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Attention Is All You Need to Tell: Transformer-Based Image Captioning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Comparison Between LSTM and Transformers for Image Captioning

Image Captioning Methodologies Using Deep Learning: A Review

Assamese news image caption generation using attention mechanism

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation