Skip to main content

Attention Is All You Need to Tell: Transformer-Based Image Captioning

  • Conference paper
  • First Online:
Advances in Distributed Computing and Machine Learning

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 427))

Abstract

Automatic Image Captioning is a task that involves two prominent areas of Deep Learning research, i.e., image processing and language generation. Over the years, we have achieved a lot of success in being able to generate syntactically and semantically meaningful descriptions using deep learning architectures. Recent studies have implemented an attention mechanism that lets the model attend to different regions of the image at different timestamps while generating the caption. In this paper, we present a Transformer architecture that generates captions by just enforcing the attention mechanism. To understand the effect of attention mechanism on model performance, we separately train two LSTM-based image captioning models for a comparative study with our architecture. The models are trained on the Flickr-8K dataset using the Cross-Entropy Loss Function. For evaluating the models we calculate CIDEr-R, BLEU, METEOR, and ROUGE-L metric scores for the captions generated by these models on test split. Results from our comparative study suggest that the Transformer Architecture is a better approach toward image captioning and meaningful descriptions can be generated even without the use of traditional recurrent neural networks as decoders.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Hu, X, Yin X, Lin K, Wang L, Zhang L, Gao J, Liu Z (2020) VIVO: surpassing human performance in novel object captioning with visual vocabulary pre-training

    Google Scholar 

  2. Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: 2020 IEEE conference on computer vision and pattern recognition (CVPR 2020)

    Google Scholar 

  3. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, pp 6000-6010

    Google Scholar 

  4. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899

    Article  MathSciNet  Google Scholar 

  5. Yang Y, Teo CL, Daume H, Aloimono Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the conference on empirical methods in natural language processing, pp 444–454

    Google Scholar 

  6. Kulkarni G (2011) Baby talk: understanding and generating simple image descriptions. In: CVPR et al (2011) Colorado Springs. CO, USA, pp 1601–1608. https://doi.org/10.1109/CVPR.2011.5995466

  7. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47(2013):853–899

    Article  MathSciNet  Google Scholar 

  8. Ordonez V, Kulkarni G, Berg TL (2011) Im2text: describing images using 1 million captioned photographs. In: Advances in neural information processing systems, pp 1143–1151

    Google Scholar 

  9. Kuznetsova P, Ordonez V, Berg T, Choi Y (2014) Treetalk: Composition and compression of trees for image descriptions. Trans Assoc Comput Linguist 2(10):351–362

    Article  Google Scholar 

  10. Kiros R, Salakhutdinov R, Zemel R. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539

  11. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA, USA, pp 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935

  12. Donahue J, Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE conference on computer vision and pattern recognition, pp 2625–2634

    Google Scholar 

  13. Pu Y, Gan Z, Henao R, Yuan X, Li C, Stevens A, Carin L (2016) Variational autoencoder for deep learning of images, labels and captions

    Google Scholar 

  14. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on machine learning

    Google Scholar 

  15. Huang L, Wang W, Chen J, Wei X (2019) Attention on attention for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), pp 4633–4642. https://doi.org/10.1109/ICCV.2019.00473

  16. Anderson P et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, Salt Lake City, UT, pp 6077–6086. https://doi.org/10.1109/CVPR.2018.00636

  17. Zhang W, Nie W, Li X, Yu Y (2019) Image caption generation with adaptive transformer (2019). In: 34rd youth academic annual conference of Chinese Association of Automation (YAC). Jinzhou, China, pp 521–526. https://doi.org/10.1109/YAC.2019.8787715

  18. Szegedy C, Vanhoucke, V, Ioffe S, Shlens J, Wojna ZB (2016) Rethinking the inception architecture for computer vision. 10.1109/CVPR.2016.308

    Google Scholar 

  19. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. CVPR 4566–4575

    Google Scholar 

  20. Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. SMT-W, pp 376–380

    Google Scholar 

  21. Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out

    Google Scholar 

  22. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. Association for Computational Linguistics (ACL), Philadelphia

    Google Scholar 

  23. Kingma D, Ba J (2014) Adam: a method for stochastic optimization. Int Conf Learn Representations

    Google Scholar 

  24. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv, [1409.0473]

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shreyansh Chordia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chordia, S., Pawar, Y., Kulkarni, S., Toradmal, U., Suratkar, S. (2022). Attention Is All You Need to Tell: Transformer-Based Image Captioning. In: Rout, R.R., Ghosh, S.K., Jana, P.K., Tripathy, A.K., Sahoo, J.P., Li, KC. (eds) Advances in Distributed Computing and Machine Learning. Lecture Notes in Networks and Systems, vol 427. Springer, Singapore. https://doi.org/10.1007/978-981-19-1018-0_52

Download citation

Publish with us

Policies and ethics