Skip to main content

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 837))

  • 312 Accesses

Abstract

A visual question answering (VQA) system uses the domain of both computer vision (CV) and natural language processing (NLP). These systems produce an answer in a given natural language corresponding to a natural language question asked about a given image. Also, these systems need to understand an image and the semantics of the given question. In this article, the limitations of some state-of-the-art VQA models, datasets used by VQA models, evaluation metrics for these datasets and limitations of the major datasets are discussed. Also, detailed failure cases of these models are also presented. Also, we present some future directions to achieve higher accuracy for answer generation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint: arXiv:1409.1556

  2. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

    Google Scholar 

  3. Sun S, Pang J, Shi J, Yi S, Ouyang W (2018) Fishnet: a versatile backbone for image, region, and pixel level prediction. In: Advances in neural information processing systems, pp 754–764

    Google Scholar 

  4. Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint: arXiv:1606.01847

  5. Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620

    Google Scholar 

  6. Geman D, Geman S, Hallonquist N, Younes L (2015) Visual turing test for computer vision systems. Proc Natl Acad Sci 112(12):3618–3623

    Article  Google Scholar 

  7. Malinowski M, Fritz M (2014) Towards a visual turing challenge. arXiv preprint: arXiv:1410.8027

  8. Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision, pp 1–9

    Google Scholar 

  9. Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? dataset and methods for multilingual image question. In: Advances in neural information processing systems, pp 2296–2304

    Google Scholar 

  10. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  11. Kim JH, Lee SW, Kwak D, Heo MO, Kim J, Ha JW, Zhang BT (2016) Multimodal residual learning for visual QA. In: Advances in neural information processing systems, pp 361–369

    Google Scholar 

  12. Zhang P, Goyal Y, Summers-Stay D, Batra D, Parikh D (2016) Yin and yang: balancing and answering binary visual questions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5014–5022

    Google Scholar 

  13. Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29

    Google Scholar 

  14. Nguyen DK, Okatani T (2018). Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6087–6096

    Google Scholar 

  15. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

    Google Scholar 

  16. Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in neural information processing systems, pp 1682–1690

    Google Scholar 

  17. Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Advances in neural information processing systems, pp 2953–2961

    Google Scholar 

  18. Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual 7w: grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4995–5004

    Google Scholar 

  19. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA, Bernstein MS (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73

    Article  MathSciNet  Google Scholar 

  20. Ilievski I, Feng J (2017) Generative attention model with adversarial self-learning for visual question answering. In: Proceedings of the thematic workshops of ACM multimedia 2017, pp 415–423

    Google Scholar 

  21. Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: Advances in neural information processing systems, pp 1564–1574

    Google Scholar 

  22. Osman A, Samek W (2019) DRAU: dual recurrent attention units for visual question answering. Comput Vis Image Underst 185:24–30

    Article  Google Scholar 

  23. Su Z, Zhu C, Dong Y, Cai D, Chen Y, Li J (2018). Learning visual knowledge memory networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7736–7745

    Google Scholar 

  24. Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959

    Article  Google Scholar 

  25. Singh J, Ying V, Nutkiewicz A (2018) Attention on attention: architectures for visual question answering (VQA). arXiv preprint arXiv:1803.07724

  26. Teney D, Anderson P, He X, Van Den Hengel A (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4223–4232

    Google Scholar 

  27. Sharma H, Agrahari M, Singh SK, Firoj M, Mishra RK (2020) Image captioning: a comprehensive survey. In: 2020 International conference on power electronics & IoT applications in renewable energy and its control (PARC). IEEE, pp 325–328

    Google Scholar 

  28. Sharma H, Jalal AS (2020). Incorporating external knowledge for image captioning using CNN and LSTM. Mod Phys Lett B 2050315

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Himanshu Sharma .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sharma, H. (2022). A Critical Analysis of VQA Models and Datasets. In: Sanyal, G., Travieso-González, C.M., Awasthi, S., Pinto, C.M., Purushothama, B.R. (eds) International Conference on Artificial Intelligence and Sustainable Engineering. Lecture Notes in Electrical Engineering, vol 837. Springer, Singapore. https://doi.org/10.1007/978-981-16-8546-0_9

Download citation

Publish with us

Policies and ethics