A Critical Analysis of VQA Models and Datasets

Sharma, Himanshu

doi:10.1007/978-981-16-8546-0_9

Himanshu Sharma⁴¹

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 837))

312 Accesses

Abstract

A visual question answering (VQA) system uses the domain of both computer vision (CV) and natural language processing (NLP). These systems produce an answer in a given natural language corresponding to a natural language question asked about a given image. Also, these systems need to understand an image and the semantics of the given question. In this article, the limitations of some state-of-the-art VQA models, datasets used by VQA models, evaluation metrics for these datasets and limitations of the major datasets are discussed. Also, detailed failure cases of these models are also presented. Also, we present some future directions to achieve higher accuracy for answer generation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge

Fine-tuning your answers: a bag of tricks for improving VQA models

Article 08 January 2022

Comprehensive Analysis of State-of-the-Art Techniques for VQA

References

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint: arXiv:1409.1556
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Google Scholar
Sun S, Pang J, Shi J, Yi S, Ouyang W (2018) Fishnet: a versatile backbone for image, region, and pixel level prediction. In: Advances in neural information processing systems, pp 754–764
Google Scholar
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint: arXiv:1606.01847
Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620
Google Scholar
Geman D, Geman S, Hallonquist N, Younes L (2015) Visual turing test for computer vision systems. Proc Natl Acad Sci 112(12):3618–3623
Article Google Scholar
Malinowski M, Fritz M (2014) Towards a visual turing challenge. arXiv preprint: arXiv:1410.8027
Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision, pp 1–9
Google Scholar
Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? dataset and methods for multilingual image question. In: Advances in neural information processing systems, pp 2296–2304
Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Kim JH, Lee SW, Kwak D, Heo MO, Kim J, Ha JW, Zhang BT (2016) Multimodal residual learning for visual QA. In: Advances in neural information processing systems, pp 361–369
Google Scholar
Zhang P, Goyal Y, Summers-Stay D, Batra D, Parikh D (2016) Yin and yang: balancing and answering binary visual questions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5014–5022
Google Scholar
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
Google Scholar
Nguyen DK, Okatani T (2018). Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6087–6096
Google Scholar
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Google Scholar
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in neural information processing systems, pp 1682–1690
Google Scholar
Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Advances in neural information processing systems, pp 2953–2961
Google Scholar
Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual 7w: grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4995–5004
Google Scholar
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA, Bernstein MS (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73
Article MathSciNet Google Scholar
Ilievski I, Feng J (2017) Generative attention model with adversarial self-learning for visual question answering. In: Proceedings of the thematic workshops of ACM multimedia 2017, pp 415–423
Google Scholar
Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: Advances in neural information processing systems, pp 1564–1574
Google Scholar
Osman A, Samek W (2019) DRAU: dual recurrent attention units for visual question answering. Comput Vis Image Underst 185:24–30
Article Google Scholar
Su Z, Zhu C, Dong Y, Cai D, Chen Y, Li J (2018). Learning visual knowledge memory networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7736–7745
Google Scholar
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959
Article Google Scholar
Singh J, Ying V, Nutkiewicz A (2018) Attention on attention: architectures for visual question answering (VQA). arXiv preprint arXiv:1803.07724
Teney D, Anderson P, He X, Van Den Hengel A (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4223–4232
Google Scholar
Sharma H, Agrahari M, Singh SK, Firoj M, Mishra RK (2020) Image captioning: a comprehensive survey. In: 2020 International conference on power electronics & IoT applications in renewable energy and its control (PARC). IEEE, pp 325–328
Google Scholar
Sharma H, Jalal AS (2020). Incorporating external knowledge for image captioning using CNN and LSTM. Mod Phys Lett B 2050315
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering and Applications, GLA University, Mathura, India
Himanshu Sharma

Authors

Himanshu Sharma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Himanshu Sharma .

Editor information

Editors and Affiliations

National Institute of Technology Durgapur, Durgapur, West Bengal, India
Goutam Sanyal
University of Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain
Carlos M. Travieso-González
Department of Computer Science and Engineering, G. L. Bajaj Institute of Technology and Management, Greater Noida, India
Shashank Awasthi
School of Engineering, Polytechnic of Porto, University of Porto, Porto, Portugal
Carla M.A. Pinto
Department of Planning and Development, National Institute of Technology Goa, Goa, India
B. R. Purushothama

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sharma, H. (2022). A Critical Analysis of VQA Models and Datasets. In: Sanyal, G., Travieso-González, C.M., Awasthi, S., Pinto, C.M., Purushothama, B.R. (eds) International Conference on Artificial Intelligence and Sustainable Engineering. Lecture Notes in Electrical Engineering, vol 837. Springer, Singapore. https://doi.org/10.1007/978-981-16-8546-0_9

Download citation

DOI: https://doi.org/10.1007/978-981-16-8546-0_9
Published: 08 April 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-8545-3
Online ISBN: 978-981-16-8546-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

A Critical Analysis of VQA Models and Datasets

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge

Fine-tuning your answers: a bag of tricks for improving VQA models

Comprehensive Analysis of State-of-the-Art Techniques for VQA

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Critical Analysis of VQA Models and Datasets

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge

Fine-tuning your answers: a bag of tricks for improving VQA models

Comprehensive Analysis of State-of-the-Art Techniques for VQA

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation