Abstract
A visual question answering (VQA) system uses the domain of both computer vision (CV) and natural language processing (NLP). These systems produce an answer in a given natural language corresponding to a natural language question asked about a given image. Also, these systems need to understand an image and the semantics of the given question. In this article, the limitations of some state-of-the-art VQA models, datasets used by VQA models, evaluation metrics for these datasets and limitations of the major datasets are discussed. Also, detailed failure cases of these models are also presented. Also, we present some future directions to achieve higher accuracy for answer generation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint: arXiv:1409.1556
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Sun S, Pang J, Shi J, Yi S, Ouyang W (2018) Fishnet: a versatile backbone for image, region, and pixel level prediction. In: Advances in neural information processing systems, pp 754–764
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint: arXiv:1606.01847
Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620
Geman D, Geman S, Hallonquist N, Younes L (2015) Visual turing test for computer vision systems. Proc Natl Acad Sci 112(12):3618–3623
Malinowski M, Fritz M (2014) Towards a visual turing challenge. arXiv preprint: arXiv:1410.8027
Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision, pp 1–9
Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? dataset and methods for multilingual image question. In: Advances in neural information processing systems, pp 2296–2304
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Kim JH, Lee SW, Kwak D, Heo MO, Kim J, Ha JW, Zhang BT (2016) Multimodal residual learning for visual QA. In: Advances in neural information processing systems, pp 361–369
Zhang P, Goyal Y, Summers-Stay D, Batra D, Parikh D (2016) Yin and yang: balancing and answering binary visual questions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5014–5022
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
Nguyen DK, Okatani T (2018). Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6087–6096
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in neural information processing systems, pp 1682–1690
Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Advances in neural information processing systems, pp 2953–2961
Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual 7w: grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4995–5004
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA, Bernstein MS (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73
Ilievski I, Feng J (2017) Generative attention model with adversarial self-learning for visual question answering. In: Proceedings of the thematic workshops of ACM multimedia 2017, pp 415–423
Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: Advances in neural information processing systems, pp 1564–1574
Osman A, Samek W (2019) DRAU: dual recurrent attention units for visual question answering. Comput Vis Image Underst 185:24–30
Su Z, Zhu C, Dong Y, Cai D, Chen Y, Li J (2018). Learning visual knowledge memory networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7736–7745
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959
Singh J, Ying V, Nutkiewicz A (2018) Attention on attention: architectures for visual question answering (VQA). arXiv preprint arXiv:1803.07724
Teney D, Anderson P, He X, Van Den Hengel A (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4223–4232
Sharma H, Agrahari M, Singh SK, Firoj M, Mishra RK (2020) Image captioning: a comprehensive survey. In: 2020 International conference on power electronics & IoT applications in renewable energy and its control (PARC). IEEE, pp 325–328
Sharma H, Jalal AS (2020). Incorporating external knowledge for image captioning using CNN and LSTM. Mod Phys Lett B 2050315
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sharma, H. (2022). A Critical Analysis of VQA Models and Datasets. In: Sanyal, G., Travieso-González, C.M., Awasthi, S., Pinto, C.M., Purushothama, B.R. (eds) International Conference on Artificial Intelligence and Sustainable Engineering. Lecture Notes in Electrical Engineering, vol 837. Springer, Singapore. https://doi.org/10.1007/978-981-16-8546-0_9
Download citation
DOI: https://doi.org/10.1007/978-981-16-8546-0_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-8545-3
Online ISBN: 978-981-16-8546-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)