Abstract
Visual question answering (VQA) system is an integrative research problem in the field of artificial intelligence. An image and a textual query are given as input to the VQA system. It tries to find the correct answer by combining the image and deductions collected from input textual queries. It is essential to interpret and retrieve the accurate answers from the visual reasoning queries. Recent studies have made use of parse tree construction on input queries which leads to poor performance due to the lack of semantic interpretation. This work is proposed to achieve comprehensive reasoning by following a semantic representation of the parsed tree construction. The proposed model, semantic tree-based visual question answering system (STVQA) captures the inherent visual evidence of every word parsed from the textual query and combines the visual evidence of its child nodes. The result obtained is transported to the parent nodes in the parse tree. Thus, the STVQA proposed system aims to fulfil global reasoning interpretation from the image and textual query. The VQA system is applicable to various domains such as image retrieval system, surveillance and hence acts as an aid for visually impaired people. The STVQA system is explored on a publicly available benchmark challenging dataset: CLEVR. It is shown that the model is computationally efficient and data-efficient and achieving a new state-of-the-art 90% accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Wan, Z., He, H.: AnswerNet: Learning to answer questions. IEEE Trans. Big Data 6(1) (2018)
He, S., Han, C., Han, G., Qin, J.: Exploring duality in visual question-driven top-down saliency. IEEE Trans. Pattern Analysis Machine Intell. 31(7) (2020)
Liang, J., et al.: “Focal visual-text attention for Memex question answering. IEEE Trans. Pattern Analysis Machine Intell. 41(8) (2018)
Hashemi Hosseinabad, S., Safayani, M., Mirzaei, A.: Multiple answers to question: a new approach for visual question answering. Springer, Vis Computer (2020)
Wang, P., et al.: FVQA: Fact-based visual question answering. IEEE Trans. Pattern Analysis Machine Intell. 40, 10 October 2018
Pendurkar, S., et al.: Attention based multi-modal fusion architecture for open-ended video question answering systems. Elsevier, vol. 171 (2020)
Andeep, S.: Toor, Harry Wechsler and Michele Nappi, “Question action relevance and editing for visual question answering.” Multimedia Tools Appl 78, 2921–2935 (2019)
Lioutas, V., Passalis, N., Tefas, A.: Explicit ensemble Attention learning for improving visual question answering. Pattern Recogn. Lett. 111, 1 (2018)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. in CVPR (2017)
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: A diagnostic dataset for compositional language an elementary visual reasoning. in CVPR (2017)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Padmajaya Rekha, K., Chitrakala, S. (2022). Semantic Tree-Structured Representation for Visual Question Answering System. In: Saraswat, M., Roy, S., Chowdhury, C., Gandomi, A.H. (eds) Proceedings of International Conference on Data Science and Applications. Lecture Notes in Networks and Systems, vol 287. Springer, Singapore. https://doi.org/10.1007/978-981-16-5348-3_29
Download citation
DOI: https://doi.org/10.1007/978-981-16-5348-3_29
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-5347-6
Online ISBN: 978-981-16-5348-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)