Skip to main content

Semantic Tree-Structured Representation for Visual Question Answering System

  • Conference paper
  • First Online:
Proceedings of International Conference on Data Science and Applications

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 287))

Abstract

Visual question answering (VQA) system is an integrative research problem in the field of artificial intelligence. An image and a textual query are given as input to the VQA system. It tries to find the correct answer by combining the image and deductions collected from input textual queries. It is essential to interpret and retrieve the accurate answers from the visual reasoning queries. Recent studies have made use of parse tree construction on input queries which leads to poor performance due to the lack of semantic interpretation. This work is proposed to achieve comprehensive reasoning by following a semantic representation of the parsed tree construction. The proposed model, semantic tree-based visual question answering system (STVQA) captures the inherent visual evidence of every word parsed from the textual query and combines the visual evidence of its child nodes. The result obtained is transported to the parent nodes in the parse tree. Thus, the STVQA proposed system aims to fulfil global reasoning interpretation from the image and textual query. The VQA system is applicable to various domains such as image retrieval system, surveillance and hence acts as an aid for visually impaired people. The STVQA system is explored on a publicly available benchmark challenging dataset: CLEVR. It is shown that the model is computationally efficient and data-efficient and achieving a new state-of-the-art 90% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Wan, Z., He, H.: AnswerNet: Learning to answer questions. IEEE Trans. Big Data 6(1) (2018)

    Google Scholar 

  2. He, S., Han, C., Han, G., Qin, J.: Exploring duality in visual question-driven top-down saliency. IEEE Trans. Pattern Analysis Machine Intell. 31(7) (2020)

    Google Scholar 

  3. Liang, J., et al.: “Focal visual-text attention for Memex question answering. IEEE Trans. Pattern Analysis Machine Intell. 41(8) (2018)

    Google Scholar 

  4. Hashemi Hosseinabad, S., Safayani, M., Mirzaei, A.: Multiple answers to question: a new approach for visual question answering. Springer, Vis Computer (2020)

    Google Scholar 

  5. Wang, P., et al.: FVQA: Fact-based visual question answering. IEEE Trans. Pattern Analysis Machine Intell. 40, 10 October 2018

    Google Scholar 

  6. Pendurkar, S., et al.: Attention based multi-modal fusion architecture for open-ended video question answering systems. Elsevier, vol. 171 (2020)

    Google Scholar 

  7. Andeep, S.: Toor, Harry Wechsler and Michele Nappi, “Question action relevance and editing for visual question answering.” Multimedia Tools Appl 78, 2921–2935 (2019)

    Article  Google Scholar 

  8. Lioutas, V., Passalis, N., Tefas, A.: Explicit ensemble Attention learning for improving visual question answering. Pattern Recogn. Lett. 111, 1 (2018)

    Article  Google Scholar 

  9. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. in CVPR (2017)

    Google Scholar 

  10. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: A diagnostic dataset for compositional language an elementary visual reasoning. in CVPR (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Padmajaya Rekha, K., Chitrakala, S. (2022). Semantic Tree-Structured Representation for Visual Question Answering System. In: Saraswat, M., Roy, S., Chowdhury, C., Gandomi, A.H. (eds) Proceedings of International Conference on Data Science and Applications. Lecture Notes in Networks and Systems, vol 287. Springer, Singapore. https://doi.org/10.1007/978-981-16-5348-3_29

Download citation

Publish with us

Policies and ethics