Abstract
This paper presents a methodology that deals with the task of generating answers corresponding to the respective questions which are based on the input images in the dataset. The model proposed in this methodology constitutes two major components and then integration of analysis results and features from these components to form a combination in order to predict the answers. We have created a pipeline that first preprocesses the dataset and then encodes the question string and answer string. Using NLP techniques like tokenization and stemming, text data is processed to form a vocabulary set. Yet another experiment with modification in model and approach was performed using easy-VQA dataset which is available publically. This model used the bag of words technique to turn a question into a vector. This approach considered two components separately for text and image feature extraction and merged it to form analysis and generate an answer. Merge is done by using element-wise multiplication. In these approaches, we have used the softmax activation function in the output layer to generate output or answer to the question. When compared to existing methodologies this approach seems comparable and gives decent results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Dataset: https://visualqa.org/download.html
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
Yi K, Wu J, Gan C, Torralba A, Kohli P, Tenenbaum J (2018) Neural-symbolic vqa: disentangling reasoning from vision and language understanding. Adv Neural Inf Process Syst 31
Liang J, Jiang L, Cao L, Li LJ, Hauptmann AG (2018) Focal visual-text attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6135–6143
Wu C, Liu J, Wang X, Li R (2019) Differential networks for visual question answering. Proc AAAI Conf Artif Intell 33(01), 8997–9004. https://doi.org/10.1609/aaai.v33i01.33018997
Zheng Z, Wang W, Qi S, Zhu SC (2019) Reasoning visual dialogs with structural and partial observations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6669–6678
https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/
Liu Y, Zhang X, Huang F, Tang X, Li Z (2019) Visual question answering via attention-based syntactic structure tree-LSTM. Appl Soft Comput 82, 105584. https://doi.org/10.1016/j.asoc.2019.105584, https://www.sciencedirect.com/science/article/pii/S1568494619303643
Nisar R, Bhuva D, Chawan P (2019) Visual question answering using combination of LSTM and CNN: a survey, pp 2395–0056
Kan C, Wang J, Chen L-C, Gao H, Xu W, Nevatia R (2015) ABC-CNN, an attention based convolutional neural network for visual question answering
Sharma N, Jain V, Mishra A (2018) An analysis of convolutional neural networks for image classification. Procedia Comput Sci 132, 377–384. ISSN 1877-0509. https://doi.org/10.1016/j.procs.2018.05.198, https://www.sciencedirect.com/science/article/pii/S1877050918309335
Staudemeyer RC, Morris ER (2019) Understanding LSTM–a tutorial into long short-term memory recurrent neural networks. arXiv:1909.09586
Zabirul Islam M, Milon Islam M, Asraf A (2020) A combined deep CNN-LSTM network for the detection of novel coronavirus (COVID-19) using X-ray images. Inform Med Unlocked 20, 100412. ISSN 2352-9148. https://doi.org/10.1016/j.imu.2020.100412
Boulila W, Ghandorh H, Ahmed Khan M, Ahmed F, Ahmad J (2021) A novel CNN-LSTM-based approach to predict urban expansion. Ecol Inform 64. https://doi.org/10.1016/j.ecoinf.2021.101325, https://www.sciencedirect.com/science/article/pii/S1574954121001163
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Azade, A., Saini, R., Naik, D. (2023). Visual Question Answering Using Convolutional and Recurrent Neural Networks. In: Singh, P., Singh, D., Tiwari, V., Misra, S. (eds) Machine Learning and Computational Intelligence Techniques for Data Engineering. MISP 2022. Lecture Notes in Electrical Engineering, vol 998. Springer, Singapore. https://doi.org/10.1007/978-981-99-0047-3_3
Download citation
DOI: https://doi.org/10.1007/978-981-99-0047-3_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-0046-6
Online ISBN: 978-981-99-0047-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)