ImageFuse: A Multi-view Image Featurization Framework for Visual Question Answering

Manmadhan, Sruthy; Kovoor, Binsu C.

doi:10.1007/978-3-030-71187-0_14

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1351))

Included in the following conference series:

International Conference on Intelligent Systems Design and Applications

2262 Accesses

Abstract

Visual Question Answering (VQA) is a task where machines are challenged to produce correct answers for a question asked about an image. This paper proposes a novel image featurization framework named ImageFuse to improve the task of VQA. It implements a combination of feature fusion networks to form a fine-grained image representation instead of directly adopting common representations from the popular ImageNet CNN models via transfer learning. The two parallel fusion networks are trained using Canonical Correlation Analysis (CCA) and Autoencoders (AE) to capture both linear and non-linear relationships that exist in multiple views of the image. Extensive experiments conducted on DAQUAR VQA dataset show a significant improvement for the proposed framework over single image representation based VQA systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Optimal Image Feature Ranking and Fusion for Visual Question Answering

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Article 11 September 2018

Multi-modal Feature Fusion Based on Variational Autoencoder for Visual Question Answering

Notes

1.
μ: Membership measure.
2.
Aⁱ, Tⁱ: i^th predicted answer, and i^th ground truth answer.
3.
WUP (a, b): Similarity based on depth of two words ‘a’ and ‘b’ in the wordNet taxonomy.

References

Teney, D., Wu, Q., van den Hengel, A.: Visual question answering: a tutorial. IEEE Signal Process. Mag. 34(6), 63–75 (2017)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In: CVPR, vol. 1, no. 2, p. 3 (2017)
Google Scholar
Yu, L., Park, E., Berg, A.C., Berg, T.L.: Visual madlibs: fill in the blank description generation and question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2461–2469 (2015)
Google Scholar
Tommasi, T., Mallya, A., Plummer, B., Lazebnik, S., Berg, A.C., Berg, T.L.: Combining multiple cues for visual madlibs question answering. Int. J. Comput. Vision 127(1), 38–60 (2019)
Article Google Scholar
Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7w: grounded question answering in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004 (2016)
Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances in Neural Information Processing Systems, pp. 289–297 (2016)
Google Scholar
Manmadhan, S., Kovoor, B.C.: Visual question answering: a state-of-the-art review. Artif. Intell. Rev. 53, 1–41 (2020)
Article Google Scholar
Fader, A., Zettlemoyer, L., Etzioni, O.: Paraphrase-driven learning for open question answering. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 1608–1618 (2013)
Google Scholar
Yue, C., Cao, H., Xiong, K., Cui, A., Qin, H., Li, M.: Enhanced question understanding with dynamic memory networks for textual question answering. Expert Syst. Appl. 80, 39–45 (2017)
Article Google Scholar
Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4613–4621 (2016)
Google Scholar
Saito, K., Shin, A., Ushiku, Y., Harada, T.: Dualnet: domain-invariant network for visual question answering. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 829–834. IEEE (2017)
Google Scholar
Toor, A.S., Wechsler, H., Nappi, M.: Question action relevance and editing for visual question answering. Multimedia Tools Appl. 78(3), 2921–2935 (2019)
Article Google Scholar
Sun, Q.S., Zeng, S.G., Liu, Y., Heng, P.A., Xia, D.S.: A new method of feature fusion and its application in image recognition. Pattern Recogn. 38(12), 2437–2448 (2005)
Article Google Scholar
Ergun, H., Akyuz, Y.C., Sert, M., Liu, J.: Early and late level fusion of deep convolutional neural networks for visual concept recognition. Int. J. Semant. Comput. 10(03), 379–397 (2016)
Article Google Scholar
Li, J., Yang, B., Yang, W., Sun, C., Xu, J.: Subspace-based multi-view fusion for instance-level image retrieval. Vis. Comput. 37, 1–15 (2020)
Google Scholar
Charte, D., Charte, F., García, S., del Jesus, M.J., Herrera, F.: A practical tutorial on autoencoders for nonlinear feature fusion: taxonomy, models, software and guidelines. Inf. Fusion 44, 78–96 (2018)
Article Google Scholar
Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometr. Intell. Lab. Syst. 2(1–3), 37–52 (1987)
Article Google Scholar
Yu, H., Yang, J.: A direct LDA algorithm for high-dimensional data—with application to face recognition. Pattern Recogn. 34(10), 2067–2070 (2001)
Article Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Manmadhan, S., Kovoor, B.C.: Optimal image feature ranking and fusion for visual question answering. In: Evolution in Computational Intelligence, pp. 103–113. Springer, Singapore (2021)
Google Scholar
Cover, T.M.: Elements of Information theory. John Wiley & Sons, Hoboken (1999)
Google Scholar
Hotelling, H.: Relations between two sets of variates. In: Breakthroughs in Statistics, pp. 162–190. Springer, New York (1992)
Google Scholar
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in Neural Information Processing Systems, pp. 1682–1690 (2014)
Google Scholar
Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Bigham, J.P.: Vizwiz grand challenge: answering visual questions from blind people. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3608–3617 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Division of Information Technology, Cochin University of Science and Technology, Kochi, Kerala, India
Sruthy Manmadhan & Binsu C. Kovoor

Authors

Sruthy Manmadhan
View author publications
You can also search for this author in PubMed Google Scholar
Binsu C. Kovoor
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Scientific Network for Innovation and Research Excellence, Machine Intelligence Research Labs (MIR Labs), Auburn, WA, USA
Ajith Abraham
Department of Computer Science, Università degli Studi di Milano, Milan, Milano, Italy
Vincenzo Piuri
Machine Intelligence Research Labs (MIR Labs), Auburn, WA, USA
Niketa Gandhi
Campus Centre de Créteil, Université Paris-Est Créteil, Créteil, France
Patrick Siarry
Department of Construction Management and Real Estate, Vilnius Gediminas Technical University, Vilnius, Lithuania
Arturas Kaklauskas
School of Engineering, Instituto Superior de Engenharia do Porto, Porto, Portugal
Ana Madureira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Manmadhan, S., Kovoor, B.C. (2021). ImageFuse: A Multi-view Image Featurization Framework for Visual Question Answering. In: Abraham, A., Piuri, V., Gandhi, N., Siarry, P., Kaklauskas, A., Madureira, A. (eds) Intelligent Systems Design and Applications. ISDA 2020. Advances in Intelligent Systems and Computing, vol 1351. Springer, Cham. https://doi.org/10.1007/978-3-030-71187-0_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-71187-0_14
Published: 03 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71186-3
Online ISBN: 978-3-030-71187-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

ImageFuse: A Multi-view Image Featurization Framework for Visual Question Answering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Optimal Image Feature Ranking and Fusion for Visual Question Answering

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Multi-modal Feature Fusion Based on Variational Autoencoder for Visual Question Answering

Notes

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

ImageFuse: A Multi-view Image Featurization Framework for Visual Question Answering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Optimal Image Feature Ranking and Fusion for Visual Question Answering

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Multi-modal Feature Fusion Based on Variational Autoencoder for Visual Question Answering

Notes

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation