Computer Vision for Image Understanding: A Comprehensive Review

Jácome-Galarza, Luis-Roberto; Realpe-Robalino, Miguel-Andrés; Chamba-Eras, Luis-Antonio; Viñán-Ludeña, Marlon-Santiago; Sinche-Freire, Javier-Francisco

doi:10.1007/978-3-030-32022-5_24

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1066))

Included in the following conference series:

The International Conference on Advances in Emerging Trends and Technologies

974 Accesses
2 Citations
4 Altmetric

Abstract

Computer Vision has its own Turing test: Can a machine describe the contents of an image or a video in the way a human being would do? In this paper, the progress of Deep Learning for image recognition is analyzed in order to know the answer to this question. In recent years, Deep Learning has increased considerably the precision rate of many tasks related to computer vision. Many datasets of labeled images are now available online, which leads to pre-trained models for many computer vision applications. In this work, we gather information of the latest techniques to perform image understanding and description. As a conclusion we obtained that the combination of Natural Language Processing (using Recurrent Neural Networks and Long Short-Term Memory) plus Image Understanding (using Convolutional Neural Networks) could bring new types of powerful and useful applications in which the computer will be able to answer questions about the content of images and videos. In order to build datasets of labeled images, we need a lot of work and most of the datasets are built using crowd work. These new applications have the potential to increase the human machine interaction to new levels of usability and user’s satisfaction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Development and Classification of Image Dataset for Text-to-Image Generation

Article 29 February 2024

A Review on Deep Learning Techniques for Classifying Images and Generating Captions

The Ikshana Hypothesis of Human Scene Understanding

References

Techopedia: https://www.techopedia.com/definition/32309/computer-vision. 03 May 2019
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Eickhoff, C., de Vries, A.: How crowdsourcable is your task. In: Proceedings of the Workshop on Crowdsourcing for Search and Data Mining (CSDM) at the Fourth ACM International Conference on Web Search and Data Mining (WSDM), pp. 11–14 (2011)
Google Scholar
Draper, R., Hunt, D.: Smart Robots, A Handbook of Intelligent Robotic System. Springer, Heidelberg (1985)
Google Scholar
Li-Jia, L., Fei-Fei, L.: What, where and who? Classifying events by scene and object recognition. In: Proceedings/IEEE International Conference on Computer Vision (2007)
Google Scholar
SUN dataset: https://groups.csail.mit.edu/vision/SUN/hierarchy.html. 26 Mar 2019
Fei-Fei, L., Iyer, A., Koch, C., Perona, P.: What do we see in a glance of a scene? J. Vis. 7(1), 10, 1–29 (2007). http://journalofvision.org/7/1/10/. https://doi.org/10.1167/7.1.10
Article Google Scholar
Coursera, Université nationale de recherche, École des hautes études en sciences économiques. https://www.coursera.org/learn/deep-learning-in-computer-vision/home/welcome. 12 Mar 2019
Fei-Fei, L., Fergus, R., Torralba, A.: Recognizing and learning object categories. In: Short Course CVPR, International Conference on Computer Vision (2007)
Google Scholar
Recognizing and Learning Object Categories course. http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html. 25 Mar 2019
Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 806–813 (2014)
Google Scholar
The Paris Dataset. http://www.robots.ox.ac.uk/~vgg/data/parisbuildings/. 24 Mar 2019
VisLab – Computer and Robot Vision Laboratory. http://vislab.isr.ist.utl.pt/datasets/#hda. 25 Mar 2019
ADE20 K dataset. http://groups.csail.mit.edu/vision/datasets/ADE20K/. 26 Mar 2019
SUN360 panorama database. http://people.csail.mit.edu/jxiao/SUN360/index_high.html. 24 Mar 2019
The Places Audio Caption Corpus. https://groups.csail.mit.edu/sls/downloads/placesaudio/index.cgi. 25 Mar 2019
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Candamo, J., Shreve, M., Goldgof, D.B., Sapper, D.B., Kasturi, R.: Understanding transit scenes: a survey on human behavior-recognition algorithms. IEEE Trans. Intell. Transp. Syst. 11, 206–224 (2010)
Article Google Scholar
Bowyer, K.W., Hollingsworth, K., Flynn, P.J.: Image understanding for iris biometrics: A survey. Comput. Vis. Image Underst. 11, 281–307 (2008)
Article Google Scholar
Trucco, E., Plakas, K.: Video tracking: a concise survey. IEEE J. Oceanic Eng. 31, 520–529 (2006)
Article Google Scholar
Imagenet large scale visual recognition challenge 2013 (ilsvrc2013): http://www.imagenet.org/challenges/LSVRC/2013/. 13 Mar 2019
IBM Watson demonstration website. https://www.ibm.com/watson/services/visual-recognition/demo/#demo. 10 May 2019
Microsoft Caption Bot. https://www.captionbot.ai/. 10 May 2019
Kaggle’s “Dog breed identification” kernel. https://www.kaggle.com/kerneler/starter-dog-breed-identification-0c8eb184-8. 10 May 2019
Fei-Fei, L., Perona, P.: A Bayesian hierarchy model for learning natural scene categories. In: CVPR (2005)
Google Scholar
Torralba, A., Fergus, R., Freeman, W.: 80 million tiny images: a large dataset for non-parametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)
Article Google Scholar
Tiny Images dataset. http://groups.csail.mit.edu/vision/TinyImages/. 01 Mar 2019
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. Advances in Neural Information Processing Systems (NIPS) 27 (2014)
Google Scholar
Cross-Modal Places database. http://projects.csail.mit.edu/cmplaces/. 23 Feb 2019
Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering (2015)
Google Scholar
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Conference Paper at NIPS (2015)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: FasterR-CNN: towards real-time object detection with region proposal networks (2016)
Google Scholar
Liang, M., Hu, X., Zhang, B.: Convolutional neural networks with intra-layer recurrent connections for scene labeling. In: Proceeding NIPS’15 Proceedings of the 28th International Conference on Neural Information Processing Systems, vol. 1, pp. 937–945 (2015)
Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Explain images with multimodal recurrent neural networks. NIPS DeepLearning Workshop 201 (2014)
Google Scholar
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. TACL (2015)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Google Scholar
Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., Yuille, A.L.: Learning like a child: fast novel visual concept learning from sentence descriptions of images. arXiv preprint arXiv:1504.06692 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)
Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 (2015)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. In: ACL pp. 479–488 (2014)
Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. arXiv preprint arXiv:1405.0312 (2014)
Lebret, R., Pinheiro, P.O., Collobert, R.: Simple image description generator via a linear phrase-based approach. arXiv preprint arXiv:1412.8419 (2014)
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in Neural Information Processing Systems, pp. 1682–1690 (2014)
Google Scholar
Mikolov, T., Karafiát, M., Burget, L., Černocký, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH, pp. 1045–1048 (2010)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Google Scholar
Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge
Google Scholar
Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Learning like a child: fast novel visual concept learning from sentence descriptions of images. arXiv preprint arXiv:1504.06692 (2015)
Tu, K., Meng, M., Lee, M.W., Choe, T.E., Zhu, S.-C.: Joint video and text parsing for understanding events and answering queries. MultiMedia IEEE 21(2), 42–70 (2014)
Article Google Scholar
Benchmark of Deep Learning Representations for Visual Recognition. http://www.csc.kth.se/cvap/cvg/DL/ots/. 23 Feb 2019

Download references

Author information

Authors and Affiliations

Escuela Superior Politécnica del Litoral, Facultad de Ingeniería en Electricidad y Computación, CIDIS, 09-01-5863, Guayaquil, Ecuador
Luis-Roberto Jácome-Galarza & Miguel-Andrés Realpe-Robalino
Universidad Nacional de Loja, Grupo de Investigación en Tecnologías de la Información y Comunicación (GITIC), Carrera de Ingeniería en Sistemas, Loja, Ecuador
Luis-Antonio Chamba-Eras
Universidad Nacional de Loja, Carrera de Ingeniería en Sistemas, Loja, Ecuador
Marlon-Santiago Viñán-Ludeña & Javier-Francisco Sinche-Freire

Authors

Luis-Roberto Jácome-Galarza
View author publications
You can also search for this author in PubMed Google Scholar
Miguel-Andrés Realpe-Robalino
View author publications
You can also search for this author in PubMed Google Scholar
Luis-Antonio Chamba-Eras
View author publications
You can also search for this author in PubMed Google Scholar
Marlon-Santiago Viñán-Ludeña
View author publications
You can also search for this author in PubMed Google Scholar
Javier-Francisco Sinche-Freire
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marlon-Santiago Viñán-Ludeña .

Editor information

Editors and Affiliations

Eindhoven University of Technology, Eindhoven, Noord-Brabant, The Netherlands
Miguel Botto-Tobar
Universidad Técnica de Babahoyo, Babahoyo, Ecuador
Joffre León-Acurio
Universitat de Valencia, Valencia, Valencia, Spain
Angela Díaz Cadena
Centro de Investigación y Desarrollo Profesional, Babahoyo, Los Rios Province, Ecuador
Práxedes Montiel Díaz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jácome-Galarza, LR., Realpe-Robalino, MA., Chamba-Eras, LA., Viñán-Ludeña, MS., Sinche-Freire, JF. (2020). Computer Vision for Image Understanding: A Comprehensive Review. In: Botto-Tobar, M., León-Acurio, J., Díaz Cadena, A., Montiel Díaz, P. (eds) Advances in Emerging Trends and Technologies. ICAETT 2019. Advances in Intelligent Systems and Computing, vol 1066. Springer, Cham. https://doi.org/10.1007/978-3-030-32022-5_24

Download citation

DOI: https://doi.org/10.1007/978-3-030-32022-5_24
Published: 13 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32021-8
Online ISBN: 978-3-030-32022-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Computer Vision for Image Understanding: A Comprehensive Review

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Development and Classification of Image Dataset for Text-to-Image Generation

A Review on Deep Learning Techniques for Classifying Images and Generating Captions

The Ikshana Hypothesis of Human Scene Understanding

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Computer Vision for Image Understanding: A Comprehensive Review

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Development and Classification of Image Dataset for Text-to-Image Generation

A Review on Deep Learning Techniques for Classifying Images and Generating Captions

The Ikshana Hypothesis of Human Scene Understanding

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation