Abstract
Computer Vision has its own Turing test: Can a machine describe the contents of an image or a video in the way a human being would do? In this paper, the progress of Deep Learning for image recognition is analyzed in order to know the answer to this question. In recent years, Deep Learning has increased considerably the precision rate of many tasks related to computer vision. Many datasets of labeled images are now available online, which leads to pre-trained models for many computer vision applications. In this work, we gather information of the latest techniques to perform image understanding and description. As a conclusion we obtained that the combination of Natural Language Processing (using Recurrent Neural Networks and Long Short-Term Memory) plus Image Understanding (using Convolutional Neural Networks) could bring new types of powerful and useful applications in which the computer will be able to answer questions about the content of images and videos. In order to build datasets of labeled images, we need a lot of work and most of the datasets are built using crowd work. These new applications have the potential to increase the human machine interaction to new levels of usability and user’s satisfaction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Techopedia: https://www.techopedia.com/definition/32309/computer-vision. 03 May 2019
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Eickhoff, C., de Vries, A.: How crowdsourcable is your task. In: Proceedings of the Workshop on Crowdsourcing for Search and Data Mining (CSDM) at the Fourth ACM International Conference on Web Search and Data Mining (WSDM), pp. 11–14 (2011)
Draper, R., Hunt, D.: Smart Robots, A Handbook of Intelligent Robotic System. Springer, Heidelberg (1985)
Li-Jia, L., Fei-Fei, L.: What, where and who? Classifying events by scene and object recognition. In: Proceedings/IEEE International Conference on Computer Vision (2007)
SUN dataset: https://groups.csail.mit.edu/vision/SUN/hierarchy.html. 26 Mar 2019
Fei-Fei, L., Iyer, A., Koch, C., Perona, P.: What do we see in a glance of a scene? J. Vis. 7(1), 10, 1–29 (2007). http://journalofvision.org/7/1/10/. https://doi.org/10.1167/7.1.10
Coursera, Université nationale de recherche, École des hautes études en sciences économiques. https://www.coursera.org/learn/deep-learning-in-computer-vision/home/welcome. 12 Mar 2019
Fei-Fei, L., Fergus, R., Torralba, A.: Recognizing and learning object categories. In: Short Course CVPR, International Conference on Computer Vision (2007)
Recognizing and Learning Object Categories course. http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html. 25 Mar 2019
Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 806–813 (2014)
The Paris Dataset. http://www.robots.ox.ac.uk/~vgg/data/parisbuildings/. 24 Mar 2019
VisLab – Computer and Robot Vision Laboratory. http://vislab.isr.ist.utl.pt/datasets/#hda. 25 Mar 2019
ADE20 K dataset. http://groups.csail.mit.edu/vision/datasets/ADE20K/. 26 Mar 2019
SUN360 panorama database. http://people.csail.mit.edu/jxiao/SUN360/index_high.html. 24 Mar 2019
The Places Audio Caption Corpus. https://groups.csail.mit.edu/sls/downloads/placesaudio/index.cgi. 25 Mar 2019
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Candamo, J., Shreve, M., Goldgof, D.B., Sapper, D.B., Kasturi, R.: Understanding transit scenes: a survey on human behavior-recognition algorithms. IEEE Trans. Intell. Transp. Syst. 11, 206–224 (2010)
Bowyer, K.W., Hollingsworth, K., Flynn, P.J.: Image understanding for iris biometrics: A survey. Comput. Vis. Image Underst. 11, 281–307 (2008)
Trucco, E., Plakas, K.: Video tracking: a concise survey. IEEE J. Oceanic Eng. 31, 520–529 (2006)
Imagenet large scale visual recognition challenge 2013 (ilsvrc2013): http://www.imagenet.org/challenges/LSVRC/2013/. 13 Mar 2019
IBM Watson demonstration website. https://www.ibm.com/watson/services/visual-recognition/demo/#demo. 10 May 2019
Microsoft Caption Bot. https://www.captionbot.ai/. 10 May 2019
Kaggle’s “Dog breed identification” kernel. https://www.kaggle.com/kerneler/starter-dog-breed-identification-0c8eb184-8. 10 May 2019
Fei-Fei, L., Perona, P.: A Bayesian hierarchy model for learning natural scene categories. In: CVPR (2005)
Torralba, A., Fergus, R., Freeman, W.: 80 million tiny images: a large dataset for non-parametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)
Tiny Images dataset. http://groups.csail.mit.edu/vision/TinyImages/. 01 Mar 2019
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. Advances in Neural Information Processing Systems (NIPS) 27 (2014)
Cross-Modal Places database. http://projects.csail.mit.edu/cmplaces/. 23 Feb 2019
Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering (2015)
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Conference Paper at NIPS (2015)
Ren, S., He, K., Girshick, R., Sun, J.: FasterR-CNN: towards real-time object detection with region proposal networks (2016)
Liang, M., Hu, X., Zhang, B.: Convolutional neural networks with intra-layer recurrent connections for scene labeling. In: Proceeding NIPS’15 Proceedings of the 28th International Conference on Neural Information Processing Systems, vol. 1, pp. 937–945 (2015)
Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Explain images with multimodal recurrent neural networks. NIPS DeepLearning Workshop 201 (2014)
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. TACL (2015)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., Yuille, A.L.: Learning like a child: fast novel visual concept learning from sentence descriptions of images. arXiv preprint arXiv:1504.06692 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 (2015)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. In: ACL pp. 479–488 (2014)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. arXiv preprint arXiv:1405.0312 (2014)
Lebret, R., Pinheiro, P.O., Collobert, R.: Simple image description generator via a linear phrase-based approach. arXiv preprint arXiv:1412.8419 (2014)
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in Neural Information Processing Systems, pp. 1682–1690 (2014)
Mikolov, T., Karafiát, M., Burget, L., Černocký, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH, pp. 1045–1048 (2010)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge
Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Learning like a child: fast novel visual concept learning from sentence descriptions of images. arXiv preprint arXiv:1504.06692 (2015)
Tu, K., Meng, M., Lee, M.W., Choe, T.E., Zhu, S.-C.: Joint video and text parsing for understanding events and answering queries. MultiMedia IEEE 21(2), 42–70 (2014)
Benchmark of Deep Learning Representations for Visual Recognition. http://www.csc.kth.se/cvap/cvg/DL/ots/. 23 Feb 2019
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Jácome-Galarza, LR., Realpe-Robalino, MA., Chamba-Eras, LA., Viñán-Ludeña, MS., Sinche-Freire, JF. (2020). Computer Vision for Image Understanding: A Comprehensive Review. In: Botto-Tobar, M., León-Acurio, J., Díaz Cadena, A., Montiel Díaz, P. (eds) Advances in Emerging Trends and Technologies. ICAETT 2019. Advances in Intelligent Systems and Computing, vol 1066. Springer, Cham. https://doi.org/10.1007/978-3-030-32022-5_24
Download citation
DOI: https://doi.org/10.1007/978-3-030-32022-5_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32021-8
Online ISBN: 978-3-030-32022-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)