A Survey on Automatic Image Captioning

Srivastava, Gargi; Srivastava, Rajeev

doi:10.1007/978-981-13-0023-3_8

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 834))

Included in the following conference series:

International Conference on Mathematics and Computing

980 Accesses
10 Citations

Abstract

Automatic image captioning is the process of providing natural language captions for images automatically. Considering the huge number of images available in recent time, automatic image captioning is very beneficial in managing huge image datasets by providing appropriate captions. It also finds application in content based image retrieval. This field includes other image processing areas such as segmentation, feature extraction, template matching and image classification. It also includes the field of natural language processing. Scene analysis is a prominent step in automatic image captioning which is garnering the attention of many researchers. The better the scene analysis the better is the image understanding which further leads to generate better image captions. The survey presents various techniques used by researchers for scene analysis performed on different image datasets.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Captioning the Images: A Deep Analysis

Domain-specific image captioning: a comprehensive review

Article 18 April 2024

Comprehensive Comparative Study on Several Image Captioning Techniques Based on Deep Learning Algorithm

Keywords

1 Introduction

Automatic image captioning is the process of providing natural language captions for images automatically. The area is garnering attention from researchers because of the huge unorganized multimedia data pouring in every second. Automatic image captioning is a step ahead of automatic image tagging where images are tagged with relevant keywords related to the contents in the image. Various researchers have come up with the definition of automatic image captioning. In [1], authors in their work define automatic image captioning as the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. Mathews et al. [12] in their paper define it as automatically describing the objects, people and scene in an image. Wang et al. [21] in their paper give the definition as recognition of visual objects in an image and the semantic interactions between objects and translate the visual understanding to sensible sentence descriptions. Liu et al. [22] mention that the grammar must be error-free and fluent. For summing up image captioning can be defined as generating short descriptions representing contents (object, scene and their interaction) of an image in human-like language automatically.

Automatic image captioning is viewed as an amalgamation of computer vision and natural language processing. The computer vision part is about recognizing the contents of an image and the natural language processing part is about converting the recognition into sentences. Research has flourished in both the fields. Computer vision researchers try to better understand the image and natural language processing research try to better express the image. Because of this integration, automatic image captioning has come out as an emerging field in artificial intelligence.

1.1 Applications

Automatic image captioning is an interesting area because of its application in various fields. It can be used in image retrieval system to organize and locate images of interest from a database. It is also useful for video retrieval. It can be used for the development of tools that aid visually impaired individuals to access pictorial information. It finds application in query-response interfaces. Journalists find the application useful in finding and captioning images related to their articles. Human-machine interaction systems can also employ the results of automatic image captioning. Such systems are also helpful in locating images verbally. It can also be used for military intelligence generation, surveillance systems, goods annotation in warehouse and self-aware systems (Fig. 1).

1.2 Scene Analysis

Scene analysis is a module in automatic image captioning and has gained importance recently. In image captioning, generally, the output is the main object in the image without caring about what the background of the image is. This negligence makes the description of the image very vague and unclear.

Consider an image where a person is standing and in the background there is a river and another image where the background is a desert. If the focus is only on the object, both the images will be captioned as a person. If the background scene is taken into consideration, the first image may be captioned as a person standing in front of a river and the second image may be captioned as a person in desert. Suppose a journalist wants a sample image for her article and sends a query in the image database as keywords person, river. In first case of image annotation both the images will be retrieved whereas in second case only the first image will be retrieved. Thus, scene analysis is very important for proper image captioning which leads to better image retrieval results (Figs. 2 and 3).

For scene analysis, the image needs to be broken down into segments for understanding. This leads to the inclusion of another image processing field - image segmentation. Various segmentation techniques exist and several are coming up as to segment the images in a way that the machine understands the image better and can generate better captions. Another field included in scene analysis is object recognition which in itself is a very broad research area.

Object detection can be enhanced by adding contextual information. Scene analysis provides the required contextual information. As the number of scenes is finite, scene analysis is also considered as scene classification problem. Since objects are related to the scenes, the probability distribution of each object over different scenes is different. Convolutional neural networks have been trained over 25 million images of Places dataset to predict approximately 200 different scene-types.

In a nutshell, scene analysis of an image is very important. Without this there is no scope for meaningful captioning of images.

2 Related Works

A lot of research has been done in the field of automatic image captioning. The whole procedure of generating image captions by machines follow a common framework which is discussed below.

2.1 Framework

On the whole, the entire procedure can be subdivided into 2 parts: image processing and language processing. Image processing part includes: image segmentation, feature extraction and classification. Feature extraction and classification can be together referred to as object recognition.

After the object recognition, we obtain the keywords corresponding to the identified objects in the images. These keywords are then fed to language processing unit which results in forming meaningful captions for images.

Each of the three modules are independent and can be researched upon individually. Techniques applied for one of them does not affect the one used for the other module. It is beneficial as each module can be studied and analyzed in isolation (Fig. 4).

2.2 Approaches

For segmentation and recognition various techniques can be used: supervised learning, unsupervised learning, neural networks or probabilistic methods (Fig. 5).

Table 1. A comparative study of different works in automatic image captioning

Full size table

3 Comparative Study

See Table 1.

4 Issues and Challenges

A number of open research issues and challenges have been identified in this field. A few of them are listed below:

1.
Large collections of digital images exist without annotations.
2.
The quality and quantity of training set becomes an important factor in determining the quality of captions that are generated.
3.
Images with low resolution, low contrast complex background and texts with multiple orientation, style, color and alignment increase the complexity of image understanding.
4.
The training set must include as much variety as possible.
5.
Searching the optimal method for each of them is very expensive and it has a major effect on the performance of the overall system.
6.
Capturing sentiments in the captions is a major challenge as not many datasets are available that include sentiment based annotations.
7.
Few datasets are available that provide captions in different languages and moreover machine translation results are not always relevant.

5 Conclusion and Future Work

Automatic image captioning is an emerging area in the field of artificial intelligence and computer vision. The area has real life applications in various fields. It is an ensemble of various modules which opens a lot of area for exploration. Better captions can be generated with proper segmentation. Enhanced descriptions can be made using sentiment addition, activity recognition, background identification and scene analysis. Moreover the areas of deep learning for faster and accurate results can also be explored further. If the hardware resource cost is a limitation, traditional machine learning algorithms can also be investigated for the purpose.

References

Sumathi, T., Hemalatha, M.: A combined hierarchical model for automatic image annotation and retrieval. In: International Conference on Advanced Computing (2011)
Google Scholar
Yu, M.T., Sein, M.M.: Automatic image captioning system using integration of N-cut and color-based segmentation method. In: Society of Instrument and Control Engineers Annual Conference (2011)
Google Scholar
Ushiku, Y., Harada, T., Kuniyoshi, Y.: Automatic sentence generation from images. In: ACM Multimedia (2011)
Google Scholar
Federico, M., Furini, M.: Enhancing learning accessibility through fully automatic captioning. In: International Cross-Disciplinary Conference on Web Accessibility (2011)
Google Scholar
Feng, Y., Lapata, M.: Automatic caption generation for news images. IEEE Trans. Pattern Anal. Mach. Intell. 35(4), 797–811 (2013)
Article Google Scholar
Xi, S.M., Im Cho, Y.: Image caption automatic generation method based on weighted feature. In: International Conference on Control, Automation and Systems (2013)
Google Scholar
Horiuchi, S., Moriguchi, H., Shengbo, X., Honiden, S.: Automatic image description by using word-level features. In: International Conference on Internet Multimedia Computing and Service (2013)
Google Scholar
Ramnath, K., Vanderwende, L., El-Saban, M., Sinha, S.N., Kannan, A., Hassan, N., Galley, M.: AutoCaption: automatic caption generation for personal photos. In: IEEE Winter Conference on Applications of Computer Vision (2014)
Google Scholar
Sivakrishna Reddy, A., Monolisa, N., Nathiya, M., Anjugam, D.: A combined hierarchical model for automatic image annotation and retrieval. In: International Conference on Innovations in Information Embedded and Communication Systems (2015)
Google Scholar
Shivdikar, K., Kak, A., Marwah, K.: Automatic image annotation using a hybrid engine. In: IEEE India Conference (2015)
Google Scholar
Mathews, A.: Captioning images using different styles. In: ACM Multimedia Conference (2015)
Google Scholar
Mathews, A., Xie, L., He, X.: Choosing basic-level concept names using visual and language context. In: IEEE Winter Conference on Applications of Computer Vision (2015)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: International Conference on Computer Vision (2015)
Google Scholar
Vijay, K., Ramya, D.: Generation of caption selection for news images using stemming algorithm. In: International Conference on Computation of Power, Energy, Information and Communication (2015)
Google Scholar
Shahaf, D., Horvitz, E., Mankoff, R.: Inside jokes: identifying humorous cartoon captions. In: International Conference on Knowledge Discovery and Data Mining (2015)
Google Scholar
Li, X., Lan, W., Dong, J., Liu, H.: Adding Chinese captions to images. In: International Conference in Multimedia Retrieval (2016)
Google Scholar
Jin, J., Nakayama, H.: Annotation order matters: recurrent image annotator for arbitrary length image tagging. In: International Conference on Pattern Recognition (2016)
Google Scholar
Shi, Z., Zou, Z.: Can a machine generate humanlike language descriptions for a remote sensing image? IEEE Trans. Geosci. Remote Sens. 55(6), 3623–3634 (2016)
Article Google Scholar
Shetty, R., Tavakoli, H.R., Laaksonen, J.: Exploiting scene context for image captioning. In: Vision and Language Integration Meets Multimedia Fusion (2016)
Google Scholar
Li, X., Song, X., Herranz, L., Zhu, Y., Jiang, S.: Image captioning with both object and scene information. In: ACM Multimedia (2016)
Google Scholar
Wang, C., Yang, H., Bartz, C., Meinel, C.: Image captioning with deep bidirectional LSTMs. In: ACM Multimedia (2016)
Google Scholar
Liu, C., Wang, C., Sun, F., Rui, Y.: Image2Text: a multimodal caption generator. In: ACM Multimedia (2016)
Google Scholar
Blandfort, P., Karayil, T., Borth, D., Dengel, A.: Introducing concept and syntax transition networks for image captioning. In: International Conference on Multimedia Retrieval (2016)
Google Scholar
Tariq, A., Foroosh, H.: A context-driven extractive framework for generating realistic image descriptions. IEEE Trans. Image Process. 26(2), 619–631 (2017)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology (BHU) Varanasi, Varanasi, 221005, U.P., India
Gargi Srivastava & Rajeev Srivastava

Authors

Gargi Srivastava
View author publications
You can also search for this author in PubMed Google Scholar
Rajeev Srivastava
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gargi Srivastava .

Editor information

Editors and Affiliations

Department of Mathematical Sciences, Indian Institute of Technology BHU, Varanasi, Uttar Pradesh, India
Debdas Ghosh
Haldia Institute of Technology, Haldia, India
Debasis Giri
University of Central Florida, Orlando, Florida, USA
Ram N. Mohapatra
Istanbul Commerce University, Istanbul, Turkey
Ekrem Savas
Kyushu University, Fukuoka, Japan
Kouichi Sakurai
Indian Institute of Technology (BHU), Varanasi, India
L. P. Singh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Srivastava, G., Srivastava, R. (2018). A Survey on Automatic Image Captioning. In: Ghosh, D., Giri, D., Mohapatra, R., Savas, E., Sakurai, K., Singh, L. (eds) Mathematics and Computing. ICMC 2018. Communications in Computer and Information Science, vol 834. Springer, Singapore. https://doi.org/10.1007/978-981-13-0023-3_8

Download citation

DOI: https://doi.org/10.1007/978-981-13-0023-3_8
Published: 14 April 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0022-6
Online ISBN: 978-981-13-0023-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Survey on Automatic Image Captioning

Abstract