Keywords

1 Introduction

Automatic image captioning is the process of providing natural language captions for images automatically. The area is garnering attention from researchers because of the huge unorganized multimedia data pouring in every second. Automatic image captioning is a step ahead of automatic image tagging where images are tagged with relevant keywords related to the contents in the image. Various researchers have come up with the definition of automatic image captioning. In [1], authors in their work define automatic image captioning as the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. Mathews et al. [12] in their paper define it as automatically describing the objects, people and scene in an image. Wang et al. [21] in their paper give the definition as recognition of visual objects in an image and the semantic interactions between objects and translate the visual understanding to sensible sentence descriptions. Liu et al. [22] mention that the grammar must be error-free and fluent. For summing up image captioning can be defined as generating short descriptions representing contents (object, scene and their interaction) of an image in human-like language automatically.

Automatic image captioning is viewed as an amalgamation of computer vision and natural language processing. The computer vision part is about recognizing the contents of an image and the natural language processing part is about converting the recognition into sentences. Research has flourished in both the fields. Computer vision researchers try to better understand the image and natural language processing research try to better express the image. Because of this integration, automatic image captioning has come out as an emerging field in artificial intelligence.

1.1 Applications

Automatic image captioning is an interesting area because of its application in various fields. It can be used in image retrieval system to organize and locate images of interest from a database. It is also useful for video retrieval. It can be used for the development of tools that aid visually impaired individuals to access pictorial information. It finds application in query-response interfaces. Journalists find the application useful in finding and captioning images related to their articles. Human-machine interaction systems can also employ the results of automatic image captioning. Such systems are also helpful in locating images verbally. It can also be used for military intelligence generation, surveillance systems, goods annotation in warehouse and self-aware systems (Fig. 1).

Fig. 1.
figure 1

Example of automatically captioned images [3].

1.2 Scene Analysis

Scene analysis is a module in automatic image captioning and has gained importance recently. In image captioning, generally, the output is the main object in the image without caring about what the background of the image is. This negligence makes the description of the image very vague and unclear.

Consider an image where a person is standing and in the background there is a river and another image where the background is a desert. If the focus is only on the object, both the images will be captioned as a person. If the background scene is taken into consideration, the first image may be captioned as a person standing in front of a river and the second image may be captioned as a person in desert. Suppose a journalist wants a sample image for her article and sends a query in the image database as keywords person, river. In first case of image annotation both the images will be retrieved whereas in second case only the first image will be retrieved. Thus, scene analysis is very important for proper image captioning which leads to better image retrieval results (Figs. 2 and 3).

Fig. 2.
figure 2

(Sample image taken from internet)

Without scene analysis: a person. With scene analysis: a person in front of river

Fig. 3.
figure 3

(Sample image taken from internet)

Without scene analysis: a person. With scene analysis: a person in a desert

For scene analysis, the image needs to be broken down into segments for understanding. This leads to the inclusion of another image processing field - image segmentation. Various segmentation techniques exist and several are coming up as to segment the images in a way that the machine understands the image better and can generate better captions. Another field included in scene analysis is object recognition which in itself is a very broad research area.

Object detection can be enhanced by adding contextual information. Scene analysis provides the required contextual information. As the number of scenes is finite, scene analysis is also considered as scene classification problem. Since objects are related to the scenes, the probability distribution of each object over different scenes is different. Convolutional neural networks have been trained over 25 million images of Places dataset to predict approximately 200 different scene-types.

In a nutshell, scene analysis of an image is very important. Without this there is no scope for meaningful captioning of images.

2 Related Works

A lot of research has been done in the field of automatic image captioning. The whole procedure of generating image captions by machines follow a common framework which is discussed below.

2.1 Framework

On the whole, the entire procedure can be subdivided into 2 parts: image processing and language processing. Image processing part includes: image segmentation, feature extraction and classification. Feature extraction and classification can be together referred to as object recognition.

After the object recognition, we obtain the keywords corresponding to the identified objects in the images. These keywords are then fed to language processing unit which results in forming meaningful captions for images.

Each of the three modules are independent and can be researched upon individually. Techniques applied for one of them does not affect the one used for the other module. It is beneficial as each module can be studied and analyzed in isolation (Fig. 4).

Fig. 4.
figure 4

Steps in automatic image captioning

2.2 Approaches

For segmentation and recognition various techniques can be used: supervised learning, unsupervised learning, neural networks or probabilistic methods (Fig. 5).

Fig. 5.
figure 5

Various approaches applied for image processing part of automatic image captioning

Table 1. A comparative study of different works in automatic image captioning

3 Comparative Study

See Table 1.

4 Issues and Challenges

A number of open research issues and challenges have been identified in this field. A few of them are listed below:

  1. 1.

    Large collections of digital images exist without annotations.

  2. 2.

    The quality and quantity of training set becomes an important factor in determining the quality of captions that are generated.

  3. 3.

    Images with low resolution, low contrast complex background and texts with multiple orientation, style, color and alignment increase the complexity of image understanding.

  4. 4.

    The training set must include as much variety as possible.

  5. 5.

    Searching the optimal method for each of them is very expensive and it has a major effect on the performance of the overall system.

  6. 6.

    Capturing sentiments in the captions is a major challenge as not many datasets are available that include sentiment based annotations.

  7. 7.

    Few datasets are available that provide captions in different languages and moreover machine translation results are not always relevant.

5 Conclusion and Future Work

Automatic image captioning is an emerging area in the field of artificial intelligence and computer vision. The area has real life applications in various fields. It is an ensemble of various modules which opens a lot of area for exploration. Better captions can be generated with proper segmentation. Enhanced descriptions can be made using sentiment addition, activity recognition, background identification and scene analysis. Moreover the areas of deep learning for faster and accurate results can also be explored further. If the hardware resource cost is a limitation, traditional machine learning algorithms can also be investigated for the purpose.