Keywords

1 Introduction

In recent years, image captioning has received huge attention. It involves observing the contents in an image and then describing it. It has a broad application area with a wide range of scenarios. Areas of research in Natural Language Processing (NLP) and also in Computer Vision (CV) fields are achieving immense advancements; larger datasets have been made available while generating text of images and videos leading to implementation of deep neural network-based methods acquiring more and more accurate results on image captioning. It involves the task of capturing an image, analyzing the video contents, recognizing the most important features of the image, and then generating the textual description based on it. Deep learning algorithms have shown better results in handling many complex and challenges of an image captioning task [1]. The image processing can be categories into three different approaches based on: retrieval, text, and novel. Retrieval-based approach caption an image from a collection of already existing captions [2]. In template based, captions are generated based on the templates which identify a set of visual notions first then connected through the sentence template to compose a sentence used by [3]. Novel based on the other hand, generates captions of an image from both visual spaces as well as from multimodal space.

This paper starts with the discussion of different image captioning methods categories into two different frameworks in Sect. 1.1. Section 1.1.1 discusses the encoder-decoder framework along with five different methods under it. References [4,5,6,7,8] methods are based on encoder-decoder architecture to generate a caption. Similarly, Sect. 1.1.2 discusses the compositional architecture-based image captioning and five other different methods another same. References [9,10,11,12,13] methods are based on the second type of framework where captions are generated by extracting components from relevant captions and later combined for describing the image. In Sect. 1.2 summarizes the various image captioning methods based on deep learning method on two different frameworks.

1.1 Image Captioning Methods

Among the various methods based on deep learning model, this paper has considered the framework used to build a model that can generate a caption or describe a given image trained and tested on some of the benchmark datasets. The architecture considered are: encoder-decoder-based framework and compositional-based framework.

1.1.1 Encoder-Decoder Framework

  1. (a)

    Encoder-Decoder pipeline: The main idea of this method is to translate a sentence from one language into another language by supplying an input as an image and the output as a sentence illustrated in Fig. 1 [4]. This method has been adopted from the neural translation concept as given by [14].

    Fig. 1
    figure 1

    The encoder-decoder method proposed by [4]

Working

It contains two stages: encoder and decoder. Firstly, the encoder phase makes a combined multimodal space which is used to order the images along with its descriptions. This encoder encodes the sentences by using the idea of machine translation using LSTM model [15]. Features of an image are embedded using a CNN. The encoder tries to minimize the pairwise ranking loss that will help to learn the ranking of images and along with its descriptions. In the second stage, the method uses the multimodal representation so that it can generate novel descriptions. The decoder part uses a new type of a neural network-based language method named as Structure-Content Neural Language Method by [4] and can generate novel descriptions.

  1. (b)

    Neural Image Caption (NIC) Generator

This method is proposed by [5] which uses a CNN as an encoder for image representations and RNN as a decoder for generating captions of an image shown in Fig. 2. The encoder in this method follows a novel approach where the last hidden layer in the model is fed as an input to the decoder [16].

Fig. 2
figure 2

Neural image caption generator [5]

Working

The encoder (RNN) translates the input of variable length into a fixed dimensional vector [5] and decodes this representation into required output which is the description. The probability of the right caption is calculated using Eq. 1 [17], where \(I\) is an image, and \(S{ }\) sentence of its length is unbounded.

$$ \theta^{*} = \begin{array}{*{20}c} {\arg \max } \\ \theta \\ \end{array} \mathop \sum \limits_{{\left( {I,S} \right)}} \log p (S|I;\theta )$$
(1)

Sampling is one of the approaches used in [17] where the first word was sampled according to p1, equivalent embed was supplied as an input for sample p2, continuing in the same order like this until all the samples reach a special end-of-sentence token or it has reached with some maximum length. Second approach uses a search technique called a Beam Search, by iteratively taking all the k-best description till some time t.

  1. (c)

    gLSTM

This method is an extension of LSTM proposed by Jia et al. [6]. The author [6] used a concept known as guided LSTM [6] which could create long description in form of sentence by adding global semantic info. This information was then added to each LSTM’s gates and cells shown in Fig. 3. It also takes into consideration various normalization strategies to manage the caption length.

Fig. 3
figure 3

gLSTM network proposed by [6]

Working

Firstly, in this method, for describing an image and extracting the semantic information from an image a Cross-Modal Retrieval (CRM) is used. Multimodal embedding space can also be used to extract the information of the image. Secondly, semantic information is added to the computation of each gates and cell state. Thus, the information is obtained together from the images and its descriptions, aiding as a guide in the procedure of generating a word sequence. In the LSTM method, the generation of a word generally depends on the embedding word at the present time step and the previous hidden state.

$$ i_{l}^{\prime } = \sigma \left( {W_{ix} x_{l} + W_{im} m_{l - 1} + W_{iq} g} \right) $$
(2)
$$ f_{l}^{\prime } = \sigma \left( {W_{fx} x_{l} + W_{fm} m_{l - 1} + W_{fq} g} \right)$$
(3)
$$ o_{l}^{\prime } = \sigma \left( {W_{ox} x_{l} + W_{om} m_{l - 1} + W_{oq} g} \right)$$
(4)
$$ c_{l}^{\prime } = f_{l}^{\prime } \odot c_{l}^{\prime } + i_{l}^{\prime } \odot h\left( {W_{cx} x_{l} + W_{cm} m_{l - 1} + W_{cq} g} \right)$$
(5)
$$ m_{l} = o_{l}^{\prime } \odot c_{l}^{\prime } $$
(6)

In the above equations by [6], vector representation of semantic information is denoted by g. While \(\odot\) denotes the element-wise multiplication, \(\sigma \) represents the sigmoid function, and \(h\) is the hyperbolic tangent function. The variable \({i}_{l}\) represents the input gate, fl, \(\mathrm{is}\) forget gate ol is output gate, cl, and ml memory cell unit and hidden layer.

  1. (d)

    Referring expression

Mao et al. [7] proposed a new method called as a referring expression. The expression determines a unique description for a particular object or may be an area specified in a given image shown in Fig. 4. Further, the expression can be interpreted to infer which object is being described [7]. This method used well-defined performance metric which gives more detailed than just image captions and as a result provides more helpful.

Fig. 4
figure 4

Illustration of referring expression proposed by [7]

Working

This method generally considers two characteristics: description generation and description comprehension. In first characteristics, a text expression is generated that exclusively identifies object which is highlighted or some specific region emphasized in the image. In the second characteristics, the method inevitably chooses an object from a given expression that refers to this chosen object. Alike other image captioning methods this method uses a CNN model to represent the image latter followed by an LSTM. The method computes feature for the whole image, to serve as context [7]. It is considered a novel dataset which they termed as ReferIt dataset [18].

  1. (e)

    Variational Auto Encoder (VAE)

This method is proposed by [8] using a semi-supervised learning technique. The encoder considered here is a deep CNN and Deep Generative Deconvolutional Neural Network (DGDN) as a decoder. The framework may also even allow unsupervised CNN learning, based on an image [8].

Working

CNN is used as an image encoder for captioning, whereas a recognition method was established for the DGDN as a decoder which decodes the latent image features [8]. The encoder supplies an approximation of distribution for the latent DGDN features which is then related to generative methods for labels or captions. They used Bayesian Support Vector Machine for generating the labels of an image and RNN for giving captions. In the process of generating a label or a caption for any new images, the task of calculating the average across the distribution of latent codes is performed.

1.1.2 Compositional-Based Framework

The second type of architecture is mainly composed of several individual functional components [1]. This approach used CNN that extracts the meaning from an image using a language method illustrated in Fig. 5.

Fig. 5
figure 5

Compositional-based framework [1]

This framework performs the following steps:

  1. (i)

    Extraction of unique visual features from the image.

  2. (ii)

    Derived visual attributes from the extracted features.

  3. (iii)

    Use the visual features and the visual attributes in a language method to generate probable captions.

  4. (iv)

    Provides ranking for the probable caption using a deep multimodal similarity method, to determine the best suitable captions.

  1. (a)

    Generation-based image captioning from sample

This method is proposed by [15], which is composed of several components: (i) visual detectors, (ii) a language method, and (iii) multimodal similarity method so as to train the method on dataset of an image captioning.

Working

A Multiple Instance Learning (MIL) [19] for training the visual detectors for word that are commonly occurs in a caption. This includes several parts of speech like a noun, verb, and adjectives. An image sub-region was considered by this method rather than the complete image. Outputs of a word detector act as conditional inputs to a maximum-entropy language method. The features extracted from the sub-regions are represented with the words probably present in the image captions. Maron and Lozano-Pérez [19] performed re-ranking of the captions using sentence-level features and a deep multimodal similarity method to acquire the semantic information of an image.

  1. (b)

    Generation of description from wild

This method is introduced by [10] for different image captioning that automatically describes images in the wild. It used the compositional framework like [9], where in [10] the image caption systems are established using different components which are trained independently and latter combined in the main structure shown in Fig 6.

Fig. 6
figure 6

Illustration of image caption pipeline [10]

Working

In this method, for identifying a comprehensive visual concept the authors have considered a deep residual network-based vision method. On the other hand, to identify images of celebrities and landmarks; an entity recognition method is used. A classifier for estimating the confidence score for each output caption [10] for generating candidates a language method and for ranking the caption deep multimodal semantic method are considered.

  1. (c)

    Generation of descriptions with structural words

A compositional network-based image captioning method is proposed by [11]. This method follows some structural words format as: <object, attribute, activity, scene> [11]. These structural words are used to generate semantically meaningful descriptions using multi-task method which is comparable to MIL [19] method. Then LSTM [20] machine translation method is used to translate the structural words into image captions.

Working

Figure 7 describes the framework with two stages capable of identifying structural words and generating descriptions from image. Identification of the structural words sequence <objects, attributes, activities, scene> is carried out in the first stage. The image captions that contain comparatively larger information by deep RNN are translated from the word sequence (recognized in first stage) in the second stage.

Fig. 7
figure 7

Stages of generating sentence [11]

  1. (d)

    Parallel-fusion RNN-LSTM architecture

Wang et al. [12] proposed a method based on deep convolutional networks and recurrent neural networks shown in Fig. 8. The main idea is to combine the benefits of RNN and LSTM which leads to decrease in complexity and increase in performance. RNN hidden units are composed of several equal dimension components that work parallel. The outputs are then merged with corresponding ratios to generate final output.

Fig. 8
figure 8

Method proposed by [12]

Working

This method follows the following strategies:

  1. 1.

    Splits the hidden layer into 2 parts with both the parts remaining uncorrelated until the output unit.

  2. 2.

    From source data, identical features vectors are fed to the hidden layers along with feedback outputs of the respective hidden layers from the past.

  3. 3.

    Send the generated output of the RNN unit to yt component of the overall output module.

    $$ h_{{1_{t} }} = \max \left( {W_{hx1} x_{t} + W_{hh1} h_{{1_{t - 1} }} + b_{h1} ,0} \right)$$
    (7)
    $$ h_{{2_{t} }} = \max \left( {W_{hx2} x_{t} + W_{hh2} h_{{2_{t - 1} }} + b_{h2} ,0} \right) $$
    (8)
    $$ y_{t} = {\text{softmax}} \left( {r_{1} W_{d1} h_{{1_{t} }} + r_{2} W_{d2} h_{{2_{t} }} + b_{d} } \right)$$
    (9)
    $$ dy_{1} = r_{1} \times dy $$
    (10)
    $$ dy_{2} = r_{2} \times dy $$
    (11)

As per [12], the parameters considered are {h1, h2} as the hidden units, {Whh1, Whh2, bh1, bh2, Wd1, Wd2} denotes the weighted parameters while dy is the Matrix for a softmax derivative. The ratios considered are r1, r2: ratios.

  1. (e)

    Fusion-based Recurrent Multimodal (FRMM) method

This method proposed by [13] introduced an end-to-end trainable Fusion-based Recurrent MultiModal (FRMM) method that can address multimodal applications which allow each input modality to be independent w.r.t architecture, parameters and length of input sequences shown in Fig. 9.

Fig. 9
figure 9

Method proposed by [13]

Working

The method has separate stages whose outputs are mapped to a common description so that it can be associated with one another during the fusion stage. The outputs are predicted by the fusion stage based on the association. Supervised learning occurs in each all stage. Figure 9 describes how the FRMM method works by taking a video description method as an example. The FRMM method learns the behavior in separate stages.

1.2 Summary

Tables 1 and 2 show the different image captioning methods using deep neural algorithms. The table is categorized into two parts based on the framework used to generate a caption for an image experimented on datasets such as Flickr8K, Flickr30K, PASCAL, UIUC PASCAL, and MSCOCO considering only BLEU and Recall@k evaluation metrics. The first category, encoder-decoder framework uses to generate a caption which got inspired from the concepts of translating sentences from one language into another language. Under this framework, various authors [4,5,6,7,8] have been successful in generating captions of images. Contrary to the encoder-decoder framework, the second category that has been considered is a Compositional Architecture proposed by [9,10,11,12,13] image captioning.

Table 1 Summary of generating image caption based on encoder-decoder framework on the dataset using evaluation metric
Table 2 Summary of generating image caption based on compositional-based framework on the dataset

In Table 1, among the method that follows encoder-decoder framework, variational autoencoder [8] has shown higher BLEU-1 value compared with the other methods experimented in MS COCO datasets. In Table 2, among the compositional-based architecture: FRMM [13] has higher BLEU-1 evaluation value experimented on MSCOCO dataset.

1.3 Conclusion

On carrying out a comprehensive survey on image captioning methods based on some deep learning methods. The following ideas were derived from: encoder-decoder framework and compositional architecture. The encoder-decoder framework is used to generate various captions from images. It first encodes an image to an intermediate representation and then generates a sentence word by word from the representation using the decoder. The compositional image captioning uses a method to detect concepts that visually appear in the input image. The detected concepts are then forwarded to the language method to generate various candidate captions from where one probable caption is chosen as the final caption or description for a given input image.