Keywords

1 Introduction

The recent advancement in the area of deep learning especially CNN has motivated researchers to attend to the issue of image captioning. An image description generation is a demanding artificial intelligence problem where a word corpus must be brought out for a given image. It requires the incorporation of both methods from computer vision intelligence to comprehend the content of the image and a language model from the field of natural language processing to construct the understanding of the image into words in the right order. Recently, deep learning methods have achieved state-of-the-art results on examples of this problem.

Describing images automatically has a wide range of applications and covers a variety of tasks. Autonomously generating semantically rich descriptions from the image pixel has a handful of benefits. As a huge amount of unlabeled images upload every day to make the most use of this data it must be annotated earlier it was done manually by humans describing images but it was very time-consuming and boring for large databases. Automatic description generation is carried out by e-commerce companies: leveraging its capability to generate an accurate description of the listed product images. This technology contributes to image annotation for visually impaired people. Fortunately, this technology can facilitate people having an issue in eyesight by providing descriptions of what's around them. In order to achieve this level of image caption generator, 3 modularity changes happen within the system i.e. first image is converted to text and, finally, these texts are read out by the system. In addition to this, various social media companies use image captioning technology to generate meaningful full text of the uploaded image by the user. This helps their models to refine and learn to detect novel patterns and objects.

The capability of autonomous image description generation can be of great importance in many fields/companies which extract information from videos for editing, adding subtitles, or summarization. Doing manually such a task can be very tedious; generation sentences from the ongoing scene in the video can be a boon for many video streaming sites and news websites to get headlines or descriptions of any viral video.

A human can easily comprehend the context behind the image and transmute the information in the form of sentences, while computers amalgamate two major fields of artificial intelligence computer vision and natural language processing to successfully complete this task. The issue associated with computer-based language generation is to make an algorithm that can discern the knowledge of scene, color, and object from the image. These computer-generated captions should be error-free as well as coherent to the main subject, discussed the state of an object, and uses a range of vocabulary to reduce redundancy.

To tackle the above-mentioned challenge many models have been proposed and quite promising results have been achieved in past few years. Work in the field image captioning field can be categorized into 3 approaches.

  • A: Retrieval based approach.

  • B: Template based approach.

  • C: Deep Learning based approach.

Primarily, image captioning was explored using retrieval based approaches. A plethora of primitive image captioning models are leveraging such an approach. In this approach, the input image is compared against a huge dataset using feature space by equating a Markov Random Field [1], and the caption is copied from the image which has the least semantic distance. Ordonez et al. 2011 [2] harness the caption for query image by finding the nearest visual match to the query image and pass-on the caption to the given input image. A year later, Kuznetsova et al. 2012 [3] presented a more refined model using a slightly different method to rank the images in the dataset against the query image. Kuznetsova introduced visual context and scene extraction and perform a manifold retrieval process for every entity/scene detected in the input image. However, Mason and Charnaik et al. 2014 [4] employed probability density estimation for language modeling which accurately describes what is in the image. This method outperforms many state-of-the-art methods by tackling the issue of grainy feature maps and makes it efficient where even entity detection systems perform poorly. The main issue associated with retrieval based approach is the need for a rich and great variety of images to have a close match to the query image. This constraint leads to more vague and incomplete sentences if used with small datasets [5]. Due to the explicit and inevitable disadvantage of retrieval based approach other researcher pave their research in Template based approach [6,7,8]. In the template base method, the image is visually inspected and then the caption is generated by selecting the word as per the template defined. The template is defined as the sentence having a number of fixed empty blocks to produce a description of the image. Yang et al. [6] uses this approach and finds the most likely word to match the specific template(scene, noun, verb, and preposition) by using the hidden Markov model. In another paper, Kulkarni et al. [7] employed Conditional Random Field (CRF) to generate an image feature that highlights attributes, objects, and spatial relationships of different objects. And, at the last step, the caption template was filled as per the content. Similarly, Li et al. [8] use context information from the image to define triplet template format(adjective, preposition, adjective) This approach of making description using specified template lacks variable size description thus, resulting in loss of information in a number of occasions. In some aspect retrieval based approach is better than the template based approach but when it comes to more grammatically correct sentences template based method performed fairly good. Motivated by the advancement in the deep learning field, the deep learning model replaces the early shallow model. A lot of novel deep learning frameworks have been presented by researchers to do image caption in numerous fields is discussed in detail in the next section.

The aim of this paper is to present a model which can accurately and concisely define what is in the image. To do so we designed an encode-decoder framework containing CNN and LSTM units. In this paper, we extensively test the gamut of the combination between high-level feature extractor (VGGnet, ResNet50 V2, InceptionV3, and Xception) and language models(LSTM and GRU) and finds the best module for encoder and decoder by comparing different seq2seq combinations against BLEU and CIDEr score. Along with this, the paper provides a summary of approaches for image captioning from retrieval based to recent ones.

2 Related Work

In the past few years, visual caption generator has garnered huge attention from researchers due to huge image data generating every day. Automatic image captioning suits the best for the task of labeling unseen data on a large scale. In this section, we will discuss automatic image captioning technology using the most recent methods (Deep learning-based approach) in detail.

In a related paper author [9] attempts to solve the issue of image captioning without human interaction using encoder-decoder architecture. as high as 53.5 BLEU-4 score was achieved by using CNN feature extractor that consists of a pre-trained InceptionResnetv2 model as an encoder and GRU as the decoder. The proposed model was compared with different architecture on the basis of BLEU-4 and meteor score.H. Chen. et al. [10] founds the application of encoder and decoder in crack detection and improved the accuracy by using a switch between encoder and decoder module. WANG. et al. [11] finds the novel approach of the seq2seq model in long-term traffic prediction. The experiment showed improved results using a hard attention layer. The model learns long-time step patterns more accurately.use of LSTM significantly reduced the effect of vanishing gradient and gradient explosion during the learning phase.

Wang et al. [12] presented a model using bidirectional LSTM for image description. The model employed CNN and two different LSTM networks, leveraging it to make more contextually rich sentences. Rennie et al. [13] tried to optimize the image captioning task using novel self-critical sequence training(SCST) and the result are quite commendable tested on the MS COCO dataset. The baseline structure of the model contains ResNet50 V2 and LSTM as an encoder-decoder module. Aneja et al. [14] proposed a rational combination of Convolution Neural Net(CNN) based language approach for text generation to address the issue of vanishing gradient in the LSTM unit. The model performed equally well as compared to the CNN-LSTM combination. Additionally, significant improvement in training time was recorded. Kiros et al. [15] used multimodal learning where the model taught using image and text together with AlexNet for image understanding. The proposed method removed the need for templates, structures, or constraints instead uses state of art feature extractor(AlexNet). Yao et al. [16] described a novel architecture of GoogleNet incorporated with high-level attributes (LSTM-A). In his paper author tried to predict the attributes and inserted them into the LSTM to find the optimum node for insertion different LSTM-A models were tested. This architecture of Yao et al. extracts better semantic relationships, hence generate a more contextually accurate description.

Author [17] discussed the autonomous description generator for the input image by using a manifold framework to detect the human-object pair. The proposed method used a hybrid deep learning approach to get the insight from the image and outline the human object pair combined with probabilistic language model outperformed on many benchmark datasets. Captioning of an image is not an easy task to address this issue scholars recommended an Adversarial Networks and Reinforcement Learning [18] based framework to rectify the issue of bias in caption generation. The proposed model was trained on the COCO dataset showed significant improvement in evaluation.

Author [19] find the application of image description generator in medial domain i.e. deployed model generates a description of chest x-rays images using adversarial reinforcement learning and outstrip many previously employed methods. The proposed architecture comprises of 3 main components: encoder, decoder, and reward module, to extract feature VGG16 model was used in an encoder, whereas LSTM was used with attention mechanism in the decoder.

Paper [20] discussed the implementation of VGG16 based encoder-decoder network(Segnet) for semantic segmentation task with the Conditional Random Field(CRF) layer after the basic Seg-Net model. The model was trained on the CamVid dataset.

Encoder Decoder Architecture:

Encoder-Decoder find its way back in 2012, by google for machine translation tasks [21]. Prior difficulty to the task where the previous context is necessary to predict the target, was done in a very nascent way. A critical advantage associated with the encoder-decoder model is that size of the input and output sequence can be different. This led the researcher to continue work in tasks like machine language translation, text-to-image conversion, and image caption where the length of input and output are different.

Image captioning is one of the tasks in which transferring of one sequence(image) into another sequence(caption) takes place. Therefore, the encoder-decoder framework fits perfectly to tackle the problem. Kiros et al. [15] were the first to employed Encoder-Decoder architecture for image captioning task. Similar to his work, Vinyal et al. [22] presented his work of generating image descriptions using CNN as an encoder and LSTM as a decoder. In sequence2sequence there are mainly two-component one is an encoder and the other one is a decoder.

The Encoder-decoder model not only takes the current state into consideration while predicting the target output but also its neighbor state. The encoder consumes the image as inputs and produces a feature vector of the image called a context vector. Onwards, the decoder takes the context vector and predicts the words for each time step.

In general, the working of an encoder and decoder model for image captioning starts with a feature extraction of an input image. This feature map is achieved by employing Convolution Neural Network (CNN) as an encoder module. CNN layer extracts the feature of an input image from a fully connected layer(Global feature), these global features are then fed into the decoder. The decoder is responsible for the text generation from an input feature vector. This decoder could be Recurrent Neural Network (RNN), Long-short Term Memory (LSTM), and Gated Recurrent Unit (GRU) as the basic unit. The majority of the encoder-decoder image captioning models use Maximum Likelihood Estimation (MLE) as their learning method whereas, few used reinforcement learning [13] to reduce the exposure bias issue, Generative Adversarial Network [23] and contrastive learning [24] methods were also proposed (Fig. 1).

Fig. 1.
figure 1

Basic encoder decoder architecture

3 Proposed Model

In this section we will study about the proposed method. Here, we define the method of study. For each encoder decoder combination.

Feature Extractor:

To represent the feature from the input image CNN was employed, ought to their success in feature extraction. The output from the Last fully connected layer gives the global representation of an image. This output is the vector input to the decoder later. In this paper we have chosen many different CNN as an encoder such as VGGNet [25], ResNet50V2 [26], Inception [27]. The best results comes out with ResNet50V2 which was discussed later in the result. To achieve the feature representation vector from each encoder module we have to follow certain steps shown in Fig. 2.

This paper proposed ResNet50V2 model as an encoder having 50 layers deep neural network, such deep network trained quite well even without facing vanishing gradient problem. ResNet50V2 escape the problem of vanishing gradient in deep network by cleverly implementing switching connection between the layers. The idea of skip connection was introduced by the ResNet50V2 in 2016. Skipping connection between the layers helps ResNet50V2 model by permitting the flow of gradient to pass through another route. Since then ResNet-50 architecture are widely explored in computer vision tasks. Talking about the architecture, ResNet50 has over 23 million trainable parameters. The v2 variant of ResNet50V2 is similar to V1 in all ways except aa batch normalization and Relu activation applied before convolution operation. ResNet50V2 is the more modified and efficient version of RestNet50.

Language Model:

Given the input feature vector, a decoder is responsible for generating word. To do so, we tested LSTM [28] and GRU [29] as a decoder or language model. Recurrent Neural Network perform exceptionally good in sequence generation. Due to the major issue of vanishing gradient associated with RNN, we tested out both LSTM and GRU as the basic unit as well. And, the result were significantly impressive.

GRU: This version of RNN is the newer than LSTM, first introduced in 2014. Basic Recurrent Neural Network suffers from short-term memory problem which means for longer sentences or sequence it forgets the previous data and remembers only last few sequence. This issue was first resolved storing long-term memory in cell state in Long-short Term Memory (LSTM) cell whereas, the hidden sate store the short term memory. GRU is the lighter modified version of LSTM, which contains both long and short term memory in hidden sate as shown in fig. Typical GRU cell contains two gates: update gate and reset gate. Update gate is responsible for how much of past memory to store meanwhile, Reset gate perform the forget operation on the stored memory. Reset gate takes the hidden state \({\mathrm{h}}_{\mathrm{t}-1}\) and the current word \({\mathrm{x}}_{\mathrm{t}}\) and then sigmoid function(\({\upsigma }_{\mathrm{g}}\)) on the top of it to give the reset gate value \({\mathrm{r}}_{\mathrm{t}}\) as in Eqs. 13 Similarly, update gate performs, the same weight operation however the weights are not same. \({\mathrm{z}}_{\mathrm{t}}\) is the value we get from update gate. Hence, GRU will use these value \({\mathrm{r}}_{\mathrm{t}}\), \({\mathrm{z}}_{\mathrm{t}}\), \({\mathrm{h}}_{\mathrm{t}-1}\) and \({\mathrm{x}}_{\mathrm{t}}\) to get new hidden state as well as the output vector (\({\mathrm{h}}_{\mathrm{t}}\)).

$$ {\text{z}}_{\rm{t}} = {\upsigma }_{\text{g}} ({\text{W}}_{\rm{z}} {\text{x}}_{\rm{t}} + {\text{U}}_{\rm{z}} {\text{h}}_{{\rm{t}} - 1} + {\text{b}}_{\rm{z}} ) $$
(1)
$$ {\text{r}}_{\rm{t}} = {\upsigma }_{\text{g}} ({\text{W}}_{\rm{r}} {\text{x}}_{\rm{t}} + {\text{U}}_{\rm{r}} {\text{h}}_{{\rm{t}} - 1} + {\rm{b}}_{\text{r}} ) $$
(2)
$$ {\text{h}}_{\text{t}} = \left( {1 - {\text{z}}_{\text{t}} } \right) \cdot {\text{h}}_{{\text{t}} - 1} + {\text{z}}_{\text{t}} \cdot {\hat{\text{h}}}_{\text{t}} $$
(3)

W, U, and b are the parameter matrices and vector. Whereas represents Hadamard product in the Eq. 3.

4 Model Training and Dataset

Different encoder decoder sets were created and trained to find out the most effective one. Details are listed in the table below. Each model was roughly tested with both LSTM/GRU in the decoder and the best one was selected based on overall BLEU and CIDEr score Then, the best architecture was selected for further analysis. All models training was done in the Google Colab environment. For every model training process, we have used Adam optimizer. Experiment results illuminate that ResNet50V2 and GRU architecture outperform other models. The experiment was repeated for batch sizes between 8 to 32, with learning rate ranges from .01 to .0001. Word2vec embedding was utilized for word embedding. It generates a 200 dimension vector for each token. During the experiment, LSTM and GRU state size was initialized to 512. We had used Flickr8k [30] dataset, which includes 8000+ image samples which are further divided into training, testing, and validation set. Every image in the dataset is paired with the 5 captions. Overall, it contains somewhere around 40000+ captions. Mean length of the caption comes out 11 and the longest sequence contains 37 words. All the image samples were collected from the photo storage website Flickr.

Fig. 2.
figure 2

Model training process

5 Evaluation Metrics

BLEU Score [31]:

BLEU stands for Bilingual Evaluation Understudy score proposed by Kishore papineni 2002. It ranges from 0 to 1. Higher the score better the generated text. BLEU score implies the similarity between the generated text and actual text. Effectiveness of the generated description was evaluated by counting the match between the n-gram of generated description and the n-gram of the original caption. Similar grams are irrespective of the index in the sentence. BLEU score can evaluate the performance of many deep learning applications for example machine translation, text summarizer, image captioning, and many more. The main drawback of using the BLEU score is its inability to grasp the context or grammar of the generated text. Thus, sometimes it even shows a high correlation between the original text and grammatically incorrect or opposite meaning sentences.

$$\mathrm{BLEU}=\mathrm{ BP }*\mathrm{ exp}\left({\sum }_{\mathrm{n}=1}^{\mathrm{N}}{\mathrm{W}}_{\mathrm{n}}{\mathrm{logP}}_{\mathrm{n}}\right)$$

BLEU or BLEU Overall is a geometric mean of n-gram scores from 1 to 4.

CIDEr Score [32]:

Consensus-based image description evaluation BLEU score which is traditionally used for machine translation task, whereas CIDEr metric was wholly generated for image caption task. CIDEr metric shows higher similarity with human judgment score. CIDEr score was calculated by measuring how frequently n-gram in the generated description are matching the reference sentence, the same thing is calculated for non-matching words. And then, the most frequent n-gram across all captions are penalized which calculating correlation using TF-IDF weighting for every n-gram.

$${\mathrm{CIDEr}}_{\mathrm{n}}({\mathrm{C}}_{\mathrm{i}}{,\mathrm{S}}_{\mathrm{i}})=\frac{1}{\mathrm{m}}\sum_{\mathrm{j}}\frac{{\mathrm{g}}^{\mathrm{n}}({\mathrm{C}}_{\mathrm{i}})\times {\mathrm{g}}^{\mathrm{n}}({\mathrm{S}}_{\mathrm{ij}})}{||{\mathrm{g}}^{\mathrm{n}}({\mathrm{C}}_{\mathrm{i}})|| ||{\mathrm{g}}^{\mathrm{n}}({\mathrm{S}}_{\mathrm{ij}})||}$$

gn(a) is a vector formed by TF-IDF scores of all n-grams in “a”. Ci and Si refers to the generated caption and the original caption respectively (Tables 1 and 2).

Results:

Table 3 Shows the overall performance of different model in their default settings, the optimum result was shown by the restnet50v2 and GRU model, which suggests that ResNet50v2 is quite well in capturing image feature as compared to the other feature extractor. Furthermore, between GRU and LSTM, GRU provides better description and pays more attention to the scene, and object as compared to LSTM generated model. Only VGG variants performs well with LSTM otherwise GRU outperform in every metrics.

Table 1. Number of image sample in dataset
Table 2. Various encoder decoder combination we tested in our experiment
Table 3. Comparison of various architecture on Flickr 8k dataset
Table 4. Comparison of different batch size on ResNetV2-50+GRU architecture
Table 5. Comparison of different learning rate on ResNetV2-50+GRU architecture
Table 6. Different metrics score of the proposed fine tuned mode

Table 4. Shows the effect of different batch sizes on the Evaluation metrics. It was evident that increasing the batch had a detrimental effect on the both BLEU-n and CIDEr score. This was due to the overfitting thus model's ability to generate effective caption degrades. The best BLEU score was achieved for batch size 4 whereas CIDEr score is the highest for batch size 8. A significant drop in BLEU-4 was recorded from 40.1 at batch size 4 to 19 at batch size 16. In Table 5. Different learning rates were tested on ResNet50ve and GRU model. Overall, model performance decreases as the learning rate decreases. Various fluctuations were recorded but the best learning rate comes out to be .001. Presented results hints that a batch size of 4 and a learning rate of .001 could generally be expected to give out the most optimum description of the image.

After all the parameter tuning was done, the test was reconstructed by setting the most optimum parameters for the selected ResNet50V2 and GRU architecture, and the results achieved were quite impressive as shown in Table 6. Approximately as high as 15% increase in the CIDEr score was achieved similarly 12.5% increment was recorded in BLEU-4 score.

6 Conclusion

Overall, in this paper, we have discussed the image captioning task through the different methods and proposed an encoder-decoder architecture based on ResNet50v2 with GRU for the same task. The promising result was shown by our model as compared to the previous method. Deep layer architecture of ResNet50v2 joint with GRU having capabilities to retain long-term memory powered this image captioning task. Perhaps, from the initial model selection stage, this combination clearly stands out from the others. With very minimal refinement, we achieved significant improvement(15% in case of the CIDEr score and 12.5% in case of the BLEU-4 score) in the quality of the description. Our presented encoder-decoder architecture, retrieve image information quite efficiently as depicted in captions that correspond to the content in the image. Table 7. Depicts some of the generated captions using the proposed model and classified based on their quality(good, average and bad captions).

Table 7. Table shows generated captions from our proposed model on flickr 8k dataset.