Keywords

1 Introduction

Image processing is an essential aspect of computer science and has substantial relevance across various fields, including object detection, scene interpretation, and visual recognition. Dedicated hardware was used by researchers for executing imaging techniques to get appropriate results, especially for rigid objects before the emergence of deep learning. However, CNN and RNN driven by deep learning have important influences on visual-to-text generation, demonstrating remarkable progress recently.

The task of describing a scenario depicted in an image or video clip comes naturally to humans, but it poses significant challenges for machines. To tackle this issue, computer scientists are exploring methods to integrate the ability to comprehend human language with the capability to automatically extract and analyze visual data, thereby enabling machines to perform similar tasks. Although, extracting objects with their actions from the image and producing crisp as well as relevant sentences needs much substantial work in comparison to a simple image recognition task.

Image and video caption generation primarily involve analyzing an image’s features and generating a corresponding textual description. As this field demands visual and textual mastery proficiency, it utilizes a blend of CV and NLP techniques to translate image comprehension from feature vectors into words arranged in the proper sequence. The captioning method must capture the objects in the given scenario as well as their traits, actions, and interrelationships.

Therefore, the most common method for image captions is the encoder-decoder architecture, which combines a Convolutional Neural Network (CNN) to encode image features and a Recurrent Neural Network (RNN) to generate a caption.

It has a clear separation of tasks – The CNN is responsible for encoding the image features, while the RNN is responsible for generating the caption. This separation of tasks makes it easier to debug and analyze the model.

Overall, the Encoder-Decoder architecture is a popular choice for image captioning due to its effectiveness, flexibility, simplicity, and clear separation of tasks.

This paper mentions the following details in upcoming sections, which are related work of image captioning, the Proposed Methodology, Results, and Discussion, and at the end conclusion and future work.

2 Related Work

Our research has involved an in-depth exploration of numerous studies about image captioning, encompassing a range of techniques, datasets, and evaluation methodologies. CNN is often used to extract features from an image. These features are then used as input to a language model that creates the image caption. CNNs are trained on large image datasets and can learn to recognize patterns and features in images [1, 3,4,5, 7, 8, 12, 15, 17]. RNN takes as input the output of the previous step (which is a word embedding) and the visual features that the CNN extracted from the image. And then generates the next word in the caption [3, 8].

Encoder-decoder models are a type of neural network architecture that leverages an encoder component to extract features from an input and a decoder component to generate an output. In the context of image captioning, the encoder is typically implemented using a convolutional neural network (CNN), which extracts salient features from the input image. On the other hand, the decoder is usually implemented using a recurrent neural network (RNN), which generates the caption based on the features extracted by the encoder. [2, 17]. This innovative approach has served as a starting point for subsequent research in the area of image captioning in 2015 [18].

Subsequently, the author of [19] introduced a novel approach to simultaneously train a CNN and an RNN for generating captions by aligning image regions with their corresponding linguistic units. To facilitate their experimentation, the authors employed the COCO dataset, which has since emerged as a widely accepted benchmark for assessing the effectiveness of image captioning models. Of significance, this paper also introduced the CIDEr score, a widely used evaluation metric for image captioning models [19].

After that, an attention-based approach to image captioning was introduced [20], where the model learns to selectively attend to different image regions when generating captions. The authors showed that attention mechanisms improve caption quality and reduce ambiguity, and proposed a novel “hard” attention mechanism that can be trained using backpropagation [20]. Attention mechanisms play a vital role in enabling image captioning models to focus on the most pertinent aspects of an image during caption generation. Rather than solely depending on the global image features, attention mechanisms allow these models to selectively concentrate on specific regions of the image that are most relevant to the current context of the caption being generated [2, 5, 13].

Thereafter bottom-up and top-down attention mechanism combines object-level features with region-level features to generate captions introduced [21]. This paper introduced a new dataset called Visual Genome, which contains more detailed object and attribute annotations than other datasets used for image captioning. This paper also introduced a new evaluation metric called SPICE, designed to measure the semantic similarity between generated and human captions [21].

Afterward, a new pre-training approach for image captioning that combines vision and language tasks to learn joint representations of images and captions was introduced [22]. The authors of this paper use a Transformer-based architecture that is pre-trained on a large corpus of image-caption pairs and shows that their method achieves state-of-the-art performance on several benchmark datasets.

Following that, a new Transformer-based architecture for image captioning that uses a meshed-memory mechanism to selectively attend to different regions of the image and the caption was introduced [23]. The authors show that their method outperforms other Transformer-based models and achieves state-of-the-art performance on the COCO dataset. Transformer Models Transformers are a relatively new development in the field of natural language processing and have proven to be highly successful in tasks such as text generation and machine translation. This is mainly because they use a self-attention mechanism that allows them to process input sequences simultaneously, making them well-suited for processing long input sequences such as captions. When creating captions, Transformer models are usually equipped with an encoder for extracting image features and a decoder for generating captions [11].

Visual Question answering is one of the major applications of image captioning that is mentioned in [24]. This paper proposes a new pre-training approach for image captioning using a single encoder to encode images and captions jointly.

A new approach was brought up in [25] for generating image captions by parallelizing the decoding process to improve efficiency. The authors propose a hierarchical structure for the caption that allows the model to generate the words in a parallel and efficient manner. The authors show that their method achieves state-of-the-art performance on the COCO dataset and is significantly faster than other models.

Throughout the years, numerous encoder-decoder techniques have emerged, employing different variations of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).

3 Methodology

A combination of CNN-LSTM is a commonly used Neural network in image captioning [3, 5–9, 12, 13, 15, 17]. In the proposed model Xception model is used for feature extraction. The Xception takes an input image and outputs a vector of visual features as the following equation.

$$ V = Xception \left( I \right) $$
(1)

In the given context, before utilizing the Xception model, a series of image preprocessing techniques were employed to adequately prepare the image. Consequently, the Xception model was applied to extract the feature vector associated with the image. In the current context, In Eq. (1) the variable “I” represents the input image, while “V” refers to the vector of visual features extracted from it.

The Xception model’s utilization of depthwise separable convolutions, in contrast to traditional CNN models, delivers notable improvements in computational efficiency and speed. This architectural choice enables the model to analyze spatial relationships and feature interactions more effectively while minimizing redundant computations. Therefore, the Xception model is used for encoding features from the image compared to the traditional one.

Fig. 1.
figure 1

Proposed Model

In the case of the decoding side, LSTM is a type of RNN that is frequently used for image captioning tasks because it is better suited for capturing long-term dependencies in sequential data such as natural language [3, 5–9, 12–13, 15, 17]. Figure 1 depicts the flow of the proposed model with training and testing bifurcations.

In image captioning, a critical task involves the model’s ability to comprehend the context of the image and produce a fitting caption accordingly. To this end, the LSTM network is employed to process the image features extracted by an Xception. The LSTM network generates a sequential set of words, which are then combined to form a grammatically sound and semantically coherent sentence.

Compared to conventional RNNs, LSTM networks possess an added memory cell capable of preserving and retrieving information over extended periods. This feature enables LSTMs to more effectively manage long-term dependencies, a challenge for traditional RNNs. Additionally; LSTMs overcome the issue of vanishing gradients that are commonly encountered in traditional RNNs, thereby increasing the effectiveness and efficiency of the network [5, 6, 9].

Moreover, LSTM networks can also selectively forget or remember information from the previous time step, making them well-suited for tasks where the model needs to maintain a context for a long period.

Therefore, we have used LSTM networks over traditional RNNs for image captioning tasks because of their ability to better capture the complex dependencies and long-term context of natural language data.

The LSTM takes the vector of visual features from the Xception and generates a sequence of words that form the image caption. The LSTM does this by processing each word in the sequence one at a time and updating its internal state based on the previous words in the sequence. This can be represented in Eq. (2)

$${h}_{t}= LSTM\left(V, {h}_{\left\{t-1\right\}}\right)$$
(2)
$${y}_{t}= Softmax\left({W}_{\left\{hy\right\}{h}_{t}}+ {b}_{y}\right)$$
(3)

where V is the visual features at time t, ht is the internal state of the LSTM at time t, yt is the output probability distribution over the vocabulary at time t, and W{hy} is the weight matrix connecting the LSTM output to the vocabulary, and by is the bias term.

The Softmax function is used to convert the output of the LSTM to a probability distribution over the vocabulary so that the network can predict the next word in the sequence based on the probability of each possible word as per Eq. (3).

In this paper, we have used the Flickr8k dataset. It contains 8000 images [26], each with five different captions provided by human annotators. The dataset is divided into the train, validation, and test sets, and is often used for evaluating image-captioning models.

Figure 2 depicts the outcome of our proposed model where we have used the start and end keywords to indicate the starting and ending of the captions.

Fig. 2.
figure 2

Results of Xception-LSTM model on Flicker30K dataset

4 Results and Discussion

Several evaluation methods are used for image captioning, including:

  1. 1.

    BLEU:

    It is an evaluation metric used in natural language processing to measure the quality of machine-generated translations. It compares the n-gram overlap between the sentences. BLEU scores range from 0 to 1, with higher scores indicating a better-quality translation. [2, 4, 5, 8, 11,12,13,14,15,16,17].

  2. 2.

    ROUGE:

    The ROUGE evaluation matrix consists of several metrics, including ROUGE-1, ROUGE-2, and ROUGE-L. ROUGE-1 calculates the overlap of unigrams (single words) between the generated and reference summaries. ROUGE-2 calculates the overlap of bigrams (pairs of adjacent words), while ROUGE-L measures the longest common subsequence between the generated and reference summaries [8, 10, 12, 15, 16].

  3. 3.

    METEOR:

    It uses a combination of unigram precision, recall, and alignment-based metrics to evaluate the similarity between the sentences. METEOR is designed to handle nuances of natural language such as synonyms, paraphrases, and word order variations, [5, 6, 8, 10, 12, 13, 15,16,17].

    It should be emphasized that a holistic assessment of image captioning systems cannot rely solely on a single metric. Instead, a blend of multiple evaluation techniques is usually employed to achieve a more comprehensive and precise evaluation of image captioning system performance.

The Xception-LSTM model evaluates the quality of the captions generated using BLEU and METEOR matrices shown in Fig. 3. BLEU evaluation works on n-gram overlapping words where the n value changes from 1 to 4. Based on the value of n results decrease. With that METEOR works on the ordering of the generated caption compare to the labeled caption.

Fig. 3.
figure 3

Evaluation matrices on Flicker30k dataset using Xception-LSTM

5 Limitation

Despite significant advancements in the field of image captioning in recent years, there remain several challenges and issues that require attention and resolution. Here are a few examples:

  • Context – It can be challenging to generate an accurate caption that conveys the intended message of an image. A single image can be perceived in different ways, leading to ambiguity in generating a descriptive caption. For example, a picture of a person riding a bicycle might be captioned differently depending on the specific details in the picture. The caption could vary from “a man riding a bicycle”, “a woman riding a bicycle” or “men riding a vehicle” depending on the contextual information in the image.

  • Ambiguity – Creating a precise and descriptive caption for an image can be a difficult task due to the ambiguity and subjectivity of visual content. Images can be interpreted in various ways, making it difficult to generate a single caption that accurately represents the content. Additionally, there may be more than one valid caption for a single image due to the different interpretations that people may have.

  • Rare or Unseen Words – In some cases, image captioning models may generate captions that contain infrequent or unfamiliar words, making them difficult for people to comprehend. This issue can be particularly troublesome for individuals who do not have expertise in the language utilized in the caption.

  • Data Bias – The process of training image captioning models involves using extensive datasets of image-caption pairs. However, these datasets may occasionally exhibit a bias towards particular types of images or captions. Consequently, the trained model may generate less accurate or less descriptive captions for certain types of images due to this bias.

  • Evaluation – Assessing the quality of image captions lacks a single standard metric, and the suitability of various metrics varies based on the specific application. For instance, certain metrics may prioritize accuracy, whereas others may prioritize the diversity or originality of the generated captions.

6 Conclusion and Future Work

Compared to other models like VGG-LSTM and ResNet-LSTM, the Xception-LSTM model offers several advantages. For one, it boasts greater computational and memory efficiency, which makes it more suitable for training on larger datasets. Additionally, the LSTM-based language decoder employed by the Xception-LSTM model is capable of modeling long-term dependencies during the caption generation process. This is a crucial factor in generating coherent and semantically meaningful captions. Moreover, the Xception-LSTM model can be fine-tuned on other tasks such as visual question answering and image retrieval, which demonstrates its versatility and effectiveness in various applications. However, the Xception-LSTM model still faces some challenges such as handling rare words and dealing with the ambiguity and diversity in the caption generation process. Future research can focus on addressing these challenges and improving the performance of the Xception-LSTM model on image captioning.