Keywords

1 Introduction

A quick glance at an image is sufficient for a human to point out and describe an immense amount of details about the visual scene. However, this remarkable ability has proven to be an elusive task of content analysis for visual recognition models.

Automatic description of an image is a very challenging task as the system must capture not only what is contained in an image but also how the objects in an image are related to each other, what actions are involved in and how the changes occurred over one image to other are captured. This task is harder than tradition image or object recognition models, as it requires language model to express the semantic knowledge along with the visual understanding.

Visual contents are generally represented by images or videos which is associated with captions or tags as text sentences. This provides the way to learn the representation of multiple data types. The multimodal neural architecture that is used for implementation consists of convolutional neural networks for image descriptions, bidirectional recurrent neural network to represent the content flow of text sentences.

A key requirement for any machine translation system that produces natural language sentences is the coherence of its output. Coherence relations must able to capture the relatedness between the texts with respect to sentence transitions. Local coherence is necessary for global coherence that automatically abstracts a text into a set of entity transition sequences and records distributional, syntactic and referential information about discourse entities, i.e. entity grid representation of discourse, which captures pattern of entity distribution in a text.

General users usually take a large number of images and description of each of the images would possibly create detailed descriptions like paragraphs which could be too long for users. The summarized description must be short, able to preserve the coherent content flow and should be in accordance with the story sequences. However, the shortened story over machine-translated natural language sentences have not been explored in the earlier works.

This proposed work could have a great impact, for instance, helping novice users better understand the context of an image available in a web and can be used for various applications such as remembering the memories of life, recognition of action [1] over a series of events, digital storytelling and many such sequential or historical events through stream of images.

Organization of this paper is as follows. Section 2 presents a literature survey of the work. Section 3 presents the detailed design. Section 4 deals with experimental setup, and Sect. 5 presents the result analysis.

2 Related Works

In recent years, there has been a growing interest in exploring the relation between images and language. Simultaneous progress in the fields of Computer Vision (CV) and Natural Language Processing (NLP) [2,3,4] has led to impressive results in learning both image-to-text and text-to-image connections. Tasks such as automatic image captioning, image retrieval or image generation from sentences have shown good results.

Image Captioning: Image captioning method is paying drastic attention in the field of computer vision and machine learning community which aims to generate text sentences that describes an input image. Most popular approach of text generation is to retrieve best sentences by learning from training a system through embedding between images and text [3, 5]. This work involves retrieval of text from the learned model and generates story based on compatibility score between the language model and the coherence model.

Many earlier research attempts have exploited multimodal networks that combine deep Convolution Neural Networks (CNN) [6, 7] for image description and Recurrent Neural Networks (RNN) [6, 8,9,10] for language sequence modelling. There are many variants of combinations used as multimodal architecture includes CNN with bidirectional RNN [11], long-term recurrent convolution networks [6], long-short term memory networks [7]. This work takes an advantage of existing models with distinctive extension to multiple dimensions of input and output.

Huang et al. [12] introduced the first dataset of sequential images with corresponding description. They first collect storyable photo albums from Flickr and then outsourced to crowd workers using Amazon’s Mechanical Turk (AMT) to collect the corresponding stories and descriptions.

Retrieval of Image: Most of the existing work involves retrieval of image by keyword for the structured queries. Few earlier works include image ranking and retrieval based on text sentences, multiple attributes and other data structured objects like graphs [13]. In [14], three different data types such as image, text and sketch are combined as a query for image retrieval. Lin et al. [15] proposed a method for video search using a text sentence as a query. Hu et al. [16] retrieve a natural language object that takes an image and associated texts, for each text query that corresponds to the image contouring. Kong et al. [17] took a scene and a sentence as an input to find the relatedness between its regions and text phrases. Similarly in [18], the correspondence between regions to phrases is computed. This work is distinct as it involves images sequences, instead of single image.

Entity-Based Approaches to Local Coherence: Substance based records of neighbourhood intelligence include a long custom inside the phonetic and subjective science writing. Element-based portrayal [19] of talk permits learning the properties of lucid writings from a corpus, without plan of action to manual explanation or a predefined information base. Entity-based theories capture coherence by characterizing the distribution of entities across discourse utterances, distinguishing between salient entities and the other texts.

Text Summarization: Text summarization is an approach that uses Natural Language Processing principles and algorithms to understand the larger text [20] and generate smaller and efficient summaries. Our approach uses certain linguistic elements [21] to identify the most relevant segments of a text and must be able to capture the syntactic and coherent flow of the generated narrative descriptions while reducing it to precise representation of the text.

3 Proposed Architecture

Image captioning technique applied over sequence of images requires learning coherent meaning and the summarization technique aimed at generating concise description of an image. This provides the way to investigate multimodal architecture which has been shown in Fig. 1.

Fig. 1
figure 1

Overall architecture of the proposed system

Figure 1 majorly divided into four different components: Convolutional Neural Network (CNN) used for describing an image, Bidirectional Recurrent Neural Network (BRNN) for language modelling, the local coherence learning to capture the smooth flow of sentences, and rank-based summarization technique to produce the crisp story of an image sequences.

3.1 Text Descriptions

The text sentences associated with an image are represented in two ways: paragraph vector to represent the text features and parse tree to represent the grammatical roles of the text sentences.

3.2 Bidirectional Recurrent Neural Network

The BRNN model is used to represent a content flow of text sequences. This bidirectional model helps to consider the previous and next text sentences while modelling forward and backward processing.

Initialize the weights \( W_{i}^{{f_{c} }} \psi W_{i}^{{b_{c} }} \psi W_{{f^{c} }} \psi W_{{b^{c} }} \psi W_{o\wp \psi } \) and bias \( \Leftarrow \leftarrow b_{i}^{{f_{c} }} \psi b_{i}^{{b_{c} }} \psi b_{{f^{c} }} \psi b_{{b^{c} }} \psi b_{o} \wp \).

For each paragraph vector pt, set the activation function f to the Rectified Linear Unit (ReLU)

$$ {\text{f}}\left( {\text{x}} \right) = { \hbox{max} }\left( {0,{\text{x}}} \right) $$
(1)

Compute the activation of input units to the forward units \( \left( {{\text{x}}_{\text{t}}^{\text{f}} } \right) \)

$$ {{\text{x}}}_{{\text{t}}}^{{\text{f}}} = {{\text{ f }}}\left( {{{\text{W}}}_{{\text{i}}}^{{\text{f}}} {{\text{p}}}_{{\text{t}}} + {{\text{ b}}}_{{\text{i}}}^{{\text{f}}} } \right) $$
(2)

Compute the activation of input units to the backward units \( \left( {{\text{x}}_{\text{t}}^{\text{b}} } \right) \)

$$ {{\text{x}}}_{{\text{t}}}^{{\text{b}}} \,{ = }\,{{\text{f }}}\left( {{{\text{W}}}_{{\text{i}}}^{{\text{b}}} {{\text{p}}}_{{\text{t}}} {{\text{ + b}}}_{{\text{i}}}^{{\text{b}}} } \right) $$
(3)

Compute the activation of forward hidden units \( \left( {{\text{h}}_{\text{t}}^{\text{f}} } \right) \)

$$ {{\text{h}}}_{{\text{t}}}^{{\text{f}}} = {{\text{ f}}}\left( {{{\text{x}}}_{{\text{t}}}^{{\text{f}}} + {{\text{W}}}_{{\text{f}}} {{\text{h}}}_{{{{\text{t}}} - 1}}^{{\text{f}}} + {{\text{b}}}_{{\text{f}}} } \right) $$
(4)

Compute the activation of backward hidden units \( \left( {\text{htb}} \right) \)

$$ {\text{htb = f }}\left( {\text{xtb + Wb ht} - \text{1b + bb}} \right) $$
(5)

Compute the final activation of BRNN—output unit (ot)

$$ {\text{o}}_{\text{t}} {\text{ = W}}_{\text{o}} \left( {{\text{h}}_{\text{t}}^{\text{f}} {\text{ + h}}_{\text{t}}^{\text{b}} } \right){\text{ + b}}_{\text{o}} $$
(6)

3.3 The Local Coherence Model

To learn the coherence among the texts, this work includes a local coherence model. The sequenced parse trees are concatenated, from which an entity grid for the whole sequence is represented. Each text is represented by an entity grid, a two-dimensional array where each row of the grid corresponds to sentences, while the column corresponds to discourse entities. This representation helps to capture the distribution of discourse entities across text sentences.

Each grid column thus corresponds to a string from a set of categories reflecting the entity’s presence or absence in a sequence of sentences. Our set consists of four symbols: S (subject), O (object), X (neither subject nor object) and—(gap which signals the entity’s absence from a given sentence). It first identifies the entity classes, fills out the grid entries with relevant syntactic information and then determines the constituent structure for each sentence, from which the syntactic roles are identified.

An entity transition is a sequence \( \left\{ {{\text{S}},{\text{ O}},{\text{ X}}, \, {-}} \right\}^{\text{n}} \) that represents entity occurrences and their syntactic roles in n adjacent sentences. Local transitions can be easily obtained from a grid as continuous subsequences of each column. After entity grid construction, entity transition is enumerated and the ratio of the occurrence frequency of each transition is calculated.

Zero padded coherence representation is forwarded as input to Rectifier Linear Unit (ReLU) which would output the vector of the coherence model (q).

3.4 Multimodal Network

The outputs of BRNN {ot}t=1 to N and the coherence model (q) are given together as a input to two fully connected layers to decide proper language and coherence match. Dropout rates and the dimensions of the variables are set accordingly.

$$ {{\text{W}}}_{{\text{f2}}} {{\text{W}}}_{{\text{f1}}} \left[ {{{\text{O}}}|{{\text{q}}}} \right] \, = \, \left[ {{{\text{S}}}|{{\text{g}}}} \right] $$
(7)

where \( {\text{O}} = \left[ {{{\text{o}}}_{ 1} \left| {{\text{ o}}_{ 2} } \right| \cdots {{\text{ o}}}_{{\text{N}}} } \right];\,{\text{S}} = \left[ {{{\text{s}}}_{ 1} \left| {{{\text{ s}}}_{ 2} } \right| \cdots {{\text{ s}}}_{{\text{N}}} } \right] \).

3.5 Training and Retrieval of Sentences

To train the model, define the compatibility score between an image comprising an album and the corresponding text sequence. The algorithm considers corresponding score between sentence and image of all possible combinations to find out the best matching.

Retrieval of best sentence sequence for a given query image stream is as follows:

  1. 1.

    Select the k-nearest images for each query image from training database using Euclidean distance on the image features.

  2. 2.

    The sentences associated with k-nearest images at location are concatenated as a paragraph sentences. This represents the candidate sentences.

  3. 3.

    Compatibility score between an image stream and a paragraph sequence is computed based on the following method:

    1. a.

      The ordered and paired compatibility score between a sentence sequence and an image sequence are defined as:

      $$ {\text{S}}_{\text{t}}^{\text{k}} *{\text{V}}_{\text{t}}^{\text{l}}. $$
      (8)
    2. b.

      The coherence relevance relation between an image sequence and a textsequence are defined as:

      $$ {\text{G}}^{{{\text{k}}*}} {\text{V}}_{\text{t}}^{\text{l}}. $$
      (9)
    3. c.

      The score Skl for a sentence sequence k and an image stream l are defined as:

      $$ {\text{S}}_{\text{kl}} { = }\sum_{{{\text{t = 1,}}..{\text{N}}}} \left( {{\text{S}}_{\text{t}}^{\text{k}} * {\text{V}}_{\text{t}}^{\text{l}} } \right){ + }\left( {{\text{G}}^{\text{k}} * {\text{V}}_{\text{t}}^{\text{l}} } \right) $$
      (10)

      where V lt denotes the 4096-dimensional CNN feature vector for tth image of stream l, and Gk and S kt are the output of Eq. (2.7) for a sentence sequence k.

    4. d.

      The cost function to train the model are defined as follows:

      $$ \begin{aligned} {\text{C}}\left( \theta \right) = & \sum_{{{\text{k}} }} [ \, \sum_{\text{l}} { \hbox{max} }\left( {0, 1 { } + {\text{ S}}_{\text{kl}} - {\text{S}}_{\text{kk}} } \right) \, + \\ & \sum_{{{\text{l}} }} { \hbox{max} }\left( {0, 1 { } + {\text{ S}}_{\text{lk}} - {\text{ S}}_{\text{kk}} } \right)] \\ \end{aligned} $$
      (11)

3.6 Text Summarization

PageRank algorithm which is used for text summarization is known as Text Rank. It is an unsupervised method for computing the extractive summary of a text. PageRank algorithm is applied over sentence graph, where the graph is symmetrical. The algorithm then built the PageRank transition by building the sentence similarity.

  1. 1.

    Preprocess the text: It includes removing stop words and stemming the remaining words.

  2. 2.

    Create a graph where vertices are sentences.

  3. 3.

    Each sentence is connected by an edge. The weight of the edge is defined by the similarity of the two sentences.

  4. 4.

    Run the PageRank algorithm on the graph.

  5. 5.

    Pick the vertices which represent the sentences with the highest PageRank score.

4 Experiment

Dataset. The Visual Storytelling (VIST) is the first-ever dataset created particularly for sequential image-to-language. The dataset includes 81,743 unique photos in 20,211 sequences, aligned to descriptive and story language. The image streams are extracted from Flickr and the text stories are crowdsourced for written to Amazon Mechanical Turk (AMT).

4.1 Retrieval Task

For experimental evaluation, the dataset is split into 8-1-1 ratio as a training set validation set, and test set, respectively. Each input query image is represented as a query Iq and the corresponding text annotated sentences as groundtruth TG. The algorithm retrieves the text sequences from training set for each input query album images that should match well with groundtruth sentences.

Given an input album image and the text sequences, an algorithm computes the compatibility score as in Eq. (11). The low-cost text sequence has given more priority and that is the best-matched sequence retrieved. The generation tasks for our approach are evaluated using quantitative measures. The proposed work performs in test set to produce narrative paragraph and the corresponding shortened story. This work exploits two metrics of language similarity (i.e. BLEU [22] and METEOR [23]) which are popularly used in text generation. A better performance is indicated by higher BLEU and METEOR values.

Figure 2 shows various examples of sentence sequences on VIST dataset. Three different stories are generated for each query image stream: Image description represents the single image context. Narrative story is generated based on other images comprising a query image sequence, and Summarized story is generated for the corresponding narrative story. The difference between single image description and the corresponding narrative story is the coherence among the sentences which are indicated by highlighted words.

Fig. 2
figure 2

Examples of generated story on VIST dataset

5 Result and Discussion

The quantitative results of story generation are shown in Table 1. The methods involved in proposed work is partitioned into three groups: (i) image captioning corresponds to the implementation of Recurrent Convolutional Network (RCN), (ii) generation of narrative paragraphs corresponds to RCN and entity-based coherence model, and (iii) generation of summarized results.

Table 1 Evaluation of story sentence generation with language similarity metrics (BLEU and METEOR)

Figure 3 clearly demonstrates that executing a coherent model over language modelling has a significant exhibition as for bleu score and has improved execution concerning meteor score, while having summarization capability has improved execution as for both bleu and meteor score.

Fig. 3
figure 3

Comparison of scores for narrative and summarized paragraph

The sequence of text annotated sentences for each test image sequence is represented as groundtruth TG, and the generated summarized stories are evaluated with reference to TG. Since the retrieval method for summarized story is based on the generated narrative story and the evaluations are performed with TG, the same has captured the coherent meaning and can only generate the similar sentences at best. This can be inferred from Fig. 3 that summarized story is most similar with TG and from Fig. 2, the coherent meaning which are preserved from the image descriptions and generated narrative stories are shown by highlighted words.

6 Conclusion

Capturing the coherent meaning of a set of images is an important task for generating narrative paragraphs, instead of retrieving a text sentence associated with each image of an image set. Thus, the proposed work implemented a method for generating precise, yet concise story that best describes a sequence of images. With quantitative evaluation, this work demonstrates that generating summarized story from the narrative description has improved performance, and however, it preserves the context of an image set with syntactic and referential information.