Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Automatic caption/description generation from images is a challenging problem that requires a combination of visual information and linguistic as illustrated in Fig. 1. In other words, it requires not only complete image understanding, but also sophisticated natural language generation [1,2,3,4]. This is what makes it such an interesting task that has been embraced by both the computer vision and natural language processing communities.

Fig. 1.
figure 1

Complete visual scene understanding is a holy grail in computer vision.

One of the most common models applied for automatic caption generation is a neural network model that composes of two sub-networks [5,6,7,8,9,10], where a convolutional neural network (CNN) [11] is used to obtain feature representation of an image; while a recurrent neural network (RNN)Footnote 1 is applied to encode and generate its caption description. In particular, Long Short-Term Memory (LSTM) model [12] has emerged as the most popular architecture among RNN, as it has the ability to capture long-term dependency and preserve sequence. Although sequential model is appropriate for processing sentential data, it does not capture any other syntactic structure of language at all. Nevertheless, it is undeniable that sentence structure is one of the prominent characteristics of language, and Victor Yngve - an influential contributor in linguistic theory stated in 1960 that “language structure involving, in some form or other, a phrase-structure hierarchy, or immediate constituent organization” [13]. Moreover, Tai et al. [14] proved that a tree-structured LSTM model that incorporates syntactic interpretation of sentence structure, can learn the semantic relatedness between sentences better than a pure sequential LSTM alone. This gives rise to question of whether is it a good idea to disregard other syntax of language in the task of generating image description.

Fig. 2.
figure 2

Model comparison: (a) Conventional RNN language model, and (b) Our proposed phrase-based model.

In this paper, we would like to investigate the capability of a phrase-based language model in generating image caption as compared to the sequential language model such as [6]. To this end, we design a novel phrase-based hierarchical LSTM model, namely phi-LSTM to encode image description in three stages - chunking of training caption, image-relevant phrases composition as a vector representation and finally, sentence encoding with image, words and phrases. As opposed to those conventional RNN language models which process sentence as a sequence of words, our proposed method takes noun phrase as a unit in the sentence, and thus processes the sentential data as a sequence of combination of both words and phrases together. Figure 2 illustrates the difference between the conventional RNN language model and our proposal with an example. Both phrases and sentences in our proposed model are learned with two different sets of LSTM parameters, each models the probability distribution of word conditions on previous context and image. Such design is motivated by the observation that some words are more prone to appear in phrase, while other words are more likely to be used to link phrases. In order to train the proposed model, a new perplexity based cost function is defined. Experimental results using two publicly available datasets (Flickr8k [15] and Flickr30k [16]), and a comparison to the state-of-the-art results [5,6,7, 9, 17] have shown the efficacy of our proposed method.

2 Related Works

The image description generation task is generally inspired by two lines of research, which are (i) the learning of cross-modality transition or representation between image and language, and (ii) the description generation approaches.

2.1 Multimodal Representation and Transition

To model the relationship between image and language, some works associate both modalities by embedding their representations into a common space [18,19,20,21]. First, they obtain the image features using a visual model like CNN [19, 20], as well as the representation of sentence with a language model such as recursive neural network [20]. Then, both of them are embedded into a common multimodal space and the whole model is learned with ranking objective for image and sentence retrieval task. This framework was also tested at object level by Karpathy et al. [21] and proved to yield better results for the image and sentence bi-directional retrieval task. Besides that, there are works that learn the probability density over multimodal inputs using various statistical approaches. These include Deep Boltzmann Machines [22], topic models [23], log-bilinear neural language model [8, 24] and recurrent neural networks [5,6,7] etc. Such approaches fuse different input modalities together to obtain a unified representation of the inputs. It is notable to mention that there are also some works which do not explicitly learn the multimodal representation between image and language, but transit between modalities with retrieval approach. For example, Kuznetsova et al. [25] retrieve images similar to the query image from their database, and extract useful language segments (such as phrases) from the descriptions of the retrieved images.

2.2 Description Generation

On the other hand, caption generation approaches can be grouped into three categories in general as below:

Template-Based. These approaches generate sentence from a fixed template [26,27,28,29,30]. For example, Farhadi et al. [26] infer a single triplet of object, action and scene from an image and convert it into a sentence with fixed template. Kulkarni et al. [27] use complex graph of detections to infer elements in sentence with conditional random field (CRF), but the generation of sentences is still based on the template. Mitchell et al. [29] and Gupta et al. [30] use a more powerful language parsing model to produce image description. In overall, all these approaches generate description which is syntactically correct, but rigid and not flexible.

Composition Method. These approaches extract components related to the images and stitch them up to form a sentence [25, 31, 32]. Description generated in such manner is broader and more expressive compared to the template-based approach, but is more computationally expensive at test time due to its non-parametric nature.

Neural Network. These approaches produce description by modeling the conditional probability of a word given multimodal inputs. For instance, Kiros et al. [8, 24] developed multimodal log-bilinear neural language model for sentence generation based on context and image feature. However, it has a fixed window context. The other popular model is recurrent neural network [5,6,7, 9, 33], due to its ability to process arbitrary length of sequential inputs such as sequence of words. This model is usually connected with a deep CNN that generates image features. The variants on how this sub-network is connected to the RNN have been investigated by different researchers. For instance, the multimodal recurrent neural network proposed by Mao et al. [5] introduces a multimodal layer at each time step of the RNN, before the softmax prediction of words. Vinyals et al. [6] treat the sentence generation task as a machine translation problem from image to English, and thus image feature is employed in the first step of the sequence trained with their LSTM RNN model.

2.3 Relation to Our Work

Automatic image caption generated via template-based [26,27,28,29,30] and composition methods [25, 31, 32] are typically two-stage approaches, where relevant elements such as objects (noun phrases) and relations (verb and prepositional phrases) are generated first before a full descriptive sentence is formed with the phrases. With the capability of LSTM model in processing long sequence of words, neural network based method that uses a two-stage approach deem unnecessary. However, we are still interested to find out how sequential model with phrase as a unit of sequence performs. The closest work related to ours is the one proposed by Lebret et al. [17]. They obtain phrase representation with simple word vector addition and learn its relevancy with image by training with negative samples. Sentence is then generated as a sequence of phrases, predicted using a statistical framework conditioned on previous phrases and its chunking tags. While their aim was to design a phrase-based model that is simpler than RNN, we intend to compare RNN phrase-based model with its sequential counterpart. Hence, our proposed model generates phrases and recomposes them into sentence with two sub-networks of LSTM, which are linked to form a hierarchical structure as shown in Fig. 2(b).

3 Our Proposed phi-LSTM Model

This section details how the proposed method encodes image description in three stages - (i) chunking of image description, (ii) encode words and phrases into distributed representations, and finally (iii) encodes sentence with the phi-LSTM model.

3.1 Phrase Chunking

Fig. 3.
figure 3

Phrase chunking from dependency parse.

A quick overview on the structure of image descriptions reveals that, key elements which made up the majority of captions are usually noun phrases that describe the content of the image, which can be either objects or scene. These elements are linked with verb and prepositional phrases. Thus, noun phrase essentially covers over half of the corpus in a language model trained to generate image description. And so, in this paper, our idea is to partition the learning of noun phrase and sentence structure so that they can be processed more evenly, compared to extracting all phrases without considering their part of speech tag.

To identify noun phrases from a training sentence, we adopt the dependency parse with refinement using Stanford CoreNLP tool [34], which provides good semantic representation over a sentence by providing structural relationships between words. Though it does not chunk sentence directly as in constituency parse and other chunking tools, the pattern of noun phrase extracted is more flexible as we can select desirable structural relations. The relations we selected are:

  • determiner relation (det),

  • numeric modifier (nummod),

  • adjectival modifier (amod),

  • adverbial modifier (advmod), but is selected only when the meaning of adjective term is modified, e.g. “dimly lit room”,

  • compound (compound),

  • nominal modifier for possessive alteration (nmod:of & nmod:poss).

Note that the dependency parse only extracts triplet made up of a governor word and a dependent word linked with a relation. So, in order to form phrase chunk with the dependency parse, we made some refinements as illustrated in Fig. 3. The triplets of selected relations in a sentence are first located, and those consecutive words (as highlighted in the figure, e.g. “the”, “man”) are grouped as a single phrase, while the standalone word (e.g. “in”) will remain as a unit in the sentence.

3.2 Compositional Vector Representation of Phrase

This section describes how compositional vector representation of a phrase is computed, given an image.

Image Representation. A 16-layer VggNet [35] pre-trained on ImageNet [36] classification task is applied to learn image feature in this work. Let \(\mathbf {I} \in \mathbb {R}^D\) be an image feature, it is embedded into a K-dimensional vector, \(\mathbf {v_p}\) with image embedding matrix, \(\mathbf {W_{ip}} \in \mathbb {R}^{K \times D}\) and bias \(\mathbf {b_{ip}} \in \mathbb {R}^K\).

$$\begin{aligned} \mathbf {v_p} = \mathbf {W_{ip}} \mathbf {I} + \mathbf {b_{ip}}. \end{aligned}$$
(1)

Word Embedding. Given a dictionary \(\mathcal {W}\) with a total of V vocabulary, where word \( w \in \mathcal {W}\) denotes word in the dictionary, a word embedding matrix \(\mathbf {W_e} \in \mathbb {R}^{K \times V}\) is defined to encode each word into a K-dimensional vector representation, x. Hence, an image description with words \( w _1 \cdots w _M\) will correspond to vectors \(\mathbf {x}_1 \cdots \mathbf {x}_M\) accordingly.

Fig. 4.
figure 4

Composition of phrase vector representation in the phi-LSTM model.

Composition of Phrase Vector Representation. For each phrase extracted from the sentence, a LSTM-based RNN model similar to [6] is used to encode its sequence as shown in Fig. 4. Similar to [6], we treat the sequential modeling from image to phrasal description as a machine translation task, where the embedded image vector is inputted to the RNN on the first time step, followed by a start token \(\mathbf {x_{sp}} \in \mathbb {R}^{K}\) indicating the translation process. It is trained to predict the next word at each time step by outputting \(\mathbf {p_{{t_p}+1}} \in \mathbb {R}^{K \times V}\), which is modeled as the probability distribution over all words in the corpus. The last word of the phrase will predict an end token. So, given a phrase P which is made up by L words, the input \(\mathbf {x_{t_p}}\) at each time step are:

$$\begin{aligned} \mathbf {x_{t_p}} = {\left\{ \begin{array}{ll} \mathbf {v_p}, &{} \text {if}\ t_p=-1 \\ \mathbf {x_{sp}}, &{} \text {if}\ t_p=0 \\ \mathbf {W_e}w_{t_{p}}, &{} \text {for}\ t_p = {1...L}. \end{array}\right. } \end{aligned}$$
(2)

For a LSTM unit at time step \(t_p\), let \(\mathbf {i}_{t_p}, \mathbf {f}_{t_p}, \mathbf {o}_{t_p}, \mathbf {c}_{t_p}\) and \(\mathbf {h}_{t_p}\) denote the input gate, forget gate, output gate, memory cell and hidden state at the time step respectively. Thus, the LSTM transition equations are:

$$\begin{aligned} \mathbf {i}_{t_p} = \sigma (\mathbf {W_i} \mathbf {x}_{t_p} + \mathbf {U_i} \mathbf {h}_{{t_p}-1}), \end{aligned}$$
(3)
$$\begin{aligned} \mathbf {f}_{t_p} = \sigma (\mathbf {W_f} \mathbf {x}_{t_p} + \mathbf {U_f} \mathbf {h}_{{t_p}-1}), \end{aligned}$$
(4)
$$\begin{aligned} \mathbf {o}_{t_p} = \sigma (\mathbf {W_o} \mathbf {x}_{t_p} + \mathbf {U_o} \mathbf {h}_{{t_p}-1}), \end{aligned}$$
(5)
$$\begin{aligned} \mathbf {u}_{t_p} = tanh(\mathbf {W_u} \mathbf {x}_{t_p} + \mathbf {U_u} \mathbf {h}_{{t_p}-1}), \end{aligned}$$
(6)
$$\begin{aligned} \mathbf {c}_{t_p} = \mathbf {i}_{t_p} \odot \mathbf {u}_{t_p} + \mathbf {f}_{t_p} \odot \mathbf {c}_{{t_p}-1}, \end{aligned}$$
(7)
$$\begin{aligned} \mathbf {h}_{t_p} = \mathbf {o}_{t_p} \odot tanh(\mathbf {c}_{t_p}), \end{aligned}$$
(8)
$$\begin{aligned} \mathbf {p}_{{t_p}+1} = \text {softmax}(\mathbf {h}_{t_p}). \end{aligned}$$
(9)

Here, \(\sigma \) denotes a logistic sigmoid function while \(\odot \) denotes elementwise multiplication. The LSTM parameters {\(\mathbf {W_i}, \mathbf {W_f}, \mathbf {W_o}, \mathbf {W_u}, \mathbf {U_i}, \mathbf {U_f}, \mathbf {U_o}, \mathbf {U_u}\)} are all matrices with dimension of \(\mathbb {R}^{K \times K}\). Intuitively, each gating unit controls the extent of information updated, forgotten and forward-propagated while the memory cell holds the unit internal memory regarding the information processed up to current time step. The hidden state is therefore a gated, partial view of the memory cell of the unit. At each time step, the probability distribution of words outputted is equivalent to the conditional probability of word given the previous words and image, \(P(w_t | w_{1:t-1},I)\). On the other hand, the hidden state at the last time step L is used as the compositional vector representation of the phrase, \(\mathbf {z} \in \mathbb {R}^K\), where \(\mathbf {z} = \mathbf {h}_{L}\).

3.3 Encoding of Image Description

Fig. 5.
figure 5

Sentence encoding using the phi-LSTM model.

Once the compositional vector of phrases are obtained, they are linked with the remaining words in the sentence using another LSTM-based RNN model as shown in Fig. 5. Another start token \(\mathbf {x_{ss}} \in \mathbb {R}^{K}\) and image representation \(\mathbf {v_s} \in \mathbb {R}^{K}\) are introduced, where

$$\begin{aligned} \mathbf {v_s} = \mathbf {W_{is}} \mathbf {I} + \mathbf {b_{is}}, \end{aligned}$$
(10)

with \(\mathbf {W_{is}} \in \mathbb {R}^{K \times D}\) and bias \(\mathbf {b_{is}} \in \mathbb {R}^K\) as embedding parameters. Hence, the input units of the LSTM in this level will be the image representation \(\mathbf {v_s}\), start token \(\mathbf {x_{ss}}\), followed by either compositional vector of phrase z or word vector x in accordance to the sequence of its description.

For simplicity purpose, the arranged input sequence will be referred as y. Therefore, given the example in Figs. 4 and 5, the LSTM input sequence of the sentence will be {\(\mathbf {v_s}, \mathbf {x_{ss}}, \mathbf {y}_1 \ldots \mathbf {y}_N\)} where N = 8, and it is equivalent to sequence {\(\mathbf {v_s}, \mathbf {x_{ss}}, \mathbf {z}_1, \mathbf {x}_3, \mathbf {z}_2, \mathbf {x}_7, \mathbf {x}_8, \mathbf {x}_9, \mathbf {x}_{10}, \mathbf {z}_3\)}, as in Fig. 5. Note that a phrase token is added to the vocabulary, so that the model can predict it as an output when the next input is a noun phrase.

The encoding of the sentence is similar to the phrase vector composition. Equations 39 are applied here using \(\mathbf {y_{t_s}}\) as input instead of \(\mathbf {x_{t_p}}\), where \(t_p\) and \(t_s\) represent time step in phrase and sentence respectively. A new set of model parameters with same dimensional size is used in this hierarchical level.

4 Training the phi-LSTM Model

The proposed phi-LSTM model is trained with log-likelihood objective function computed from the perplexityFootnote 2 of sentence conditioned on its corresponding image in the training set. Given an image I and its description S, let R be the number of phrases of the sentence, \( P_i \) correspond to the number of LSTM blocks processed to get the compositional vector of phrase i, Q is the length of composite sequence of sentence S, while \(\mathbf {p_{t}}_{p}\) and \(\mathbf {p_{t}}_{s}\) are the probability output of LSTM block at time step \( t_{p} -1\) and \( t_{s} -1\) for phrase and sentence level respectively. The perplexity of sentence S given its image I is

$$\begin{aligned} \log _2 \mathcal {PPL}(\mathbf {S}|\mathbf {I}) = -\frac{1}{N} \left[ \sum _{t_{s}=-1}^{Q} \log _2 \mathbf {p_t}_{s} + \sum _{i=1}^{R} \left[ \sum _{t_{p}=-1}^{P_i} \log _2 \mathbf {p_t}_{p} \right] \right] , \end{aligned}$$
(11)

where

$$\begin{aligned} N = Q + \sum _{i=1}^{R} P_i. \end{aligned}$$
(12)

Hence, with M number of training samples, the cost function of our model is:

$$\begin{aligned} \mathcal {C(\theta )} = -\frac{1}{L} \sum _{j=1}^{M} \left[ N_j \log _2 \mathcal {PPL} (\mathbf {S_j}|\mathbf {I_j}) \right] + \lambda _\theta \cdot \parallel \theta \parallel _2^2, \end{aligned}$$
(13)

where

$$\begin{aligned} L = M\times \sum _{j=1}^{M} N_j. \end{aligned}$$
(14)

It is the average log-likelihood of word given their previous context and the image described, summed with a regularization term, \(\lambda _\theta \cdot \parallel \theta \parallel _2^2\), average over the number of training samples. Here, \(\theta \) is the parameters of the model.

This objective however, does not discern on the appropriateness of different inputs at each time step. So, given multiple possible inputs, it is unable to distinguish which phrase is the most probable input at that particular time step during the decoding stage. That is, when a phrase token is inferred as the next input, all possible phrases will be inputted in the next time step. The candidate sequences are then ranked according to their perplexity up to this time step, where only those with high probability are kept. Unfortunately, this is problematic because subject in an image usually has much lower perplexity as compared to object and scene. Thus, such algorithm will end up generating description made up of only variants of subject noun phrases.

Fig. 6.
figure 6

Upper hierarchy of the phi-LSTM model with phrase selection objective.

To overcome this limitation, we introduce a phrase selection objective during the training stage. At all time steps when an input is a phrase, H number of randomly selected phrases that are different from the ground truth input is feed into the phi-LSTM model as shown in Fig. 6. The model will then produce two outputs, which are the next word prediction solely based on the actual input, and a classifier output that distinguishes the actual one from the rest. Though the number of inputs at these time steps increases, the memory cell and hidden state that is carried to the next time step keep only information of the actual input. The cost function for phrase selection objective of a sentence is

$$\begin{aligned} \mathcal {C}_{PS} = \sum _{t_{s} \in \mathcal {P}} \sum _{k=1}^{H+1} \kappa _{t_{s}k} \sigma (1-y_{t_{s}k}h_{t_{s}k}\mathbf {W_{ps}}). \end{aligned}$$
(15)

where \(\mathcal {P}\) is the set of all time steps where the input is phrase, \(h_{t_{s}k}\) is the hidden state output at time step \(t_{s}\) from input k, and \(y_{t_{s}k}\) is its label which is +1 for the actual input and -1 for the false inputs. \(\mathbf {W_{ps}} \in \mathbb {R}^{K \times 1}\) is trainable parameters for the classifier while \(\kappa _{t_{s}k}\) scales and normalizes the objective based on the number of actual and false inputs at each time step. The overall objective function is then

$$\begin{aligned} \mathcal {C}_{F}(\theta ) = -\frac{1}{L} \sum _{j=1}^{M} \left[ N_j \log _2 \mathcal {PPL} (\mathbf {S_j}|\mathbf {I_j}) + \mathcal {C}_{PSj} \right] + \lambda _\theta \cdot \parallel \theta \parallel _2^2. \end{aligned}$$
(16)

This cost function is minimized and backpropagated with RMSprop optimizer [37] and trained in a minibatch of 100 image-sentence pair per iteration. We cross-validate the learning rate and weight decay depending on dataset, and dropout regularization [38] is employed over the LSTM parameters during training to avoid overfitting.

5 Image Caption Generation

Generation of textual description using the phi-LSTM model given an image is similar to other statistical language models, except that the image relevant phrases are generated first in the lower hierarchical level of the proposed model. Here, embedded image feature of the given image followed by the start token of phrase are inputted into the model, acting as the initial context required for phrase generation. Then, the probability distribution of the next word over the vocabulary is obtained at each time step given the previous contexts, and the word with the maximum probability is picked and fed into the model again to predict the subsequent word. This process is repeated until the end token for phrase is inferred. As we usually need multiple phrases to generate a sentence, beam search scheme is applied and the top K phrases generated are kept as the candidates to form the sentence. To generate a description from the phrases, the upper hierarchical level of the phi-LSTM model is applied in a similar fashion. When a phrase token is inferred, K phrases generated earlier are used as the inputs for the next time step. Keeping only those phrases which generate positive result with the phrase selection objective, inference on the next word given the previous context and the selected phrases is performed again. This process iterates until the end token is inferred by the model.

Some constraints are added here, which are (i) each predicted phrase may only appears once in a sentence, (ii) maximum number of unit (word or phrase) that made up a sentence is limited to 20, (iii) maximum number of words forming a phrase is limited to 10, and (iv) generated phrases with perplexity higher than threshold T are discarded.

6 Experiment

6.1 Datasets

The proposed phi-LSTM model is tested on two benchmark datasets - Flickr8k [15] and Flickr30k [16], and compared to the state-of-the-art methods [5,6,7, 9, 17]. These datasets consist of 8000 and 31000 images respectively, each annotated with five ground truth descriptions from crowd sourcing. For both datasets, 1000 images are selected for validation and another 1000 images are selected for testing; while the rest are used for training. All sentences are converted to lower case, with frequently occurring punctuations removed and word that occurs less than 5 times (Flickr8k) or 8 times (Flickr30k) in the training data discarded. The punctuations are removed so that the image descriptions are consistent with the data shared by Karpathy and Fei-Fei [7].

6.2 Results Evaluated with Automatic Metric

Sentence generated using the phi-LSTM model is evaluated with automatic metric known as the bilingual evaluation understudy (BLEU) [39]. It computes the n-gram co-occurrence statistic between the generated description and multiple reference sentences by measuring the n-gram precision quality. It is the most commonly used metric in this literature.

Table 1. BLEU score of generated sentence on Flickr8k and Flickr30k dataset.

Table 1 shows the performance of our proposed model in comparison to the current state-of-the-art methods. NIC [6] which is used as our baseline is a reimplementation, and thus its BLEU score reported here is slightly different from the original work. Our proposed model performs better or comparable to the state-of-the-art methods on both Flickr8k and Flickr30k datasets. In particular, we outperform our baseline on both datasets, as well as PbIC [17] - a work that is very similar to us on Flickr30k dataset by at least 5–10%.

Fig. 7.
figure 7

Effect of the perplexity threshold, T and maximum number of phrases used for generating sentence, K on the BLEU score (best viewed in colour).

Table 2. Vocab size, word occurrence and average caption length in training data, test data, and generated description in Flickr8k dataset.

As mentioned in Sect. 5, we generate K phrases from each image and discard those with perplexity higher than a threshold value T, when generating the image caption. In order to understand how these two parameters affect our generated sentence, we use different K and T to generate the image caption with our proposed model trained on the Flickr30k dataset. Changes of the BLEU score against T and K are plotted in Fig. 7. It is shown that K does not have a significant effect on the BLEU score, when T is set to below 5.5. On the other hand, unigram and bi-gram BLEU scores improve with lower perplexity threshold, in contrast to tri-gram and 4-gram BLEU scores that reach an optimum value when T=5.2. This is because the initial (few) generated phrases with the lowest perplexity are usually different variations of phrase describing the same entity, such as ‘a man’ and ‘a person’. Sentence made with only such phrases has higher chance to match with the reference descriptions, but it would hardly get a match on tri-gram and 4-gram. In order to avoid generating caption made from only repetition of similar phrases, we select T and K which yield the highest 4-gram BLEU score, which are T=6.5 and K=6 on Flickr8k dataset, and T=5.2 and K=5 on Flickr30k dataset. A few examples are shown in Fig. 8.

6.3 Comparison of the phi-LSTM Model with Its Sequence Model Counterpart

Table 3. Top 5 (a) least trained word found, and (b) most trained word missing, from the generated captions in the Flickr8k dataset.

To compare the differences between a phrase-based hierarchical model and a pure sequence model in generating image caption, the phi-LSTM model and NIC [6] are both implemented using the same training strategy and parameter tuning. We are interested to know how well the corpus is trained by both models. Using the Flickr8k dataset, we computed the corpus information of (i) the training data, (ii) the reference sentences in the test data and (iii) the generated captions as tabulated in Table 2. We remove words that occur less than 5 times in the training data, and it results in 4833 words being removed. However, this reduction in term of word count is only 2.48%. Furthermore, even though the model is evaluated in comparison to all reference sentences in the test data, there are actually 1228 words within the references that are not in our training corpus. Thus, it is impossible for the model to predict those words, and this is a limitation on scoring with references in all language models. For a better comparison with the 1000 generated captions, we also compute another reference corpus based on the first sentence of each test image. From Table 2, it can be seen that even though there are at least 1187 possible words to be inferred with images in the test set, the generated descriptions are made up from only 128 and 154 words in NIC [6] and phi-LSTM model, respectively. These numbers show that the actual number of words learned by these two models are barely 10%, suggesting more research is necessary to improve the learning efficiency in this field. Nevertheless, it shows that introducing the phrase-based structure in sequential model still improves the diversity of caption generated.

To get further insight on how the word occurrence in the training corpus affects the word prediction when generating caption, we record the top five, most trained words that are missing from the corpus of generated captions, and the top five, least trained words that are predicted by both models when generating description, as shown in Table 3. We consider only those words that appear in the reference sentences to ensure that these words are related to the images in the test data. It appears that the phrase-based model is able to infer more words which are less trained, compared to the sequence model. Among the top five words that are not predicted, even though they have high occurrence in the training corpus, it can be seen that those words are either not very observable in the images, or are more probable to be described with other alternative. For example, the is a more probable alternative of another.

A few examples of the image description generated with our proposed model and NIC model [6] are shown in Fig. 9. It can be seen that both models are comparable qualitatively. An interesting example is shown in the first image where our model mis-recognizes the statue as a person, but is able to infer the total number of “persons” within the image. The incorrect recognition stems from insufficient training data on the word statue in the Flickr8k dataset, as it only occurs for 48 times, which is about 0.015% in the training corpus.

Fig. 8.
figure 8

Example of phrases generated from images using the lower hierarchical level of the phi-LSTM model. Red fonts indicate that the perplexity of that phrase is below threshold T.

Fig. 9.
figure 9

Examples of caption generated with the phi-LSTM model, in comparison to NIC [6].

7 Conclusion

In this paper, we present the phi-LSTM model, which is a neural network model trained to generate reasonable description on image. The model consists of a CNN sub-network connected to a two-hierarchical level RNN, in which the lower level encodes noun phrases relevant to the image; while the upper level learns the sequence of words describing the image, with phrases encoded in the lower level as a unit. A phrase selection objective is coupled when encoding the sentence. It is designed to aid the generation of caption from relevant phrases. This design preserves syntax of sentence better, by treating it as a sequence of phrases and words instead of a sequence of words alone. Such adaptation also splits the content to be learned by the model into two, which are stored in two sets of parameters. Thus, it can generate sentence which is more accurate and with more diverse corpus, as compared to a pure sequence model.