Keywords

1 Introduction

As a challenging task in coupling computing vision with natural language processing, modeling the correlation between images and sentences has gained increasing research attentions recently. It has great potentials not only for current practical tasks like solving cross-media retrieval, but also for building up towards intelligent machines which can be regarded as a long-term goal. As a practical sub-problem faced in numerous multimedia retrieval tasks, matching images and sentences is the bottleneck of cross-media retrieval due to its difficulty and complexity. As we know, the cross retrieval between images and videos is simple because a frame of a video can be seen as an image and both of them share similar features. For text and voice, it is also a fact that speech recognition technology is mature enough to bridge the gap between these two modalities. However, matching images and sentences is still far from being well addressed. In the long run, such technique is a step towards the future artificial intelligence through which we can interact with computers in natural languages to carry out more complex tasks with the assistance of computer vision.

Correlating images and sentence based on their expressed semantics is difficult because of the heterogeneity-gap between them and it is widely considered as a basic barrier to match the two modalities. The most important part of modeling the correlation between images and sentences is to capture the semantic correlations by learning a multimodal joint model. While some previous models [1,2,3] have been proposed to study the global level matching relations between image and sentence by representing them as a global vector, they neglect the local fragments of the sentence and their correspondences to the image content. Fragment level methods [4,5,6,7] try to align the fragments of sentence and regions of image to address this problem. However, these methods usually represent fragments of image as objects or instances and it is necessary to extract visual features of these objects or instances respectively which inevitably leads to high complexities.

In this work, we propose a novel end-to-end architecture to tackle the problem of image sentence matching. The proposed method is inspired by the fact that CNN have demonstrated their powerful abilities in learning the visual representation of an image. As the feature map generated by a CNN can represent partial semantics of the corresponding image, it is reasonable to represent the fragments of images as feature maps generated by CNN. Different with previous models, the proposed method represents fragments of images as feature maps generated by CNN, using convolutional method to compose fragmental features of sentences, and finally obtains matching score by encoding features of image and sentence fragments jointly. As independent and specialized fragmental feature representations are allowed for images and texts, the proposed method is flexible to generate a joint abstraction of two modalities by interlinking the intermediate fragmental features.

The key contributions of this work can be summarized as follows:

  • An end-to-end trainable system is proposed which composes features of image and sentence fragments and then generates matching score of image and sentence by fusing the intermediate fragmental features of them.

  • Feature maps generated by CNN are utilized as fragments of images to build a fragment-level model for image sentence matching, which is shown to be reasonable and advantageous.

  • Effectiveness of the proposed model is demonstrated by comprehensive experiments on two public datasets: Flickr8K [18] and Flickr30K [19]. Our experimental results have shown its competitive performance with the state-of-the-art approaches.

2 Related Work

The proposed method is relevant to several fields in natural language processing and computer vision research. The typical research related to the work presented in this paper mainly includes matching images and sentences, which is now described.

Modeling semantic correlation between images and sentences has significant values in multimedia understanding and retrieval. The shared concept across modalities plays an important role in interlinking the interpretations of two modalities. There has been a large body of work resorting to modeling semantic correlation between images and sentences in dealing with various tasks like image captioning, bidirectional image and sentence retrieval and visual question answering, etc.

In image captioning, literatures of [10,11,12,13,14] have focused on generating novel descriptive sentences for a query image. Vinyals et al. [10] presented an end-to-end system that can automatically generate reasonable descriptions in plain English for images based on a CNN as encoder for image representations and a RNN as decoder for generating corresponding sentences. Ghosh et al. [11] used discourse representation structure to gain semantic information for image captions which is further modeled into a graphical structure. Fang et al. [12] used multi-instance learning and traditional maximum-entropy language model for description generation. Most recently, Wang et al. [13] proposed an end-to-end trainable deep bidirectional LSTM model including a deep CNN and two separate LSTM networks for description generation. Word2VisualVec [14] is proposed to project vectorized sentences into a given visual feature space which is advocated as a new shared representation of images and sentences.

In bidirectional image and sentence retrieval, global level matching [1,2,3] of relations between image and sentence is studied by representing sentence and image as global vectors while fragment level methods [4,5,6,7] align the fragments of sentence and regions of image in a finer-grained manner. Socher et al. [1] proposed to employ the semantic dependency-tree recursive neural network (SDT-RNN) to map the sentence into the same semantic space as the image representation, and the association is then measured as the distance in that space. Eisenschtat and Wolf [2] introduced a bidirectional neural network architecture for the task of matching vectors from two data sources. Yan and Mikolajczyk [21] used deep canonical correlation analysis (DCCA) for matching images and texts.

Some other related work which is closer to our motivation in this paper tries to study the local inter-modal correspondences between image and sentence fragments. In [4], Karpathy et al. broke down images and sentences into fragments and then embedded these fragments in a common space. Plummer et al. [5] regarded the objects as fragments of image to collect region-to-phrase (RTP) correspondences for richer image-to-sentence models. Ma et al. [6] proposed multimodal convolutional neural networks which consist of one image CNN encoding the image content, and one matching CNN composing words to different semantic fragments and learning the inter-modal relations between image and the composed fragments of sentence at different levels. Huang et al. [7] proposed sm-LSTM which selects pairwise instances from image and sentence which are salient and aggregates local similarity measurement. As fragment level methods, these literatures have the same characteristics in representing fragments of image as objects or instances whose visual features need to be extracted respectively. Although this processing is reasonable and interpretable, it highly relies on the sophisticatedly constructed neural networks which are usually more complex and hard to train. Instead, in this paper we represent fragments of images as the feature maps directly generated by a CNN model (i.e., VGG model with 16 weight layers), hence the design complexity of image processing can be significantly reduced.

Fig. 1.
figure 1

The framework of proposed method, including visual module for composing fragmental features of images, textual module for composing fragmental features of sentences, and fusional module for generating final matching score of images and sentences.

3 Modeling Image and Sentence Correlation

In this section, a novel convolutional neural network architecture is proposed for image and sentence matching. With this architecture, we aim at learning reasonable matching scores for image and sentence pairs, i.e., assigning semantically similar pairs of image and sentence with higher score values. The proposed model can be trained on a set of images and sentences whose relationships are labeled as relevant or irrelevant.

As illustrated in Fig. 1, the proposed method takes images and sentences as the inputs and generates the corresponding matching scores. The proposed method is advantageous in allowing independent and specialized fragmental feature extractions from both images and sentences, so that it is flexible to encode the intermediate fragmental features at higher-level in order to provide better matching scores. This is achieved by designing the whole architecture with three components. The textual module is responsible for the abstraction of textual data as fragments of sentences. The visual module is responsible to extract the intermediate features of images as visual fragments. Finally, the fusional module combines the above extracted features with a matching convolutional neural network and a multi-layer perceptron (MLP). The matching convolutional neural network in this module is designed for the interaction between fragments of sentences and images generated by textual and visual modules respectively and produces the fusional representation of textual and visual data. Following the matching CNN, MLP produces the final matching scores between images and sentences.

3.1 Visual Module

Inspired by semantic representation using vectors in natural language processing, we utilize feature maps of CNN model in a novel way in visual module. As a remarkable method in solving NLP problems, word2vec [8] is flexible in modeling the high-level semantics of word or sentence with a vector. Similarly, we use a vector to represent partial or entire high-level semantics of an image and in our method, we regard a feature map generated by CNN of an image as the representation of partial visual semantics (i.e., fragment of image) in the given image. The feature map is then flattened to a vector as the representation of visual semantics in a similar manner that a word embedding vector is used as a representation of textual semantics. As illustrated in Fig. 1, VGG [15] (16 weight layers) with all fully-connected layers removed is utilized to generate fragments of an image \(\mathbf{{v}}\). We add an flatten layer to flatten all feature maps of the top max-pooling layer of VGG to vectors, which is then aligned into a matrix \(\mathbf{{pv}}\) as intermediate representation of an image. The functional role of visual module can be summarized as taking image \(\mathbf{{v}}\) as input and generating the matrix representation through Eq. (1):

$$\begin{aligned} \mathbf{{pv}} = flatten(VGG(\mathbf{{v}})) \end{aligned}$$
(1)

3.2 Textual Module

In the textual module, wide convolutional layers and pooling layers of Dynamic Convolutional Neural Network [9] are employed to model sentences, in order to solve the problem that the width is various at an intermediate layer because the length of the input sentence is not fixed. In the proposed method, a sentence is represented as a sentence matrix \(\mathbf{{s}}\) whose column corresponds to the embedding of a word as \({\mathbf{{w}}_i} \in {R^\mathrm{{d}}}\). The values in each embedding \({\mathbf{{w}}_i}\) are parameters that are optimized during the training phase.

$$\begin{aligned} \mathbf{{s}} = {[{\mathbf{{w}}_1}} \ldots {{\mathbf{{w}}_s}}] \end{aligned}$$
(2)

where the subscript s is the length of the sentence.

As illustrated in Fig. 1, taking sentence matrix \(\mathbf{{s}}\) as the input, textual module generates the intermediate abstraction of a sentence \(\mathbf{{ps}}\) (i.e., fragments of sentence) through a 7 layers structure, including \(\mathbf{{s}}\) as the first layer, two wide convolution layers, one dynamic k-max pooling layer, one k-max pooling layer, one folding layer and the last flatten layer. The relationships between each consecutive layers are detailed in Fig. 2. In Fig. 2, the wide convolution layers are obtained by wide convolution operation which ensures that all weights in the filter reach the entire sentence, including the words at the margins. A dynamic k-max pooling layer is a k-max pooling operation where k is a function of the length of the sentence and the depth of the network in the textual module. The function is designed as [9]:

$$\begin{aligned} {k_l} = \max ({k_{top}},[\frac{{L - l}}{L}s]) \end{aligned}$$
(3)

where l is the number of the current convolutional layer to which the pooling is applied and L is the total number of convolutional layers in the network, \({k_{top}}\) is the fixed pooling parameter applied for the folding layer. The k-max pooling operator with \(k = {k_{top}}\) is applied on top of the folding layer to guarantee that the input to the flatten layers is independent of the length of the input sentence. Folding layer is after the convolutional layer and before (dynamic) k-max pooling which is responsible to sum every two rows in a feature map component-wise.

Fig. 2.
figure 2

The architecture of texual module with seven words input sentence. Word embeddings have size of 16. The network has two convolutional layers with two feature maps each. The widths of the filters at the two layers are respectively 3 and 2. The (dynamic) k-max pooling layers have values k of 5 and 3.

3.3 Fusional Module

As we can see from the Fig. 1, the fusional module is composed of a matching convolutional neural network for generating the final fusional representation \({v_{\mathbf{{fr}}}}\) of image and sentence by encoding intermediate textual and visual features jointly and a multi-layer perceptron (MLP) for producing the matching score of image-sentence pair. We designed a 8 layers architecture for fusional module (one for 1D convolution, two for 2D convolution, three for pooling, and two for MLP).

Basically, in Layer-1 of fusional module (shown in Fig. 1), we apply sliding windows to both intermediate features of sentence and image i.e., \(\mathbf{{ps}}\) and \(\mathbf{{pv}}\), and model all possible combinations of them through one-dimensional (1D) convolution. 1D convolution firstly composes different segments of image and sentence respectively and then generates different local matchings of these compositions. For segment i on \(\mathbf{{ps}}\) and segment j on \(\mathbf{{pv}}\), we have the feature map:

$$\begin{aligned} \mathbf{{z}}_{i,j}^{(1,f)}\mathop = \limits ^{def} \mathbf{{z}}_{i,j}^{(1,f)}(\mathbf{{ps}},\mathbf{{pv}}) = \sigma ({\mathbf{{w}}^{(1,f)}}{} \mathbf{{\hat{z}}}_{i,j}^{(0)} + {{b}^{(1,f)}}) \end{aligned}$$
(4)

where \(\mathbf{{z}}_{i,j}^{(1,f)}\) is the output of feature map of type-f for location (ij) in Layer-1 of fusional module, \({\mathbf{{w}}^{(1,f)}}\) is parameters for type-f in Layer-1 of fusional module, and \(\mathbf{{\hat{z}}}_{i,j}^{(0)}\) simply concatenates the vectors of image fragments and sentence segments for \(\mathbf{{ps}}\) and \(\mathbf{{pv}}\):

$$\begin{aligned} \mathbf{{\hat{z}}}_{i,j}^{(0)} = {[\mathbf{{ps}}_{i:i + {k_1} - 1}^\mathrm{T},\mathbf{{pv}}_{j:j + {k_1} - 1}^\mathrm{T}]^\mathrm{T}} \end{aligned}$$
(5)

where \({k_1}\) is the width of the window in 1D convolution.

Clearly the 1D convolution preserves the location information about sentence segments. In the following Layer-2, a two-dimensional (2D) max-pooling in non-overlapping \(2 \times 2\) windows (illustrated in Fig. 1) is carried out.

After Layer-2, we can obtain a low level fusional representation of sentence and image fragments. The next step is to generate a high level fusional representation which encodes the information from both sentence and image fragments. This is performed by employing general 2D convolutions and 2D poolings. General 2D convolution is formulated as:

$$\begin{aligned} \mathbf{{z}}_{i,j}^{(l,f)} = \sigma ({\mathbf{{w}}^{(l,f)}}{} \mathbf{{\hat{z}}}_{i,j}^{(l - 1)} + {b^{(l,f)}}),l = 3,5,... \end{aligned}$$
(6)

where \(\mathbf{{\hat{z}}}_{i,j}^{(l - 1)}\) concatenates the corresponding vectors from its 2D receptive field in Layer-(\(l - 1\)). General 2D pooling layer is followed by 2D convolution layer and this combination can be repeated if necessary. We apply the pooling strategy used in [26] to all pooling layers except for the first pooling layer.

After the final fusional representation \({v_{\mathbf{{fr}}}}\) of image and sentence is generated by the matching convolutional neural network, a two layers multi-layer perceptron with \({v_{\mathbf{{fr}}}}\) as input are then applied to produce the matching score as Eq. (7):

$$\begin{aligned} score = {\mathbf{{w}}_s}(\sigma ({\mathbf{{w}}_f}({v_{\mathbf{{fr}}}} + {b_f}))) + {b_s} \end{aligned}$$
(7)

where \(\sigma ( \cdot )\) is the ReLu activation function, \({\mathbf{{w}}_f}\) and \({b_f}\) are parameters of the first layer of MLP, \({\mathbf{{w}}_s}\) and \({b_s}\) are parameters of the second layer of MLP.

Table 1. Results comparison in terms of R@K (the higher the better) and Med r (median rank) (the lower the better) on Flickr8K. Values in brackets correspond to performances worse than our proposed method.

3.4 Implementation Details

Configuration. In implementing our method, we use 50-dimensional word embedding trained with the Word2Vec [8] in the textual module. The value of \(k_{top}\) for the top k-max pooling is 5. The widths of the wide convolutional filters in two wide convolution layers are 7 and 5, respectively. The number of feature maps at the first convolutional layer is 6 while the number of maps at the second convolutional layer is 12.

In the visual module, we use VGG [15] (16 weight layers) with all fully-connected layers removed and we borrow the architecture and the original parameters learnt from ImageNet dataset [24] for initialization. In the fusional module, a 3-word window is used in the first 1D convolution layer throughout all experiments. Various numbers of feature maps (typically from 150 to 300) are tested to obtain an optimal performance. We use ReLu as the activation function for all models (both convolution and MLP), which is validated to have better performances than sigmoid-like functions.

Training. To optimize the model parameters, a measure of max-margin objective function is quantified as Eq. (8), in order to force the matching scores of correlated image-sentence pairs to be greater than those of uncorrelated image-sentence pairs:

$$\begin{aligned} \begin{array}{l} loss({x_i},{y_i},{y_j},\varTheta ) = \\ \max (0,margin + score({x_i},{y_j}) - score({x_i},{y_i})) \end{array} \end{aligned}$$
(8)

where \(\varTheta \) denotes the parameters to be optimized, margin is a hyper parameter to control the penalty of matching scores, \(({x_i},{y_i})\) denotes the correlated image-sentence pair, and \(({x_i},{y_j})\) is the randomly sampled uncorrelated image-sentence pair \((i \ne j)\). The notational meanings of x and y vary with the matching tasks: for image retrieval from a query sentence, x denotes a sentence while y denotes a image, and vice versa. It is obvious that the iterative optimization is time-consuming if all irrelevant image-sentence pairs are taken into account. As a result, we use stochastic gradient descent (SGD) with mini-batch (100 200 in sizes) for optimization. In order to avoid over-fitting, early-stopping [16] and dropout (with probability 0.5) [17] are both employed.

During the training process, \(margin=0.5\) is assigned in the max-margin function defined by Eq. (8) to force the semantic similar image-sentence pairs to get higher matching scores. Once the training process is completed, all the training data are discarded and the resulted network is validated on another set of testing dataset on which the image-sentence pairs are scored and sorted.

4 Experiments and Discussions

In this section, the effectiveness of the proposed method is evaluated on two public datasets for two tasks of both image and sentence retrieval.

4.1 Datasets and Evaluation Metrics

Datasets. Two public datasets of Flickr8K [18] and Flickr30K [19] are employed to evaluate our method since both datasets contain images and corresponding descriptive sentences. Flickr8K consists of 8,000 images collected from Flickr, each with 5 sentences describing the image contents while Flickr30K is a larger dataset consisting of 31,783 images also collected from Flickr. Similar to Flickr8K, there are also 5 sentences describing the content of each image in Flickr30K. For Flickr8K dataset, we use the standard training, validation and testing split provided by the database. Meanwhile, the public training, validation and testing split as in [20] is directly used for Flickr30K dataset in our experiments.

Evaluation metrics. The popular metrics recall and median rank score are evaluated for both image and sentence retrieval. Recall R@K stands for the percentage of ground truth among the top K returned results, which is a useful metric of performances especially for search engines and ranking systems. The median rank indicates the position k of the returned list at which a system has a recall of 50%.

4.2 Results and Discussions

We compare our model with some typical models in performing bidirectional image sentence retrieval, including DCCA [21], FV [22], Deep Fragment [4], m-RNN [20], m-CNNs [6], Bi-LSTMV [13] and DTSN [23]. All performances obtained by these models are reported to be the state-of-the-art in the corresponding literatures and the reported metric values are directly compared in this paper. The performances of the proposed method and the other methods on Flickr8K and Flickr30K are shown in Tables 1 and 2 respectively.

Table 2. Results comparison in terms of R@K (the higher the better) and Med r (median rank) (the lower the better) on Flickr30K. Values in brackets correspond to performances worse than our proposed method.

As we can see from Tables 1 and 2, our method is competitive and reaches the state-of-the-art performances across different metrics on both datasets. When there are more training instances in Flickr30K, the proposed method performs much better than on Flickr8K. Compared to Deep Fragment [4] and RTP [5] which represent fragments of image as objects or instances, the comparable results achieved by the proposed method demonstrate that feature maps generated by CNN are also promising in representing fragments of image. Breaking down both images and sentences into fragments and then jointly encoding them in our method presents competitive results in both datasets, as it improves or equals the performance of m-CNNs [6] in at least one metric, which instead encodes an image to a vector to interact with different level of sentence fragments.

To fully investigate the performances of the proposed method, extended evaluation is carried out in inter dataset scenarios, i.e., training the model from one dataset but testing on another. The corresponding inter dataset results are shown in Table 3. From the table, our model derived from Flickr30k and tested on Flickr8k shows similar performances as in inner dataset scenarios in Tables 1 and 2. However, the performance is much worse on the other direction, i.e., deriving the model from Flickr8k and testing on Flickr30k. This implies that more training instances are valuable to improve the performance of the proposed model, which is a reasonable phenomenon across many approaches.

Table 3. Performance comparison of inter datasets scenario on Flickr8K and Flickr30K.

4.3 Limitations and Future Work

Though comparable performances can be achieved using the proposed method, it still has much space to be improved: In visual module, a basic CNN architecture VGG is employed to generate fragments of images. Note that there are many CNN models with better capability in image representation e.g. attention model [25], which have great potentials in improving our method. Similarly, the adoption of more optimized structures in the textual module and the fusional module can also improve the whole performance of the proposed method, which benefits from the proposed architecture as it allows fragmental feature extraction and combination to be processed seperatedly. One direction of future work is to extend the proposed method to apply more effective image representation CNN models to improve the overall performances.

5 Conclusions

In this work, we introduced a novel fragment level method to solve the task of bidirectional cross-media information retrieval. The proposed method includes three modules, the visual module for representing fragments of images, the textual module for representing fragments of sentences, and the fusional module for generating the final matching scores of image and sentence pairs. In combination with the fragmental abstraction of textual input, this novel representation of image fragments is demonstrated to be reasonable and effective. Experimental results on benchmark datasets have shown that our method is promising in bidirectional image and sentence retrieval and achieves comparable performances among many state-of-the-art models.