Keywords

1 Introduction

Multimodal analysis has received ever-increasing research focus due to the explosive growth of multimodal data such as image, text, video and audio. A core problem for multimodal analysis is to mine the internal correlation across different modalities. In this paper, we focus on the image-text matching. For example, given a query image, our aim is to retrieve the relevant texts in the database that best illustrate the image. There are two major challenges in multimodal matching: (1) effectively extracting the feature from the multimodal data; (2) inherently correlating the feature across different modalities.

Previous works for multimodal matching prefered to adopt off-the-shelf models to extract the features rather than learn modality-specific features. For the image, some well-known hand-crafted feature extraction techniques such as SIFT [1], GIST [2] were widely used. Inspired by recent breakthroughs of convolutional neural network (CNN) in visual recognition, CNN visual features were also introduced to multimodal matching [14]. For the text, latent Dirichlet allocation (LDA) [3] and word2vec [18] models were two typical choices for vectorization. Despite their contributions to the multimodal matching, off-the-shelf models suffer from some weaknesses. They are not specific designed for the task of multimodal matching. That is, these features are not discriminative enough, which limits the final matching performance.

Fig. 1.
figure 1

Overview of the proposed two-stream convolutional neural network.

Another challenge is to correlate these multimodal features. Most deep learning based methods [4, 5] are highly dependent on the categorical information for network training. However, such high-level semantic information is absent in most scenarios and requires much manual labels. Furthermore, the explosive increase of data makes it unrealistic to label each data with a certain category. Luckily, co-occurred data usually delivers correlated information (i.e. image-text pair information). The pair information is relatively easy to be obtained via the web crawler and should be fully explored for multimodal matching.

To address above issues, we propose a novel two-stream convolutional neural network as shown in Fig. 1, which extracts visual and textual representations and simultaneously performs the task of multimodal matching. Thus the similarity between images and texts can be measured directly according to the learned representations. More specifically, CNN is the backbone to extract the feature from the raw images and texts respectively. The outputs of the two stream are concatenated and followed by several shared fully connected layers. The final output of the network is the class probabilities after a softmax regression. To train the network, we adopt an extreme multiclass classification loss and a ranking loss both based on the pair information.

The remainder of this paper is structured as follows. Section 2 reviews the related work. Section 3 presents our two-stream network for multimodal matching and its learning process, followed by experimental results in Sect. 4. Section 5 draws an overall conclusion.

2 Related Work

The core issue for multimodal matching is to learn discriminative and joint image-text representations. Canonical correlation analysis (CCA) [7] and cross-modal factor analysis (CFA) [8] were two classic methods. They linearly projected vectors from the two views into a shared correlation maximum space. Andrew et al. proposed deep CCA [12] to learn the nonlinear transformation through two deep networks, whose outputs are maximally correlated. Yan et al. [13] further introduced DCCA into image-text matching.

Inspired by recent breakthroughs in visual recognition, CNN was also widely employed in multimodal matching. Wei et al. [14] provided a new baseline for cross-modal retrieval with CNN visual features instead of traditional SIFT [1] and GIST [2] features. CNN has also shown its powerful abilities in natural language processing. Hu et al. [10] proposed a sentence matching model based on CNN that represented the sentence and captured the matching relation simultaneously. In [9], convolutional architectures were first employed to learn the correlation between image and sentence by encoding their separate representations into a joint one.

There are also some deep models related to our work. In [6], a three-stream deep convolutional network was proposed to generate a shared representation across image, text, and sound modality. Wang et al. [15] presented a two-branch network to learn the image-text joint embedding. The network was trained by an extended ranking constraint and only received the input of feature vectors. Mao et al. [16] proposed a multimodal Recurrent Neural Network (m-RNN) model for image captioning and cross-modal retrieval. [17] presented a selective multimodal network that incorporated attention and recurrent selection mechanism based on long short term memory.

3 Two-Stream CNN

3.1 Network Architecture

Overall Architecture. As exhibited in Fig. 1, the overall architecture of the proposed network contains two parts. The color part with two streams focuses on the feature extraction from the raw image and text. The gray one integrates the feature vectors from different modalities with shared weights and fully connected layers for further multimodal matching. In general, to generate a joint representation, the color part is specific to modality but gray one is shared across modalities.

Image Stream. We adopt a 50-layer ResNet model [11] pretrained on ImageNet classification tasks as the visual CNN. We discard the top fully connected layer designed for ImageNet. Thus, given a raw image resized to \(224\,\times \,224\), a 2048 -\(\textit{dim}\) vector considered as the image representation is produced by the model after average pooling.

Fig. 2.
figure 2

Overview of the textual CNN stream.

Text Stream. Since each image can be represented by a fixed-length vector with CNN, we also design a textual CNN with three convolutional layers to vectorize the text as shown in Fig. 2. Text is first encoded into a \(1\,\times \,n\,\times \,d\) numerical matrix \(\mathbf {T}\), where n is the length of the sentence and d is the size of the vocabulary. The vocabulary contains all tokens appeared in the corpus. Let \(w_i\) be the i-th word in the vocabulary, thus \(w_i\) can be converted into a one-hot high-dimensional sparse vector \(\mathbf {v_i}\) where the i-th element is set to be 1 and rests to be 0. Then the embedding layer turns each \(\mathbf {v_i}\) into a low-dimensional dense word embedding \(\mathbf {e_i}\) with the length of k via a lookup table. Thus, each sentence is encoded into a \(1\times n\times k\) matrix.

Though embedding layer encodes the semantic information of each word into vectors, simply concatenating word vectors ignores many subtleties of a possible good representation, e.g. consideration of word ordering. Therefore, following convolutional layers are employed to extract the word sequence information of the words. In each convolutional layer, the context in the sentence is modeled using two convolution kernels of size \(1\,\times \,2\) and \(1\,\times \,3\), respectively. And the outputs of two convolutional operations are concatenated directly, fed into following layers. At the end of network, a pooling layer with dropout is used to produce final output, which matches the size of image features. Convolutional layers combined with word embedding ensure that the output feature contains most necessary information to effectively represent sentences for further multimodal matching.

3.2 Network Learning

Objective Function. Supervised semantic labels usually play an important role in deep neural network learning. However, the lack of labels poses a unique challenge to multimodal matching: how to effectively utilize the only image-text pair information. In this paper, we transform the multimodal matching into an extreme multiclass classification task where the matching becomes accurately classifying a specific data among tens of thousands classes. Here, each multimodal document including an image and corresponding text is viewed as a pseudo class. Given an instance \(x^{i}\), we apply the softmax function to the output of the network \(\mathbf {z}\in \mathfrak {R}^{1\times n}\) (n is the number of multimodal document). Thus, we can obtain the posterior probability of the instance being classified into the right category c. It can be formally written as Eq. (1).

$$\begin{aligned} P(c|x^{i}) = \textit{softmax}(\mathbf {z}) = \frac{e^{\mathbf {z}_c}}{\sum _{j=1}^{n}e^{\mathbf {z}_j}}. \end{aligned}$$
(1)

Then we minimize the negative log-likelihood P(c\(|x^{i}\)), defined as Eq. (2).

$$\begin{aligned} L_{cls} = -log(P(c|x^{i})). \end{aligned}$$
(2)

To obtain more discriminative representations, we also performed a metric learning based on a ranking constraint. Pair of distances in the feature space between \(x^p\) and \(x^n\) against the anchor \(x^a\) should be pulled apart up to a margin \(\alpha \) (\(\alpha \) = 0.1 in our case) as \(d(x^a,x^p) + \alpha < d(x^a,x^n)\). Instances sharing the same pseudo class with \(x^a\) are defined as \(x^p\), otherwise, \(x^n\). We compute the cosine distance between the feature vectors \((\mathbf {v}_i,\mathbf {v}_j)\) of two instances \((x^i,x^j)\) as \(d(x^i,x^j) = 1- \frac{\mathbf {v}_{i}\cdot \mathbf {v}_j}{{\parallel \mathbf {v}_i\parallel }_2{\parallel \mathbf {v}_j\parallel }_2}.\) We further define the bi-directional ranking constraint with a hinge loss for the given image reference \((x_{img}^a, x_{txt}^p, x_{txt}^n)\) and the text reference \((x_{txt}^a, x_{img}^p, x_{img}^n)\) respectively as Eq. (3).

$$\begin{aligned} {\begin{matrix} L_{rank} = max\{0,d(x_{img}^a, x_{txt}^p)-d(x_{img}^a, x_{txt}^n)+\alpha \}\\ + max\{0,d(x_{txt}^a, x_{img}^p)-d(x_{txt}^a, x_{img}^n)+\alpha \}. \end{matrix}} \end{aligned}$$
(3)

The final objective function is a weighted combination of the classification loss and ranking loss as Eq. (4).

$$\begin{aligned} L = \lambda _1L_{cls} + \lambda _2L_{rank}. \end{aligned}$$
(4)

Training Scheme. Network training is done in three steps. Firstly, we fix the image stream and train the remaining part using the classification loss (\(\lambda _2 = 0\), only text data is used). The reason behind is that pre-trained weights on Imagenet can be used for image stream but weights of the remaining part have to be learned from scratch. Secondly, we update the weights of the entire network after step 1 converges (\(\lambda _2 = 0\), both text and image data are used). Considering that ranking loss usually converges very slowly or even does not converge especially in two-stream network learning, we fine-tune the entire network using the combination of the classification loss and ranking loss (\(\lambda _1 = 1, \lambda _2 = 1\)) only in the last step.

4 Experiment

4.1 Datasets and Evaluation Metrics

We choose widely-used Flickr30k [19] for experiments. Flickr30k contains 31,783 images collected from website Flickr. Each image is described with five sentences. We follow the partition scheme in [16, 17], where 29,783, 1,000, and 1,000 images are used for training, validation, and test respectively. R@k and Med r are adopted as evaluation metrics. R@k is the average recall rate over all queries in the test set. Specifically, given a query, the recall rate will be 1 if at least one ground truth occurs in the top-k returned results and 0 otherwise. Med r is the median rank of the closest ground truth in the ranking list.

4.2 Implementation Details

For Flickr30k, the vocabulary size d is 20,074, and each word is encoded into a 300-\(\textit{dim}\) dense vector. To ensure that each input sentence has the same length of 32, we use 0 vectors as paddings for those short sentenses. And we use the pre-trained vectors of the word2vec [18] model to initialize our embedding layer. The network is optimized by backpropagation and mini-batch stochastic gradient descent with the momentum fixed to 0.9. For the three training steps, learning rate is set to 0.001. 0.0001 and 0.00005 respectively. The maximum epochs are set to 180, 60 and 20 accordingly. In our experiments, we observe convergence within 150, 30, 10 epochs.

4.3 Experimental Results

We consider two basic multimodal tasks: Img2Txt (an image query to retrieve texts) and Txt2Img (a text query to retrieve images). Table 1 presents the experimental results of different methods in terms of R@k and Med r. The proposed network outperforms other methods in the Img2Txt task with the highest R@1 of 48.4%. In the Txt2Img task, R@1 obtained by our method is only 0.7% lower than the best method RBF-Net [20]. The results indicate that the learned features are effective for multimodal matching. The superiority of our network can be explained by the following two aspects: (1) We simultaneously perform feature extraction and multimodal matching. Compared with off-the-Shelf models, the learned features are more targeted for the matching task instead of previous generic representations; (2) We fully explore the image-text pair information via the classification and ranking loss to generate more discriminative representations.

We also conduct experiments to analyze the effect of the training scheme. Step 1 only trains the text stream using the classification loss and directly adopts the image features extracted from pre-trained ResNet-50. Step2 trains the entire network using the classification loss, which encourages instance from the same document to fall into one category. Thus, results obtained from step 2 gains a great increase of about 10%, 6% on R@1 in the bidirectional retrieval respectively. Step 3 combines ranking constraints to further finetune the network, which provides a higher performance for the final model.

Table 1. Bidirectional image and text retrieval results on Flickr30K.

Another issue to be noticed is that the improvement brought by step 2 is not as impressive as that by step 3. On the one hand, that illustrates the effectiveness of posing multimodal matching as a classification problem. On the other hand, considering the effectiveness of ranking loss in previous works, there could be space for improvement in our network especially the weakness of R@5 and R@10. Ranking loss requires a careful triplet sampling strategy from the extremely unbalanced positive and negative ones, which points out the direction of our future work.

5 Conclusion

This paper mainly addresses the issue of multimodal matching via a novel two-stream convolutional neural network. The proposed network can extract the features from the raw image and text. To guarantee the features shared between different modalities, a classifier and ranking constraint are adopted for network learning by utilizing the pair information. Experimental results on Flickr30k datasets demonstrate the effectiveness of viewing each multimodal document as a discrete class. For further research, the ranking constraint will be polished to perform a more effective metric learning. Also, more detailed experiments on the Microsoft COCO datasets will be conducted to further validate the validity of our network.