Keywords

1 Introduction

With the development of deep learning, neural networks achieve great progress in computer vision and natural language processing. Cross-modal tasks, mainly between images and texts, are gaining more and more attention [5]. In this work, we focus on one major task in cross-modal learning: image-text matching.

The goal of the image-text matching task is to find the most matching pairs through a large number of given images and texts. Thus, in real-time applications, it is vital to find the best matches of the given images/texts efficiently.

Fig. 1.
figure 1

Different encoding methods

Traditional solutions in deep learning are to find a shared latent space [25] by encoding image and text features separately. Normally, convolution-based [11] networks are used to encode images while RNN-based [8] networks such as LSTM [12] are applied for text encoding. Then the distance measurements like cosine similarity are used to calculate the similarity of the pooled vectors from different modalities. A triplet ranking loss [25] is then applied to train the neural network for finding the most similar pairs across modalities. These architectures can be illustrated by Fig. 1(a). As shown, features across modalities are isolated since they are separately encoded.

Recently, there has been much progress achieved with the development of pre-trained models in different modalities. These improvements make it possible to joint-encode the features across modalities to learn a joint representation of vision and language.

Pre-trained models push the state-of-the-art performances of many tasks to a new level. In the CV field, the pre-trained models, such as VGG [24] and ResNet [11], have been regarded as the backbone models to extract the visual features for the downstream tasks. In the NLP field, the pre-trained models, exemplified by ELMo [19], GPT [21] and BERT [6], use fine-tuning method to achieve new state-of-the-art performances in downstream tasks like natural language inference [2].

The arise of pre-trained encoders allows separate encoding to encode single modal features with higher representation quality. However, the distribution of pre-trained encoders are different across modalities, thus the traditional usage of bottom-up structures (Fig. 1(a)) would make it difficult to project cross-modality features into a shared latent space.

Later in the cross-modal field, following the idea of applying large-scale data to create pre-trained models, joint encoding methods are based on large-scale image-text paired data [15]. This architecture, shown in Fig. 1(b), combines texts and images together through an attention-based structure [27] encoder to learn joint representations across two modalities. These models, exemplified by Unicoder-VL [15], UNITER [4], achieve new state-of-the-art results on many cross-modal tasks like VQA [1], image-captioning [3] as well as image-text matching [25]. In image captioning and VQA tasks, the goal is to generate corresponding captions or to find answer spans, which requires images and texts to entangle with each other. Joint-encoding models boost these tasks to a whole new level.

However, in the image-text matching task, the goal is to find the most matching pair from a large number of images and texts.

Since joint encoding methods combine the texts and images as inputs to the model, during inference, these models require the pre-trained structure to iterate all possible pairs which take massive calculation consumption. We name such unacceptable cost Inference Disaster. Such a problem constrains these models in real-time usage despite its outstanding performance.

As illustrated above, in the image-text matching task, traditional methods are relatively weak in representation encoding compared with joint-encoding methods based on pre-training with large-scale image-text pairs. Meanwhile, the joint-encoding methods suffer from the inference disaster.

In this work, in order to maintain the retrieval efficiency as well as promoting the performance of the model, we propose an Enhanced Separate Encoding Framework to modify the separate encoding framework, focusing on excavating multi-layer features of separate pre-trained visual and textual encoders and projecting them to the common subspace.

Our proposed framework is constructed based on separate encoding models, thus is very efficient during inference compared with the joint-encoding methods.

We attach extra encoding modules to align and project features across modalities. These extra modules extract features from the entire pre-trained encoder in different modalities and project them in a shared latent space, thus the representations across modalities are less distant compared with separate pre-trained features.

Experiments show that our proposed framework achieves competitive performances against joint-encoding methods without using large-scale image-text pairs for pre-training and outperforms all previous traditional separate-encoding methods in Flickr30K and MS-COCO dataset.

To summarize our Contributions:

  1. (a)

    We analyze the traditional separate-encoding methods as well as recent joint-encoding methods, pointing out the importance of both performances and efficiency in the image-text matching task.

  2. (b)

    We propose a framework to break the limit of separate encoding methods. The framework outperforms all previous separate encoding methods and achieves competitive performances against joint-encoding methods, meanwhile, it does not use large-scale image-text pairs.

2 Related Work

2.1 Traditional Methods in Image-Text Matching

Encoding features from different modalities separately is the major method used before. The goal is to find a better shared latent space of image features and text features. Triplet ranking loss is introduced by [25] and used to narrow down the distance between matching pairs. [9] incorporated a hard negative method to focus on maximum violating negative pairs, which is widely applied by later works. More recently, [14, 28] introduced faster-RCNN network to use regional semantic features to enhance the image encoding quality. Other approaches such as incorporating knowledge graphs [23], using graph networks [16, 29] are explored to further boost the performances. Most of these methods encode image features with pre-trained models such as ResNet and faster-RCNN, while encoding text features with RNNs. Thus, when incorporating pre-trained text encoders, it is more difficult to learn a shared latent space in two different distributions from pre-trained encoders across modalities.

2.2 Pre-trained Models and Joint-Encoding

In computer vision field, ResNet [11] and VGG [24] are widely used as backbones in vision models. These convolution-based structure models are trained using image classification data such as ImageNet. Models like Fast RCNN [10], Faster RCNN [22] are built based on these backbone models and aim for detection and segmentation tasks.

Recent arise of pre-trained models in natural language processing started with ELMo [19], using unsupervised data to train language models. GPT [21] and BERT [6] introduce the attention-based structure called transformer [27], take the NLP research into a new era of pre-training. These successes of pre-trained models motivate researchers to construct cross-modal pre-trained models using large-scale cross-modal datasets. These models use pre-calculated regional features combined with text sequences to create joint-encoded features, exemplified by UNITER [4], Unicoder-VL [15] and LXMERT [26]. These models achieve great performances in cross-modal tasks such as VQA, image captioning; yet in the image-text matching task, the inference efficiency is limited by its joint-encoding nature.

3 Limits of Previous Encoding Methods

3.1 Different Distribution in Separate Encoding

When both modalities are equipped with pre-trained encoders, exemplified by ResNet in images and BERT in texts, the distribution is different inherently, making previous methods difficult to project different modalities into a shared latent space.

3.2 Inference Disaster in Joint Encoding

Joint-Encoding models use large-scale image-text paired data to pre-train the joint-encoding models [4, 15, 17, 26, 30].

Fig. 2.
figure 2

Structure of enhanced separate encoding framework

Most of these methods firstly encode input image region features that are extracted from an RCNN model trained with [13]. These regional features from original images play roles as tokens in a sequence.

Despite the excellent performances in downstream tasks, such structures would face a massive calculation consume problem during inference in the matching task: Suppose N sequences(captions) and M images are to be examined, which are total \( M \times N\) entangled pairs. Suppose the inference time for each pair is T, with a batch-size B. The model needs to went through \(M \times N\) times inference, resulting in a time cost \( \frac{M \times N \times T}{B}\), which has an \(O(n^2)\) time complexity. While inference with separately encoded features only need to run a cosine similarity between pairs, which has an O(n) time complexity.

4 Framework Construction

Separate encoding methods would be less effective in applying pre-trained encoders in both modalities, considering that features in two modalities are under different distribution; meanwhile, joint encoding methods, though encoding jointly, would suffer from a less efficient inference process. Leveraging advantages and disadvantages, we propose an enhanced separate encoding method, aiming to narrow down the distance between features from two different pre-trained encoders. The core motivation is that allowing separately pre-trained features to be further encoded by non-pre-trained modules, thus these features are more similar in nature since these non-pre-trained modules are more aligned.

Therefore, we construct extra modules to align and extract pre-trained cross-modality multi-layer features and train these modules from scratch to learn a shared latent space (Fig. 2).

The entire enhanced separate encoding framework consists of three steps: feature encoding, feature alignment, and feature projection.

4.1 Feature Encoding

First, we obtain the multi-layer features of separate pre-trained encoders.

Separate encoding features are trained with different types of corpora. In image pre-training, ResNet is trained with image classification data and the feature map of ResNet can be used as the backbone of further downstream tasks. Faster-RCNN model is trained with object detection data or semantic segmentation data and the output feature is regional features of a given image. In text pre-training, BERT is trained with a mask language model, using large-scale Wikipedia corpus. Based on the transformer structure, the output is the multi-layer token-level feature.

We use all levels of separately pre-trained features combined to find better cross-modal representations: In image encoding, we denote the \(i^{th}\) layer of feature map from ResNet as \(H_i \in \mathbb {R}^{W_i\times H_i \times D_i}\); \(W_i\),\(H_i\) are the width and height size of the convolution output. We denote the regional feature from faster-RCNN as \(H_r \in \mathbb {R}^{N^r \times D_r}\) and \(N^r\) is the region number. In text encoding, we denote the \(j^{th}\) layer of transformer block output from BERT as \(S_j \in \mathbb {R}^{L \times D_j} \), L is the sequence length.

These obtained features are encoded separately from pre-trained models, thus are quite different across modalities.

4.2 Feature Alignment

In text encoding, the output feature is token-level, which is sub-word level feature in BERT specifically. In image encoding, the output features are feature-maps extracted from ResNet features and regional features extracted from RCNN network features. Therefore, it is difficult to directly project these features with different layers and different dimensions into a shared latent space. We manage to convert different layers of features into aligned regional features across modalities by reshaping them via feature concatenation and average pooling.

4.3 Feature Projection

After feature alignment, we have multi-level regional image features and multi-level sub-word textual features. The feature projection is a two-phase process:

Region/Token-Wise Projection. First we project both region features in encoding images and token features in encoding texts into a similar latent space. The token-region matching can be better encoded with attention-based modules as explored by [6, 14, 15], thus we construct a self-attention based encoder to encode these aligned features.

The encoder F(X) follows a standard transformer structure [27].

$$\begin{aligned} A = \mathrm {Softmax}(\frac{ W_qXW_k^TX}{\sqrt{d}})(W_vX)\end{aligned}$$
(1)
$$\begin{aligned} F(X) = \mathrm {LayerNorm}(X + A + \mathrm {FFN(A})) \end{aligned}$$
(2)

We feed the aligned feature \(\widehat{H}_i\), \(\widehat{H}_r\) from image encoder and \(\widehat{S}_i\) from text encoder into corresponding transformer blocks to get token/region level features. Considering that we have both ResNet features and faster-RCNN features combined, we duplicate the last layer of \(\widehat{S}_k\) to create \(\widehat{S}_r\) to match the corresponding \(\widehat{H}_r\). We then apply average pooling over the region/token level representations to obtain vectors of the given image and text.

$$\begin{aligned} \vec {H}_i = AvgPool(F_i(\widehat{H}_i)) , \vec {H}_r = AvgPool(F_r(\widehat{H}_r)) , \vec {S}_k = AvgPool(F_k(\widehat{S}_k)) \end{aligned}$$
(3)

Layer-Wise Projection. As mentioned in feature alignment, we use layer concatenation to align multi-level features, which is rigid in nature. We are unaware which level of features across modalities might be encoded more similar, thus we fully connect these vectors, allowing different level of features to match their potential similar features across modalities.

$$\begin{aligned} \vec {V_H} = Linear(Concat([\vec {H}_0, \cdots , \vec {H}_i, \cdots ],\vec {H}_r))\end{aligned}$$
(4)
$$\begin{aligned} \vec {V_S} = Linear(Concat([\vec {S}_0, \cdots , \vec {S}_k, \cdots ), \vec {S}_r) \end{aligned}$$
(5)

These two steps of feature projection encode the features that are inherently different into a similar latent space. Since joint-encoding the concatenated token and region features are not feasible in separate encoding, we decompose the separate encoding features into token-wise and layer-wise, and align them to be encoded into a more similar latent space.

After acquiring the separate encoded vectors \(\vec {V_H}\) and \(\vec {V_S}\) from two modalities, we use triplet ranking loss to train the entire model.

5 Experiment

5.1 Datasets

We use Flickr30K [20] dataset and MS-COCO [18] to test our enhance separate encoding framework.

In Flickr30K, there are 31,783 images with 5 captions each, and MS-COCO 2014 contains 123,287 images with 5 cations per image. We follow [9] for the train-valid-test split, which is 1k test for Flickr30K, 1k, and 5k for MS-COCO. which results in 113287 training, 5000 validation, and 5000 testing images for MS-COCO. Flickr30K dataset is split into 29783 training, 1000 validation, and 1000 testing images. Our results average over 5 folds of 1k test images and use the full 5000 test images for MS-COCO testing. We use recall by K (R@K) defined as the fraction of queries for which the correct item is retrieved in the closest K points to the query.

5.2 Implementation Details

For both Flickr30K and COCO dataset, we use ResNet152 and Faster-RCNN with ResNet101 as image encoding models. The Faster-RCNN features are extracted following [30], with region number 100 and hidden size 2048. The dimension of 4 layers of feature maps in ResNet152 are [56, 56, 256], [28, 28, 512], [14, 14, 1024] and [7, 7, 2048]. We apply average pooling with pooling window [8, 8], [4, 4], [2, 2] and [1, 1]. After merging and linear transformation, the output features of 4 feature maps are [49, 256], [49, 256], [49, 512], [49, 1024]. The region feature is [100, 1024]. And we use BERT-base as a text encoding model, which contains 12 layers with hidden dimension size 768. We set max sequence length to 32. During feature alignment, we concatenate every 3 layers of BERT output and use linear transformation to obtain 4 layers of features with dimension size [32, 256], [32, 256], [32, 512] and [32, 1024]. We duplicate the last layer to align with region features from faster-RCNN. The transformer block is a 1-layer transformer with 8 heads and an intermediate size 1024.

During training, we use NVIDIA 1080Ti GPUs to train the entire model, with learning rate set to 2e−5, batch-size 128 for Flickr30K, and 320 for MS-COCO dataset. We also ensemble two single models to create an ensemble model of an enhanced separate encoding framework to boost the performances.

5.3 Experiment Setup

We establish baselines testing the matching results as well as inference cost. We implement joint-encoding approaches based on two different joint-encoding structures. In the Unicoder-VL structure, we follow the implementation in [15]. In the LXMERT structure, the core idea is encoding features across modalities jointly only in the higher layers. Thus, we use the first 8 layers of BERT-base structure for text encoding and region-features from Faster-RCNN for image encoding. Then we concatenate the image and text features and feed them into the last 4 layers of BERT-base structure and use the special [CLS] token for similarity score learning.

The inference cost is tested on a single NVIDIA 1080Ti GPU. We set batch-size 128 evaluating our enhanced separate encoding framework. When evaluating joint-encoding methods on 1k test of Flickr30K dataset, we use batch-size 5000 which is the caption number; we iterate each image to calculate the similarity score of the matching pairs.

Table 1. Performances on Flickr30K dataset Unicoder-VL\(^*\) is further pre-trained with large-scale image-text pairs.
Table 2. Results on MS-COCO dataset.

5.4 Experiment Result

As seen in Table 1 and 2, our enhanced separate encoding framework outperforms previous separate encoding approaches by a large margin, while outperforming joint encoding methods that are trained without image-text pair pre-training.

The calculation cost during inference, as seen in Table 1, is enormous in joint-encoding methods. We use 8 GPUs to run inference in joint-encoding with very large batch-size, still the time cost is unbearable. Meanwhile, without pre-training, the performance of joint-encoding is not superior to separate encoding methods.

Joint-encoding model further pre-trained with large-scale image-text pairs has great performances while it has less competitive performances when only trained with image-text pairs in the given task. This indicates that joint-encoding method relies on using large-scale image-text pairs to enhance the model while joint- Therefore, we believe that separate encoding with our enhanced framework is both effective and efficient.

Table 3. Projection study on Flickr30K dataset; R/T-P is region/token-wise projection; L-P is layer-wise projection.

6 Ablation Studies

6.1 Effectiveness of Feature Projection

The motivation of our enhanced separate encoding framework is to project separately pre-trained features into a similar latent space. Therefore, we construct ablations studies proving that feature projection modules play vital roles in our framework.

We establish baselines on both Flickr30K and COCO dataset. We concatenate the pooled \(\widehat{H}\) and \(\widehat{S}\) without using F(X) region/token-wise projection or layer-wise linear transformation projection. That is we run baselines without feature projection, we simply use concatenated outputs features from feature align process.

As seen in Table 3, F(X) projection (R/T-P) and linear transformation (L-P) are important in projecting features to be more similar, indicating that though the pre-trained features possess abundant information, they are different inherently across modalities. Therefore, though both projection methods are easy to construct, the idea of allowing separately-pre-trained features to be aligned and further encoded is extremely effective.

7 Conclusions and Future Work

In this paper, we focus on the image-text matching task. Firstly, we analyze the traditional separate encoding methods as well as recent joint-encoding methods based on pre-training with large-scale image-text pairs. We discuss the problems that constrain these methods, then we propose a framework to leverage the advantages and disadvantages of these methods, achieving competitive results while maintaining a minimal inference cost.

In the future, following our analysis, we are hoping to apply large-scale image-text pairs to train the projection modules to take performances of the image-text matching task to a higher level as well as try different languages.