1 Introduction

Over the past few years, multi-modal data, i.e. media data of various types but homogeneous topic, has been growing rapidly with the emerging development of social media websites (e.g., Twitter, Facebook, Youtube, Instagram, etc), where users are allowed to retrieve information from these heterogeneous data using their preferred queries [22, 26, 28, 29, 49]. In order to maximally benefit from the richness of multimedia data and make optimal use of the rapidly developing multimedia technology, automated mechanisms are needed to establish a similarity link from one multimedia item to another if they are related to each other, independent of the type of modalities, such as text, visual or audio, present in the items. In order to provide an answer to the above challenge, research towards reliable solutions for cross-modal retrieval, that are able to operate across modality boundaries, has gained significant attraction recently.

The primary issue in cross-modal retrieval lies within the fact that features of different modalities have very different statistical characteristics, indicating its impossibility to directly compare features of different modalities. Current research has been focused on two aspects: correlation maximization [9, 22, 48, 49] and feature selection [35, 40, 42, 43, 45, 46] [2]. Subspace learning and dictionary learning are popular approaches. With subspace learning, a common subspace and corresponding transforms are learned so that the transformed features are maximally correlated [22]. With dictionary learning, multiple dictionaries are jointly learned by correlating the sparse coefficients obtained on the training data [49]. Mixed norm regularization has been added to improve feature selection [9, 35, 42, 43]. These methods achieve considerable performance; however, most of them are supervised and require labeled data, which could be hard to obtain in the real world.

In the deep learning realm, several unsupervised models based on canonical correlation analysis (CCA) [10] or autoencoder have been proposed to learn modality invariant features [1, 5, 31, 44] without supervising labels. These models generate representations in an embedding space shared by different modalities and optimizations are performed to maximize the correlation for the shared representation. The core of these approaches is to close the gap between different modalities by finding certain transforms under which the transformed features are maximally correlated. These transforms are expected to be modality invariant so that the transformed features have similar statistical characteristics and cannot be distinguished from each other. However, existing approaches fail to explicitly address the statistical aspect of the transformed features, especially the intra-modal discriminativeness and the inter-modal consistence, hence these features can still be statistically different.

In order to address the statistical aspect of the feature transforms, we propose a novel DNN based approach, termed Deep Adversarial Metric Learning (DAML), for cross-modal retrieval task. DAML is inspired by the recent advance in domain adaptation [6] where adversarial learning is utilized to avoid domain shift and to facilitate generation of domain invariant features. Besides, to enforce statistical similarity between transformed features of different modalities, similarity between their distributions must be measured in a certain way. In our proposed DAML, we also employ the coupled metric learning technique [15] to learn an appropriate similarity measure that preserves the statistical similarity between transformed features of different modalities.

Figure 1 illustrates the general framework of DAML. Similar to [1, 5, 38, 44], we adopt two feed-forward networks as the image and text feature mappings in DAML to nonlinearly transform the respective features to a common subspace, under which the intra-class variation is minimized and the inter-class variation is maximized. In addition to requiring the transformed features to be maximally correlated, we also require them to be statistically indistinguishable in the subspace, i.e. the difference of each sample pair captured from two modalities of the same class is minimized. To achieve this, we introduce modality classifier to identify the source modality of a transformed feature. These components are trained under the adversarial learning framework. This is quite different from previous methods where no requirement is placed on the statistical characteristics of the transformed features. By doing so, we explicitly require that mapped features of different modalities have similar statistical distributions. The adversary introduced by the modality classifier can be seen as a regularization term in the subspace learning procedure of the proposed method. Therefore, it ensures that the transformed features of different modalities can be directly compared in the subspace with their intrinsic characteristics are well preserved.

Figure 1
figure 1

The general architecture of the proposed DAML consists of four major components: image feature projection, text feature projection, modality classifier, cross-modal similarity metric, which together form a standard feed-forward architecture. The image and text features are mapped to the common subspace with successive two-fold procedure. One branch termed cross-modal similarity metric proceeds the feature discrimination and feature correlation jointly in the subspace, and the other branch termed modality classifier accounts for the diversity between the representations of different modalities in the subspace. Adversarial learning manner is adopted to jointly optimized the two branches during training

This paper is an extension and improvement of our previous method termed UCAL presented in [11]. The main differences between the proposed DAML and previous UCAL can be summarized as the following three aspects: 1) our proposed DAML is a supervised cross-modal learning approach that incrementally incorporates the discriminativeness of class labels in the learned transformed features, while UCAL is an unsupervised method that limitedly maximizing the correlation of inter-modal data; 2) our proposed DAML also employs coupled metric learning technique to learn appropriate distance metric that preserve the statistical distribution of multimodal data; 3) the parameter learning algorithm that learns the optimal neural network weights developed for the proposed DAML is also different from that in UCAL, since the weights play the roles of both transformations and distance metric in DAML. Comprehensive evaluation on three benchmark datasets illustrates that our proposed DAML significantly outperforms previous UCAL and several other state of the art cross-modal retrieval approaches.

The rest of paper is organized as follows. In Section 2, we discusses previous work in cross-modal retrieval and adversarial learning. We describe details of the proposed method in Section 3 and present the experimental results in Section 4. Finally, the conclusion is made in Section 5.

2 Related work

2.1 Cross-modal retrieval

As for the traditional cross-modal retrieval methods, one popular group is subspace learning based methods, such as Canonical Correlation Analysis (CCA) [10] and its extensions [7, 22, 23, 48]. By assuming that the representations in different features spaces are correlated through certain common information, Rasiwasia et al. [22] proposed to learn the subspace by maximizing the correlation between the image feature and the text feature spaces through CCA. Sharma et al. [23] proposed multiview extensions to CCA, LDA and Marginal Fisher Analysis (MFA), i.e. Generalized Multiview Analysis (GMA), Generalized Multiview LDA (GMLDA) and Generalized Multiview MFA (GMMFA), and showed that they performed well on cross-modal retrieval problems.

It is notable that dictionary learning has been introduced to address the fact that the subspace assumption could be restrictive for some real world multimodal data. Zhuang et al. [49] extends unimodal dictionary learning framework to multimodal data. Instead of independently learning the dictionary and corresponding coefficients for a single modality, the coefficients for different modalities are correlated using a linear mapping; l1,2 norm was also used to discover inter-modality structures. As pointed out by Gu et al. [9], both subspace and dictionary learning have problem with feature selection: either all features are linearly combined or only some components are selected from a feature vector. To tackle this, they formulated subspace learning using graph embedding and applied l2,1 regularization to jointly perform feature selection and subspace learning. Tian et al. [32] explored the intrinsic manifold structures in different modalities and developed a so-called correlation component manifold space learning method to capture the correlations residing in the heterogeneous data. Wang et al. [35] proposed to explicitly learn two projections that map two modalities into a coupled common subspace and adopted l2,1 norm on the learned projections to perform feature selection. Xu et al. [42, 43] further introduced dictionary learning into the coupled feature mapping framework, forming a two step framework. In particular, two dictionaries were learned jointly in a way similar to [49]; then the learned sparse representations were then mapped into a common subspace.

Meanwhile, neural networks have also been applied to cross-modal retrieval. Srivastava et al. [31] applied autoencoder and Restricted Boltzman Machine (RBM) to multimodal data. They followed similar pattern by adding a shared representation layer to correlate each modality. Another autoencoder based model is Correspondence Autoencoder (Corr-AE) [5]. Instead of reconstructing via shared representations, Corr-AE correlates representations learned by each autoencoder through a predefined similarity measure. The model is trained to minimize the reconstruction error for each modality and the pairwise discrepancy between the learned representations. Wang et al. [37] further adopted stacked auto-encoders to form deeper nonlinear embeddings for different modalities, showing the capability of learning more effective mapping functions and shared representations. Andrew et al. [1] proposed a direct extension to CCA, namely DCCA. It uses two feedforward networks to transform features of each modality and the networks are trained to maximize the correlation between the transformed features over all the data. Yan et al. [44] further proposed an end-to-end learning framework based on DCCA. Although these methods tried to maximally correlate different modalities and to better choose features, none of them explicitly address the statistical aspect of the representations learned from different modalities. The transformed features are not guaranteed to possess similar statistical properties, which can make them statistically separate. In this paper, we explicitly address this issue through adversarial learning.

Moreover, several coupled metric learning algorithms have been proposed for cross-modal matching such as Cross Modal Metric Learning (CMML) [17], Cross-Modal Similarity Learning (CMSL) [12], Coupled Marginal Fisher Analysis (CMFA) [24] and Online Asymmetric Similarity Learning (OASL) [39]. These methods only learn a pair of linear transformations to map cross-modal samples into a new common feature space, which is not effective enough to discover the nonlinear relationship of samples. Later, Liong et al. [15] proposed Deep Coupled Metric Learning (DCML), a metric learning approach that learns two sets of nonlinear transformations to map data samples into common space considering the variation of different classes. Different from DCML, our proposed DAML is based on adversarial learning, and utilizes category information adequately to preserve inter-modal and intra-modal structure simultaneously, thus ensures that the learned subspace feature representations to be both discriminative within modality and modality-invariant.

Lastly, it is worth mention that a bundle of hashing based approaches such as [27, 40, 41, 50] have been proposed for cross-modal retrieval problem. More related works can be referred to the latest literature review in [33]. These cross-modal hashing methods find linear projections to embed the heterogeneous data into a common Hamming space, where the multi-modal features are represented by low dimensional binary codes. Different from the hashing based methods, we focus on the traditional cross-modal retrieval task and aim to learn compact real-valued subspace representations rather than binary codes.

2.2 Adversarial learning

Adversial learning was recently proposed by Goodfellow et al. [8] in GAN for image generation. The framework consists of two major components, namely the generator and the discriminator. The two components have opposite training goals: the generator is trained to generate samples that cannot be distinguished from the source by the discriminator; the discriminator is trained to correctly identify the samples that are produced by the generator. Eventually, the generator learns to duplicate the source distribution. Despite its extensive application in image generation [8, 20], researchers also uses it as a regularizer [6]. Makhzani et al. [16] introduced adversarial learning into autoencoder by regularizing the intermediate representation of the autoencoder using a prior distribution through adversarial loss. In particular, a classifier is introduced to identify if a sample is drawn directly from the prior distribution. The encoder is trained to fool the classifier so the learned representations have a similar distribution as the prior. Larsen et al. [14] combined adversarial network with Variational Autoencoder (VAE) [13]. From the perspective of VAE, the adversarial part provides an additional adversarial loss to the VAE. This can be considered as a regularized VAE. Larsen et al. [14] used an additional adversarial network to regularize an improved version of Variational Autoencoder and proved its efficiency via image reconstruction and manipulation. A closely related work is by Ganin et al. [6], where adversarial learning was applied to domain adaptation to learn domain invariant features. Ganin et al. [6] regularized feature extractor in domain adaptation with adversarial network to generate domain invariant features and achieved exciting performance. Yet, no attempt has been made to apply adversarial learning to cross-modal retrieval.

Inspired by these works, we introduce adversarial learning as regularization into cross-modal retrieval for image and text. Similar to the neural networks based methods, we use neural networks for feature transforms. However, we not only maximize the correlation between the transformed features, we also regularize their distributions through the introduction of modality classifier, which predicts the source modality of a transformed feature and thus brings adversary.

3 Proposed method

3.1 Problem formulation

Let \(\mathcal {D} = \{I_{1}, ..., I_{n}\}\) be a collection of n instances with each instance Ii = (vi, ti) consisting of dV dimensional visual feature vi and dT dimensional text feature ti. We also define feature matrices of two modalities as V = {v1, ..., vn} and T = {t1, ..., tn}. In practice, the visual features and the text features are represented in different high dimensional spaces with diverse statistical properties; therefore they cannot be directly compared against each other. Suppose we have two mappings fV(v;𝜃V) = fV(vi;𝜃V) and fT(t;𝜃T) = fT(ti;𝜃T) that respectively transform the visual and text features into d dimensional vectors sV and sT with same dimension.

Although the transformed features have the same dimensionality, they are not guaranteed to be directly comparable since the statistical properties of the transformed features are still unknown. These transformed features can still follow unknown yet complex distributions, which prohibits effective cross-modal retrieval. Yet, existing methods, either based on subspace learning or deep neural networks, focus on maximizing the correlation in the transformed space or choosing better features. No explicit requirements are imposed on the statistical aspect.

To make the features directly comparable, we have the following two objectives: 1) it is desirable to exploit more discriminative information from training samples; 2) it is expected to reduce the modality gap of the pairwise data from different modalities. We use feed-forward networks to train nonlinear transformation for each modality using the adversarial learning framework. This allows us to put an additional restriction on the statistical properties on the transformed features.

3.2 Deep adversarial metric learning

As shown in Figure 1, our proposed DAML first conducts image and text feature projection to obtain the transformed representations sV and sT, meanwhile the constraints of the intra-modal and inter-modal similarity metric and modality classifier restrain the learned subspace representations to be discriminative and modality-invariant. In the second stage, we construct a multi-task learning architecture to learning discriminative and modality-invariant subspace representations jointly. Specifically, in the following subsections, we decompose the subspace learning procedure into three loss terms: 1) adversarial loss was utilized to minimize the “modality gap” between two unknown distributions of representations from different modalities to promote modality-invariant; 2) feature discrimination loss, which models the intra-modality similarity by category information and ensures learned representations to be discriminative; 3) feature correlation loss, which minimize the distances among intra-class cross-modality samples and maximizes the distances among inter-class cross-modality samples.

3.2.1 Adversarial loss

To enforce the statistical requirement and close the “heterogeneity gap” demonstrated above, a modality classifier D with parameters 𝜃D was introduced, which acts as the “discriminator” in GAN. Mapped features from image modality are assigned with label 01, while mapped features from text modality are assigned with label 10. For the modality classifier, the goal is to differentiate the source modality as precise as possible given an unknown mapped feature. For the classifier implementation, we used a 3-layer feed-forward neural network with parameters 𝜃D (see Section 3.3 for implementation details). The adversarial loss Ladv can now formally be defined as:

$$ \mathit{L}_{adv}(\theta_{V}, \theta_{T}, \theta_{A}) = -\frac{1}{\mathit{n}}\sum\limits_{i = 1}^{n}(\mathbf{m_{i}}\cdot(\log D(\mathbf{v}_{i}; \theta_{A}) + \log(\mathbf{1} - D(\mathbf{t}_{i}; \theta_{A}))). $$
(1)

Essentially, Ladv denotes the cross-entropy loss of modality classification all instances oi, i = 1, ..., n used per iteration for training. Furthermore, mi is the ground-truth modality label of each instance, expressed as one-hot vector, while D(.;𝜃D) is the generated modality probability per item (image or text) of the instance oi.

3.2.2 Feature discrimination loss

In order to ensure that the intra-modal discrimination in data is preserved after feature projection, a classifier is deployed to predict the semantic labels of the items projected in the common subspace. For this purpose, a feed-forward network activated by softmax was added on top of each subspace embedding neural network. This classifier takes the projected features of the instances oi of coupled images and texts as training data and generates as output a probability distribution of semantic categories per item.

Suppose li to be groundtruth label of each representation, which is expressed as one-hot vector. And the predicted probability distribution from outputs of label classifier is described as \(\hat {\mathit {p}_{i}}\). Then the intra-modality objective function can be written as follows, regardless of which modality the transformed feature representations come from.

$$\begin{array}{@{}rcl@{}} \mathit{L}_{dis}(\theta_{V}, \theta_{T}, \theta_{D}) = -\frac{1}{\mathit{N}}\sum\limits_{n = 1}^{N}(\mathit{l_{i}}\cdot(\log \hat{\mathit{p}}_{i}(f_{V}(\mathit{v}_{i})) + \log \hat{\mathit{p}}_{i}(f_{T}(\mathit{t}_{i})))). \end{array} $$
(2)

3.2.3 Feature correlation loss

For inter-modal structure, we utilized correlation loss motivated by the coupled metric learning. The loss aims to minimize the intra-class variation and maximize the inter-class variation for feature representation of different modalities. Specifically, for each pair of training samples vi and tj from two different modalities, we compute their square distance as \(d(v_{i}, t_{j}) = \|f_{V}(v_{i}) - f_{T}(t_{i})\|_{2}^{2}\). We expect d(vi, tj) to be as small as possible if vi and tj are of the same class and as large as possible otherwise. This can be formulated as the following constraints:

$$\begin{array}{@{}rcl@{}} d(v_{i}, t_{j}) &\leq& \xi_{1}, ~~~~ if~~l_{v_{i}, t_{j}} = 1, \end{array} $$
(3)
$$\begin{array}{@{}rcl@{}} d(v_{i}, t_{j}) &\geq& \xi_{2}, ~~~~ if~~l_{v_{i}, t_{j}} = -1, \end{array} $$
(4)

where \(l_{v_{i}, t_{j}} = 1\) indicates that vi and tj belong to the same class, and \(l_{v_{i}, t_{j}} = -1\) otherwise, ξ1 and ξ2 are the small and large thresholds, respectively. We follows [15] to integrate the large margin optimization objective:

$$\begin{array}{@{}rcl@{}} \mathit{L}_{cor}(\theta_{V}, \theta_{T}, \theta_{C}) &=& \sum\limits_{i, j} s(1 - l_{v_{i}, t_{i}} (\theta - d(v_{i}, t_{i}))) + \sum\limits_{i} \left\lVert f_{V}(\mathbf{v}_{i}) - f_{T}(\mathbf{t}_{i})\right\rVert_{2}, \end{array} $$
(5)

where s(⋅) is a generalized logistic loss function, ξ1 = ξ − 1 and ξ2 = ξ + 1. In (5), the second term is similar as the correlation loss term in [11] that minimizes the difference between each pair of data of the same class captured from different modalities.

3.3 Optimization

As demonstrated above, we can incorporate three loss terms in (2), (5) and (1) altogether, which can be optimized through SGD and the optimization goals of these two objective functions are opposite, which can be formally described as a min-max game just as shown in [8]:

$$\begin{array}{@{}rcl@{}} (\hat{\theta}_{V}, \hat{\theta}_{T}, \hat{\theta}_{D}, \hat{\theta}_{C}) &=& \underset{\theta_{V}, \theta_{T}, \theta_{C}, \theta_{D}}{\arg\min}~\alpha\mathit{L}_{dis}(\theta_{V}, \theta_{T}, \theta_{C}, \theta_{D}) \\ &&+\, \beta\mathit{L}_{cor}(\theta_{V}, \theta_{T}, \theta_{C}, \theta_{D}) - \sigma\cdot\mathit{L}_{adv}(\hat{\theta}_{A}), \end{array} $$
(6)
$$\begin{array}{@{}rcl@{}} \hat{\theta}_{A} &=& \underset{\theta_{A}}{\arg\max}~(\alpha\mathit{L}_{dis}(\hat{\theta}_{V}, \hat{\theta}_{T}, \hat{\theta}_{C}, \hat{\theta}_{D}) + \beta\mathit{L}_{cor}(\hat{\theta}_{V}, \hat{\theta}_{T}, \hat{\theta}_{C}, \hat{\theta}_{D}) - \sigma\cdot\mathit{L}_{adv}(\theta_{A})). \end{array} $$
(7)

Here the feature discrimination loss term Ldis is a classifier that predicts the semantic labels of the items projected in the common subspace, thus incorporating the discriminations of labels into the common subspace; the feature correlation loss term Lcor aims to minimize the intra-class variation and maximize the inter-class variation for feature representation of different modalities; and the adversarial loss term Ladv is a cross-entropy loss term used in the modality classifier, which differentiates the source modality of image or text. Parameters α and β are the weight coefficients for the feature discrimination loss term Ldis and feature correlation loss term Lcor respectively, σ is the ratio between these two loss terms and the adversarial loss Ladv, which controls the balance between the two branches of the feature projection and the adversary. One way to train such an architecture has been proposed in [6], which add adversarial loss Ladv to embedding loss Lemb and utilizing Gradient Reversal Layer (GRL) (as shown in Figure 1) to incorporate min-max optimization. If a Gradient Reversal layer is added before the first layer of modality classifier, the min-max optimization can be performed simultaneously, which can be summarized as the Algorithm 1.

figure a

4 Experiments

4.1 Experimental setup

4.1.1 Datasets and features

We conduct experiments on three widely-used cross-modal datasets: Wikipedia [4], NUS-WIDE-10k [3] and Pascal Sentence [21]. For these datasets, each image-text pair is linked by a single class label and the text modality consists of discrete tags. Here we briefly introduce the three datasets adopted in the experiment.

  • WikipediaFootnote 1 is the most widely-used dataset for cross-modal retrieval task. This dataset consists of 2,866 image/text pairs of 10 categories, and is randomly divided as follows: 2,173 pairs for training, 231 pairs for validation and 462 pairs for testing.

  • Pascal SentenceFootnote 2 is generated from 2008 PASCAL development kit. This dataset contains 1,000 images which are evenly categorized into 20 categories, and each image has 5 corresponding sentences which makes up one document. For each category, 40 documents are selected for training, 5 documents for testing and 5 documents for validation.

  • NUS-WIDE-10KFootnote 3 is generated from NUS- WIDE dataset. NUS-WIDE dataset consists of about 270,000 images with their tags categorized into 81 categories. While NUS-WIDE-10k dataset has totally 10,000 image/text pairs selected evenly from the 10 largest categories of NUS-WIDE dataset, which are animal, cloud, flower, food, grass, person, sky, toy, water and window. The dataset is split into three subsets: Training set with 8,000 pairs, testing set with 1,000 pairs and validation set with 1,000 pairs.

For fair and objective comparison, we exactly follow the dataset partition and feature extraction strategies of [19, 36] in the experiments. The general statistics of the four datasets are summarized in Table 1.

Table 1 General statistics of the four datasets used in our experiments, where “*/*” in columns of “Instance” stands for the number of training/test image-text pairs

It is worth mention that for all datasets, we mainly use image feature extracted from deep Convolutional Neural Network (CNN) to represent an image, as the deep visual feature has shown strong ability and been widely used for image representation. Specifically, the adopted deep feature is 4,096d vector extracted by the fc7 layer of VGGNet [25] for all compared methods on all datasets. Regarding the text feature, we use the traditional bag of words (BoW) vector with TF-IDF weighting scheme to represent each text instance, and the dimension of the BoW vector in each dataset is also illustrated in Table 1. In addition, to make fair comparison with several earlier cross-modal retrieval approaches on Wikipedia dataset, we also adopt the publicly available 128d SIFT feature for image and 10d LDA feature for text representations11, respectively.

4.1.2 Implementation details

On all the dataset, we set the dimension of the transformed features to 200 and train our DAML model using three fully connected layers for both image and text modalities. We use a three layer network 4096 → 2048 → 1024 → 200 for image feature transform and a single layer network 300 → 200 for text feature transform. For the modality classifier, we use a three layer network 200 → 100 → 50 → 2. We use binomial cross-entropy for loss functions LD. While training our model we notice that a strong modality classifier on the contrary can worsen the performance. To alleviate this, we update the modality classifiers less often then the feature transforms.

During the training procedure, the batch size is set to 64 for our DAML on all datasets. We tune the model parameters α, β, σ using grid search (for each parameter in range of [0.001, 100] with 10 times per step). In our experiment, the three parameters are empirically set to be 0.01, 0.1 and 1.0, respectively, which show stable performance on different datasets. In addition, to make fair evaluation with the state-of-the-art methods, we not only refer to the published results in the corresponding papers but also re-evaluate some of those methods implementations provided by respective authors to obtain objective assessment.

4.1.3 Evaluation metric

We apply the proposed DAML to two cross-modal retrieval tasks, i.e. image retrieval by text (Img2Txt) and text retrieval by image (Txt2Img). To evaluate the performance, we use the standard measure of mean average precision (mAP) and precision-scope curve that have been widely adopted in literatures [1, 5, 22, 35]. To calculate mAP, we first evaluate the average precision (AP) of the retrieval result for each query then average the AP values over the query set. We implement the proposed model using Tensorflow and run the experiments on a desktop machine with 4-core CPU at 4 GHz, 32 GB memory and Geforece Titan X GPU.

4.2 Comparison with existing methods

We first compare our DAML approach with 10 state-of-the-art methods on Wikipedia dataset, which has been widely adopted as a benchmark dataset in the literature. The compared methods are: 1) CCA [10], CCA-3V [7], LCFS [35], JRL [47] and JFSSL [34], which are traditional cross-modal retrieval methods; and 2) Multimodal-DBN [30], Bimodal-AE [18], Corr-AE [5], and CMDN [19], which are DNN based.

Table 2 shows the mAP of our DAML and the compared methods on the Wikipedia dataset using shallow and deep features, respectively. From Table 2, we can draw the follow observations: 1) Our DAML significantly outperforms both the traditional and the DNN based cross-modal retrieval methods. Especially, comparing to CMDN which gets the best retrieval accuracy in all the compared methods, our DAML further gains improvement by 4.66% and 5.05% in average using shallow and deep features, respectively. It is worth mention that CMDN also model inter-modal invariance and intra-modal discrimination jointly in multi-task learning framework, while the adversarial learning facilitates our DAML well balance inter-modal invariance and intra-modal discrimination to obtain more effective cross-modal representation. 2) Our DAML is superior to CCA, Bimodal-AE, Corr-AE, CMDL and CMDN that use the correlation loss based on coupled samples to model the inter-modal similarity. The reason is that the proposed double triplet constraints are effective to leverage the cues of both similar and dissimilar pairs relying on their discriminant labels, which benefits DAML to effectively model the inter-modal similarity. It consistently indicates that our DAML is more effective to explore to inter-modal similarity than DCML. 3) Our DAML is also outperforms LCFS, CDLFM, LGCFL, JRL, JFSSL that also leverage class label information to model the intra-modal discrimination. Different from these methods, our DAML formulates the feature discrimination and correlation loss that model the inter-modal invariance and intra-modal discrimination, which jointly obtain better category separation across different modalities.

Table 2 Cross-modal retrieval comparison on Wikipedia dataset. Here “–” denotes that no experimental results with same settings are available

Figure 2 shows three examples of text queries and the top five images retrieved by the proposed DAML for the Text2Img task on Wiki dataset. It can be observed that our method finds the closet matches of the image modality at the semantic level for both text queries. And the retrieved images are all belonging to the same label of the text queries, i.e., “warfare” and “literature” respectively.

Figure 2
figure 2

Typical examples of the Text2Img task obtained by our proposed DAML on Wiki dataset with CNN features. In each example, the text query and the top five images retrieved are listed in the following columns

Moreover, the retrieval results on Pascal Sentence dataset and NUS-WIDE-10k dataset are shown in Table 3. We can see that the our DAML consistently achieves the best performance compared to its counterparts. Specifically, our DAML outperforms the best counterpart CMDN in terms of mAP score by 0.001 and 0.017 on average.

Table 3 Cross-modal retrieval comparison in terms of mAP on Pascal Sentences and NUSWIDE-10k dataset. Here “–” denotes that no experimental results with same settings are available

4.3 Further analysis on DAML

4.3.1 Visualization of learned adversarial representation

We further investigate the effectiveness of the cross-modal representations learned by our DAML. In particular, for each of the image and text modality we randomly choose 1000 transformed features in the test set to form a total of 2000 features. The chosen features do not necessarily form image text pairs. We then use t-SNE to visualize the distribution of these features.

Figure 3 shows the t-SNE embedding for the data distribution of Wiki dataset. Figure 3a shows the features with adversarial loss and Figure 3b shows the same without adversarial loss. We can see that without adversarial loss, the transformed features in Figure 3a are still scattered and the adversarial loss indeed effectively closes the gap between different modalities. In Figure 3b, the transformed features are likely to form a single cluster. This indicates that adversarial learning as a regularization works as expected to close the statistical gaps between modalities and that it is an effective tool for processing multimodal data.

Figure 3
figure 3

t-SNE visualization for the chosen data in Wiki. Red represents visual features and blue represents text features

4.3.2 Balance of label predicting and structure preserving

Furthermore, the adversarial learning in our DAML is also beneficial to balance the processes of feature discrimination and feature correlation, which model intra-modal discrimination and inter-modal invariance, respectively. To investigate the contributions of these two processes, we develop two variations of DAML: DAML with feature discrimination loss Ldis only, and DAML with feature correlation \(\mathcal {L}_{cor}\) only. The optimization procedure is similar to DAML. Table 4 shows the performance of DAML and its two variations on Wikipedia dataset and Pascal Sentence dataset. We see that both the intra-modal discrimination and inter-modal invariance terms contribute to the final retrieval rate, indicating that optimizing the Ldis term and the \(\mathcal {L}_{cor}\) simultaneously performs better than optimizing only one of them. We also see that the intra-modal discrimination term contributes more to the overall performance than the inter-modal invariance term, since in practice the consistent relation across different modalities is difficult to explore.

Table 4 Performance of cross-modal retrieval with full DAML method, DAML method with \(\mathcal {L}_{dis}\) only, and DAML method with \(\mathcal {L}_{cor}\) only

5 Conclusion

In this paper, we proposed a novel approach Deep Adversarial Metric Learning (DAML) for cross-modal retrieval, which aims to learn discriminative (intra-modality) and invariant (inter-modality) representations in common subspace. We decompose the whole problem into three loss terms: 1) adversarial loss was utilized to minimize the “modality gap” between two unknown distributions of representations from different modalities to promote modality-invariant; 2) for feature discrimination loss, intra-modality similarity was modelled by category information, which ensures learned representations to be discriminative; 3) regarding inter-modality similarity, we utilized feature correlation loss to minimize the distances among intra-class cross-modality samples and maximize the distances among inter-class cross-modality samples. The experimental results on three widely used multimodal datasets show the proposed DAML outperforms several state-of-art methods on cross-modal retrieval tasks.