1 Introduction

In social network, lots of multi-modal data, such as, image, video, text, audio are mixed together and endow semantic correlations. Many single-modality approaches have been proposed to understand those data, such as image classification or retrieval [1, 2], sentence semantic matching [3] and answer selection [4, 5]. There is immediate need to analyze those data across different modalities, such as retrieving similar instances from the other modalities giving a query from one modality [6], i.e., cross-modal retrieval. In recent years, cross-modal retrieval has gaining lots of attentions [7,8,9,10]. The main challenge for cross-modal retrieval is how to measure the similarity between the data from different modalities because of the semantic gap, heterogeneity and diversity within them.

To mitigate this problem, an intuitive way is to learn a common latent representation space, in which the similarity between data from different modalities can be measured directly. For example, the classical methods are to learn linear projection by maximizing the correlations between the pair-wise data from different modalities. Such as, canonical correlation analysis (CCA) [11] was adopted to project text features and image features into a low-dimensional common subspace for cross-modal retrieval [12]. Furthermore, some extensions of kernel-based methods [13, 14] have been proposed to model more complex correlations among different modalities. However, the main drawback of those methods is that they are simply to project the original representation into a common space and neglect the unique property of each modality.

Thanks to the successful applications of deep neural network (DNN), a large number of DNN-based methods have been proposed to cross-modal retrieval. For instance, deep canonical correlation analysis (DCCA) [15] combined deep network with CCA for cross-modal retrieval. The correspondence autoencoder (Corr-AE) [16] was proposed to model the correlations of different modalities through incorporating representation learning and correlation learning. Cross-modal multiple deep networks (CMDN) [17] was proposed by constructing hierarchical network structure to model the correlations of inter-modality and intra-modality.

Recently, some GANs-based cross-modal retrieval methods [6, 18] have been proposed. For example, adversarial cross-modal retrieval (ACMR) [6] proposes to learn the discriminative and modality-invariant common representations by adversarial learning. Cross-modal generative adversarial networks for common representation learning (CM-GANs) [18] exploits the cross-modal correlation by the weight-sharing constraint. However, most of the existing DNN-based [16, 17, 19,20,21,22] and GANs-based works [6, 18] simply project original representations into a common representation space and ignore the specific information in each modality. In fact, data from different modalities have some common characteristics and also have private characteristics for each modality. Because those data come from different modalities and have inconsistent distributions. An intuitive way is to introduce a private subspace to capture modality specific properties, and introduce a common subspace to capture the properties shared by different modalities [23].

Fig. 1
figure 1

The flowchart of the proposed representation separation adversarial networks (ReSAN) for cross-modality retrieval, which includes two sub-networks. The upper sub-network is the image representation learning network, while the below one is the text representation learning network

In this paper, we propose a novel cross-modal retrieval method, called Representation Separation Adversarial Networks (ReSAN), which separates the original representations into common representation and private representation. Figure 1 shows the framework, which includes two sub-networks, i.e., image sub-networks and text sub-networks. First, to separate the original representation, we minimize the correlations between common and private representation to encourage them to be independent. At shown in the red box with dotted lines in the Fig. 1, we hope that common representations only contains the components shared by different modalities, while private representations only contains the unique components of each modality. Then, we reconstruct the original representations by exchanging the common representation to encourage the information swap across modalities. Finally, we use semantic information to make the common representation to be discriminative and modality-invariant. The main contributions of this work can be summarized as follows:

  • We propose a representation separation adversarial networks for cross-modal retrieval, which explicitly splits the original representations into common representations and private representations.

  • We propose a modality-variant common representation learning strategy, which can exchange the information among modalities during learning processing.

  • We evaluate the proposed method ReSAN on cross-modal retrieval and the results demonstrate it obtained best performance compared to most existing methods.

The rest of paper is organized as follows. We first briefly review the related works in Sect. 2, and present the proposed method in Sect. 3. Then, we derive the algorithm in Sect. 4, and conduct experiments in Sect. 5. Finally, we conclude this paper in Sect. 6.

2 Related works

Since generative adversarial networks (GANs) [24] have been proposed in 2014, it has been used in a wide applications, such as image style transformation [25, 26], image synthesis [27, 28], object tracking [29] and zero-shot learning [30, 31]. The original GANs consists of a generative model G and a discriminative model D. The generative model aims to generate fake data and capture the distribution over real data, discriminative model aims to discriminate the real data and generated data. G and D play the minimax game on V(GD) as follows:

$$\begin{aligned} {\min _{{G}}}\max _{{D}}{V(G,D)} = {E_{{x}\sim {{p}_{data}(x)}}[logD(x)]}+{E_{{z}\sim {{p}_{z}(z)}}[log(1 - D(G(z)))]} \end{aligned}$$
(1)

where x denotes the real data and z is the noise input. Wang et al. [6] first introduce the GANs into cross-modal retrieval and proposed adversarial cross-modal retrieval (ACMR), which learns modality-invariant and discriminative common representations through adversarial learning. Wu et al. [32] proposed cycle-consistent deep generative hashing for cross-modal retrieval, which learns a couple of hash mappings by cycle-consistent adversarial learning without paired input-output examples. Peng et al. [18] proposed cross-modal generative adversarial networks for common representation learning (CM-GANs), which considers both inter-modality and intra-modality adversarial learning in a more effective manner to learn common representation. Although those methods exploit the cross-modal correlation by adversarial, however, they ignore private components of the original representations.

Recently, domain separation networks (DSNs) [33] is proposed for transfer learning [34]. It explicitly separate the representations of different domain into two parts: one is the private component and the other is the shared component across domains. The experimental results demonstrate its success in unsupervised domain adaption scenario. Yang et al. [23] propose shared predictive cross-modal deep quantization (SPDQ), which construct a shared subspace and two private subspaces to adequately exploit the intrinsic correlations among multiple modalities. Inspired by those works, this paper is dedicated to separate the original representations into common representation and private representation explicitly and exploit common latent semantic representations. Different from works [23], we achieve this idea under the framework of generative adversarial networks.

3 Proposed method

3.1 Notations and problem statement

3.1.1 Notations

To simplify the notations, we focus on two modalities, i.e., image modality and text modality. Assuming N instances of image-text pairs, we denote the whole dataset as \({\mathscr{O}} = \{(\mathbf{v}_{i},\mathbf{t}_{i})\}_{i = 1}^{N}\), where \(\mathbf{v}_{i}\) is the i-th image feature vector, \(\mathbf{t}_{i}\) is the i-th text feature vector. For each pair of data \((\mathbf{v}_{i},\mathbf{t}_{i})\), the semantic label is assigned by vector \(\mathbf{y}_i = [y_i^1, y_i^2, \ldots , y_i^d]\), where d is the total number of categories, \(y_i^j = 1\) if \((\mathbf{v}_{i},\mathbf{t}_{i})\) belongs to the j-th class while \(y_i^j = 0\) otherwise.

3.1.2 Problem statement

Since the image feature vectors and text feature vectors typically have different statistical properties, they cannot be directly compared. To address this problem, we propose to learn the transform function \(\mathbf{c}_i^v = {f}(\mathbf{v}_{i}; \varUpsilon _v)\) for image modality, and \(\mathbf{c}_i^t = {g}(\mathbf{t}_{i}; \varUpsilon _t)\) for text modality, respectively. \(\mathbf{c}_i^v\) and \(\mathbf{c}_i^t\) denote image and text representation in common representation space, \(\varUpsilon _v\) and \(\varUpsilon _t\) are the parameters of the two functions, respectively. After that, we can measure their similarity by calculating the cosine distance between \(\mathbf{c}_i^v\) and \(\mathbf{c}_i^t\). We expect the cosine distance of the semantically similar image-text pairs is smaller than that of the semantically dissimilar image-text pairs.

3.2 Model

Inspired by GANs’ strong ability in modelling data distribution and learning discriminative representations, we utilize GANs to model the distribution over the data of different modalities and to learn the common representations. In this paper, we introduce two generative adversarial networks: GAN\(_v\) for image modality and GAN\(_t\) for text modality.

3.2.1 Generative model

Image representation Generator \(G_v\) and text representation Generator \(G_t\) take image-text paired features \(\mathbf{h}^v\) and \(\mathbf{h}^t\) as the inputs, respectively. Through several fully-connected layers, the same length representations \(\mathbf{r}^v\) and \(\mathbf{r}^t\) are obtained for image modality and text modality. Then, image representations \(\mathbf{r}^v\) are separated into common representation \(\mathbf{c}^v\) and private representation \(\mathbf{p}^v\), while text representations \(\mathbf{r}^t\) are also separated into common representation \(\mathbf{c}^t\) and private representation \(\mathbf{p}^t\), as shown in Fig. 1.

Ideally, we expect that the representations in common space only include the semantic information shared by images and texts, while the private representation only contains their own specific information. To achieve this, we minimize the cosine distances between representations in common space and private space for each modality. This can be formulated as follows:

$$\begin{aligned} {{L}}_{Space_*}={\frac{1}{K}\sum _{i=1}^{K}<\mathbf{c}_i^*, \mathbf{p}_i^*>^2} \end{aligned}$$
(2)

where \(*\in \{v, t\}\), represents different modalities, \(<\mathbf{a}, \mathbf{b}>\) is the inner product of \(\mathbf{a}\) and \(\mathbf{b}\), K is the number of instance in one batch.

To ensure the effectiveness of separation and improve the information swap among modalities, we reconstruct the original representations by exchanging the common representations among modalities. Specifically, we concatenate image private representation \(\mathbf{p}^v\) and text common representations \(\mathbf{c}^t\) as the input of several fully-connected layers to reconstruct the image representations \(\hat{\mathbf{r}}^v\). Similarly, we concatenate text private representation \(\mathbf{p}^t\) and image common representations \(\mathbf{c}^v\) as the input of several fully-connected layers to reconstruct text representations \(\hat{\mathbf{r}}^t\). The reconstruction loss can be formulated as follows:

$$\begin{aligned} {{L}}_{Recon_*}= \frac{1}{K}\sum _{i=1}^{K}(\hat{\mathbf{r}}_i^* - \mathbf{r}_i^*)^2 \end{aligned}$$
(3)

where \(*\in \{v, t\}\).

To exploit the common semantics from inter-modality and intra-modality, we denote the inter-modality similarity matrix of image and text as \(S^{vv}\) and \(S^{tt}\), the intra-modality similarity matrix as \(S^{vt}\). For image modality \(S^{vv}\), we set \(S_{ij}^{vv}= 1\) if image \(v_i\) and image \(v_j\) are the same class, and \(S_{ij}^{vv}= 0\) otherwise. Similarly, for text modality \(S^{tt}\), \(S_{ij}^{tt}= 1\) if text \(t_i\) and text \(t_j\) are the same class, and \(S_{ij}^{tt}= 0\) otherwise. For \(S^{vt}\), we define \(S_{ij}^{vt}= 1\) if image \(v_i\) and text \(t_j\) belong to the same class, and otherwise \(S_{ij}^{vt}= 0\).

Based on the above notations and discussion, the objective function of image modality is defined as follows:

$$\begin{aligned} {{L}}_{S_v}= -{\frac{1}{K}\sum _{i, j=1}^{K}(S_{ij}^{vt}\varTheta _{ij} - log(1 + e^{\varTheta _{ij}}))} -{\frac{1}{K}\sum _{i, j=1}^{K}(S_{ij}^{vv}\varGamma _{ij} - log(1 + e^{\varGamma _{ij}}))} \end{aligned}$$
(4)

where \(\varTheta _{ij} = cos(\mathbf{c}_i^v, \mathbf{c}_j^t)\), \(\varGamma _{ij} = cos(\mathbf{c}_i^v, \mathbf{c}_j^v)\), \(\cos (\mathbf{a}, \mathbf{b})\) is the cosine function used to compute the similarity between \(\mathbf{a}\) and \(\mathbf{b}\).

Similarly, the objective function of text modality can be formulated as follows:

$$\begin{aligned} {{L}}_{S_t}= -{\frac{1}{K}\sum _{i, j=1}^{K}(S_{ij}^{vt}\varTheta _{ij} - log(1 + e^{\varTheta _{ij}}))} -{\frac{1}{K}\sum _{i, j=1}^{K}(S_{ij}^{tt}\varPhi _{ij} - log(1 + e^{\varPhi _{ij}}))} \end{aligned}$$
(5)

where \(\varPhi _{ij} = cos(\mathbf{c}_i^t, \mathbf{c}_j^t)\). The first term in equations (4) and (5) is the negative log likelihood of the cross-modal similarities with the likelihood function defined as follows:

$$\begin{aligned} P(S_{ij}^{vt}| \mathbf{c}_i^v, \mathbf{c}_j^t) = \left\{ \begin{array}{ll} \sigma (\varTheta _{ij})&{} \quad \text{when} \quad {S_{ij}^{vt} = 1}\\ 1 - \sigma (\varTheta _{ij})&{} \quad \text{when} \quad {S_{ij}^{vt} = 0} \end{array} \right. \end{aligned}$$

where \(\sigma (\varTheta _{ij}) = \frac{1}{1 + e^{-\varTheta _{ij}}}\).

It is easy to find that minimizing this negative log likelihood is equivalent to maximize the likelihood, which can make the similarity between \(\mathbf{c}_i^v\) and \(\mathbf{c}_j^t\) to be large when \(S_{ij}^{vt} = 1\), and to be small when \(S_{ij}^{vt} = 0\). The second term in (4) and (5) measure the inter-similarity of the image and text modality, respectively. Therefore, Eqs. (4) and (5) encourage to learn more discriminative common representations.

3.2.2 Discriminative model

Two discriminators are designed to distinguish the representations from common representation space. The image representation discriminator \(D_v\) tries to distinguish the image representations \(\mathbf{c}^v\) as the real data from representations \(\mathbf{c}^t\) as fake data. The text discriminator \(D_t\) tries to distinguish the representations \(\mathbf{c}^t\) as the real data from the representations \(\mathbf{c}^v\) as fake data. Based on this, the adversarial loss for image modality can be defined as follow:

$$\begin{aligned} {{L}}_{adv_v}={E_{{c^v}\sim {{P}_{c^v}}}[log(D_v(\mathbf{c}^v))]}+{E_{{c^t}\sim {{P}_{c^t}}}[log(1 - D_v(\mathbf{c}^t))]} \end{aligned}$$
(6)

Similarly, adversarial loss for text modality can be defined as follow:

$$\begin{aligned} {{L}}_{adv_t}={E_{{c^t}\sim {{P}_{c^t}}}[log(D_t(\mathbf{c}^t))]}+{E_{{c^v}\sim {{P}_{c^v}}}[log(1 - D_t(\mathbf{c}^v))]} \end{aligned}$$
(7)

After the adversarial learning, ultimately the discriminators cannot identify the representations come from which modality. The cross-modal correlations could be well learned and the discriminative common properties are simultaneously captured.

3.3 Objective function

With the above definitions, the whole objective function can be formulated as follows:

$$\begin{aligned} {\min _{{G_v, G_t}}}\max _{{D_v, D_t}}L_{GAN_v}(G^v, G^t, D^v) + L_{GAN_t}(G^v, G^t, D^t) \end{aligned}$$
(8)

where \({{L}}_{GAN_*}={L}_{adv_*} + \alpha {L}_{Space_*} + \beta {L}_{Recon_*} +\gamma {L}_{S_*}\), \(*\in \{v, t\}\), \(\alpha\), \(\beta\) and \(\gamma\) are the regularization parameters.

4 Algorithm

4.1 Optimizing discriminative model

Following [6], we adopt stochastic gradient method to optimize the discriminator. For the image pathway, image discriminator is conducted to maximize the log-likelihood to correctly discriminate common representations. It is trained by ascending their stochastic gradient with the following equation:

$$\begin{aligned} \theta _{D_v} \leftarrow \theta _{D_v} + \mu \cdot \nabla _{\theta _{D_v}}\frac{1}{K}\sum _{i=1}^{K}{[log(D_{v}(\mathbf{c}_i^v)) + log(1 - D_{v}(\mathbf{c}_i^t))]} \end{aligned}$$
(9)

where \(\theta _{D_v}\) are parameters of image discriminative model, \(\mu\) is learning rate. Similarly, for the text discriminator in text pathway, it is trained by ascending their stochastic gradient with the following equation:

$$\begin{aligned} \theta _{D_t} \leftarrow \theta _{D_t} + \mu \cdot \nabla _{\theta _{D_t}}\frac{1}{K}\sum _{i=1}^{K}{[log(D_{t}(\mathbf{c}_i^t)) + log(1 - D_{t}(\mathbf{c}_i^v))]} \end{aligned}$$
(10)

where \(\theta _{D_t}\) are parameters of text discriminative model.

4.2 Optimizing generative model

For the image generator, it is trained by descending their stochastic gradient with the following equation:

$$\begin{aligned} \theta _{G_v} \leftarrow \theta _{G_v} - \mu \cdot \nabla _{\theta _{G_v}}\left[ \frac{1}{K}\sum _{i=1}^{K}log(D_{t}(\mathbf{c}_i^v)) + \alpha {L}_{Space_v} + \beta {L}_{Recon_v} + \gamma {L}_{S_v}\right] \end{aligned}$$
(11)

For the text generator, similarly, it is updated parameters by descending the stochastic gradient as follows:

$$\begin{aligned} \theta _{G_t} \leftarrow \theta _{G_t} - \mu \cdot \nabla _{\theta _{G_t}}\left[ \frac{1}{K}\sum _{i=1}^{K}log(D_{v}(\mathbf{c}_i^t)) + \alpha {L}_{Space_t} + \beta {L}_{Recon_t} + \gamma {L}_{S_t}\right] \end{aligned}$$
(12)

The details of the whole procedure is summarised in Algorithm 1.

figure g

5 Experiments and results

To evaluate the proposed method, we conduct experiments on the Wikipedia and the NUSWIDE-10k datasets. In Sect. 5.1, we describe the datasets and evaluation, following by the implementations details in Sect. 5.2. In Sect. 5.3, we show the experimental results and analysis.

5.1 Datasets and evaluation

5.1.1 Datasets

The Wikipedia [35] and the NUSWIDE-10k [36] datasets are widely used for cross-modal retrieval. The Wikipedia dataset consists of 10 categories, 2866 instances (image-text pairs), in which 2173 image-text pairs are randomly selected for training, the rest of 693 image-text pairs are for testing. The NUSWIDE-10k includes 10 categories and contains 10,000 image-text pairs, in which 8000 image-text pairs are randomly selected for training and the 2000 image-text pairs are selected for testing. We adopt 4096d vector extracted by the fc7 layer of VGGNet [37] as image feature, text features are 3000d bag-of-words (BoW) vector in Wikipedia and 1000 BoW in NUSWIDE-10k dataset. The statistical results of those two datasets are summarised in Table 1.

Table 1 The details of two datasets, where “/” in column “Instances” represents the number of training/test image-text pairs

5.1.2 Evaluation

In this paper, we use mean Average Precision (mAP) to evaluate the cross-modal retrieval performances.

$$\begin{aligned} mAP = \frac{1}{N}\sum _{i=1}^{N}{AP(q_i)} \end{aligned}$$

where AP(\(\cdot\)) computes the average precision, N is the number of query samples and \(q_i\) represents the i-th query sample. The larger the mAP value is, the better the retrieval performance is. We conduct two different tasks including retrieving text using image as query (Img2Txt) and retrieving image using text as query (Txt2Img), and report the performance of mAP. The results of ACMR are obtained by implementing the code provided by the authors, and the others are reported from the published papers.

5.1.3 Compared methods

To show the effectiveness of our method, we selected following representative methods for comparison, including traditional methods, DNN-based methods and GANs-based methods.

Traditional methods:

  • CCA [38]: It learns linear projection by maximizing the correlation between pairwise data of different modalities.

  • LCFS [39]: It learns two projection matrices with sparsity penalties to select relevant and discriminative features from the coupled feature spaces simultaneously.

  • JRL [40]: It integrates graphs regularization and semi-supervised information to jointly learn representation for different modalities.

DNN-based methods:

  • Bimodal-AE [41]: It proposes a novel application of deep networks to learn features over multiple modalities.

  • Corr-AE [16]: It models jointly the cross-modal correlation and reconstruction information.

  • CMDN [17]: It models inter-modal invariance and intra-modal discrimination jointly in a multi-task learning framework.

GANs-based method:

  • ACMR [6]: It seeks an effective common subspace based on adversarial learning.

5.2 Implementation details

The proposed method consists of two sub-networks, one for image modality and the other for text modality. As shown in the Fig. 2, the generative model of image modality is a four fully connected layer network with Leaky ReLU activation function, which projects the raw image features into a common subspace. We use a fully connected layer map the 2048 dimensional vector from the middle layer to the image private representation space. Then, we concatenate image private representation and text common representation to reconstruct the image representation with a fully connected layer. The generative model for text modality is similar to image modality. Each discriminative model consists of two fully connected layers: the neurons number in the first layer is 128, and the second layer is 1. The mini-batch size is set to 64. Moreover, the \(\alpha\), \(\beta\) and \(\gamma\) are empirically set to 0.1, 0.1 and 1, respectively.

Fig. 2
figure 2

The structural details of the representation separation adversarial networks (ReSAN)

5.3 Experimental results

5.3.1 Results on Wikipedia dataset

Table 2 shows the mAP of different methods on this dataset. From that, we can see that LCFS and JRL obtained the similar results, which are much better than the traditional CCA method. GAN-based methods such as ACMR and ReSAN obtained better results than other traditional methods and DNN-based methods. Among them, our method achieved the best performance on this dataset.

Table 2 The mAP of different methods on the Wikipedia dataset for Img2Txt and Txt2Img

Furthermore, we show the top-10 results retrieved by CCA, ACMR, and ReSAN on Wikipedia dataset in Fig. 3.

Fig. 3
figure 3

Examples of the bi-modal retrieval results on Wikipedia dataset by the CCA, ACMR and ReSAN, the results with green borders are correct, while those with red dotted borders are wrong (Color figure online)

From that, we can see that four results are wrong for CCA. ACMR and our method have obtained good retrieval performance in the task of Img2Txt. For the task of Txt2Img, the CCA retrieved some wrong images, which do not belong to geography category. For ACMR, there is a wrong result. Compared to ACMR, the results retrieved by ReSAN are more related to the semantic category although there is also a wrong result. The reason might be that CCA measures the global correlation between data from different modalities. However, our method and ACMR explore the semantic information of each modality data based deep convolutional network, which can reduce the semantic gap effectively and generate more discriminative representations.

Furthermore, we visualize the learned representations on the Wikepedia dataset for ACMR and ReSAN using t-SNE [42] in Fig. 4. We apply min–max normalization to make the distribution of each category to be more clear. From that, we conclude the learned representation by our method are more discriminative compared to other method. Specifically, we can see from the figure that the representation generated for images on some classes (biology, geography, sport, warfare) are relatively concentrated. Because our method separate the original representation into common representation and private representation. Compared with the original representation, the common representation obtains more modality independent semantic information which is more helpful to reduce the modality gap. This demonstrates the effectiveness of the representation separation.

Fig. 4
figure 4

t-SNE visualization for ten semantic categories in the Wikipedia test dataset. Different number represents different semantic category

5.3.2 Results on NUSWIDE-10k dataset

The results on NUSWIDE-10k are presented in Table 3. From this Table, we obtain following observations: (1) JRL achieved best performance among the traditional methods, and demonstrate the advantages of jointly using supervised information and graph regularization, (2) Among the DNN-based methods, ACMR and our method obtained the best performances.

Table 3 The mAP of different methods on the NUSWIDE-10k dataset for Img2Txt and Txt2Img

Figure 5 shows the PR curves of ACMR and ReSAN on Wikipedia and NUS-WIDE-10k dataset. From that, we can see the PR of ReSAN is better than that of ACMR on Wikipedia dataset. For NUS-WIDE-10k dataset, ReSAN achieved comparable results with ACMR.

Fig. 5
figure 5

The PR curves of ACMR and ReSAN on Wikipedia and NUS-WIDE-10k dataset

5.3.3 The effectiveness of different terms

In the proposed method, adversarial learning aims to model the joint data distribution of different modalities, while representation separation is to learn common semantic representations for cross-modal retrieval. To demonstrate their contribution for improving the retrieval performance, we denote the ReSAN as three different model, without adversarial learning as (ReSAN-D), without representation separation as (ReSAN-P), without reconstruction as (ReSAN-C). The results are shown in the Table  4. From that, we can see that three different components effectively improve the retrieval performance with different levels. Among them, the representation separation and reconstruction plays important roles for the retrieval performance.

Table 4 The contribution for different terms in ReSAN on the Wikipedia and NUSWIDE-10k dataset

5.3.4 Running time of ReSAN

Figure 6 shows the running time of ReSAN and ACMR. It can be seen that our method takes longer time to compute representations than that of ACMR. In fact, it can be done off-line. During the retrieval stage, our method is faster than ACMR. This is very important in real applications.

Fig. 6
figure 6

Running time of ReSAN and ACMR

6 Conclusion

In this paper, we proposed a representation separation adversarial networks for cross-modal retrieval method, which explicitly splits the original representations into common representation and private representation for each modality. To learn modality-variant common representation, we proposed exchanging strategy the common representations among different modalities. Furthermore, we adopt the label information to increase the discriminant ability for the common representations. The experimental results on two wide datasets demonstrate as follows: modeling the unique part for each modality can effectively improve the robustness of the common representations. In the future, we will apply this representation separation approach for the unsupervised scenarios.