Keywords

1 Introduction

With the development of science and technology, more and more multimedia data, such as images and texts, appear on the Internet. Owing to the explosive increase of these data, the requirement of cross-modal retrieval increases sharply. Cross-modal retrieval aims to search semantically related images (texts) with text (image) query and vice versa. Image retrieval hashing is a long-established research task to retrieve images with similar contents [17], it is common for us to process images with VGG [19] or some other neural networks. For text, Word2Vec technology is widely used, which also try to exploit latent semantic [23]. One of the biggest challenges of cross-modal retrieval is how to bridge the heterogeneous gap between two different modalities. The cause of the heterogeneous gap is the difference distribution between the feature from different modalities. Data from intra-modality also have heterogeneous information, which can be tackled from multiple views [5]. To tackle the problem of the heterogeneous gap between modalities, many cross-modal hashing methods have been proposed because of the advantages of low storage cost and high query speed by mapping data into binary codes.

The development of cross-modal retrieval can be divided into two phases: shallow cross-modal hashing and deep learning-based cross-modal hashing. Shallow cross-modal hashing is based on hand-crafted features and learns the hash codes by linear functions. The advantage of these methods is easily implemented, while they cannot fully explore the semantic information of two modalities. Recently, with the development of deep learning, the deep neural network(DNN) has been deployed to cross-modal hashing. DNN-based cross-modal hashing can be divided into two categories: supervised hashing and unsupervised hashing. Supervised methods, with label information, such as tags, always perform remarkably. While in the real-life, it is a waste of time for us to obtain labels of the image-text pairs. Unsupervised methods that do not use label information in the training phase have also shown remarkable performance in recent years. Unsupervised methods focus more on the information of raw features. As a result, the quality of the hash codes that used in retrieval task is dramatically concerned with the feature learning stage.

However, there are still some issues that should be tackled. Firstly, at the feature extraction phase, these methods only focus on the single source feature, neglecting the rich semantic information gained from multiple views. Secondly, the general similarity matrix of features can not bridge the heterogeneous gap well, because the distribution information or similarities of different scales are not considered. In this paper, we propose a novel self-auxiliary hashing (SAH) method for unsupervised cross-modal retrieval. SAH provides a two-branch network for each modality, including the uniform branch and the auxiliary branch. Each branch will generate specific features and hash codes. Moreover, based on the features and hash codes of two branches, we construct multiple similarity matrices for inter-modality and intra-modality. These similarity matrices will be calculated to preserve more semantic and similarity information. Extensive experiments demonstrate the superior performance of our method.

2 Related Work

Cross modality hashing can be roughly divided into supervised cross modality hashing and unsupervised cross modality hashing. The task of cross modality retrieval is to retrieve images (or texts) with similar semantics to the input text (or image). Shallow cross-modal hashing methods [12, 13, 15, 16] and deep cross-modal hashing methods [1, 2, 11, 22] are two stages of cross-modal hashing methods development. Shallow Cross-Modal Hashing uses hand-crafted features to learn the binary vector projection which is mapped from instances. However, most shallow cross-modal hashing retrieval methods just deal the feature with only a single layer and map data in a linear or nonlinear way. In recent years, the deep learning algorithm proposed in machine learning has been applied to cross modality retrieval. Deep cross-modal retrieval [18] also can be divided into unsupervised methods and supervised methods.

Supervised hashing methods [7, 10, 18, 22] explore relative information, such as semantic information, or some other relative information by labels or tags, to enhance the ability of cross modality retrieval. Deep cross-modal hashing (DCMH) [7] is an end-to-end hashing method with deep neural networks, which can jointly learn hash codes and feature. In deep cross-modal hashing methods, generative adversarial network (GAN) is used to make adversarial learning. Self-Supervised Adversarial Hashing Networks (SSAH) [10] and Wang et al. [22] use image and text adversarial networks to generate hashing codes of both modalities, the learned features are used to keep the semantic relevance and preserve the semantic of different modalities.

Although some supervised methods perform well in practical applications, supervised information, such as label, is hard for us to collect, which is not suitable in reality.

Unsupervised hashing methods aim to learn hashing functions without supervised information, such as labeled data. For example, inter-media hashing (IMH) [20] considers the inter-media consistency and intra-media consistency with linear hash functions, and learns the hash function of image modality and text modality jointly. CVH [9] proposes a principled method to learn a hash function of different modality instances. Collective Matrix Factorization Hashing (CMFH) [4] learns the hash codes of an instance from two modalities and proposes the upper and lower bound. Latent Semantic Sparse Hashing (LSSH) [26] copes the instances of image and text with different methods and performs search by Sparse Coding and Matrix Factorization. Unsupervised Deep Cross-Modal Hashing (UDCMH) [24] makes a combination of deep learning and matrix factorization, considering the neighbour information and the weight assignment of optimization stage. Deep joint semantics reconstructing hashing (DJSRH) [21] considers the neighborhood information of different modalities.

Although the performance of these methods are remarkable, the features they focus on are not comprehensive. Moreover, they neglect the deep similarity information of two modalities and have bad performance at bridging the “heterogeneity gap”.

3 Proposed Method

3.1 Problem Fomulation

Assume the training dataset of our methods is a collection of the pairwise image-text instances, written as O = (X, Y). X is the instance of image modality and Y is the text modal instance. The number of instances of each modality is n. The goal of our method is to learn the modality-specific hash function for image modality and text modality which can generate hash codes with rich semantic information. For each modality, two branches (uniform branch and modality-specific auxiliary branch) are used to generate different features for each modality. MF\(_{*_i}\) are the i-th multi-scale features of image or text modality, which generate from the auxiliary branch with different dimensions. MH\(_{*_{i}}\) denotes the i-th multi-scale hash code of image or text which is generated from MF\(_{*_i}\). \(F_{*}\) and \(H_{*}\) are the feature and hash code gained from the uniform branch which is same as other unsupervised methods. The notations used in SAH are summarized in Table 1.

Table 1. Notations and their descriptions.

3.2 Network Architecture

Figure 1 is a flowchart of our SAH. Our method is composed of two networks, image network and text network, both of them can be divided into the uniform branch and the modality-specific auxiliary branch. For image network, the uniform branch is composed of AlexNet [8]. Image auxiliary branch, shown by Fig. 2, deals the input image with a fully connected layer and gains the auxiliary data. For text network, the uniform branch consists of MLP. Text auxiliary branch, drawn in Fig. 3, tackles the input text data with a pooling layer first and gets the auxiliary data which is convenient for the later procession.

Fig. 1.
figure 1

The overview of our proposed SAH.

Fig. 2.
figure 2

Image auxiliary branch.

Fig. 3.
figure 3

Text auxiliary branch.

Feature Extraction. For image modality, we adopt the pre-trained AlexNet as the uniform feature extractor which is widely used in unsupervised methods. However, features gained from AlexNet are not comprehensive enough, which is the common disadvantage of previous works. Features obtained from a single scale often comes from the same measurement perspective, ignoring the details that may be obtained from other perspectives. Benefit from feature learning at multiple scales, multi-scale features can better represent the semantics of instances.

To gain multi-scale features, we process the input of image modality by fully connected layer and three pooling layers respectively. Then we get three sizes of image feature which we called the auxiliary data. And we resize them into the same size. These three auxiliary data are single-channel, we make them expand to three channels. To tackle with the auxiliary data of image modality, we deal them with five convolution layer and three fully connected layer networks and obtain three multi-scale features MF\(_{I_{1}}\), MF\(_{I_{2}}\) and MF\(_{I_{3}}\). For text modality, we set four pooling layers to tackle with the input text data respectively and get four different size auxiliary data. Due to the character of text data is sparse, we deal these four data just with a fully connected layer and get four multi-scale features MF\(_{T_{1}}\), MF\(_{T_{2}}\), MF\(_{T_{3}}\) and MF\(_{T_{4}}\) of text modality.

The reason for the difference in the amount of auxiliary data between two modalities is that, image always contains more comprehensive information than text. Therefore, we should explore the semantic information of text more comprehensively. To this end, we can obtain rich semantic features of each modality, which can be utilized to construct similarity matrices and guide hash codes learning. We calculate the similarity matrix of uniform feature based on cosine similarity. \({S^{F}_{IT}}\) is intra-modal similarity matrix, \({S^{F}_{II}}\) and \({S^{F}_{TT}}\) are inter-modal similarity matrices, defined as follows:

$$\begin{aligned} \begin{aligned} S^{F}_{x, y}&= cos(F_{x}, F_{y}),\\ s.t.(x,y)\in&{(I,I),(T, T),(I,T)}. \end{aligned} \end{aligned}$$
(1)

Hash Code Generation. We will generate two kinds of hash code, uniform hash code and comprehensive hash code for two modalities respectively. The uniform hash codes (\(H_{I}\) and \(H_{T}\)) is obtained by the uniform feature in uniform branch with a simple hash layer for each modalities.

For image modality, we process three auxiliary image features and get three same size hash codes MH\(_{I_{1}}\), MH\(_{I_{2}}\), and MH\(_{I_{3}}\) with auxiliary branch of image modality. We concatenate these three hash codes together through hash layer HILayer to obtain the comprehensive hash code \(H_{I\_com}\) which contains multi-scale semantics. The concatenation will not change the semantic of each bit dramatically, it can be seen as a way of data enhancement.

$$\begin{aligned} {H_{I\_com} } = HILayer(ConCat({MH_{I_{1} } } ,{MH_{I_{2} } } ,{MH_{I_{3} } } )), \end{aligned}$$
(2)

where ConCat() denotes the concatenation of vectors.

For text modality, we have four auxiliary features, and we process them with four different hash layer and get four same size auxiliary hash codes, MH\(_{T_{1}}\), MH\(_{T_{2}}\), MH\(_{T_{3}}\), MH\(_{T_{4}}\). We also concatenate them four and make this hash code into a hash layer HTLayer and get the comprehensive hash code \(H_{T\_com}\).

$$\begin{aligned} {H_{T\_com} } = HTLayer(ConCat({MH_{T_{1} } } ,{MH_{T_{2} } } ,{MH_{T_{3} } },{MH_{T_{4} } } )). \end{aligned}$$
(3)

Concatenation is a compression of semantic information which can preserve different scales of semantic information and similarity. Furthermore, we fuse the uniform hash code and the comprehensive hash code according to a certain proportion \(\mu \) (\(0<\mu <1\)) and get the mixed hash code. The mixed hash code can maintain more semantic information than the uniform hash code of each modality.

$$\begin{aligned} H_{img\_mix} = \mu MH_{I\_com} + (1 - \mu )H_{I}. \end{aligned}$$
(4)
$$\begin{aligned} H_{txt\_mix} = \mu MH_{t\_com} + (1 - \mu )H_{T}. \end{aligned}$$
(5)

The inter-similarity matrix of uniform hash codes can be calculated by similarity function:

$$\begin{aligned} \begin{aligned} S^{H}_{x,y}&= cos(H_{x}, H_{y}), \\ s.t.(x,y)\in&{(I,I),(T, T),(I,T)}. \end{aligned} \end{aligned}$$
(6)

Similarity Matrices Learning. Since dimension reduction during the procession from features to hash codes will cause some semantic lose, we aim to keep the semantic consistency of instance pairs. To this end, we introduce a loss function that can measure the semantic consistency between hash codes and features of intra-modality and inter-modality. The loss function \( L_{1}\) can be written as follows:

$$\begin{aligned} \begin{aligned} L_{1} =&\sum _{i=1}^{n}\sum _{j=1}^{n} \parallel S^{H}_{x,y}(i, j) - S^{F}_{x,y}(i,j) \parallel . \end{aligned} \end{aligned}$$
(7)

For intra-modality, the multi-scale hash codes and the uniform hash codes are generated from features of different scale, they preserve richer semantic information of different view. Uniform feature similarity matrix \(S^{F}\) offers us the degree of similarity among different instances in a single modality. Loss function \(L_1\) makes the multi-scale hash codes retains the semantic consistency, too. To ensure the accuracy of hash codes, the similarity matrix of hash codes should approximate to the feature similarity matrix. Therefore, we can minimize the distance between the similarity of the multi-scale hash codes of each modality and its intra-modality feature similarity. The loss function \(L_{2}\) can be written as follows:

$$\begin{aligned} \begin{aligned} \ L_{2} = \sum _{ni = 1}^{3} \parallel S^{MH}_{I_{ni} } -S^{F}_{I,I} \parallel +\sum _{mi = 1}^{4} \parallel S^{MH}_{T_{mi} } -S^{F}_{T,T}\parallel . \end{aligned} \end{aligned}$$
(8)

For inter-modality, the similarity matrix of features should also contains the inherent pair-wise information. The feature similarity of pair-wise instance of different modalities can be seen as converging to the maximum value in the cosine similarity. Apart from that, complex hash code will generate from the mix hash code to make sure the mixed hash codes still retain similarity consistency. To this end, loss function \(L_{3}\) and \({L_{4}}\) can be written as:

$$\begin{aligned} L_{3}=\sum _{i = 1}^{n} \mid S^{CH}_{I,T} -E\mid + \sum _{i = 1}^{n} \mid S^{CH}_{I,T} - S^{F}_{I, T}\mid . \end{aligned}$$
(9)
$$\begin{aligned} L_{4}=\sum _{i = 1}^{n} \mid S^{F}_{I, T} -E\mid . \end{aligned}$$
(10)

where E is an identity matrix.

3.3 Optimization

As mentioned above, the final loss function can be written as follows:

$$\begin{aligned} min \ \alpha L_{1} +\beta L_{2} + \gamma L_{3} +\delta L_{4}. \end{aligned}$$

The goal of our method is to generate hash codes, a kind of discrete data. The optimization of our objective function should satisfy the discrete condition. The sign function can map the input into \(-1\) or 1. The gradient of this function is zero for all non-zero inputs, and may cause gradient explosion in backpropagation:

$$\begin{aligned} \lim _{\delta \rightarrow \infty }tanh(\eta x) = sgn(x). \end{aligned}$$
(11)

where \(\eta \) is a hyper-parameter and will rise during network training.

With sign function as the activation function, the network will finally converge to our hash layer by changing the problem into a sequence of smoothed optimization problems.

4 Experiment

4.1 Datasets

MIRFlickr25k [6] contains 25, 000 image-text pairs collected from the image website Flickr. The image-text pairs are labeled from 24 categories. All the images are denoted as SIFT feature. We use BoW vector to form the text tags with 1386 dimensions.

NUS-WIDE [3] consists of 269, 648 pairs of images and texts. There are 81 label categories in the dataset, but we only used the top 10 most frequent categories, resulting in a total of 186,577 image-text pairs that can be used. The setup for this dataset is the same as the other methods. We use BoW vector to form the text tags with 500 dimensions.

4.2 Baselines and Evaluaton

We compare our SAH with 6 baseline methods, including CVH [9], IMF [20], CMFH [4], LSSH [25], UDCMH [24], and DJSRH [21].

Evaluation Criterion. Mean Average Precision (mAP) [14] and the top-K precision curves are used to evaluate the performance of the proposed SAH and baselines. Two instances of different modalities are considered semantically similar if they have the same label.

4.3 Implementation Details

The network of each modality is composed of the uniform branch and the auxiliary branch. For image modality, our uniform branch is composed of AlexNet which is same with UDCMH [24] for the sake of fairness. The auxiliary branch deal with the input image data and get three scale auxiliary data of image, 1024 \(\times \) 1024, 512 \(\times \) 512, 256 \(\times \) 256, respectively. For text modality, MLP is uniform branch. In auxiliary branch, the lengths of text auxiliary data in four scale are 1024, 512, 256 and 128, respectively. For hyper-parameter, we set \(\alpha = 1\), \(\beta = 0.1\), \(\gamma = 1\), \(\delta = 1\), and \(\mu = 0.1\) to achieve best performance. We implement our method by PyTorch on the NVIDIA RTX 1660Ti. We fix batch size as 32 and the learning rate for image network and text network is 0.005. During the optimization phase, we employ mini-batch optimizer to optimize our networks of two modalities.

4.4 Comparison with Existing Methods

Results on MIRFlickr25k. Table 2 shows the MAP@50 on MIRFlickr25K dataset of our proposed SAH and other previous methods. As can be seen, the proposed SAH significantly outperforms the baselines. We show the curve of 128 bits length hash code and can easily find that our SAH has the best performance. For the I\(\rightarrow \)T retrieval, we get more than \(50\%\) improvement in MAP in 128 bits compared with CVH. Compare with the latest method DJSRH, we get \(3.1\%\) enhancement in 128 bits. For the T\(\rightarrow \)I retrieval, we also achieve the superior performance compare with methods. The difference value of I\(\rightarrow \)T and T\(\rightarrow \)I has a shrink than any other works, which means that the auxiliary data bridges the heterogeneous gap (Fig. 4).

Fig. 4.
figure 4

Precision@top-K curves on two datasets at 128 bits.

Table 2. Mean average precision (MAP@50) comparison results.

Results on NUS-WIDE. Table 3 also shows the MAP@50 on NUS-WIDE dataset of six methods, which shows that our SAH performs better than other methods. It can be seen that we get the best performance on four kinds of code length for two datasets, which means that our method is effective for cross modality retrieval. The results indicate that the auxiliary data of both modalities could mine more latent information in both modalities and remain the similarity consistency.

4.5 Ablation Study

We verify our method with 3 variants as diverse baselines of SAH:

  1. (a)

    SAH-1 is built by removing the intra-modality multi-scale hash codes semantic enhancement;

  2. (b)

    SAH-2 is built by removing the similarity matrices difference between uniform hash codes and uniform features;

  3. (c)

    SAH-3 is built by removing the consistency between complex hash codes similarity and uniform features similarity.

Table 3 shows the results on MIRFlickr25K dataset with 64 bits and 128 bits. From the results, we can observe that each part is important to our method. Especially the part of similarity consistency between features and complex hash codes, which ensures the semantic consistency.

Table 3. The mAP@50 results for ablation analysis on MIRFlickr25k.

5 Conclusion

In this paper, we propose a novel unsupervised deep hashing model named self-auxiliary hashing. We propose a two-branch network for each modality, mixing the uniform hash codes and the comprehensive hash codes, which can preserve richer semantic information and bridge the gap of different modalities. Moreover, we make a full use of inter-modality similarity matrices and the multi-scale intra-modality similarity matrices to learn the similarity information. Extensive experiments conducted on two datasets show that our SAH outperforms several baseline methods for cross modality retrieval.