1 Introduction

With the explosive growth of multimedia data, multimodal applications are attracting increasing attention, such as Image Captioning [17], Visual Question Answering [39], and Cross-modal Retrieval [27, 32, 44, 56]. Unlike unimodal retrieval [1], cross-modal retrieval aims to search semantically similar samples from one modality with a given query from another modality (e.g., from image to text). However, it is nearly impossible to perform exact nearest neighbor search for real-world large-scale multimedia data due to the cost of time and storage involved. Many methods have been proposed to address this problem, such as Approximate Nearest Neighbor (ANN) search. Hashing is one of the most promising techniques among the ANN-based methods [35, 45]. Recently, hashing-based ANN search has become popular because of its low computational cost and fast retrieval speed. The aim of hashing is to learn hash functions to project data instances from the original feature space into a compact Hamming space, where the data instances from different modalities can be represented by binary hash codes of a certain length. In the Hamming space, the similarity between original data instances can be represented by the Hamming distance. This requires the hash codes to preserve the similarity relationships between the data instances in the original feature space.

Data instances from different modalities are heterogeneous. Their features may have different distributions. To narrow such a gap, many cross-modal hashing methods have been proposed to mine the semantic relationships between different modalities and preserve them in a common Hamming space. According to whether supervised information (i.e., semantic labels) is used or not, recent existing methods can be generally divided into unsupervised hashing [46] and supervised hashing [55]. Supervised methods usually achieve superior performance to unsupervised methods by learning the correlations between multimodal data via semantic labels. However, most of them are shallow hashing methods which use hand-craft features. The hand-crafted features may not be optimal discriminative representations of data instances.

Deep learning based methods have meanwhile achieved promising performance. Typical deep learning based methods, such as DVSH [5], CMDVH [16] and TVDH [42], train hash codes and hash functions simultaneously. They are also called one-stage hashing methods with end-to-end training scheme. One advantage of such methods is that the joint optimization of hash codes and hash functions can lead to better retrieval accuracy compared to two-stage methods. However, the hash codes must be retrained when the hash functions are changed. Moreover, the alternating training of hash codes and hash functions makes them impact each other, potentially leading to suboptimal performance. The pairwise-based [19, 21, 55] or triplet-based [12] training strategies also limit them to preserving only local similarities due to computational constraints.

To improve the flexibility and address the above limitations, two-stage methods are proposed, which decouple the training of hash codes and hash functions [22, 26, 31, 48]. Hash codes are first learned independently. Hash functions are then trained to map new queries to the pre-trained hash codes. This makes it easier to optimize hash codes and hash functions separately. At first, the retrieval accuracy of such methods is not as good as that of the one-stage methods. Nearly all of the two-stage methods only use a coarse-grained similarity matrix \(\varvec{S}^c={s^c_{ij}}\), where \(s^c_{ij}=1\) means that the data instances \(x^c_i\) and \(x^c_j\) are similar and \(s^c_{ij}=0\) means dissimilar. Obviously, it will cause substantial loss of information before generating the hash codes, because this similarity matrix cannot capture the differentiated similarity relationships between the training data at the first stage. At the second stage, most of the two-stage methods treat the learning of hash functions as a multi-binary classification problem. In fact, each bit of a hash code does not necessarily correspond to a partition in the feature space. The learning of hash functions at this stage should focus on preserving as much discriminative information as possible. In other words, this should be a result of similarity calculation rather than classification. In addition, separate training is no longer limited to mini-batches, enabling preservation of global similarities.

Based on the above observations, we propose a novel two-stage method for cross-modal retrieval, called Deep Cross-modal Hashing with Fine-grained Similarity (DFGH). It effectively explores the differentiated similarity relationships between the training data and utilizes them to assist the learning of hash codes and hash functions. When performing hash code inference at the first stage, we use semantic labels to build a fine-grained similarity matrix by assuming two instances are more similar if they share more labels. In this way, we can represent the differentiated similarity relationships between the training data. When performing hash code mapping at the second stage, we design a similarity sensitivity strategy to learn the hash functions. In this strategy, we employ a mapping loss to fit the hash codes and a classification loss to retain the semantic information. Importantly, we further design a similarity preserving loss based on the above fine-grained similarity matrix to preserve the differentiated similarity information.

The main contributions of our work are summarized as:

  1. 1)

    We propose a novel two-stage hashing method for cross-modal retrieval with the fine-grained similarity matrix. The learning of hash codes and hash functions are linked up without affecting each other by utilizing the matrix.

  2. 2)

    At the first stage of the method, we preserve the distinguishable similarity relationships between data instances with the similarity matrix. At the second stage, the similarity-preserving loss is designed so that the hash functions can preserve the fine-grained similarity information.

  3. 3)

    We conduct extensive experiments on three benchmark datasets to demonstrate the effectiveness of the proposed DFGH. The results show its superiority to the state-of-the-art methods especially on large-scale image datasets (e.g., MS-COCO).

2 Related work

Different from unimodal hashing, cross-modal hashing aims to realize the retrieval when queries and samples to be retrieved have different modalities. It is crucial to narrow the semantic gap between different modalities when extending unimodal hashing to cross-modal hashing. Cross-modal hashing methods can be divided into two categories in terms of whether supervised information is used during the learning process, i.e., unsupervised and supervised.

Unsupervised methods learn hash functions by capturing the correlations between different modalities without supervised information. One common way is to preserve the structure of original features. For example, Inter-Media Hashing (IMH) [43] learned hash functions by exploring both intra-media consistency and inter-media consistency. PDH [40] maintained the predictability of binary hash codes and optimized the objective function by iterative method. CMFH [14] learned hash codes by collective matrix factorization with a latent factor model from different modalities. LSSH [57] used sparse coding to capture the salient structure of images and matrix factorization to obtain the latent concepts of texts, and then projected the learned semantic features into a common space. MSAE [47] learned hash functions by capturing the intra-media and inter-media semantic relationships between data from heterogeneous sources. LCMH [58] aimed at learning hash functions from large-scale training datasets, while considering both the intra-similarity in each modality and the inter-similarity across different modalities. Fusion Similarity Hashing (FSH) [32] modeled the fusion similarity among different modalities by constructing an undirected asymmetric graph. Unsupervised Knowledge Distillation (UKD) [18] utilized a similarity matrix as well. As an unsupervised method, it estimated the similarity matrix using the original features of image and text, i.e., it utilized the distance between the features of instances. Our similarity matrix in this paper is different from theirs, and our method focuses on the utilization of supervision information, i.e., labels of instances.

Supervised methods explore the correlations between heterogeneous data to train hash functions using supervised information. Thus, they can usually obtain better accuracy than unsupervised methods. For example, Semantic Correlation Maximization (SCM) [53] tried to train hash codes by a semantic similarity matrix that was calculated by label vectors. Cross-Modality Similarity-Sensitive Hashing (CMSSH) [4] modeled the learning of hash functions as a binary classification problem and used a boost algorithm during the learning process. CVH [24] learned hash functions by projecting similar data instances from different modalities into similar hash codes.

Many deep hashing methods have been proposed recently. Most of them are called one-stage methods, which means that they combine the learning of hash codes and hash functions in an end-to-end manner. For example, Deep Cross-Modal Hashing (DCMH) [21] integrated the learning of deep features and hash codes into an end-to-end framework, and used pairwise-based object function to train the framework. PRDH [52] integrated different types of pairwise constraints to learn the hash codes. Deep Semantic-Preserving Ordinal Hashing (DSPOH) [23] learned hash functions by using deep neural networks to explore the ranking structure of feature dimensions. Self-Supervised Adversarial Hashing (SSAH) [25] used two adversarial networks to maximize the semantic correlation and consistency of representations between different modalities. Asymmetric Supervised Consistent and Specific Hashing (ASCSH) [37] presented a multimodal mapping learning strategy and a discrete asymmetric learning framework. It extracted the pairwise similarity and semantic labels to make full use of the supervised information. Multi-Level Correlation Adversarial Hashing (MLCAH) [36] designed a multi-level correlation hashing model with a label-consistency attention mechanism to excavate the local and global correlation information. Deep Multiscale Fusion Hashing (DMFH) [38] adopted multiscale fusion models to enhance semantic relevance and improved hash code representativeness. Asymmetric Correlation Quantization Hashing (ACQH) [37] used real-valued query embeddings and stacked database quantization embeddings for improved retrieval. Fast Discriminative Discrete Hashing (FDDH) [33] presented an efficient closed-form solution and online strategy to learn discriminative hash codes. It introduced an orthogonal basis to regress hash codes and relaxed label values to minimize quantization loss, and an orthogonal transform to ensure semantic consistency. Modality-Invariant Asymmetric Networks (MIAN) [54] introduced probabilistic alignment to capture intra- and inter-modal similarities, bridging semantic and heterogeneity gaps. Multi-Label Semantic Supervised Graph Attention Hashing (MS\(^2\)GAH) [15] designed an end-to-end framework that integrates graph attention networks and multi-label annotations to bridge the semantic modality gap.

Different from the one-stage methods, two-stage methods divide the learning of hash codes and hash functions into two stages. The pioneering framework [28] decomposed the hash learning into two stages. Semantics-Preserving Hashing (SePH) [31] generated unified binary hash codes by minimizing the KL-divergence, and then learned hash functions by kernel logistic regression with a sampling strategy. Supervised Discrete Manifold-Embedded Cross-Modal Hashing (SDMCH) [34] generated hash codes by using semantic labels. It exploited a nonlinear manifold structure of data and constructed the correlations among multiple heterogeneous modalities. Two-Step Cross-modal Hashing (TECH) [10] learned hash codes by preserving the similarity in the original space and exploiting the label correlations in the label space at the first stage, and then learned hash functions by leveraging the similarity relationships between training instances at the second stage. Discrete Latent Factor Hashing (DLFH) [22] designed a discrete latent factor model to learn the hash codes and proposed a method called out-of-sample extension to the hash function. Fast Cross-Modal Hashing (FCMH) [49] used global and local similarity embedding and efficient discrete optimization for improved accuracy and scalability. Compared with those one-stage methods, two-stage methods are flexible enough to change the hash functions if needed. They take less time to train hash codes and hash functions, and can achieve comparable retrieval performance. However, existing two-stage methods attempt to map similar data instances to the same hash code, and assume that two data instances are similar as long as they share one label. Such coarse-grained similarity definition will produce suboptimal hash codes and impact the retrieval accuracy. We propose DFGH which defines fine-grained similarity to generate better hash codes and hash functions at the first and second stage respectively, by considering that the more labels two data instances share, the more similar they are.

In many one-stage methods, cross-entropy [21] is adopted to preserve the pairwise similarity relationships between data instances in the Hamming space. Besides, different types of loss functions are proposed. For example, Transfer Adversarial Hashing (TAH) [7] adopted a hybrid deep architecture which incorporated a pairwise t-distribution cross-entropy loss to learn hash codes. HashGAN [6] used a cosine cross-entropy loss to learn hash functions. However, most of them treat the hard and easy examples equally and enforce the same penalization on them, which may lead to suboptimal hash codes. To address this issue, some previous methods [8, 51] attempted to assign higher weights to more important examples for prioritization. However, they just assigned the priority via the coarse-grained similarity information. In comparison, our DFGH transfers the cross-entropy loss from one-stage to two-stage framework to preserve the similarity relationships in the Hamming space, and designs a fine-grained similarity matrix to assign different weights to different pairs, making the learning of hash functions more sensitive to similar pairs.

Fig. 1
figure 1

The overall architecture of our proposed DFGH model. Refer to Section 3 for detail

3 Methodology

3.1 Formulation

Without loss of generality, we focus on cross-modal retrieval for bimodal data (i.e., image and text), as other modalities should be similar. Firstly, we introduce the notations and problem definition. We use bold uppercase letter \(\varvec{W}\) to represent a matrix and bold lowercase letter \(\varvec{w}\) to represent a vector. \(\varvec{W}^\textrm{T}\) represents the transpose of the matrix \(\varvec{W}\); \(|| \cdot ||_F\) represents the Frobenius norm of a matrix; \(\varvec{W}_{i*}\) denotes the i-th row of \(\varvec{W}\); \(\textbf{1}\) denotes a vector with all elements being 1; and \(sign(\cdot )\) denotes a sign function, which is defined as:

$$\begin{aligned} sign(x)= {\left\{ \begin{array}{ll} -1&{} x<0\\ +1&{} x \ge 0 \end{array}\right. } \end{aligned}$$
(1)

Let \(O={{o_i}}_{i=1}^n\) denote a cross-modal dataset with n instances, where \(o_i=(\varvec{x}_i,\varvec{y}_i,\varvec{l}_i)\) is the i-th instance. Specifically, \(\varvec{x}_i\in \mathbb {R}^{1\times d_x}\) represents the original image feature; \(\varvec{y}_i\in \mathbb {R}^{1\times d_y}\) represents the original text feature; and \(\varvec{l}_i=[\varvec{l}_{i1},\ldots ,\varvec{l}_{ic}]\) is the multi-label assigned to \(o_i\), where c is the number of classes. If \(o_i\) belongs to the j-th class, \(\varvec{l}_{ij}=1\), otherwise \(\varvec{l}_{ij}=0\). Hence, we can use \(\varvec{X},\ \varvec{Y},\ \varvec{L}\) to represent the image-feature matrix, text-feature matrix, and semantic label matrix, respectively. We use \(\varvec{B}^{\varvec{x}}\) and \(\varvec{B}^{\varvec{y}}\) to represent the matrices composed of the binary hash codes corresponding to the image and text feature, respectively. Table 1 summarizes the main notations used in this paper.

Table 1 Main notations

The objective of DFGH is to learn two sets of modality-specific hash functions to map the features from different modalities into a common Hamming space, i.e., \(\varvec{f}^{\varvec{x}}\left( \varvec{x};\theta _x\right) \in \{{-\textbf{1},\textbf{1}}\}^k\) for images and \(\varvec{f}^{\varvec{y}}\left( \varvec{y};\theta _y\right) \in \{{-\textbf{1},\textbf{1}}\}^k\) for texts, where k denotes the length of hash code and \(\theta \) denotes the parameters of the hash function to be learned. In the Hamming space, the similarity between two binary hash codes is evaluated by the Hamming distance. For two binary hash codes \(\varvec{b}_i\) and \(\varvec{b}_j\), there is a natural relationship between the Hamming distance and their inner product, which is defined as:

$$\begin{aligned} H_{dist}\left( \varvec{b}_i,\varvec{b}_j\right) =\frac{1}{2}\left( k-\varvec{b}_i\varvec{b}_j^{\textrm{T}}\right) \end{aligned}$$
(2)

The larger the inner product is, the smaller the Hamming distance is. Thus, we can use the inner product instead of the Hamming distance to quantify the similarity between two binary hash codes in the Hamming space.

3.2 Overview

The architecture of DFGH is shown in Fig. 1, which contains two main components, i.e., Hash Code Inference with Fine-grained Similarity and Hash Code Mapping with Similarity Sensitivity. In essence, what we need to do is to generate hash codes for the training set at the first stage, and then at the second stage, utilize these generated hash codes to supervise the training of the hash functions. At the first stage, a fine-grained similarity matrix is designed to generate a set of discriminative hash codes which preserve the differentiated similarity relationships between the training data. Besides, we use an Autoencoder to encourage hash codes to preserve the semantic information of labels. At the second stage, we employ two deep neural networks (ImgNet and TxtNet) to learn hash functions for the image and text modality, respectively. The output dimension of both networks is the length of hash code. We also design a similarity-preserving loss based on the similarity information (i.e., our similarity matrix), to enable the training of hash functions being more sensitive to the pairs with higher degree of similarity.

3.3 Hash code inference with fine-grained similarity

Fine-grained similarity mining

At this stage, we need to infer a set of hash codes which can best preserve the differentiated similarities between the original training data. Firstly, we construct a fine-grained similarity matrix with the semantic labels to represent the differentiated similarities. As analyzed previously, most hashing methods just use a coarse-grained similarity matrix \(\varvec{\bar{S}}\) to describe the similarities between the original data instances. Specifically, in a multi-label setting where a data instance \(o_i\) can be associated with multiple labels, it defines \(\varvec{\bar{S}}_{ij}=1\) if \(o_i\) and \(o_j\) share at least one label, indicating they are semantically similar, and \(\varvec{\bar{S}}_{ij}=0\) otherwise, where \(i,j \in \left\{ 1,\ldots ,n\right\} \). However, discrete values cannot fully describe the differentiated similarity information. As shown in Fig. 2, when data instances A and B have common labels, they will be mapped to the same point in the Hamming space, irrespective of their potential associations with different semantic label vectors. Besides, despite the fact that a data instance C shares more labels with A than with B, C and B will be mapped to the same point in the Hamming space. It reveals that the similarity between C and A is the same as the similarity between B and A in the Hamming space, though C is semantically closer to A in the original feature space.

Fig. 2
figure 2

An illustration of using coarse-grained similarity

To tackle these issues, we introduce a fine-grained similarity matrix \(\varvec{\hat{S}}\), with its elements defined as follows:

$$\begin{aligned} \varvec{\hat{S}}_{ij}=\frac{\varvec{l}_i\varvec{l}_j^{\textrm{T}}}{\textbf{1}\varvec{l}_i^{\textrm{T}}+\textbf{1}\varvec{l}_j^{\textrm{T}}-\varvec{l}_i\varvec{l}_j^{\textrm{T}}} \end{aligned}$$
(3)

where \(\varvec{l}_i\) and \(\varvec{l}_j\) are semantic label vectors of data instances \(o_i\) and \(o_j\), respectively. The value of \(\varvec{\hat{S}}_{ij}\) is larger if \(o_i\) and \(o_j\) share more labels, which indicates that \(o_i\) and \(o_j\) have a higher degree of similarity. It is obvious that the value of \(\varvec{\hat{S}}_{ij}\) is 0 if \(o_i\) and \(o_j\) share no common labels. For the convenience of calculation, we make the hash codes belong to \(\{-1,1\}^k\), so their correct inner product for the dissimilar pairs should be closer to \(-1\) instead of 0. To construct the relationship between the similarity matrix and the dot product of hash codes, we convert \(\varvec{\hat{S}}_{ij}\) from [0, 1] to [\(-1\), 1]. We then get the final fine-grained similarity matrix \(\varvec{S}\) as:

$$\begin{aligned} \varvec{S}_{ij}= {\left\{ \begin{array}{ll} \varvec{\hat{S}}_{ij}&{} \varvec{\hat{S}}_{ij}>0\\ -1&{} \varvec{\hat{S}}_{ij}=0 \end{array}\right. } \end{aligned}$$
(4)

When the instances \(o_i\) and \(o_j\) share at least one label, which means that they have a certain degree of similarity, we set the value of \(\varvec{S}_{ij}\) to be greater than 0 and expect that the Hamming distance between the two hash codes is not too large. Conversely, if no labels are shared, we aim for a maximized Hamming distance between the hash codes. We use the inner product of two hash codes to reconstruct the original relationship and define the similarity reconstruction loss as:

$$\begin{aligned} \min \limits _{\varvec{B}} \ \ {\left\| \frac{1}{k}\varvec{B}\varvec{B}^{\textrm{T}}-\varvec{S}\right\| _F^2} \ \ \varvec{s.t.} \ \ \varvec{B}\in \left\{ -1,\ 1\right\} ^{n\times k} \end{aligned}$$
(5)

The designed loss pushes the generated hash codes to be discriminative and to preserve the differentiated similarities between the training data.

Semantic information mining

To enable the hash codes to preserve the semantic information of labels, we consider to establish connection between the hash codes and the semantic labels.

$$\begin{aligned} \varvec{B}=\varvec{LW},\ \ \ \varvec{L}=\varvec{B}\varvec{W}^{\textrm{T}},\ \ \ \varvec{W}\in \mathbb {R}^{c\times k} \end{aligned}$$
(6)

This additional Autoencoder pushes the hash codes to preserve the information contained in the original semantic labels. Besides, the learning of hash codes is supervised by the fine-grained similarity reconstruction loss. Thus, there are no adverse consequences even when the number of nodes in the hidden layer (i.e., the number of bits in each hash code) exceeds the number of nodes in the input layer (i.e., the number of bits in each semantic label) within the Autoencoder. Our aim is to minimize the reconstruction error, and the objective function is formulated as:

$$\begin{aligned} \min \limits _{\varvec{W}} \ \ {\left\| \varvec{L}-\varvec{LW}\varvec{W}^\textrm{T}\right\| _F^2} \ \ \varvec{s.t.} \ \ \varvec{LW}=\varvec{B} \end{aligned}$$
(7)

This enables the generated hash codes to preserve the semantic information of labels. We use a learnable parameter \(\alpha \) to decide the importance of reconstruction. Then, the total loss function of the first stage is defined as:

$$\begin{aligned} \min \limits _{\varvec{B}, \varvec{W}}{} & {} {\left\| \frac{1}{k}\varvec{B}\varvec{B}^{\textrm{T}}-\varvec{S}\right\| _F^2} + \alpha \left\| \varvec{L}-\varvec{LW}\varvec{W}^\textrm{T}\right\| _F^2 \nonumber \\{} & {} \quad \varvec{s.t.} \varvec{B}\in \left\{ -1,\ 1\right\} ^{n\times k}, \varvec{LW}=\varvec{B} \end{aligned}$$
(8)

Thus we obtain optimal hash codes which preserve the differentiated similarity relationships between training data and the semantic information of labels.

Optimization Next, we proceed to derive the optimization process for the objective function. To optimize (8), we first rewrite it as:

$$\begin{aligned} \min \limits _{\varvec{B}, \varvec{W}} {\left\| \frac{1}{k}\varvec{B}\varvec{B}^\textrm{T}-\varvec{S}\right\| _F^2}+ & {} \alpha \left\| \varvec{L}-\varvec{B}\varvec{W}^\textrm{T}\right\| _F^2\!\! +\lambda \left\| \varvec{B}-\varvec{L}\varvec{W}\right\| _F^2 \nonumber \\{} & {} \!\!\!\!\!\! \varvec{s.t.} \varvec{B}\in \left\{ -1,\ 1\right\} ^{n\times k} \end{aligned}$$
(9)

where \(\lambda \) is a parameter used to adjust the weight of the last term. Equation (9) is a Mixed Integer Programming (MIP) problem, which is usually NP-hard due to the binary constraints. To address this issue, we convert the binary constraints \(\varvec{B}\in {\{-1,1\}}^{n \times k}\) into two conditions \(\varvec{B}\in S_b\) and \(\varvec{B}\in S_p\), where \(S_b\) and \(S_p\) are the sets of \([-1,1]^{n\times k}\) and \(\left\{ B|\left\| B\right\| _F^2=nk\right\} \), respectively [41]. We introduce two variables \(\varvec{Z}_1\) and \(\varvec{Z}_2\) to absorb these constraints. Therefore, the constrains in (9) can be converted into \(\varvec{B}=\varvec{Z}_1, \varvec{Z}_1\in S_b\) and \(\varvec{B}=\varvec{Z}_2,\ \varvec{Z}_2\in S_p\). Building on such transformation, we design an Alternating Direction of Multipliers (ADMM) algorithm [3], and iteratively solve the augmented Lagrangian function of (9) as:

$$\begin{aligned} \min \limits _{\varvec{B}, \varvec{W}, \varvec{Z}_1, \varvec{Z}_2, \varvec{Y}, \varvec{Y}_2}{} & {} \left\| \frac{1}{k}\varvec{B}\varvec{B}^\textrm{T}-\varvec{S}\right\| _F^2+\alpha \left\| \varvec{L}-\varvec{B}\varvec{W}^\textrm{T}\right\| _F^2\nonumber \\{} & {} +\lambda \left\| \varvec{B}-\varvec{LW}\right\| _F^2 \nonumber \\{} & {} +\delta _{S_b}\left( \varvec{Z}_1\right) +\delta _{S_p}\left( \varvec{Z}_2\right) +\frac{\rho _1}{2}\left\| \varvec{B}-\varvec{Z}_1\right\| _F^2\nonumber \\{} & {} +\frac{\rho _2}{2}\left\| \varvec{B}-\varvec{Z}_2\right\| _F^2 \nonumber \\{} & {} +\textrm{Tr}\left( \varvec{Y}_1^\textrm{T}({\varvec{B}-\varvec{Z}}_1)\right) \nonumber \\{} & {} +\textrm{Tr}\left( \varvec{Y}_2^\textrm{T}\left( {\varvec{B}-\varvec{Z}}_2\right) \right) \end{aligned}$$
(10)

where \(\delta _S\left( \varvec{Z}\right) \) is an indicator function which equals to zero if \(\varvec{Z}\in S\) and \(+\infty \) otherwise; \(\varvec{Y}_1\) and \(\varvec{Y}_2\) are the dual variables; \(\rho _1\) and \(\rho _2\) are the penalty variables. The specific process for updating these variables is shown as below.

1)\(\ \varvec{B}\)-Step: Through fixing all the variables except \(\varvec{B}\), we can update \(\varvec{B}\) by minimizing the following objective functions:

$$\begin{aligned} \min \limits _{\varvec{B}} \mathcal {L}={} & {} \left\| \frac{1}{k}\varvec{B}\varvec{B}^\textrm{T}-\varvec{S}\right\| _F^2+\alpha \left\| \varvec{L}-\varvec{B}\varvec{W}^\textrm{T}\right\| _F^2 +\lambda \left\| \varvec{B}-\varvec{LW}\right\| _F^2 \nonumber \\{} & {} +\delta _{S_b}\left( \varvec{Z}_1\right) +\delta _{S_p}\left( \varvec{Z}_2\right) +\frac{\rho _1}{2}\left\| \varvec{B}-\varvec{Z}_1\right\| _F^2+\frac{\rho _2}{2}\left\| \varvec{B}-\varvec{Z}_2\right\| _F^2 \nonumber \\{} & {} +\textrm{Tr}\left( \varvec{Y}_1^\textrm{T}({\varvec{B}-\varvec{Z}}_1)\right) +\textrm{Tr}\left( \varvec{Y}_2^\textrm{T}\left( {\varvec{B}-\varvec{Z}}_2\right) \right) \end{aligned}$$
(11)

Such a sub-problem can be iteratively solved by the LBFGS-B method, with the gradient calculated as:

$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial \varvec{B}}={} & {} \frac{4}{k^2}\varvec{B}\varvec{B}^\textrm{T}\varvec{B}-\frac{4}{k}\varvec{SB}+2\alpha \left( \varvec{B}\varvec{W}^\textrm{T}-\varvec{L}\right) \varvec{W}\nonumber \\{} & {} +2\lambda \left( \varvec{B}-\varvec{LW}\right) +\left( \rho _1+\rho _2\right) \varvec{B}+\varvec{Y}_1+\varvec{Y}_2\nonumber \\{} & {} -\rho _1\varvec{Z}_1-\rho _2\varvec{Z}_2 \end{aligned}$$
(12)

2) \(\varvec{Z}\)-Step: Through fixing all the variables except \(\varvec{Z}\), we can update \(\varvec{Z}_1\) and \(\varvec{Z}_2\) by the proximal minimizing methods. In particular, we eliminate \(\delta _{S_b}\left( \varvec{Z}_1\right) \) and \(\delta _{S_p}\left( \varvec{Z}_2\right) \) in (10) first and set the derivative with respect to \(\varvec{Z}_1\) and \(\varvec{Z}_2\) to 0. We then get the closed-form solutions of \(\varvec{Z}_1\) and \(\varvec{Z}_2\), and project \(\varvec{Z}_1\) and \(\varvec{Z}_2\) into the sets \(S_b\) and \(S_p\), respectively. The updating process for \(\varvec{Z}_1\) and \(\varvec{Z}_2\) is formulated as:

$$\begin{aligned} \varvec{Z}_1= & {} \mathbf {\Pi }_{S_b}\left( \varvec{B}+\frac{1}{\rho _1}\varvec{Y}_1\right) \end{aligned}$$
(13)
$$\begin{aligned} \varvec{Z}_2= & {} \mathbf {\Pi }_{S_p}\left( \varvec{B}+\frac{1}{\rho _2}\varvec{Y}_2\right) =\sqrt{nk}\frac{\varvec{B}+\frac{1}{\rho _2}\varvec{Y}_2}{\left\| \varvec{B}+\frac{1}{\rho _2}\varvec{Y}_2\right\| } \end{aligned}$$
(14)

where \(\mathbf {\Pi }_{S_b}\) and \(\mathbf {\Pi }_{S_p}\) are two projection operators, \(\mathbf {\Pi }_{S_b}\) projects values above 1 to 1 and values below \(-1\) to \(-1\)., and \(\mathbf {\Pi }_{S_p}\) forces \(\varvec{Z}_2\) to satisfy the condition \(\left\| \varvec{Z}_2\right\| _F^2=nk\).

3) \(\varvec{Y}\)-Step: Through fixing all the variables except \(\varvec{Y}\), we can update \(\varvec{Y}_1\) and \(\varvec{Y}_2\) by performing gradient ascent on the dual problem, which is formulated as:

$$\begin{aligned} \varvec{Y}_1=\varvec{Y}_1+\varvec{\rho }_1\left( \varvec{B}-\varvec{Z}_1\right) \end{aligned}$$
(15)
$$\begin{aligned} \varvec{Y}_\textbf{2}=\varvec{Y}_2+\varvec{\rho }_2\left( \varvec{B}-\varvec{Z}_2\right) \end{aligned}$$
(16)

4) \(\varvec{W}\)-Step: Through fixing all the variables except \(\varvec{W}\), we can update \(\varvec{W}\) by taking the derivative of (10) with respect to \(\varvec{W}\) and setting it to 0. Thus, we obtain the following equation:

$$\begin{aligned} \varvec{\alpha W}\varvec{B}^\textrm{T} \varvec{B}+\varvec{\lambda }\varvec{L}^\textrm{T} \varvec{LW} = \varvec{\alpha }\varvec{L}^\textrm{T} \varvec{B}+\varvec{\lambda }\varvec{L}^\textrm{T} \varvec{B} \end{aligned}$$
(17)

We can solve such a Sylvester equation efficiently by the Bartels-Stewart algorithm [2].

The above iterative optimization is described in Algorithm 1. Here, the sign function is not utilized until the final hash code is obtained in the last step. Therefore, the derivation of sign function is not involved in this process.

Algorithm 1
figure f

The learning strategy of Hash Code Inference.

3.4 Hash code mapping with similarity sensitivity

Similarity sensitivity learning

At the second stage, we aim to learn hash functions to project new query samples onto appropriate hash codes. Following the conventions within the cross-modal hashing field, we choose a Convolutional Neural Network (CNN) for images and a Multi-Layer Perception (MLP) for texts to perform the mapping. The outputs of these two networks can be denoted by \(\varvec{F}_{i*}^x=f^x(\varvec{x}_i;\theta _x)\in \mathbb {R}^{1\times k}\) and \(\varvec{F}_{i*}^y=f^y(\varvec{y}_i;\theta _y)\in \mathbb {R}^{1\times k}\), where \(\theta _x\) and \(\theta _y\) are the parameters of the image network and the text network, respectively. The loss function of the second stage can be divided into the mapping loss, classification loss and similarity preserving loss.

Considering that the above two networks (i.e., the CNN for images and the MLP for texts) should map the images and texts into a common representation space, we employ the following mapping loss to measure the error between these representations and the first-stage hash codes.

$$\begin{aligned} \mathcal {J}_1={\left\| \varvec{F}^{\varvec{u}}-\varvec{B}\right\| }_F^2 \end{aligned}$$
(18)

where u can be x or y, x for image modality and y for text modality.

Just as an Autoencoder is employed to ensure the preservation of semantic label information within hash codes during the first stage, we aspire to maintain the distinctiveness between data instances with different semantic labels in the common space. To predict the labels of the representations (i.e., the hash codes), two linear classifiers are attached to the top of the image network and the text network, respectively. The classification loss in the label space is defined as:

$$\begin{aligned} \mathcal {J}_2={\left\| \varvec{F}^{\varvec{u}}\varvec{V}-\varvec{L}\right\| }_F^2 , \varvec{V} \in \mathbb {R}^{k \times c} \end{aligned}$$
(19)

where \(\varvec{V}\) is the parameter of the linear layer. This layer maps the hash code to class label for loss calculation.

The retrieval performance can be improved when the output of the network is close to the semantically similar data instances but far away from the dissimilar ones in the common representation space. Therefore, we introduce a similarity preserving loss. Firstly, we convert \(\varvec{S}_{ij}\) from the range of \(-1\) to 1 back to the range of 0 to 1, which means that we convert \(-1\) to 0 in \(\varvec{S}\). A coarse-grained matrix \(\widetilde{\varvec{S}}\) is then utilized to calculate its probability under the condition \(\varvec{F}^u\), which is formulated as:

$$\begin{aligned} \varvec{p}({\widetilde{\varvec{S}}}_{ij}|\varvec{F}_{i*}^u,\varvec{F}_{j*}^u)= {\left\{ \begin{array}{ll} \varphi (\psi _{ij}) &{} \widetilde{\varvec{S}}=1 \\ 1-\varphi (\psi _{ij}) &{} \widetilde{\varvec{S}}=0 \end{array}\right. } \end{aligned}$$
(20)

where \(\varphi \left( x\right) =\frac{1}{1+e^{-x}}\) and \(\psi _{ij}=\frac{1}{2}(\varvec{F}_{i*}^u){(\varvec{F}_{j*}^u)}^{\textrm{T}}\); \({\widetilde{\varvec{S}}}_{ij}=1\) means that the data instances \(o_i\) and \(o_j\) form a similar pair, and \({\widetilde{\varvec{S}}}_{ij}=0\) means \(o_i\) and \(o_j\) form a dissimilar pair. The larger the inner product of two data instances is, the more similar they are. Thus, we formulate the negative log likelihood of the similarities as:

$$\begin{aligned} \widetilde{\mathcal {J}_3}=-\sum _{i,j=1}^{n}{({\widetilde{\varvec{S}}}_{ij}}\log {(\varphi (\psi _{ij}))}+(1-{\widetilde{\varvec{S}}}_{ij})(\log {(1-\varphi (\psi _{ij}))}) \end{aligned}$$
(21)

Motivated by Focal Loss (FL) [30], we assign different weights to different data pairs. For a similar pair, the more similar the pair is, the larger weight it will be assigned. Because the loss decreases as the Hamming distance decreases, we assign larger weights to significantly penalize the similar pairs with a small Hamming distance, i.e., hard similar pairs. Therefor, the loss is more sensitive to the pairs with higher degree of similarities, and then supervise the model to provide close representations for similar instances. To meet the above expectations, the weight is defined as:

$$\begin{aligned} \varvec{W}_{ij}=\left( {\gamma _1\varvec{S}}_{ij}\right) ^{{\widetilde{\varvec{S}}}_{ij}}{\left( \frac{k+\mu _{ij}}{k}\gamma _2\right) }^{1-{\widetilde{\varvec{S}}}_{ij}} \end{aligned}$$
(22)

where \(\mu _{ij}=sign\left( \varvec{F}_{i*}^u\right) {(sign(\varvec{F}_{j*}^u))}^{\textrm{T}}\); \(\gamma _1\) and \(\gamma _2\) are hyper-parameters that adjust the weights \(\varvec{W}\) and deal with the class imbalance problem. By taking the priority weight in (22), the similarity preserving loss is rewritten as a weighted cross-entropy loss:

$$\begin{aligned} \mathcal {J}_3= & {} -\sum _{i,j=1}^{n}{\varvec{W}_{ij}({\widetilde{\varvec{S}}}_{ij}}\log {(\varphi (\psi _{ij}))}\nonumber \\{} & {} +(1-{\widetilde{\varvec{S}}}_{ij})(\log {(1-\varphi (\psi _{ij}))}) \end{aligned}$$
(23)

With this loss function, the mapping has the ability to preserve the differentiated similarity relationships between the original data. In addition, the weights used for the cross-entropy makes the training process sensitive to the hard similar pairs. The whole loss function of the second stage is therefore:

$$\begin{aligned} \mathcal {J}=\mathcal {J}_1+\beta _1\mathcal {J}_2+\beta _2\mathcal {J}_3 \end{aligned}$$
(24)

where \(\beta _1\) and \(\beta _2\) are hyper-parameters to adjust the weights of \(\mathcal {J}_2\) and \(\mathcal {J}_3\). \(\mathcal {J}_2\) is related to the length of hash code k and the number of classes c. It can be easily derived that the longer the hash code is, the easier the hash code can preserve the classification information. By contrast, the larger the number of classes is, the more difficult for the hash code to get the correct class. In conclusion, longer hash code and smaller number of the classes will lead to a smaller classification loss. Therefore, we put k in the numerator and c in the denominator. Finally, as the scale of this ratio is controlled by logarithm, \(\beta _1\) is empirically set as \(\ln {\left( e+\left( \frac{k}{c}\right) ^2\right) }-1\).

Optimization. At the second stage, we need to learn the parameters \(\theta _x\), \(\theta _y\) and \(\varvec{V}\), which are updated by the Back-propagation (BP) algorithm with the Stochastic Gradient Descent (SGD) method. The derivatives of the loss function are computed as:

$$\begin{aligned} \frac{\partial \mathcal {J}}{\partial \varvec{F}_{i*}^u}= & {} 2\left( \varvec{F}_{i*}^u-\varvec{B}_{i*}\right) \nonumber \\{} & {} +\frac{1}{2}\beta _2\varvec{W}_{ij}\sum \limits _{j:{\widetilde{\varvec{S}}}_{ij}\in \widetilde{\varvec{S}}}{(\varphi (\psi _{ij})-{\widetilde{\varvec{S}}}_{ij})\varvec{F}_{j*}^u}\nonumber \\{} & {} +\frac{1}{2}{\beta _2\varvec{W}}_{ji}\sum \limits _{j:{\widetilde{\varvec{S}}}_{ji}\in \widetilde{\varvec{S}}}{(\varphi (\psi _{ji})-{\widetilde{\varvec{S}}}_{ji})\varvec{F}_{j*}^u}\end{aligned}$$
(25)
$$\begin{aligned} \frac{\partial \mathcal {J}}{\partial \varvec{V}}= & {} 2\beta _1\left( \left( \varvec{F}^u\right) ^T\varvec{F}^u\varvec{V}-\left( \varvec{F}^u\right) ^T\varvec{L}\right) \end{aligned}$$
(26)

The gradients are then fed back to update the network parameters. The derivation of sign function is not involved in the process of back-propagation.

4 Experiments

4.1 Datasets

MIRFlickr-25K

[20] consists of 25,000 images collected from Flickr. Each image is associated with a number of textual tags and labeled with at least one of the 24 unique semantic labels. We select the images with at least 20 textual tags, and obtain 20,015 image-tag pairs. For each pair, the textual instance is represented as a 1,386-dimensional Bag-of-words (BOW) vector. We randomly sample 2,000 pairs as the query set, and the remaining 18,015 pairs are used as the database where we randomly sample 10,000 pairs as the training set. It is worth mentioning that in this task, such spilt strategy is widely used and random selection does not have a significant impact on the results.

NUS-WIDE

[11] is a real-world image dataset containing 269,648 images. Each image is associated with a number of textual tags labeled with at least one of the 81 unique concept labels. We only select the image-tag pairs which belong to the 21 most frequent concepts. For each pair, the textual instance is represented as a 1,000-dimensional BOW vector. 100 pairs per concept label are randomly sampled as the query set, and the remaining 500 pairs per concept label are randomly sampled as the training set. In total, there are 2,100 pairs for query set and 10,500 pairs for training set.

MS-COCO

[29] contains 82,783 training images and 40,504 validation images, and is also a multi-label dataset. In our experiments, 117,218 data instances are used. Each text is represented as a 2,000-dimensional BOW vector. We randomly sample 5,000 instances as the query set, and the remaining are used as the database where we randomly sample 10,000 instances as the training set.

4.2 Evaluation metrics

In general, we experiment with two typical retrieval tasks for cross-modal hashing methods, i.e., Image-to-Text Retrieval (\(\varvec{I}\rightarrow \varvec{T}\)) and Text-to-Image Retrieval (\(\varvec{T}\rightarrow \varvec{I}\)). We adopt two commonly used protocols in cross-modal retrieval, i.e., Hamming ranking and hash lookup. Hamming ranking sorts the data instances in a database and returns the data instances with the highest similarity to a given query. The top-N-precision curve and Mean Average Precision (MAP) are widely used to measure the Hamming ranking task. Hash lookup returns all the data instances within a certain Hamming radius given the query instance. The precision-recall curve is widely used to measure the hash lookup task. To obtain the MAP score, we need to calculate the Average Precision (AP) for each query, which is defined as:

$$\begin{aligned} AP=\frac{1}{N}\sum \limits _{r=1}^{R}{p\left( r\right) {\times rel}_r} \end{aligned}$$
(27)

where N is the number of relevant data instances in the database; p(r) represents the precision of the top-r returned data instances; \({rel}_r=1\) if the r-th returned data instance is relevant to the query and \({rel}_r=0\) otherwise. We can then calculate the MAP score by averaging the APs for all queries. Here, we choose MAP@R=500.

4.3 Implementation details

For image modality, following the method GCH [50], we adopt CNN-F [9] for image modality due to its excellent capability on image feature extraction. It is pre-trained on ImageNet [13]. We keep the five convolutional layers conv1-conv5 and the next two fully-connected layers fc6-fc7 unchanged and then replace the eighth layer fc8 with a new fully-connected layer which has k nodes to map the deep image features into the Hamming space.

For text modality, we first use the BOW representation to transform each text into a vector, and then use a simple MLP to map the BOW vectors into the Hamming space. The MLP consists of three fully-connected layers with 512, 512 and k nodes, respectively.

For our proposed method, we set the hyper-parameters as \(\alpha =\lambda =1\), \(\rho _1=\rho _2=0.01\), \(\beta _2=10\), \(\gamma _1=1\), \(\gamma _2=0.05\) for MIRFlickr-25K, \(\gamma _1=0.5\), \(\gamma _2=0.1\) for NUS-WIDE, and \(\gamma _1=0.1\), \(\gamma _2=0.5\) for MS-COCO. To learn the neural network parameters at the second stage, we adopt Adam solver and set the learning rate within \({10}^{-3}\) and \({10}^{-4}\), and the batch size to 64. We implement our method with PyTorch on a single NVIDIA GTX 1080Ti GPU.

4.4 Comparison with state-of-the-art approaches

We compare our proposed method with the state-of-the-art cross-modal hashing methods, including CMSSH [4], SCM [53], STMH [46], SePH [31], DCMH [21], SSAH [25], GCH [50], DLFH [22], MLCAH [36], ASCSH [37], FCMH [49], MIAN [54] and MS\(^2\)GAH [15]. Among these methods, CMSSH, SCM, STMH, SePH and DLFH are shallow methods, and the rest are deep methods. The deep networks for our proposed DFGH are the same as the previous work. They all use the CNN-F network for image modality and MLP for text modality. Besides, SePH, DLFH and FCMH are also two-stage methods.

Table 2 MAP scores of compared methods with different lengths of hash codes on three benchmark datasets
Fig. 3
figure 3

The top-N-precision and precision-recall curves with 64-bit hash codes on MIRFlickr-25K

Table 2 reports the MAP scores of all ten methods and our DFGH for two cross-modal retrieval tasks with the hash codes varying from 16 bits to 64 bits. For fair comparison, we follow the setting of GCH in [50], as it reported the highest metric scores over all the comparison methods. In [50], the author reported the scores of CMSSH, SCM, STMH, SePH and DCMH. The comparisons between our method and these SOTA methods are under the same settings. We also discuss other methods, e.g., DLFH, MLCAH, ASCSH, FCMH, MIAN and MS\(^2\)GAH. To the best of our knowledge, these methods did not release the source code to the public, making it hard to reproduce the results under the same settings. Fortunately, according to existing works [25, 36, 50], the random sampling strategy makes little influence on the results. Under a similar scaling partition of the training and test set, the results of the same methods are almost the same. Therefore, we cite the results directly from the original papers. Furthermore, we use “SUM” to represent the sum of the MAP scores for the tasks of Image-to-Text Retrieval and Text-to-Image Retrieval. Figures 3, 4 and 5 show the top-N-precision and precision-recall curves with 64-bit hash codes on the three datasets. Because SSAH, MLCAH and MS\(^2\)GAH have not released their source codes and the original papers did not give the experimental results under the same settings as those listed in this work, there are no corresponding curves for these two methods in Figs. 35. Similarly, there are no corresponding curves for DLFH and ASCSH in Fig. 5.

Fig. 4
figure 4

The top-N-precision and precision-recall curves with 64-bit hash codes on NUS-WIDE

As shown in Table 2 and Fig. 35, our DFGH outperforms most of the compared methods in terms of different datasets and different lengths of hash codes. Specifically, on MIRFlickr-25K, compared to GCH and MS\(^2\)GAH, which get the top three highest scores in “SUM”, DFGH achieves a slight lower performance on the MAP score with 32-bit and 64-bit hash codes, while a significant improvement with 16-bit hash codes. It means that the reduction of the hash code length may have less impact on DFGH, because DFGH uses a fine-grained similarity matrix and a similarity preserving loss function to capture the differentiated similarity relationships, which is also effective for short hash codes. On both NUS-WIDE and MS-COCO, DFGH significantly outperforms the other methods. Compared with MIRFlickr-25K, NUS-WIDE and MS-COCO are more complex and contain larger amounts of data, and the tasks are thus more challenging. The excellent performance on both datasets demonstrates that DFGH performs well in complex scenarios where it is more demanding to capture the differentiated similarity relationships between the original data instances and preserve them in the Hamming space.

On these three benchmark datasets, for one-stage deep methods DCMH and GCH, the results of Text-to-Image Retrieval are generally better than those of Image-to-Text Retrieval, indicating that it is more difficult to capture the semantic similarity relationships between image modality and text modality via image query. For one-stage methods, the learning of hash codes is affected by both image features and text features. The generated hash codes may contain more semantic information of text modality, because it is more difficult to capture the hidden semantic information from an image than from texts. Different from the one-stage methods, the results for Image-to-Text Retrieval by our DFGH are generally slightly better than those for Text-to-Image Retrieval on MIRFlickr-25K. The reason may be that the two-stage methods only generate hash codes from the semantic labels without using the image features and text features at the first stage. Moreover, the training set is almost half the size of the entire dataset on MIRFlickr-25K, which makes the image and text networks have similar ability to map the original data to hash codes. However, on NUS-WIDE, the training set is only a small portion (less than \(4\%\)) of the entire dataset, making it more difficult to preserve the similarity relationships between different modalities for image modality. In the results of another recent two-stage method (i.e., FCMH), we can also observe a similar outcomes, while our approach further enhances performance. Additionally, with texts composed of sentences, MS-COCO has higher quality in text modality than the other two datasets. Therefore, it is much easier to capture the hidden semantic information from texts in MS-COCO. Different from SePH, our DFGH uses MLP for text modality, which has a stronger ability to extract text features.

Fig. 5
figure 5

The top-N-precision and precision-recall curves with 64-bit hash codes on MS-COCO

Table 3 MAP scores with different lengths of hash codes for TECH and DFGH on MIRFlickr-25K

The above experimental results have shown that the proposed DFGH outperforms the state-of-the-art cross-modal hashing methods. We now compare our DFGH with the two-stage method TECH [10] on MIRFlickr-25K. Because TECH used different settings from the other methods, we conduct another experiment by following the experimental protocols provided in TECH. Specifically, we randomly select 2,000 data instances as the query set and the remaining 18,015 instances as the training and retrieval set. Deep CNN-F features are extracted for TECH, and the MAP scores are adopted for all query instances. The related experimental results are reported in Table 3. It can be seen that DFGH outperforms TECH in most cases. From the MAP scores of DFGH, we can also find that the extracted text representations can preserve the similarity relationships between different modalities better than image representations. The greater performance boost in shorter code length proves the effectiveness of the preserved similarity relationships again. It should be noted that, in the task of text-to-image, our method outperforms TECH significantly. Our method surpasses the TECH by 0.040, 0.038, 0.026 respectively in 16 bits, 32 bits, and 64 bits. In fact, the task of text-to-image is far more widely used than the task of image-to-text in practice. For the latter task, the technique of image captioning is more common. In terms of computational complexity, the length of hash code is the determining factor. It is worthwhile to spend more time in the training process to obtain higher accuracy.

Table 4 MAP scores of DFGH and its variants on three benchmark datasets
Table 5 MAP scores of models without \(\mathcal {J}_1\), \(\mathcal {J}_2\) and \(\mathcal {J}_3\) on three benchmark datasets

4.5 Ablation study

Effects of main components

To verify the effectiveness of the main components in our DFGH model, we conduct the ablation study experiments on three benchmark datasets, as shown in Table 4. We aim to explore four variants of DFGH: 1) DFGH-NAE is a DFGH variant without Autoencoder at the first stage; 2) DFGH-UWXE is a DFGH variant that uses unweighted cross-entropy loss at the second stage; 3) DFGH-MBC is a DFGH variant that uses multi-binary classification to train the hash functions at the second stage; and 4) DFGH-CGM is a DFGH variant that uses the coarse-grained similarity matrix at the first stage and unweighted cross-entropy loss at the second stage.

Fig. 6
figure 6

Parameter analysis of \(\alpha \) and \(\lambda \) on MIRFlickr-25K and NUS-WIDE

As expected, DFGH outperforms DFGH-NAE, which verifies the effectiveness of the added Autoencoder at the first stage. Comparing DFGH with DFGH-UWXE, we find that the unweighted pairwise cross-entropy loss may lead to suboptimal performance. The main reason is that this loss assigns the same weight to each pair, causing the learning process to be equally sensitive to each pair. In contrast, our DFGH model uses the fine-grained similarity matrix to assign different weights to different pairs, making the training process sensitive to similar and hard pairs and further enabling the hash codes in Hamming space to preserve the differentiated similarity relationships between the original data. DFGH-UWXE significantly outperforms DFGH-CGM. The only difference between these two variants is the similarity matrix used at the first stage. This comparison shows the effectiveness of the fine-grained similarity matrix and further demonstrates the importance of fully capturing the similarity information.

Effects of loss functions in Stage 2

Further, to verify the effectiveness of the components of the loss function at the second stage, we conduct the ablation study experiments on three benchmark datasets, as shown in Table 5. We report the MAP scores of the model without \(\mathcal {J}_1\), \(\mathcal {J}_2\) and \(\mathcal {J}_3\) on three benchmark datasets, respectively. On the whole, \(\mathcal {J}_1\) is the most crucial loss. There is a dramatic drop in performance without \(\mathcal {J}_1\). This proves the effectiveness of the pipeline of our two-stage method. The most important thing for hash function learning is to get guidance on the hash codes inferred from the first stage. For \(\mathcal {J}_2\) and \(\mathcal {J}_3\), there is an improvement of an average of 0.02 for each task. Note that the improvement brought by doubling the hash code length is about 0.02 here, so such improvement is significant. It indicates that the design of classification loss and weighted similarity preserving loss can well boost the retrieval performance, proving the effectiveness of preserving the semantic information of labels and similarity information in hash codes.

Fig. 7
figure 7

Parameter analysis of \(\gamma _1\) and \(\gamma _2\) on MIRFlickr-25K and NUS-WIDE

Fig. 8
figure 8

Parameter analysis of \(\beta _2\) on MIRFlickr-25K and NUS-WIDE

4.6 Parameter sensitivity analysis

We analyze parameter sensitivity on the most used datasets MIRFlickr-25K and NUS-WIDE. We design experiments with both retrieval tasks and fix the hash code length as 64 bits empirically. There are eight hyper-parameters in DFGH, i.e., \(\alpha \), \(\lambda \), \(\gamma _1\), \(\rho _1\), \(\rho _2\), \(\gamma _2\), \(\beta _1\) and \(\beta _2\). As mentioned before, \(\beta _1\) is related to the length of the hash code k and the number of classes c, thus we construct the value of \(\beta _1\) empirically. In addition, we follow the previous work SADH in [41] and set the value of two dual variables \(\rho _1\) and \(\rho _2\) to 0.01. Therefore, we only present the experimental results about \(\alpha \), \(\lambda \), and \(\beta _2\) in Figs. 6, 7 and 8. Here, we use “Average” to represent the average score for Image-to-Text Retrieval and Text-to-Image Retrieval. Overall, the influence of the parameters on the results tends to be clearer and more stable. This means that in the parameter selection process, we can get the most suitable parameters without too much effort.

Parameters \(\alpha \) and \(\lambda \)

represent the importance of the Autoencoder loss. From Fig. 6, we observe that the average MAP scores of DFGH first increase and then drop with \(\alpha \) and \(\lambda \) in [0.5, 10] on MIRFlickr-25K, but keep relative stable on NUS-WIDE. It indicates that the semantic information of labels is helpful to improve the retrieval performance, but too much intervention of Autoencoder may make the learning of hash code worse on a small-scale dataset.

Parameters \(\gamma _1\) and \(\gamma _2\)

adjust the coarse-grained matrix and deal with the class imbalance problem. From Fig. 7, we can observe that on MIRFlickr-25K, the average MAP scores first increase steadily with \(\gamma _1\) in [0.05, 0.7], and then remain stable with \(\gamma _1\) in [0.7, 1.0]. Besides, the average MAP scores first keep relatively stable with \(\gamma _2\) in [0.7, 1.0], and then drop rapidly. Different from those on MIRFlickr-25K, the average MAP scores on NUS-WIDE are not sensitive to the hyper-parameters \(\gamma _1\) and \(\gamma _2\) in [0.05, 0.7], and drop when \(\gamma _1\) and \(\gamma _2\) are greater than 0.7.

Parameter \(\beta _2\)

controls the similarity preserving loss. As shown in Fig. 8, on MIRFlickr-25K the average MAP scores first increase steadily with \(\beta _2\) in [1, 10], indicating that the similarity preserving loss is necessary to learn the hash functions. However, when \(\beta _2\) becomes too large, the average MAP scores also drop. Different from the average MAP scores on MIRFlickr-25K, those on NUS-WIDE first increase steadily with \(\beta _2\) in [1, 10], and then keep stable over a wide range of values. Besides, the best retrieval performance can be achieved under different settings on different datasets.

5 Conclusion

In this paper, we introduce a two-stage deep hashing method (i.e., DFGH) for cross-modal retrieval. To generate better hash codes, we design a fine-grained similarity matrix to explore the differentiated similarity relationships between data instances. An autoencoder is utilized to preserve the semantic information of labels. To obtain better hash functions, we design a similarity sensitivity learning strategy and a similarity preserving loss. They enable the hash codes generated by the hash functions to preserve the differentiated similarity relationships in the Hamming space. The adoption of such fine-grained similarity matrix obviously promotes the retrieval performance, demonstrating the importance of exploring fine-grained similarity information. In the future, it is promising to utilize more information besides of labels to mine more similarity information for the construction of the similarity matrix.