Deep cross-modal hashing with fine-grained similarity

Chen, Yangdong; Quan, Jiaqi; Zhang, Yuejie; Feng, Rui; Zhang, Tao

doi:10.1007/s10489-023-05028-y

Deep cross-modal hashing with fine-grained similarity

Published: 19 October 2023

Volume 53, pages 28954–28973, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

Deep cross-modal hashing with fine-grained similarity

Download PDF

Yangdong Chen ORCID: orcid.org/0000-0001-6319-5579¹,
Jiaqi Quan¹,
Yuejie Zhang¹,
Rui Feng¹ &
…
Tao Zhang²

238 Accesses
Explore all metrics

Abstract

Cross-modal hashing has attracted noticeable attention in the multimedia community. Two-stage methods often show impressive performance by first learning hash codes for data instances from different modalities, then learning hash functions that map the original multimodal data to the low dimensional hash codes. However, most existing two-stage methods can hardly obtain satisfactory hash codes at the first stage, as the commonly used coarse-grained similarity matrix fails to capture the differentiated similarity relationships between the original data instances. Besides, such methods cannot obtain satisfactory hash functions at the second stage, where the learning of hash functions is treated as a multi-binary classification problem. In this paper, we propose a novel two-stage hashing method for cross-modal retrieval. At the first stage, we capture the differentiated similarity relationships between data instances by designing a fine-grained similarity matrix and add an Autoencoder to mine the semantic information. At the second stage, we introduce a similarity sensitivity learning strategy under the guidance of the similarity matrix to train the hash functions. This strategy makes the training process sensitive to the similar and hard pairs, boosting the retrieval performance. Comprehensive experiments on three benchmark datasets validate the effectiveness of our method.

High-order nonlocal Hashing for unsupervised cross-modal retrieval

Article 27 February 2021

A novel cross-modal hashing algorithm based on multimodal deep learning

Article 21 March 2017

Deep supervised fused similarity hashing for cross-modal retrieval

Article 21 June 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the explosive growth of multimedia data, multimodal applications are attracting increasing attention, such as Image Captioning [17], Visual Question Answering [39], and Cross-modal Retrieval [27, 32, 44, 56]. Unlike unimodal retrieval [1], cross-modal retrieval aims to search semantically similar samples from one modality with a given query from another modality (e.g., from image to text). However, it is nearly impossible to perform exact nearest neighbor search for real-world large-scale multimedia data due to the cost of time and storage involved. Many methods have been proposed to address this problem, such as Approximate Nearest Neighbor (ANN) search. Hashing is one of the most promising techniques among the ANN-based methods [35, 45]. Recently, hashing-based ANN search has become popular because of its low computational cost and fast retrieval speed. The aim of hashing is to learn hash functions to project data instances from the original feature space into a compact Hamming space, where the data instances from different modalities can be represented by binary hash codes of a certain length. In the Hamming space, the similarity between original data instances can be represented by the Hamming distance. This requires the hash codes to preserve the similarity relationships between the data instances in the original feature space.

Data instances from different modalities are heterogeneous. Their features may have different distributions. To narrow such a gap, many cross-modal hashing methods have been proposed to mine the semantic relationships between different modalities and preserve them in a common Hamming space. According to whether supervised information (i.e., semantic labels) is used or not, recent existing methods can be generally divided into unsupervised hashing [46] and supervised hashing [55]. Supervised methods usually achieve superior performance to unsupervised methods by learning the correlations between multimodal data via semantic labels. However, most of them are shallow hashing methods which use hand-craft features. The hand-crafted features may not be optimal discriminative representations of data instances.

Deep learning based methods have meanwhile achieved promising performance. Typical deep learning based methods, such as DVSH [5], CMDVH [16] and TVDH [42], train hash codes and hash functions simultaneously. They are also called one-stage hashing methods with end-to-end training scheme. One advantage of such methods is that the joint optimization of hash codes and hash functions can lead to better retrieval accuracy compared to two-stage methods. However, the hash codes must be retrained when the hash functions are changed. Moreover, the alternating training of hash codes and hash functions makes them impact each other, potentially leading to suboptimal performance. The pairwise-based [19, 21, 55] or triplet-based [12] training strategies also limit them to preserving only local similarities due to computational constraints.

To improve the flexibility and address the above limitations, two-stage methods are proposed, which decouple the training of hash codes and hash functions [22, 26, 31, 48]. Hash codes are first learned independently. Hash functions are then trained to map new queries to the pre-trained hash codes. This makes it easier to optimize hash codes and hash functions separately. At first, the retrieval accuracy of such methods is not as good as that of the one-stage methods. Nearly all of the two-stage methods only use a coarse-grained similarity matrix $\varvec{S}^c={s^c_{ij}}$, where $s^c_{ij}=1$ means that the data instances $x^c_i$ and $x^c_j$ are similar and $s^c_{ij}=0$ means dissimilar. Obviously, it will cause substantial loss of information before generating the hash codes, because this similarity matrix cannot capture the differentiated similarity relationships between the training data at the first stage. At the second stage, most of the two-stage methods treat the learning of hash functions as a multi-binary classification problem. In fact, each bit of a hash code does not necessarily correspond to a partition in the feature space. The learning of hash functions at this stage should focus on preserving as much discriminative information as possible. In other words, this should be a result of similarity calculation rather than classification. In addition, separate training is no longer limited to mini-batches, enabling preservation of global similarities.

Based on the above observations, we propose a novel two-stage method for cross-modal retrieval, called Deep Cross-modal Hashing with Fine-grained Similarity (DFGH). It effectively explores the differentiated similarity relationships between the training data and utilizes them to assist the learning of hash codes and hash functions. When performing hash code inference at the first stage, we use semantic labels to build a fine-grained similarity matrix by assuming two instances are more similar if they share more labels. In this way, we can represent the differentiated similarity relationships between the training data. When performing hash code mapping at the second stage, we design a similarity sensitivity strategy to learn the hash functions. In this strategy, we employ a mapping loss to fit the hash codes and a classification loss to retain the semantic information. Importantly, we further design a similarity preserving loss based on the above fine-grained similarity matrix to preserve the differentiated similarity information.

The main contributions of our work are summarized as:

1)
We propose a novel two-stage hashing method for cross-modal retrieval with the fine-grained similarity matrix. The learning of hash codes and hash functions are linked up without affecting each other by utilizing the matrix.
2)
At the first stage of the method, we preserve the distinguishable similarity relationships between data instances with the similarity matrix. At the second stage, the similarity-preserving loss is designed so that the hash functions can preserve the fine-grained similarity information.
3)
We conduct extensive experiments on three benchmark datasets to demonstrate the effectiveness of the proposed DFGH. The results show its superiority to the state-of-the-art methods especially on large-scale image datasets (e.g., MS-COCO).

2 Related work

Different from unimodal hashing, cross-modal hashing aims to realize the retrieval when queries and samples to be retrieved have different modalities. It is crucial to narrow the semantic gap between different modalities when extending unimodal hashing to cross-modal hashing. Cross-modal hashing methods can be divided into two categories in terms of whether supervised information is used during the learning process, i.e., unsupervised and supervised.

Unsupervised methods learn hash functions by capturing the correlations between different modalities without supervised information. One common way is to preserve the structure of original features. For example, Inter-Media Hashing (IMH) [43] learned hash functions by exploring both intra-media consistency and inter-media consistency. PDH [40] maintained the predictability of binary hash codes and optimized the objective function by iterative method. CMFH [14] learned hash codes by collective matrix factorization with a latent factor model from different modalities. LSSH [57] used sparse coding to capture the salient structure of images and matrix factorization to obtain the latent concepts of texts, and then projected the learned semantic features into a common space. MSAE [47] learned hash functions by capturing the intra-media and inter-media semantic relationships between data from heterogeneous sources. LCMH [58] aimed at learning hash functions from large-scale training datasets, while considering both the intra-similarity in each modality and the inter-similarity across different modalities. Fusion Similarity Hashing (FSH) [32] modeled the fusion similarity among different modalities by constructing an undirected asymmetric graph. Unsupervised Knowledge Distillation (UKD) [18] utilized a similarity matrix as well. As an unsupervised method, it estimated the similarity matrix using the original features of image and text, i.e., it utilized the distance between the features of instances. Our similarity matrix in this paper is different from theirs, and our method focuses on the utilization of supervision information, i.e., labels of instances.

Supervised methods explore the correlations between heterogeneous data to train hash functions using supervised information. Thus, they can usually obtain better accuracy than unsupervised methods. For example, Semantic Correlation Maximization (SCM) [53] tried to train hash codes by a semantic similarity matrix that was calculated by label vectors. Cross-Modality Similarity-Sensitive Hashing (CMSSH) [4] modeled the learning of hash functions as a binary classification problem and used a boost algorithm during the learning process. CVH [24] learned hash functions by projecting similar data instances from different modalities into similar hash codes.

Many deep hashing methods have been proposed recently. Most of them are called one-stage methods, which means that they combine the learning of hash codes and hash functions in an end-to-end manner. For example, Deep Cross-Modal Hashing (DCMH) [21] integrated the learning of deep features and hash codes into an end-to-end framework, and used pairwise-based object function to train the framework. PRDH [52] integrated different types of pairwise constraints to learn the hash codes. Deep Semantic-Preserving Ordinal Hashing (DSPOH) [23] learned hash functions by using deep neural networks to explore the ranking structure of feature dimensions. Self-Supervised Adversarial Hashing (SSAH) [25] used two adversarial networks to maximize the semantic correlation and consistency of representations between different modalities. Asymmetric Supervised Consistent and Specific Hashing (ASCSH) [37] presented a multimodal mapping learning strategy and a discrete asymmetric learning framework. It extracted the pairwise similarity and semantic labels to make full use of the supervised information. Multi-Level Correlation Adversarial Hashing (MLCAH) [36] designed a multi-level correlation hashing model with a label-consistency attention mechanism to excavate the local and global correlation information. Deep Multiscale Fusion Hashing (DMFH) [38] adopted multiscale fusion models to enhance semantic relevance and improved hash code representativeness. Asymmetric Correlation Quantization Hashing (ACQH) [37] used real-valued query embeddings and stacked database quantization embeddings for improved retrieval. Fast Discriminative Discrete Hashing (FDDH) [33] presented an efficient closed-form solution and online strategy to learn discriminative hash codes. It introduced an orthogonal basis to regress hash codes and relaxed label values to minimize quantization loss, and an orthogonal transform to ensure semantic consistency. Modality-Invariant Asymmetric Networks (MIAN) [54] introduced probabilistic alignment to capture intra- and inter-modal similarities, bridging semantic and heterogeneity gaps. Multi-Label Semantic Supervised Graph Attention Hashing (MS$^2$GAH) [15] designed an end-to-end framework that integrates graph attention networks and multi-label annotations to bridge the semantic modality gap.

Different from the one-stage methods, two-stage methods divide the learning of hash codes and hash functions into two stages. The pioneering framework [28] decomposed the hash learning into two stages. Semantics-Preserving Hashing (SePH) [31] generated unified binary hash codes by minimizing the KL-divergence, and then learned hash functions by kernel logistic regression with a sampling strategy. Supervised Discrete Manifold-Embedded Cross-Modal Hashing (SDMCH) [34] generated hash codes by using semantic labels. It exploited a nonlinear manifold structure of data and constructed the correlations among multiple heterogeneous modalities. Two-Step Cross-modal Hashing (TECH) [10] learned hash codes by preserving the similarity in the original space and exploiting the label correlations in the label space at the first stage, and then learned hash functions by leveraging the similarity relationships between training instances at the second stage. Discrete Latent Factor Hashing (DLFH) [22] designed a discrete latent factor model to learn the hash codes and proposed a method called out-of-sample extension to the hash function. Fast Cross-Modal Hashing (FCMH) [49] used global and local similarity embedding and efficient discrete optimization for improved accuracy and scalability. Compared with those one-stage methods, two-stage methods are flexible enough to change the hash functions if needed. They take less time to train hash codes and hash functions, and can achieve comparable retrieval performance. However, existing two-stage methods attempt to map similar data instances to the same hash code, and assume that two data instances are similar as long as they share one label. Such coarse-grained similarity definition will produce suboptimal hash codes and impact the retrieval accuracy. We propose DFGH which defines fine-grained similarity to generate better hash codes and hash functions at the first and second stage respectively, by considering that the more labels two data instances share, the more similar they are.

In many one-stage methods, cross-entropy [21] is adopted to preserve the pairwise similarity relationships between data instances in the Hamming space. Besides, different types of loss functions are proposed. For example, Transfer Adversarial Hashing (TAH) [7] adopted a hybrid deep architecture which incorporated a pairwise t-distribution cross-entropy loss to learn hash codes. HashGAN [6] used a cosine cross-entropy loss to learn hash functions. However, most of them treat the hard and easy examples equally and enforce the same penalization on them, which may lead to suboptimal hash codes. To address this issue, some previous methods [8, 51] attempted to assign higher weights to more important examples for prioritization. However, they just assigned the priority via the coarse-grained similarity information. In comparison, our DFGH transfers the cross-entropy loss from one-stage to two-stage framework to preserve the similarity relationships in the Hamming space, and designs a fine-grained similarity matrix to assign different weights to different pairs, making the learning of hash functions more sensitive to similar pairs.

3 Methodology

3.1 Formulation

Without loss of generality, we focus on cross-modal retrieval for bimodal data (i.e., image and text), as other modalities should be similar. Firstly, we introduce the notations and problem definition. We use bold uppercase letter $\varvec{W}$ to represent a matrix and bold lowercase letter $\varvec{w}$ to represent a vector. $\varvec{W}^\textrm{T}$ represents the transpose of the matrix $\varvec{W}$; $|| \cdot ||_F$ represents the Frobenius norm of a matrix; $\varvec{W}_{i*}$ denotes the i-th row of $\varvec{W}$; $\textbf{1}$ denotes a vector with all elements being 1; and $sign(\cdot )$ denotes a sign function, which is defined as:

$$\begin{aligned} sign(x)= {\left\{ \begin{array}{ll} -1&{} x<0\\ +1&{} x \ge 0 \end{array}\right. } \end{aligned}$$

(1)

Let $O={{o_i}}_{i=1}^n$ denote a cross-modal dataset with n instances, where $o_i=(\varvec{x}_i,\varvec{y}_i,\varvec{l}_i)$ is the i-th instance. Specifically, $\varvec{x}_i\in \mathbb {R}^{1\times d_x}$ represents the original image feature; $\varvec{y}_i\in \mathbb {R}^{1\times d_y}$ represents the original text feature; and $\varvec{l}_i=[\varvec{l}_{i1},\ldots ,\varvec{l}_{ic}]$ is the multi-label assigned to $o_i$, where c is the number of classes. If $o_i$ belongs to the j-th class, $\varvec{l}_{ij}=1$, otherwise $\varvec{l}_{ij}=0$. Hence, we can use $\varvec{X},\ \varvec{Y},\ \varvec{L}$ to represent the image-feature matrix, text-feature matrix, and semantic label matrix, respectively. We use $\varvec{B}^{\varvec{x}}$ and $\varvec{B}^{\varvec{y}}$ to represent the matrices composed of the binary hash codes corresponding to the image and text feature, respectively. Table 1 summarizes the main notations used in this paper.

Table 1 Main notations

Full size table

The objective of DFGH is to learn two sets of modality-specific hash functions to map the features from different modalities into a common Hamming space, i.e., $\varvec{f}^{\varvec{x}}\left( \varvec{x};\theta _x\right) \in \{{-\textbf{1},\textbf{1}}\}^k$ for images and $\varvec{f}^{\varvec{y}}\left( \varvec{y};\theta _y\right) \in \{{-\textbf{1},\textbf{1}}\}^k$ for texts, where k denotes the length of hash code and $\theta $ denotes the parameters of the hash function to be learned. In the Hamming space, the similarity between two binary hash codes is evaluated by the Hamming distance. For two binary hash codes $\varvec{b}_i$ and $\varvec{b}_j$, there is a natural relationship between the Hamming distance and their inner product, which is defined as:

$$\begin{aligned} H_{dist}\left( \varvec{b}_i,\varvec{b}_j\right) =\frac{1}{2}\left( k-\varvec{b}_i\varvec{b}_j^{\textrm{T}}\right) \end{aligned}$$

(2)

The larger the inner product is, the smaller the Hamming distance is. Thus, we can use the inner product instead of the Hamming distance to quantify the similarity between two binary hash codes in the Hamming space.

3.2 Overview

The architecture of DFGH is shown in Fig. 1, which contains two main components, i.e., Hash Code Inference with Fine-grained Similarity and Hash Code Mapping with Similarity Sensitivity. In essence, what we need to do is to generate hash codes for the training set at the first stage, and then at the second stage, utilize these generated hash codes to supervise the training of the hash functions. At the first stage, a fine-grained similarity matrix is designed to generate a set of discriminative hash codes which preserve the differentiated similarity relationships between the training data. Besides, we use an Autoencoder to encourage hash codes to preserve the semantic information of labels. At the second stage, we employ two deep neural networks (ImgNet and TxtNet) to learn hash functions for the image and text modality, respectively. The output dimension of both networks is the length of hash code. We also design a similarity-preserving loss based on the similarity information (i.e., our similarity matrix), to enable the training of hash functions being more sensitive to the pairs with higher degree of similarity.

3.3 Hash code inference with fine-grained similarity

Fine-grained similarity mining

At this stage, we need to infer a set of hash codes which can best preserve the differentiated similarities between the original training data. Firstly, we construct a fine-grained similarity matrix with the semantic labels to represent the differentiated similarities. As analyzed previously, most hashing methods just use a coarse-grained similarity matrix $\varvec{\bar{S}}$ to describe the similarities between the original data instances. Specifically, in a multi-label setting where a data instance $o_i$ can be associated with multiple labels, it defines $\varvec{\bar{S}}_{ij}=1$ if $o_i$ and $o_j$ share at least one label, indicating they are semantically similar, and $\varvec{\bar{S}}_{ij}=0$ otherwise, where $i,j \in \left\{ 1,\ldots ,n\right\} $. However, discrete values cannot fully describe the differentiated similarity information. As shown in Fig. 2, when data instances A and B have common labels, they will be mapped to the same point in the Hamming space, irrespective of their potential associations with different semantic label vectors. Besides, despite the fact that a data instance C shares more labels with A than with B, C and B will be mapped to the same point in the Hamming space. It reveals that the similarity between C and A is the same as the similarity between B and A in the Hamming space, though C is semantically closer to A in the original feature space.

To tackle these issues, we introduce a fine-grained similarity matrix $\varvec{\hat{S}}$, with its elements defined as follows:

$$\begin{aligned} \varvec{\hat{S}}_{ij}=\frac{\varvec{l}_i\varvec{l}_j^{\textrm{T}}}{\textbf{1}\varvec{l}_i^{\textrm{T}}+\textbf{1}\varvec{l}_j^{\textrm{T}}-\varvec{l}_i\varvec{l}_j^{\textrm{T}}} \end{aligned}$$

(3)

where $\varvec{l}_i$ and $\varvec{l}_j$ are semantic label vectors of data instances $o_i$ and $o_j$, respectively. The value of $\varvec{\hat{S}}_{ij}$ is larger if $o_i$ and $o_j$ share more labels, which indicates that $o_i$ and $o_j$ have a higher degree of similarity. It is obvious that the value of $\varvec{\hat{S}}_{ij}$ is 0 if $o_i$ and $o_j$ share no common labels. For the convenience of calculation, we make the hash codes belong to $\{-1,1\}^k$, so their correct inner product for the dissimilar pairs should be closer to $-1$ instead of 0. To construct the relationship between the similarity matrix and the dot product of hash codes, we convert $\varvec{\hat{S}}_{ij}$ from [0, 1] to [$-1$, 1]. We then get the final fine-grained similarity matrix $\varvec{S}$ as:

$$\begin{aligned} \varvec{S}_{ij}= {\left\{ \begin{array}{ll} \varvec{\hat{S}}_{ij}&{} \varvec{\hat{S}}_{ij}>0\\ -1&{} \varvec{\hat{S}}_{ij}=0 \end{array}\right. } \end{aligned}$$

(4)

When the instances $o_i$ and $o_j$ share at least one label, which means that they have a certain degree of similarity, we set the value of $\varvec{S}_{ij}$ to be greater than 0 and expect that the Hamming distance between the two hash codes is not too large. Conversely, if no labels are shared, we aim for a maximized Hamming distance between the hash codes. We use the inner product of two hash codes to reconstruct the original relationship and define the similarity reconstruction loss as:

$$\begin{aligned} \min \limits _{\varvec{B}} \ \ {\left\| \frac{1}{k}\varvec{B}\varvec{B}^{\textrm{T}}-\varvec{S}\right\| _F^2} \ \ \varvec{s.t.} \ \ \varvec{B}\in \left\{ -1,\ 1\right\} ^{n\times k} \end{aligned}$$

(5)

The designed loss pushes the generated hash codes to be discriminative and to preserve the differentiated similarities between the training data.

Semantic information mining

To enable the hash codes to preserve the semantic information of labels, we consider to establish connection between the hash codes and the semantic labels.

$$\begin{aligned} \varvec{B}=\varvec{LW},\ \ \ \varvec{L}=\varvec{B}\varvec{W}^{\textrm{T}},\ \ \ \varvec{W}\in \mathbb {R}^{c\times k} \end{aligned}$$

(6)

This additional Autoencoder pushes the hash codes to preserve the information contained in the original semantic labels. Besides, the learning of hash codes is supervised by the fine-grained similarity reconstruction loss. Thus, there are no adverse consequences even when the number of nodes in the hidden layer (i.e., the number of bits in each hash code) exceeds the number of nodes in the input layer (i.e., the number of bits in each semantic label) within the Autoencoder. Our aim is to minimize the reconstruction error, and the objective function is formulated as:

$$\begin{aligned} \min \limits _{\varvec{W}} \ \ {\left\| \varvec{L}-\varvec{LW}\varvec{W}^\textrm{T}\right\| _F^2} \ \ \varvec{s.t.} \ \ \varvec{LW}=\varvec{B} \end{aligned}$$

(7)

This enables the generated hash codes to preserve the semantic information of labels. We use a learnable parameter $\alpha $ to decide the importance of reconstruction. Then, the total loss function of the first stage is defined as:

$$\begin{aligned} \min \limits _{\varvec{B}, \varvec{W}}{} & {} {\left\| \frac{1}{k}\varvec{B}\varvec{B}^{\textrm{T}}-\varvec{S}\right\| _F^2} + \alpha \left\| \varvec{L}-\varvec{LW}\varvec{W}^\textrm{T}\right\| _F^2 \nonumber \\{} & {} \quad \varvec{s.t.} \varvec{B}\in \left\{ -1,\ 1\right\} ^{n\times k}, \varvec{LW}=\varvec{B} \end{aligned}$$

(8)

Thus we obtain optimal hash codes which preserve the differentiated similarity relationships between training data and the semantic information of labels.

Optimization Next, we proceed to derive the optimization process for the objective function. To optimize (8), we first rewrite it as:

$$\begin{aligned} \min \limits _{\varvec{B}, \varvec{W}} {\left\| \frac{1}{k}\varvec{B}\varvec{B}^\textrm{T}-\varvec{S}\right\| _F^2}+ & {} \alpha \left\| \varvec{L}-\varvec{B}\varvec{W}^\textrm{T}\right\| _F^2\!\! +\lambda \left\| \varvec{B}-\varvec{L}\varvec{W}\right\| _F^2 \nonumber \\{} & {} \!\!\!\!\!\! \varvec{s.t.} \varvec{B}\in \left\{ -1,\ 1\right\} ^{n\times k} \end{aligned}$$

(9)

where $\lambda $ is a parameter used to adjust the weight of the last term. Equation (9) is a Mixed Integer Programming (MIP) problem, which is usually NP-hard due to the binary constraints. To address this issue, we convert the binary constraints $\varvec{B}\in {\{-1,1\}}^{n \times k}$ into two conditions $\varvec{B}\in S_b$ and $\varvec{B}\in S_p$, where $S_b$ and $S_p$ are the sets of $[-1,1]^{n\times k}$ and $\left\{ B|\left\| B\right\| _F^2=nk\right\} $, respectively [41]. We introduce two variables $\varvec{Z}_1$ and $\varvec{Z}_2$ to absorb these constraints. Therefore, the constrains in (9) can be converted into $\varvec{B}=\varvec{Z}_1, \varvec{Z}_1\in S_b$ and $\varvec{B}=\varvec{Z}_2,\ \varvec{Z}_2\in S_p$. Building on such transformation, we design an Alternating Direction of Multipliers (ADMM) algorithm [3], and iteratively solve the augmented Lagrangian function of (9) as:

$$\begin{aligned} \min \limits _{\varvec{B}, \varvec{W}, \varvec{Z}_1, \varvec{Z}_2, \varvec{Y}, \varvec{Y}_2}{} & {} \left\| \frac{1}{k}\varvec{B}\varvec{B}^\textrm{T}-\varvec{S}\right\| _F^2+\alpha \left\| \varvec{L}-\varvec{B}\varvec{W}^\textrm{T}\right\| _F^2\nonumber \\{} & {} +\lambda \left\| \varvec{B}-\varvec{LW}\right\| _F^2 \nonumber \\{} & {} +\delta _{S_b}\left( \varvec{Z}_1\right) +\delta _{S_p}\left( \varvec{Z}_2\right) +\frac{\rho _1}{2}\left\| \varvec{B}-\varvec{Z}_1\right\| _F^2\nonumber \\{} & {} +\frac{\rho _2}{2}\left\| \varvec{B}-\varvec{Z}_2\right\| _F^2 \nonumber \\{} & {} +\textrm{Tr}\left( \varvec{Y}_1^\textrm{T}({\varvec{B}-\varvec{Z}}_1)\right) \nonumber \\{} & {} +\textrm{Tr}\left( \varvec{Y}_2^\textrm{T}\left( {\varvec{B}-\varvec{Z}}_2\right) \right) \end{aligned}$$

(10)

where $\delta _S\left( \varvec{Z}\right) $ is an indicator function which equals to zero if $\varvec{Z}\in S$ and $+\infty $ otherwise; $\varvec{Y}_1$ and $\varvec{Y}_2$ are the dual variables; $\rho _1$ and $\rho _2$ are the penalty variables. The specific process for updating these variables is shown as below.

1)$\ \varvec{B}$-Step: Through fixing all the variables except $\varvec{B}$, we can update $\varvec{B}$ by minimizing the following objective functions:

$$\begin{aligned} \min \limits _{\varvec{B}} \mathcal {L}={} & {} \left\| \frac{1}{k}\varvec{B}\varvec{B}^\textrm{T}-\varvec{S}\right\| _F^2+\alpha \left\| \varvec{L}-\varvec{B}\varvec{W}^\textrm{T}\right\| _F^2 +\lambda \left\| \varvec{B}-\varvec{LW}\right\| _F^2 \nonumber \\{} & {} +\delta _{S_b}\left( \varvec{Z}_1\right) +\delta _{S_p}\left( \varvec{Z}_2\right) +\frac{\rho _1}{2}\left\| \varvec{B}-\varvec{Z}_1\right\| _F^2+\frac{\rho _2}{2}\left\| \varvec{B}-\varvec{Z}_2\right\| _F^2 \nonumber \\{} & {} +\textrm{Tr}\left( \varvec{Y}_1^\textrm{T}({\varvec{B}-\varvec{Z}}_1)\right) +\textrm{Tr}\left( \varvec{Y}_2^\textrm{T}\left( {\varvec{B}-\varvec{Z}}_2\right) \right) \end{aligned}$$

(11)

Such a sub-problem can be iteratively solved by the LBFGS-B method, with the gradient calculated as:

$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial \varvec{B}}={} & {} \frac{4}{k^2}\varvec{B}\varvec{B}^\textrm{T}\varvec{B}-\frac{4}{k}\varvec{SB}+2\alpha \left( \varvec{B}\varvec{W}^\textrm{T}-\varvec{L}\right) \varvec{W}\nonumber \\{} & {} +2\lambda \left( \varvec{B}-\varvec{LW}\right) +\left( \rho _1+\rho _2\right) \varvec{B}+\varvec{Y}_1+\varvec{Y}_2\nonumber \\{} & {} -\rho _1\varvec{Z}_1-\rho _2\varvec{Z}_2 \end{aligned}$$

(12)

2) $\varvec{Z}$-Step: Through fixing all the variables except $\varvec{Z}$, we can update $\varvec{Z}_1$ and $\varvec{Z}_2$ by the proximal minimizing methods. In particular, we eliminate $\delta _{S_b}\left( \varvec{Z}_1\right) $ and $\delta _{S_p}\left( \varvec{Z}_2\right) $ in (10) first and set the derivative with respect to $\varvec{Z}_1$ and $\varvec{Z}_2$ to 0. We then get the closed-form solutions of $\varvec{Z}_1$ and $\varvec{Z}_2$, and project $\varvec{Z}_1$ and $\varvec{Z}_2$ into the sets $S_b$ and $S_p$, respectively. The updating process for $\varvec{Z}_1$ and $\varvec{Z}_2$ is formulated as:

$$\begin{aligned} \varvec{Z}_1= & {} \mathbf {\Pi }_{S_b}\left( \varvec{B}+\frac{1}{\rho _1}\varvec{Y}_1\right) \end{aligned}$$

(13)

$$\begin{aligned} \varvec{Z}_2= & {} \mathbf {\Pi }_{S_p}\left( \varvec{B}+\frac{1}{\rho _2}\varvec{Y}_2\right) =\sqrt{nk}\frac{\varvec{B}+\frac{1}{\rho _2}\varvec{Y}_2}{\left\| \varvec{B}+\frac{1}{\rho _2}\varvec{Y}_2\right\| } \end{aligned}$$

(14)

where $\mathbf {\Pi }_{S_b}$ and $\mathbf {\Pi }_{S_p}$ are two projection operators, $\mathbf {\Pi }_{S_b}$ projects values above 1 to 1 and values below $-1$ to $-1$., and $\mathbf {\Pi }_{S_p}$ forces $\varvec{Z}_2$ to satisfy the condition $\left\| \varvec{Z}_2\right\| _F^2=nk$.

3) $\varvec{Y}$-Step: Through fixing all the variables except $\varvec{Y}$, we can update $\varvec{Y}_1$ and $\varvec{Y}_2$ by performing gradient ascent on the dual problem, which is formulated as:

$$\begin{aligned} \varvec{Y}_1=\varvec{Y}_1+\varvec{\rho }_1\left( \varvec{B}-\varvec{Z}_1\right) \end{aligned}$$

(15)

$$\begin{aligned} \varvec{Y}_\textbf{2}=\varvec{Y}_2+\varvec{\rho }_2\left( \varvec{B}-\varvec{Z}_2\right) \end{aligned}$$

(16)

4) $\varvec{W}$-Step: Through fixing all the variables except $\varvec{W}$, we can update $\varvec{W}$ by taking the derivative of (10) with respect to $\varvec{W}$ and setting it to 0. Thus, we obtain the following equation:

$$\begin{aligned} \varvec{\alpha W}\varvec{B}^\textrm{T} \varvec{B}+\varvec{\lambda }\varvec{L}^\textrm{T} \varvec{LW} = \varvec{\alpha }\varvec{L}^\textrm{T} \varvec{B}+\varvec{\lambda }\varvec{L}^\textrm{T} \varvec{B} \end{aligned}$$

(17)

We can solve such a Sylvester equation efficiently by the Bartels-Stewart algorithm [2].

The above iterative optimization is described in Algorithm 1. Here, the sign function is not utilized until the final hash code is obtained in the last step. Therefore, the derivation of sign function is not involved in this process.

Algorithm 1

The learning strategy of Hash Code Inference.

Full size image

3.4 Hash code mapping with similarity sensitivity

Similarity sensitivity learning

At the second stage, we aim to learn hash functions to project new query samples onto appropriate hash codes. Following the conventions within the cross-modal hashing field, we choose a Convolutional Neural Network (CNN) for images and a Multi-Layer Perception (MLP) for texts to perform the mapping. The outputs of these two networks can be denoted by $\varvec{F}_{i*}^x=f^x(\varvec{x}_i;\theta _x)\in \mathbb {R}^{1\times k}$ and $\varvec{F}_{i*}^y=f^y(\varvec{y}_i;\theta _y)\in \mathbb {R}^{1\times k}$, where $\theta _x$ and $\theta _y$ are the parameters of the image network and the text network, respectively. The loss function of the second stage can be divided into the mapping loss, classification loss and similarity preserving loss.

Considering that the above two networks (i.e., the CNN for images and the MLP for texts) should map the images and texts into a common representation space, we employ the following mapping loss to measure the error between these representations and the first-stage hash codes.

$$\begin{aligned} \mathcal {J}_1={\left\| \varvec{F}^{\varvec{u}}-\varvec{B}\right\| }_F^2 \end{aligned}$$

(18)

where u can be x or y, x for image modality and y for text modality.

Just as an Autoencoder is employed to ensure the preservation of semantic label information within hash codes during the first stage, we aspire to maintain the distinctiveness between data instances with different semantic labels in the common space. To predict the labels of the representations (i.e., the hash codes), two linear classifiers are attached to the top of the image network and the text network, respectively. The classification loss in the label space is defined as:

$$\begin{aligned} \mathcal {J}_2={\left\| \varvec{F}^{\varvec{u}}\varvec{V}-\varvec{L}\right\| }_F^2 , \varvec{V} \in \mathbb {R}^{k \times c} \end{aligned}$$

(19)

where $\varvec{V}$ is the parameter of the linear layer. This layer maps the hash code to class label for loss calculation.

The retrieval performance can be improved when the output of the network is close to the semantically similar data instances but far away from the dissimilar ones in the common representation space. Therefore, we introduce a similarity preserving loss. Firstly, we convert $\varvec{S}_{ij}$ from the range of $-1$ to 1 back to the range of 0 to 1, which means that we convert $-1$ to 0 in $\varvec{S}$. A coarse-grained matrix $\widetilde{\varvec{S}}$ is then utilized to calculate its probability under the condition $\varvec{F}^u$, which is formulated as:

$$\begin{aligned} \varvec{p}({\widetilde{\varvec{S}}}_{ij}|\varvec{F}_{i*}^u,\varvec{F}_{j*}^u)= {\left\{ \begin{array}{ll} \varphi (\psi _{ij}) &{} \widetilde{\varvec{S}}=1 \\ 1-\varphi (\psi _{ij}) &{} \widetilde{\varvec{S}}=0 \end{array}\right. } \end{aligned}$$

(20)

where $\varphi \left( x\right) =\frac{1}{1+e^{-x}}$ and $\psi _{ij}=\frac{1}{2}(\varvec{F}_{i*}^u){(\varvec{F}_{j*}^u)}^{\textrm{T}}$; ${\widetilde{\varvec{S}}}_{ij}=1$ means that the data instances $o_i$ and $o_j$ form a similar pair, and ${\widetilde{\varvec{S}}}_{ij}=0$ means $o_i$ and $o_j$ form a dissimilar pair. The larger the inner product of two data instances is, the more similar they are. Thus, we formulate the negative log likelihood of the similarities as:

$$\begin{aligned} \widetilde{\mathcal {J}_3}=-\sum _{i,j=1}^{n}{({\widetilde{\varvec{S}}}_{ij}}\log {(\varphi (\psi _{ij}))}+(1-{\widetilde{\varvec{S}}}_{ij})(\log {(1-\varphi (\psi _{ij}))}) \end{aligned}$$

(21)

Motivated by Focal Loss (FL) [30], we assign different weights to different data pairs. For a similar pair, the more similar the pair is, the larger weight it will be assigned. Because the loss decreases as the Hamming distance decreases, we assign larger weights to significantly penalize the similar pairs with a small Hamming distance, i.e., hard similar pairs. Therefor, the loss is more sensitive to the pairs with higher degree of similarities, and then supervise the model to provide close representations for similar instances. To meet the above expectations, the weight is defined as:

$$\begin{aligned} \varvec{W}_{ij}=\left( {\gamma _1\varvec{S}}_{ij}\right) ^{{\widetilde{\varvec{S}}}_{ij}}{\left( \frac{k+\mu _{ij}}{k}\gamma _2\right) }^{1-{\widetilde{\varvec{S}}}_{ij}} \end{aligned}$$

(22)

where $\mu _{ij}=sign\left( \varvec{F}_{i*}^u\right) {(sign(\varvec{F}_{j*}^u))}^{\textrm{T}}$; $\gamma _1$ and $\gamma _2$ are hyper-parameters that adjust the weights $\varvec{W}$ and deal with the class imbalance problem. By taking the priority weight in (22), the similarity preserving loss is rewritten as a weighted cross-entropy loss:

$$\begin{aligned} \mathcal {J}_3= & {} -\sum _{i,j=1}^{n}{\varvec{W}_{ij}({\widetilde{\varvec{S}}}_{ij}}\log {(\varphi (\psi _{ij}))}\nonumber \\{} & {} +(1-{\widetilde{\varvec{S}}}_{ij})(\log {(1-\varphi (\psi _{ij}))}) \end{aligned}$$

(23)

With this loss function, the mapping has the ability to preserve the differentiated similarity relationships between the original data. In addition, the weights used for the cross-entropy makes the training process sensitive to the hard similar pairs. The whole loss function of the second stage is therefore:

$$\begin{aligned} \mathcal {J}=\mathcal {J}_1+\beta _1\mathcal {J}_2+\beta _2\mathcal {J}_3 \end{aligned}$$

(24)

where $\beta _1$ and $\beta _2$ are hyper-parameters to adjust the weights of $\mathcal {J}_2$ and $\mathcal {J}_3$. $\mathcal {J}_2$ is related to the length of hash code k and the number of classes c. It can be easily derived that the longer the hash code is, the easier the hash code can preserve the classification information. By contrast, the larger the number of classes is, the more difficult for the hash code to get the correct class. In conclusion, longer hash code and smaller number of the classes will lead to a smaller classification loss. Therefore, we put k in the numerator and c in the denominator. Finally, as the scale of this ratio is controlled by logarithm, $\beta _1$ is empirically set as $\ln {\left( e+\left( \frac{k}{c}\right) ^2\right) }-1$.

Optimization. At the second stage, we need to learn the parameters $\theta _x$, $\theta _y$ and $\varvec{V}$, which are updated by the Back-propagation (BP) algorithm with the Stochastic Gradient Descent (SGD) method. The derivatives of the loss function are computed as:

$$\begin{aligned} \frac{\partial \mathcal {J}}{\partial \varvec{F}_{i*}^u}= & {} 2\left( \varvec{F}_{i*}^u-\varvec{B}_{i*}\right) \nonumber \\{} & {} +\frac{1}{2}\beta _2\varvec{W}_{ij}\sum \limits _{j:{\widetilde{\varvec{S}}}_{ij}\in \widetilde{\varvec{S}}}{(\varphi (\psi _{ij})-{\widetilde{\varvec{S}}}_{ij})\varvec{F}_{j*}^u}\nonumber \\{} & {} +\frac{1}{2}{\beta _2\varvec{W}}_{ji}\sum \limits _{j:{\widetilde{\varvec{S}}}_{ji}\in \widetilde{\varvec{S}}}{(\varphi (\psi _{ji})-{\widetilde{\varvec{S}}}_{ji})\varvec{F}_{j*}^u}\end{aligned}$$

(25)

$$\begin{aligned} \frac{\partial \mathcal {J}}{\partial \varvec{V}}= & {} 2\beta _1\left( \left( \varvec{F}^u\right) ^T\varvec{F}^u\varvec{V}-\left( \varvec{F}^u\right) ^T\varvec{L}\right) \end{aligned}$$

(26)

The gradients are then fed back to update the network parameters. The derivation of sign function is not involved in the process of back-propagation.

4 Experiments

4.1 Datasets

MIRFlickr-25K

[20] consists of 25,000 images collected from Flickr. Each image is associated with a number of textual tags and labeled with at least one of the 24 unique semantic labels. We select the images with at least 20 textual tags, and obtain 20,015 image-tag pairs. For each pair, the textual instance is represented as a 1,386-dimensional Bag-of-words (BOW) vector. We randomly sample 2,000 pairs as the query set, and the remaining 18,015 pairs are used as the database where we randomly sample 10,000 pairs as the training set. It is worth mentioning that in this task, such spilt strategy is widely used and random selection does not have a significant impact on the results.

NUS-WIDE

[11] is a real-world image dataset containing 269,648 images. Each image is associated with a number of textual tags labeled with at least one of the 81 unique concept labels. We only select the image-tag pairs which belong to the 21 most frequent concepts. For each pair, the textual instance is represented as a 1,000-dimensional BOW vector. 100 pairs per concept label are randomly sampled as the query set, and the remaining 500 pairs per concept label are randomly sampled as the training set. In total, there are 2,100 pairs for query set and 10,500 pairs for training set.

MS-COCO

[29] contains 82,783 training images and 40,504 validation images, and is also a multi-label dataset. In our experiments, 117,218 data instances are used. Each text is represented as a 2,000-dimensional BOW vector. We randomly sample 5,000 instances as the query set, and the remaining are used as the database where we randomly sample 10,000 instances as the training set.

4.2 Evaluation metrics

In general, we experiment with two typical retrieval tasks for cross-modal hashing methods, i.e., Image-to-Text Retrieval ($\varvec{I}\rightarrow \varvec{T}$) and Text-to-Image Retrieval ($\varvec{T}\rightarrow \varvec{I}$). We adopt two commonly used protocols in cross-modal retrieval, i.e., Hamming ranking and hash lookup. Hamming ranking sorts the data instances in a database and returns the data instances with the highest similarity to a given query. The top-N-precision curve and Mean Average Precision (MAP) are widely used to measure the Hamming ranking task. Hash lookup returns all the data instances within a certain Hamming radius given the query instance. The precision-recall curve is widely used to measure the hash lookup task. To obtain the MAP score, we need to calculate the Average Precision (AP) for each query, which is defined as:

$$\begin{aligned} AP=\frac{1}{N}\sum \limits _{r=1}^{R}{p\left( r\right) {\times rel}_r} \end{aligned}$$

(27)

where N is the number of relevant data instances in the database; p(r) represents the precision of the top-r returned data instances; ${rel}_r=1$ if the r-th returned data instance is relevant to the query and ${rel}_r=0$ otherwise. We can then calculate the MAP score by averaging the APs for all queries. Here, we choose MAP@R=500.

4.3 Implementation details

For image modality, following the method GCH [50], we adopt CNN-F [9] for image modality due to its excellent capability on image feature extraction. It is pre-trained on ImageNet [13]. We keep the five convolutional layers conv1-conv5 and the next two fully-connected layers fc6-fc7 unchanged and then replace the eighth layer fc8 with a new fully-connected layer which has k nodes to map the deep image features into the Hamming space.

For text modality, we first use the BOW representation to transform each text into a vector, and then use a simple MLP to map the BOW vectors into the Hamming space. The MLP consists of three fully-connected layers with 512, 512 and k nodes, respectively.

For our proposed method, we set the hyper-parameters as $\alpha =\lambda =1$, $\rho _1=\rho _2=0.01$, $\beta _2=10$, $\gamma _1=1$, $\gamma _2=0.05$ for MIRFlickr-25K, $\gamma _1=0.5$, $\gamma _2=0.1$ for NUS-WIDE, and $\gamma _1=0.1$, $\gamma _2=0.5$ for MS-COCO. To learn the neural network parameters at the second stage, we adopt Adam solver and set the learning rate within ${10}^{-3}$ and ${10}^{-4}$, and the batch size to 64. We implement our method with PyTorch on a single NVIDIA GTX 1080Ti GPU.

4.4 Comparison with state-of-the-art approaches

We compare our proposed method with the state-of-the-art cross-modal hashing methods, including CMSSH [4], SCM [53], STMH [46], SePH [31], DCMH [21], SSAH [25], GCH [50], DLFH [22], MLCAH [36], ASCSH [37], FCMH [49], MIAN [54] and MS$^2$GAH [15]. Among these methods, CMSSH, SCM, STMH, SePH and DLFH are shallow methods, and the rest are deep methods. The deep networks for our proposed DFGH are the same as the previous work. They all use the CNN-F network for image modality and MLP for text modality. Besides, SePH, DLFH and FCMH are also two-stage methods.

Table 2 MAP scores of compared methods with different lengths of hash codes on three benchmark datasets

Full size table

Table 2 reports the MAP scores of all ten methods and our DFGH for two cross-modal retrieval tasks with the hash codes varying from 16 bits to 64 bits. For fair comparison, we follow the setting of GCH in [50], as it reported the highest metric scores over all the comparison methods. In [50], the author reported the scores of CMSSH, SCM, STMH, SePH and DCMH. The comparisons between our method and these SOTA methods are under the same settings. We also discuss other methods, e.g., DLFH, MLCAH, ASCSH, FCMH, MIAN and MS$^2$GAH. To the best of our knowledge, these methods did not release the source code to the public, making it hard to reproduce the results under the same settings. Fortunately, according to existing works [25, 36, 50], the random sampling strategy makes little influence on the results. Under a similar scaling partition of the training and test set, the results of the same methods are almost the same. Therefore, we cite the results directly from the original papers. Furthermore, we use “SUM” to represent the sum of the MAP scores for the tasks of Image-to-Text Retrieval and Text-to-Image Retrieval. Figures 3, 4 and 5 show the top-N-precision and precision-recall curves with 64-bit hash codes on the three datasets. Because SSAH, MLCAH and MS$^2$GAH have not released their source codes and the original papers did not give the experimental results under the same settings as those listed in this work, there are no corresponding curves for these two methods in Figs. 3–5. Similarly, there are no corresponding curves for DLFH and ASCSH in Fig. 5.

As shown in Table 2 and Fig. 3–5, our DFGH outperforms most of the compared methods in terms of different datasets and different lengths of hash codes. Specifically, on MIRFlickr-25K, compared to GCH and MS$^2$GAH, which get the top three highest scores in “SUM”, DFGH achieves a slight lower performance on the MAP score with 32-bit and 64-bit hash codes, while a significant improvement with 16-bit hash codes. It means that the reduction of the hash code length may have less impact on DFGH, because DFGH uses a fine-grained similarity matrix and a similarity preserving loss function to capture the differentiated similarity relationships, which is also effective for short hash codes. On both NUS-WIDE and MS-COCO, DFGH significantly outperforms the other methods. Compared with MIRFlickr-25K, NUS-WIDE and MS-COCO are more complex and contain larger amounts of data, and the tasks are thus more challenging. The excellent performance on both datasets demonstrates that DFGH performs well in complex scenarios where it is more demanding to capture the differentiated similarity relationships between the original data instances and preserve them in the Hamming space.

On these three benchmark datasets, for one-stage deep methods DCMH and GCH, the results of Text-to-Image Retrieval are generally better than those of Image-to-Text Retrieval, indicating that it is more difficult to capture the semantic similarity relationships between image modality and text modality via image query. For one-stage methods, the learning of hash codes is affected by both image features and text features. The generated hash codes may contain more semantic information of text modality, because it is more difficult to capture the hidden semantic information from an image than from texts. Different from the one-stage methods, the results for Image-to-Text Retrieval by our DFGH are generally slightly better than those for Text-to-Image Retrieval on MIRFlickr-25K. The reason may be that the two-stage methods only generate hash codes from the semantic labels without using the image features and text features at the first stage. Moreover, the training set is almost half the size of the entire dataset on MIRFlickr-25K, which makes the image and text networks have similar ability to map the original data to hash codes. However, on NUS-WIDE, the training set is only a small portion (less than $4\%$) of the entire dataset, making it more difficult to preserve the similarity relationships between different modalities for image modality. In the results of another recent two-stage method (i.e., FCMH), we can also observe a similar outcomes, while our approach further enhances performance. Additionally, with texts composed of sentences, MS-COCO has higher quality in text modality than the other two datasets. Therefore, it is much easier to capture the hidden semantic information from texts in MS-COCO. Different from SePH, our DFGH uses MLP for text modality, which has a stronger ability to extract text features.

Table 3 MAP scores with different lengths of hash codes for TECH and DFGH on MIRFlickr-25K

Full size table

The above experimental results have shown that the proposed DFGH outperforms the state-of-the-art cross-modal hashing methods. We now compare our DFGH with the two-stage method TECH [10] on MIRFlickr-25K. Because TECH used different settings from the other methods, we conduct another experiment by following the experimental protocols provided in TECH. Specifically, we randomly select 2,000 data instances as the query set and the remaining 18,015 instances as the training and retrieval set. Deep CNN-F features are extracted for TECH, and the MAP scores are adopted for all query instances. The related experimental results are reported in Table 3. It can be seen that DFGH outperforms TECH in most cases. From the MAP scores of DFGH, we can also find that the extracted text representations can preserve the similarity relationships between different modalities better than image representations. The greater performance boost in shorter code length proves the effectiveness of the preserved similarity relationships again. It should be noted that, in the task of text-to-image, our method outperforms TECH significantly. Our method surpasses the TECH by 0.040, 0.038, 0.026 respectively in 16 bits, 32 bits, and 64 bits. In fact, the task of text-to-image is far more widely used than the task of image-to-text in practice. For the latter task, the technique of image captioning is more common. In terms of computational complexity, the length of hash code is the determining factor. It is worthwhile to spend more time in the training process to obtain higher accuracy.

Table 4 MAP scores of DFGH and its variants on three benchmark datasets

Full size table

Table 5 MAP scores of models without $\mathcal {J}_1$, $\mathcal {J}_2$ and $\mathcal {J}_3$ on three benchmark datasets

Full size table

4.5 Ablation study

Effects of main components

To verify the effectiveness of the main components in our DFGH model, we conduct the ablation study experiments on three benchmark datasets, as shown in Table 4. We aim to explore four variants of DFGH: 1) DFGH-NAE is a DFGH variant without Autoencoder at the first stage; 2) DFGH-UWXE is a DFGH variant that uses unweighted cross-entropy loss at the second stage; 3) DFGH-MBC is a DFGH variant that uses multi-binary classification to train the hash functions at the second stage; and 4) DFGH-CGM is a DFGH variant that uses the coarse-grained similarity matrix at the first stage and unweighted cross-entropy loss at the second stage.

As expected, DFGH outperforms DFGH-NAE, which verifies the effectiveness of the added Autoencoder at the first stage. Comparing DFGH with DFGH-UWXE, we find that the unweighted pairwise cross-entropy loss may lead to suboptimal performance. The main reason is that this loss assigns the same weight to each pair, causing the learning process to be equally sensitive to each pair. In contrast, our DFGH model uses the fine-grained similarity matrix to assign different weights to different pairs, making the training process sensitive to similar and hard pairs and further enabling the hash codes in Hamming space to preserve the differentiated similarity relationships between the original data. DFGH-UWXE significantly outperforms DFGH-CGM. The only difference between these two variants is the similarity matrix used at the first stage. This comparison shows the effectiveness of the fine-grained similarity matrix and further demonstrates the importance of fully capturing the similarity information.

Effects of loss functions in Stage 2

Further, to verify the effectiveness of the components of the loss function at the second stage, we conduct the ablation study experiments on three benchmark datasets, as shown in Table 5. We report the MAP scores of the model without $\mathcal {J}_1$, $\mathcal {J}_2$ and $\mathcal {J}_3$ on three benchmark datasets, respectively. On the whole, $\mathcal {J}_1$ is the most crucial loss. There is a dramatic drop in performance without $\mathcal {J}_1$. This proves the effectiveness of the pipeline of our two-stage method. The most important thing for hash function learning is to get guidance on the hash codes inferred from the first stage. For $\mathcal {J}_2$ and $\mathcal {J}_3$, there is an improvement of an average of 0.02 for each task. Note that the improvement brought by doubling the hash code length is about 0.02 here, so such improvement is significant. It indicates that the design of classification loss and weighted similarity preserving loss can well boost the retrieval performance, proving the effectiveness of preserving the semantic information of labels and similarity information in hash codes.

4.6 Parameter sensitivity analysis

We analyze parameter sensitivity on the most used datasets MIRFlickr-25K and NUS-WIDE. We design experiments with both retrieval tasks and fix the hash code length as 64 bits empirically. There are eight hyper-parameters in DFGH, i.e., $\alpha $, $\lambda $, $\gamma _1$, $\rho _1$, $\rho _2$, $\gamma _2$, $\beta _1$ and $\beta _2$. As mentioned before, $\beta _1$ is related to the length of the hash code k and the number of classes c, thus we construct the value of $\beta _1$ empirically. In addition, we follow the previous work SADH in [41] and set the value of two dual variables $\rho _1$ and $\rho _2$ to 0.01. Therefore, we only present the experimental results about $\alpha $, $\lambda $, and $\beta _2$ in Figs. 6, 7 and 8. Here, we use “Average” to represent the average score for Image-to-Text Retrieval and Text-to-Image Retrieval. Overall, the influence of the parameters on the results tends to be clearer and more stable. This means that in the parameter selection process, we can get the most suitable parameters without too much effort.

Parameters $\alpha $ and $\lambda $

represent the importance of the Autoencoder loss. From Fig. 6, we observe that the average MAP scores of DFGH first increase and then drop with $\alpha $ and $\lambda $ in [0.5, 10] on MIRFlickr-25K, but keep relative stable on NUS-WIDE. It indicates that the semantic information of labels is helpful to improve the retrieval performance, but too much intervention of Autoencoder may make the learning of hash code worse on a small-scale dataset.

Parameters $\gamma _1$ and $\gamma _2$

adjust the coarse-grained matrix and deal with the class imbalance problem. From Fig. 7, we can observe that on MIRFlickr-25K, the average MAP scores first increase steadily with $\gamma _1$ in [0.05, 0.7], and then remain stable with $\gamma _1$ in [0.7, 1.0]. Besides, the average MAP scores first keep relatively stable with $\gamma _2$ in [0.7, 1.0], and then drop rapidly. Different from those on MIRFlickr-25K, the average MAP scores on NUS-WIDE are not sensitive to the hyper-parameters $\gamma _1$ and $\gamma _2$ in [0.05, 0.7], and drop when $\gamma _1$ and $\gamma _2$ are greater than 0.7.

Parameter $\beta _2$

controls the similarity preserving loss. As shown in Fig. 8, on MIRFlickr-25K the average MAP scores first increase steadily with $\beta _2$ in [1, 10], indicating that the similarity preserving loss is necessary to learn the hash functions. However, when $\beta _2$ becomes too large, the average MAP scores also drop. Different from the average MAP scores on MIRFlickr-25K, those on NUS-WIDE first increase steadily with $\beta _2$ in [1, 10], and then keep stable over a wide range of values. Besides, the best retrieval performance can be achieved under different settings on different datasets.

5 Conclusion

In this paper, we introduce a two-stage deep hashing method (i.e., DFGH) for cross-modal retrieval. To generate better hash codes, we design a fine-grained similarity matrix to explore the differentiated similarity relationships between data instances. An autoencoder is utilized to preserve the semantic information of labels. To obtain better hash functions, we design a similarity sensitivity learning strategy and a similarity preserving loss. They enable the hash codes generated by the hash functions to preserve the differentiated similarity relationships in the Hamming space. The adoption of such fine-grained similarity matrix obviously promotes the retrieval performance, demonstrating the importance of exploring fine-grained similarity information. In the future, it is promising to utilize more information besides of labels to mine more similarity information for the construction of the similarity matrix.

Data availability and access

The data of MIRFlickr that support the findings of this study are openly available at: https://press.liacs.nl/mirflickr/, reference number [20]. The data of NUS-WIDE that support the findings of this study are available at: https://lms.comp.nus.edu.sg/wp-content/uploads/2019/research/nuswide/NUS-WIDE.html, reference number [11]. The data of that support the findings of this study are available at: https://cocodataset.org/, reference number [29].

References

Ayadi H, Torjmen-Khemakhem M, Daoud M et al (2018) Mf-re-rank: A modality feature-based re-ranking model for medical image retrieval. J Assoc Inf Sci Technol 69(9):1095–1108
Article Google Scholar
Bartels RH, Stewart GW (1972) Solution of the matrix equation ax+ xb= c [f4]. Commun ACM 15(9):820–826
Article MATH Google Scholar
Boyd S, Parikh N, Chu E (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc
Bronstein MM, Bronstein AM, Michel F, et al (2010) Data fusion through cross-modality metric learning using similarity-sensitive hashing. In: CVPR, pp 3594–3601
Cao Y, Long M, Wang J, et al (2016) Deep visual-semantic hashing for cross-modal retrieval. In: KDD, pp 1445–1454
Cao Y, Liu B, Long M, et al (2018) Hashgan: Deep learning to hash with pair conditional wasserstein gan. In: CVPR, pp 1287–1296
Cao Z, Long M, Huang C, et al (2018) Transfer adversarial hashing for hamming space retrieval. In: AAAI
Cao Z, Sun Z, Long M, et al (2018) Deep priority hashing. In: MM, pp 1653–1661
Chatfield K, Simonyan K, Vedaldi A, et al (2014) Return of the devil in the details: Delving deep into convolutional nets. In: BMVC
Chen ZD, Wang Y, Li HQ, et al (2019) A two-step cross-modal hashing by exploiting label correlations and preserving similarity in both steps. In: MM, pp 1694–1702
Chua TS, Tang J, Hong R, et al (2009) Nus-wide: a real-world web image database from national university of singapore. In: CIVR, pp 1–9
Deng C, Chen Z, Liu X et al (2018) Triplet-based deep hashing network for cross-modal retrieval. IEEE Trans Image Process 27(8):3893–3903
Article MathSciNet MATH Google Scholar
Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: CVPR, pp 248–255
Ding G, Guo Y, Zhou J (2014) Collective matrix factorization hashing for multimodal data. In: CVPR, pp 2075–2082
Duan Y, Chen N, Zhang P et al (2022) Ms2gah: Multi-label semantic supervised graph attention hashing for robust cross-modal retrieval. Pattern Recognit 128:108676
Article Google Scholar
Erin Liong V, Lu J, Tan YP, et al (2017) Cross-modal deep variational hashing. In: ICCV, pp 4077–4085
Hendricks LA, Venugopalan S, Rohrbach M, et al (2016) Deep compositional captioning: Describing novel object categories without paired training data. In: CVPR, pp 1–10
Hu H, Xie L, Hong R, et al (2020) Creating something from nothing: Unsupervised knowledge distillation for cross-modal hashing. In: CVPR, pp 3123–3132
Hu Y, Jin Z, Ren H, et al (2014) Iterative multi-view hashing for cross media indexing. In: MM, pp 527–536
Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation. In: MIR, pp 39–43
Jiang QY, Li WJ (2017) Deep cross-modal hashing. In: CVPR, pp 3232–3240
Jiang QY, Li WJ (2019) Discrete latent factor model for cross-modal hashing. IEEE Trans Image Process 28(7):3490–3501
Article MathSciNet MATH Google Scholar
Jin L, Li K, Li Z et al (2018) Deep semantic-preserving ordinal hashing for cross-modal similarity search. TNNLS 30(5):1429–1440
MathSciNet Google Scholar
Kumar S, Udupa R (2011) Learning hash functions for cross-view similarity search. In: IJCAI, pp 1360–1365
Li C, Deng C, Li N, et al (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In: CVPR, pp 4242–4251
Li H, Zhang C, Jia X, et al (2021) Adaptive label correlation based asymmetric discrete hashing for cross-modal retrieval. IEEE Transactions on Knowledge and Data Engineering
Li W, Zheng Y, Zhang Y et al (2021) Cross-modal retrieval with dual multi-angle self-attention. J Assoc Inf Sci Technol 72(1):46–65
Article Google Scholar
Lin G, Shen C, Suter D, et al (2013) A general two-step approach to learning-based hashing. In: ICCV, pp 2552–2559
Lin TY, Maire M, Belongie S, et al (2014) Microsoft coco: Common objects in context. In: ECCV, pp 740–755
Lin TY, Goyal P, Girshick R, et al (2017) Focal loss for dense object detection. In: ICCV, pp 2980–2988
Lin Z, Ding G, Hu M, et al (2015) Semantics-preserving hashing for cross-view retrieval. In: CVPR, pp 3864–3872
Liu H, Ji R, Wu Y, et al (2017) Cross-modality binary code learning via fusion similarity hashing. In: CVPR, pp 7380–7388
Liu X, Wang X, Ym Cheung (2021) Fddh: Fast discriminative discrete hashing for large-scale cross-modal retrieval. IEEE Trans Neural Netw Learn Syst 33(11):6306–6320
Article MathSciNet Google Scholar
Luo X, Yin XY, Nie L, et al (2018) Sdmch: Supervised discrete manifold-embedded cross-modal hashing. In: IJCAI, pp 2518–2524
Luo X, Zhang PF, Huang Z et al (2019) Discrete hashing with multiple supervision. IEEE Trans Image Process 28(6):2962–2975
Article MathSciNet MATH Google Scholar
Ma X, Zhang T, Xu C (2020) Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Trans Multimed 22(12):3101–3114
Article Google Scholar
Meng M, Wang H, Yu J et al (2020) Asymmetric supervised consistent and specific hashing for cross-modal retrieval. IEEE Trans Image Process 30:986–1000
Article MathSciNet Google Scholar
Nie X, Wang B, Li J, et al (2020) Deep multiscale fusion hashing for cross-modal retrieval. TCSVT
Noh H, Seo PH, Han B (2016) Image question answering using convolutional nal network with dynamic parameter prediction. In: CVPR, pp 30–38
Rastegari M, Choi J, Fakhraei S, et al (2013) Predictable dual-view hashing. In: ICML, pp 1328–1336
Shen F, Xu Y, Liu L et al (2018) Unsupervised deep hashing with similarity-adaptive and discrete optimization. IEEE Trans Pattern Anal Mach Intell 40(12):3034–3044
Article Google Scholar
Shen Y, Liu L, Shao L, et al (2017) Deep binaries: Encoding semantic-rich cues for efficient textual-visual cross retrieval. In: ICCV, pp 4097–4106
Song J, Yang Y, Yang Y, et al (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: SIGMOD, pp 785–796
Tu RC, Mao XL, Ma B, et al (2020) Deep cross-modal hashing with hashing functions and unified hash codes jointly learning. IEEE Transactions on Knowledge and Data Engineering
Wang D, Cui P, Ou M, et al (2015) Deep multimodal hashing with orthogonal regularization. In: IJCAI
Wang D, Gao X, Wang X, et al (2015) Semantic topic multimodal hashing for cross-media retrieval. In: IJCAI
Wang W, Ooi BC, Yang X, et al (2014) Effective multi-modal retrieval based on stacked auto-encoders. In: VLDB, pp 649–660
Wang Y, Luo X, Nie L, et al (2020) Batch: A scalable asymmetric discrete cross-modal hashing. IEEE Transactions on Knowledge and Data Engineering
Wang Y, Chen ZD, Luo X et al (2021) Fast cross-modal hashing with global and local similarity embedding. IEEE Trans Cybern 52(10):10064–10077
Article Google Scholar
Xu R, Li C, Yan J, et al (2019) Graph convolutional network hashing for cross-modal retrieval. In: IJCAI, pp 982–988
Yan C, Pang G, Bai X, et al (2019) Deep hashing by discriminating hard examples. In: MM, pp 1535–1542
Yang E, Deng C, Liu W, et al (2017) Pairwise relationship guided deep hashing for cross-modal retrieval. In: AAAI
Zhang D, Li WJ (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. In: AAAI
Zhang Z, Luo H, Zhu L et al (2022) Modality-invariant asymmetric networks for cross-modal hashing. IEEE Trans Knowl Data Eng 35(5):5091–5104
Google Scholar
Zhen Y, Yeung DY (2012) Co-regularized hashing for multimodal data. In: NIPS, p 1376
Zheng C, Zhu L, Lu X et al (2019) Fast discrete collaborative multi-modal hashing for large-scale multimedia retrieval. IEEE Trans Knowl Data Eng 32(11):2171–2184
Article Google Scholar
Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In: SIGIR, pp 415–424
Zhu X, Huang Z, Shen HT, et al (2013) Linear cross-modal hashing for efficient multimedia search. In: MM, pp 143–152

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (No. 61976057; No. 62172101), and the Science and Technology Commission of Shanghai Municipality (No. 21511101000; No. 22DZ1100101), and the Fundamental Research Funds for the Central Universities of China (No. 2023110139).

Funding

This work was supported by National Natural Science Foundation of China (No. 61976057; No. 62172101), and the Science and Technology Commission of Shanghai Municipality (No. 21511101000; No.22DZ1100101).

Author information

Authors and Affiliations

School of Computer Science, Shanghai Key Lab of Intelligent Information Processing, Fudan University, Yangpu, Shanghai, 200433, China
Yangdong Chen, Jiaqi Quan, Yuejie Zhang & Rui Feng
School of Information Management and Engineering, Shanghai University of Finance and Economics, Yangpu, Shanghai, 200433, China
Tao Zhang

Authors

Yangdong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Quan
View author publications
You can also search for this author in PubMed Google Scholar
Yuejie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Feng
View author publications
You can also search for this author in PubMed Google Scholar
Tao Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization and formal analysis: All authors; Methodology, software and investigation: Yangdong Chen, Jiaqi Quan; Writing - original draft preparation: Yangdong Chen, Jiaqi Quan; Writing - review and editing: Yuejie Zhang, Rui Feng, Tao Zhang; Funding acquisition: Yuejie Zhang, Rui Feng; Resources: Yuejie Zhang, Rui Feng, Tao Zhang; Supervision: Yuejie Zhang, Rui Feng, Tao Zhang; Project administration: Yuejie Zhang;

Corresponding author

Correspondence to Yuejie Zhang.

Ethics declarations

Conflicts of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Ethical standard

The authors declare that they have no potential conflicts of interest with the content of this paper, and there is no research involving human participants or animals in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, Y., Quan, J., Zhang, Y. et al. Deep cross-modal hashing with fine-grained similarity. Appl Intell 53, 28954–28973 (2023). https://doi.org/10.1007/s10489-023-05028-y

Download citation

Accepted: 18 September 2023
Published: 19 October 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10489-023-05028-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Deep cross-modal hashing with fine-grained similarity

Abstract

Similar content being viewed by others

High-order nonlocal Hashing for unsupervised cross-modal retrieval

A novel cross-modal hashing algorithm based on multimodal deep learning

Deep supervised fused similarity hashing for cross-modal retrieval

Explore related subjects

1 Introduction

2 Related work

3 Methodology

3.1 Formulation

3.2 Overview

3.3 Hash code inference with fine-grained similarity

Fine-grained similarity mining

Semantic information mining

3.4 Hash code mapping with similarity sensitivity

Similarity sensitivity learning

4 Experiments

4.1 Datasets

MIRFlickr-25K

NUS-WIDE

MS-COCO

4.2 Evaluation metrics

4.3 Implementation details

4.4 Comparison with state-of-the-art approaches

4.5 Ablation study

Effects of main components

Effects of loss functions in Stage 2

4.6 Parameter sensitivity analysis

Parameters \(\alpha \) and \(\lambda \)

Parameters \(\gamma _1\) and \(\gamma _2\)

Parameter \(\beta _2\)

5 Conclusion

Data availability and access

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Ethical standard

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation