1 Introduction

In recent years, the rapid development of the information society has brought a tremendous explosion of data in multiple media types, such as texts, images, audios and videos, which has consequently boosted the demands for effective data storage and similarity search techniques. To relieve that, great efforts have been made [1, 9, 32, 42, 51, 60], where one representative solution is hashing learning. Hashing techniques aim to project data from an arbitrary high dimensional space into a binary space where similarity relationships of the data in the original space are preserved. As data is represented by compact binary codes, the storage and searching costs would be reduced dramatically. In virtue of these major advantages, hashing based methods have attracted much attention and a great variety of excellent work has been presented [6, 14, 31, 34, 35, 39, 54, 58].

Most prior work mainly focuses on data retrieval within a single modality [11, 23, 25, 28, 43, 44, 53, 55], for instance, using texts to search similar texts and images for finding relevant images. However, in practice, it is more common to search data across modal borders, e.g., using several words or a sentence to search relevant videos or images. Handling such cross-modal retrieval tasks is challenging due to the existence of heterogeneity between different modalities. To cope with it, cross-modal hashing (CMH) has been investigated, which aims to learn effective hash codes for multi-modal data with the original intra- and inter-correlations well maintained [37, 40, 52, 56, 59, 61].

Depending on whether supervised information is involved, existing cross-modal hashing can be roughly categorized into supervised and unsupervised methods. Commonly, supervised methods [19, 20, 24, 45] utilize annotated labels or a pre-computed similarity matrix as a uniform guidance for all modalities to ensure the intra- and inter-modality consistencies during the binary learning procedure, thus obtaining precise binary codes and effective mapping functions. Unsupervised methodologies [7, 8, 36, 57] can only analyse raw features to uncover their intrinsic relations to guide hash learning. Therefore, the quality of the to-be-learnt binary codes and functions highly relies on how much useful information could be excavated from the original datasets. Generally, supervised hashing can achieve better performance as more information is available, while unsupervised counterparts are more practical in real applications, where data often comes with no tags and manually annotating large collections of data is very time-consuming and difficult.

Recently, with the strong power in extracting insightful non-linear features, deep learning has shown its superiority and been prevalent in many fields, including information retrieval, computer vision, etc. Combining deep learning and hashing techniques has become more and more ubiquitous and a large family of proposed methods [2,3,4, 13, 21, 47] have exhibited encouraging performance compared to conventional non-deep hashing methods. Thereinto, existing deep cross-modal hashing methods mainly focus on supervised scenes while unsupervised learning receives relatively less attention yet it is crucial in reality as we mentioned above. In this work, we concentrate on improving the performance of deep cross-modal hashing in unsupervised searching scenarios.

Technically, the key objectives of deep unsupervised cross-modal learning are to obtain reliable guide information, relieve the modal discrepancy and build effective connection between multimodal data. To achieve the goals, many existing methods construct affinity graphs by various approaches so as to capture the underlying similarity relationships, and then learn binary codes by approximating the affinity relationships [17, 38, 49].

However, there are still some problems that need to be further considered. First, current methods widely compute the pair-wise distance such as the Euclidean distance and the Cosine distance between samples to indicate their similarity relationships, which then is processed as the guidance for learning. Nevertheless, the aforementioned similarity measures only consider the pair-wise points independently, neglecting their relationships with other points (i.e., neighbours). Such a local-view affinity construction makes the gained similarity information insufficient for desired results. Second, conventionally unsupervised methods treat each individual modality separately, constructing modality-specific affinity matrices. Nevertheless, it is rarely taken into consideration the diversity and complement benefits from different modalities. In practice, different modalities contain various information, thereby it is highly likely that the intrinsic information carried by multiple modalities could be complementary to each other. Intuitively, an interactive learning strategy through incorporating multi-modality information is expected to yield competitive results. Third, the similarity-preserving strategy for the learning purpose has not been studied adequately by previous methods, in result, limiting the learning performance.

To address the above issues, in this paper, we propose a novel unsupervised deep cross-modal method, called High-order Nonlocal Hashing (HNH) for cross-modal retrieval. An overview of our proposed method is illustrated in Figure 1 and the contributions are summarised as follows.

  • The proposed HNH takes a deep exploration into the underlying semantic structure of multi-modal data, which elegantly distills similarity information in both the local and nonlocal views, and integrates multi-modal correlation information to construct a high-order unified similarity matrix as the learning supervision. In this way, it can significantly mitigate the heterogeneity problem and provide a more precise guidance for cross-modal learning.

  • (Re-P2.4) To reduce modality discrepancy and improve performance, HNH keeps intra-consistency and inter-affinity both implicitly and explicitly. This is implemented by introducing a common representation for all modalities and reconstructing similarity relationships with the common representation and modal-specific embeddings.

  • Extensive experiments demonstrate that the proposed algorithm outperforms baseline methods for cross-modal hashing retrieval.

Fig. 1
figure 1

Illustration of the proposed HNH framework, which consists of three main parts: similarity matrix construction, feature learning and hash code learning

The rest of the paper is organized as follows. In Section 2, we give a brief review of the related work. In Section 3, we elaborate the proposed work including the notation, framework and optimization scheme, followed by comprehensive experiments in Section 4. The conclusion is given in Section 5.

2 Related work

In the literature, existing cross-modal hashing methods can be divided into shallow (i.e., non-deep) hashing and deep hashing according to if deep neural networks are used in the learning procedure. In this section, we briefly review the related work of these two categories.

2.1 Shallow cross-modal Hashing

Early work in hashing learning usually learns binary codes and embedding functions by designing effective algorithms to deal with hand-craft features. Based on whether supervised information is utilized, existing shallow cross-modal hashing methods can be roughly categorized into unsupervised and supervised parts. Usually, the former generates hash codes or projection functions by exploiting the intra- and inter-modality similarity structures of the training data. For instance, Inter-Media Hashing (IMH) [36] learns consistent binary representations for multimedia data by exploring the intra-media and inter-media connections and consistencies. To eliminate the semantic gap and capture more semantic information, Latent Semantic Sparse Hashing (LSSH) [57] adopts the sparse coding and matrix factorization to deal with images and texts, respectively. Collective Matrix Factorization Hashing (CMFH) [7] learns unified hash codes of instances by the collective matrix factorization with a latent factor model for different views. Composite Correlation Quantization (CCQ) [26] introduces an isomorphic latent space, in which different modalities can be more effectively correlated.

In comparison, supervised cross-modal hashing methods can additionally utilize the precise semantic information, e.g., semantic labels/tags. Various methods have been explored and also shown excellent performance, such as Cross View Hashing (CVH) [19], Semantics Preserving Hashing (SePH) [22], Discrete Cross-modal Hashing (DCH) [48] and Discrete Manifold-Embedded Cross-Modal Hashing (SDMCH) [27]. More specifically, CVH is the first work that considers hashing learning in the multiple view settings, which learns binary codes by minimizing the similarity-weighted Hamming distance. SePH learns hash representations by minimizing the KL-divergence of a probability distribution derived from the given semantic affinities. DCH introduces a classification framework with labels as supervision to directly learn discriminative hash codes without discarding the discrete constraints. SDMCH generates binary codes by preserving manifold correlations and leveraging various kinds of supervised information.

2.2 Deep cross-modal Hashing

By means of the prominent nonlinear modeling ability of deep networks, deep hashing outperforms non-deep methods in a landslide as it can learn high-level features from the original data. This learning pattern has become the mainstream in the hashing learning field and currently, more efforts are put in the supervised scenes. Representative supervised deep hashing methods include Deep Visual-Semantic Hashing (DVSH) [2], Deep Cross-Modal Hashing (DCMH) [16], Pairwise Relation Guided Deep Hashing (PRDH) [50] and Subspace Relation Learning for Cross-modal Hashing (SRLCH) [33]. DVSH takes into consideration the specific properties of different modalities and performs the joint visual-text representation learning by utilizing convolutional neural networks (CNN) and Long Short Term Memory (LSTM) to deal with different modalities. DCMH constructs an end-to-end deep discrete hash learning framework to enable the deep feature learning and hash learning in one step. PRDH employs various kinds of pairwise constraints to capture the correlations across different modalities and imposes the decorrelation restriction to pursue more precise and discriminative binary codes. In order to learn more discriminative hash codes, SRLCH directly exploits the relation information of semantic labels by transforming class labels into a subspace.

On the other hand, solving unsupervised cross-modal problem is relatively tougher because there are only original features available. For instance, Deep Binary Reconstruction (DBRC) [12] proposes a binary reconstruction framework with the Adaptive Tanh (ATanh) activation, so that discrete hash codes can be directly learnt. Unsupervised Deep Cross-Modal Hashing (UDCMH) [46] performs the feature learning and binarization simultaneously under a novel Laplacian and discrete constraint based framework. Deep Joint-Semantics Reconstructing Hashing (DJSRH) [38] designs a joint-semantics affinity matrix constructing strategy to capture the underlying semantic relationships among data, and the binary codes can be obtained by reconstructing the built joint affinity.

3 Method

3.1 Notation and problem definition

In this paper, we use boldface uppercase letters, e.g., U, to represent matrices and boldface lowercase letters, e.g., u, to denote vectors. Ui and Uj represent the i-th row and the j-th column of U, respectively. Uij is the element at the position (i,j) of matrix U. The transpose of U is indicated as UT and U− 1 denotes the inverse of the matrix U. In addition, Ic indicates an identity matrix with dimensionality c, \(\left \lVert \cdot \right \rVert _{F}\) denotes the Frobenius of a vector or matrix. The Hadamard matrix product ⊗ (i.e., element-wise product) between any two matrices, e.g., \(\textbf {U}= \{\textbf {U}_{i j}\}_{i, j=1}^{n}\) and \(\textbf {V} = \{\textbf {V}_{i j}\}_{i, j=1}^{n}\) is defined as:

$$ \mathbf{U} \otimes \mathbf{V} = \left\{\begin{array}{cccc}\mathbf{U}_{11}\cdot \mathbf{V}_{11} & {\mathbf{U}_{12}\cdot \mathbf{V}_{12}} & {...} & {\mathbf{U}_{1n}\cdot \mathbf{V}_{1n}} \\\mathbf{U}_{21}\cdot \mathbf{V}_{21} & {\mathbf{U}_{22}\cdot \mathbf{Y}_{22}} & {...} & {\mathbf{U}_{2n}\cdot \mathbf{V}_{2n}} \\ {\vdots} & {\vdots} & {\vdots} & {\vdots} \\ \mathbf{U}_{n1}\cdot \mathbf{V}_{n1} & {\mathbf{U}_{n2}\cdot \mathbf{V}_{n2}} & {...} & {\mathbf{U}_{nn}\cdot \mathbf{V}_{nn}} \end{array}\right\} $$
(1)

sign(⋅) is an element-wise sign function which is defined as follows:

$$ sign(\mathbf{u}) = \left\{\begin{array}{cccc} 1\ &&\mathbf{u}>0 & \\ -1\ &&\mathbf{u}\leq0 & . \end{array} \right. $$
(2)

Assume there is a multi-modal dataset with n image-text pair instances, indicated as \(\mathcal {O}=(\mathbf {X},\mathbf {Y})\), where \(\mathbf {X}=[\mathbf {x}_{1},\mathbf {x}_{2},...,\mathbf {x}_{n}] \in { \mathbb {R}^{d_{1} \times n}}\) and \(\mathbf {Y}=[\mathbf {y}_{1},\mathbf {y}_{2},...,\mathbf {y}_{n}] \in {\mathbb {R}^{d_{2} \times n}}\) denote the d1-dimensional image feature matrix and the d2-dimensional text feature matrix (usually d1d2) with n training points, respectively. In experiments, we make an assumption that samples from both modalities in the training set are observed.

Given the training data and specific code length c, the purpose of our method is to learn modality-specific projecting functions f(x;𝜃x) and g(y;𝜃y) for images and texts, respectively, where 𝜃x and 𝜃y are network parameters, and gain the corresponding binary representations \(\mathbf {B}_{\mathbf {x}} \in {\{-1,1\}^{c \times n}}\) and By ∈{− 1,1}c×n. In principle, Bx and By learnt by mapping functions should preserve well the similarity in the original multi-modal spaces.

3.2 Framework overview

As shown in Figure 1, our model is an end-to-end joint learning framework which mainly contains two subnetworks: INet and TNet, to perform image and text feature encoding, respectively. As the original image and text have specific characteristics, the INet and TNet are built on different components. More concretely, the INet is adapted from the pre-trained (on ImageNet) Alexnet [18] which is composed of five convolution layers and three fully-connected (fc) layers. We remove the last layer and add a fully-connected (fc) layer which contains c hidden units on top of the remaining layers. The TNet is a three-layer MultiLayer Perceptrons (MLP) [30].

Given the batch-input image-text pairs, at the very beginning of each iteration, we extract 4,096-dimensional vectors as the original high-level image features from the fc-7 layer of the pre-trained Alexnet, and use the original text features, e.g., BoW features, as the original text representations. Both of the features are then used to construct a unified similarity matrix for two modalities with the new designed strategy which will be progressively elaborated below. We take the original images and texts features from datasets as inputs fed into our network, which subsequently outputs the corresponding representative descriptors. To ensure the learnt representations can effectively preserve the original similarity relationships, we constantly optimize the whole network by minimizing the objective loss function defined in (11).

3.3 High-order nonlocal affinity matrix

Many previous work [10, 38, 49] has shown the feasibility of using features extracted from pre-trained deep architectures to construct affinity matrices as the learning guidance. In light of this, in each epoch of the training stage, we randomly select m(mn) instance pairs \(\{(\textbf {x}_{i},\textbf {y}_{i} )\}_{i=1}^{m}\) from two modalities as a batch input. We further extract their features \(\{(\textbf {x}^{\prime }_{i},\textbf {y}^{\prime }_{i} )\}_{i=1}^{m}\), where \(\textbf {x}^{\prime }_{i} \in \mathbb {R}^{d_{x} \times 1}\) and \(\textbf {y}^{\prime }_{i} \in \mathbb {R}^{d_{y} \times 1}\) from pre-trained deep neural networks and the original text dataset, respectively. Later, we calculate the pair-wise Cosine distance between these features to construct real-valued modality-specific affinity matrices Ax and Ay ∈ [0,1]m×m for images and texts, respectively. They are defined as follows:

$$ \mathbf{A}_{\mathbf{x}}[i,j] = d_{c o s} (\mathbf{x}^{\prime}_{i}, \mathbf{x}^{\prime}_{j}) = \frac{{\mathbf{x}^{\prime}_{i}}^{T} \mathbf{x}^{\prime}_{j}}{\|{\mathbf{x}^{\prime}_{i}}{\|}_{2}\| \mathbf{x}^{\prime}_{j} {\|}_{2}}, $$
(3)
$$ \mathbf{A}_{\mathbf{y}}[i,j] = d_{c o s} (\mathbf{y}^{\prime}_{i}, \mathbf{y}^{\prime}_{j}) = \frac{{\mathbf{y}^{\prime}_{i}}^{T} \mathbf{y}^{\prime}_{j}}{\|{\mathbf{y}^{\prime}_{i}}{\|}_{2}\| \mathbf{y}^{\prime}_{j} {\|}_{2}}, $$
(4)

where Ax[i,j] and Ay[i,j] represent the (i,j)-th element of Ax and Ay.

After constructing the above affinity matrices, previous methods directly utilize them to supervise the hash code and function learning, where two potential problems exist. On the one hand, the above similarity construction approach exploits the relation of every two points in a local view, which implies that only the points themselves are involved when computing their similarity. As their relationships with other points are neglected, sufficient semantic similarity information can be hardly captured. For example, suppose we have three points a, b and c, and the distance between a and b is the same as that of a and c. According to the above similarity measure, two pairs have the same similarity. However, it may be not accurate because their neighbourhood relationships have not be considered. If a and c have more common similar neighbours than a and b, the relationship between a and b would be closer compared to that of the other pair. This kind of high-order nonlocal relationships can not be observed when only the local-view pair-wise distance between samples independently is calculated.

On the other hand, the above similarity matrices are constructed purely based on one modality, i.e., a similarity matrix for one modality is built only based on the information within this modality. In the case that the amount of information within one modality is limited, extracting sufficient semantic information relying on one modality for learning is not always feasible. Besides, learning binary representations and functions by approximating these independent similarity matrices separately suffers from the discrepancy between different modalities, leading to inferior performance. In reality, the diverse and complementary information carried by different modalities can boost each other. Therefore, in cross-modal learning, merging information from various modalities can be an effective way to capture rich semantics and consequently gain a high-quality similarity matrix.

Considering this, we first re-weight the original similarities by exploiting pair-wise relationships with the holistic view, which is defined as follows:

$$ \begin{array}{@{}rcl@{}} &&\widetilde{\mathbf{A}}_{\mathbf{*}} = \mathbf{A}_{\mathbf{*}} \otimes \mathbf{\Psi}_{\mathbf{*}},\\ &&\mathbf{\Psi}_{\mathbf{*}}=\mathbf{A}_{\mathbf{*}}^{T} \mathbf{A}_{\mathbf{*}},\\ &&s.t. \ \ \mathbf{*} \in \{\mathbf{x}, \mathbf{y}\}, \end{array} $$
(5)

where the symbol represents the Hadamard matrix product (i.e., element-wise product). Since each row or column of A,∈{x,y} indicates the relation between one point and all the other points, \(\mathbf {A}_{\mathbf {*}}^{T} \mathbf {A}_{\mathbf {*}}\) can reflect how two samples are similar to each other based on their neighbours. It is common that if two points share more neighbours with larger similarity values, they should be closer than other points, and vice versa. For instance, as we can see in Figure 2, points a and b have 3 common neighbours {a,b,d}, and a and c also have 4 common neighbours {a,c,e,f}, according to the above method, the nonlocal similarity for (a,b) is 1 × 0.2 + 0.2 × 1 + 0.1 × 0.1 = 0.41, and for (a,c) is 1 × 0.2 + 0.2 × 1 + 0.1 × 0.1 + 0.3 × 0.5 = 0.56 > 0.41, so a and c are more similar than a and b, which can not be observed via the traditional local-view similarity matrix construction. In a word ,our similarity computation can help capture more reliable and precise relationships between two points by considering both local and nonlocal information. In addition, we further regulate \(\widetilde {\mathbf {A}}_{\mathbf {*}} = k_{\mathbf {*}} \cdot \widetilde {\mathbf {A}}_{\mathbf {*}} - 1\) to give a flexible quantization area for the following quantization procedure.

Fig. 2
figure 2

An example of neighbour relationships. Point a and b have the same Cosine similarity as point a and c, but a and c have more common neighbours with larger similarity values than a and b, so a is more similar to c

After getting two relatively high-order nonlocal similarity matrices \(\widetilde {\mathbf {A}}_{\mathbf {x}}\) and \(\widetilde {\mathbf {A}}_{\mathbf {y}}\), we set out to solve the second problem. As we mentioned above, the information in one modality can complement and reinforce others, so we integrate the gained \(\widetilde {\mathbf {A}}_{\mathbf {x}}\) and \(\widetilde {\mathbf {A}}_{\mathbf {y}}\) together to get a new joint similarity matrix:

$$ \begin{array}{@{}rcl@{}} &&\widetilde{\mathbf{S}}=\gamma \widetilde{\mathbf{A}}_{\mathbf{x}} + (1-\gamma) \widetilde{\mathbf{A}}_{\mathbf{y}},\\ &&s.t. \ \ \gamma \in[0,1], \end{array} $$
(6)

where γ is the balance parameter.

Unlike previous methods, our similarity matrix constructing strategy does not limit itself in a local view that only considers two points independently, but also captures the nonlocal similarity structure. Moreover, the bimodal similarity is merged together in an elegant way, such that sufficient information for data from different modalities can be obtained, making the following hash learning algorithm more efficient.

3.4 Objective function

After constructing the similarity matrix for both modalities, we turn to perform binary code and mapping function learning. We treat the similarity matrix \(\widetilde {\mathbf {S}}\) as the supervisory signal to train the network. Specifically, we denote features extracted from the last hidden layer of INet as f(X;𝜃x), and text representations generated from TNet are g(Y;𝜃y). We then binarize them to obtain the binary representations Bx and By via sign(⋅) functions. They are further normalised, however for simplicity this process is not represented in the following objective functions and the optimization procedure. To eliminate the modality gap, we first introduce a common representation \(\mathbf {U} \in { \mathbb {R}^{c \times m}}\) for both modalities and minimise the distance between it and the binarized representations Bx and By, respectively. Based on this, the common space loss is formulated as:

$$ \begin{array}{@{}rcl@{}} && \mathcal{J}_{1}= {\min_{\mathbf{U}, {\theta}_{\mathbf{x}}, {\theta}_{\mathbf{y}}}} \|\mathbf{U} - \mathbf{B}_{\mathbf{x}}{\|}_{F}^{2} + \|\mathbf{U} - \mathbf{B}_{\mathbf{y}}{\|}_{F}^{2},\\ && \ \ s.t. \ \ \mathbf{B}_{\mathbf{x}}= sign\left( f\left( \mathbf{X} ; {\theta}_{\mathbf{x}}\right)\right) \in {\{-1,1\}^{c \times m}}, \\ && \ \ \ \ \ \ \ \ \ \mathbf{B}_{\mathbf{y}} = sign\left( g\left( \mathbf{Y} ; {\theta}_{\mathbf{y}}\right)\right) \in {\{-1,1\}^{c \times m}}. \end{array} $$
(7)

However, in the above formulation, adopting the sign(⋅) function as the activation function will inevitably lead to the intractable back-propagate gradient problem. To handle it, the tanh(⋅) function is applied to approximate sign(⋅) as:

$$ \begin{array}{@{}rcl@{}} &&\mathcal{J}_{1}= \underset{\mathbf{U}, {\theta}_{\mathbf{x}}, {\theta}_{\mathbf{y}}}{\min} \|\mathbf{U}-\mathbf{B}_{\mathbf{x}}{\|}_{F}^{2} + \|\mathbf{U}-\mathbf{B}_{\mathbf{y}}{\|}_{F}^{2},\\ && \ \ s.t. \ \ \mathbf{B}_{\mathbf{x}}= tanh\left( f\left( \mathbf{X} ; {\theta}_{\mathbf{x}}\right)\right) \in {[-1,1]^{c \times m}}, \\ && \ \ \ \ \ \ \ \ \ \mathbf{B}_{\mathbf{y}} = tanh\left( g\left( \mathbf{Y} ; {\theta}_{\mathbf{y}}\right)\right) \in {[-1,1]^{c \times m}}. \end{array} $$
(8)

The objective function suggests that the binary representations for image and text modality would be consistent as much as possible. In light of this, the gap between two modalities can be bridged. Moreover, the common representation U can also act as the representation of each modality as Bx and By do, so that it can be further utilized to keep the intra-consistency as the following formulation defines:

$$ \begin{array}{@{}rcl@{}} &&\mathcal{J}_{2}= {\min_{\mathbf{U}, {\theta}_{\mathbf{x}}, {\theta}_{\mathbf{y}}}} \underbrace{\|\widetilde{\mathbf{S}}- \mathbf{U}^{T} \mathbf{B}_{\mathbf{x}}{\|}_{F}^{2}}_{\text {image modality }} + \underbrace{\|\widetilde{\mathbf{S}} - \mathbf{U}^{T} \mathbf{B}_{\mathbf{y}}{\|}_{F}^{2}}_{\text {text modality }},\\ && \ \ s.t. \ \ \mathbf{B}_{\mathbf{x}}= tanh\left( f\left( \mathbf{X} ; {\theta}_{\mathbf{x}}\right)\right) \in {[-1,1]^{c \times m}}, \\ && \ \ \ \ \ \ \ \ \ \mathbf{B}_{\mathbf{y}} = tanh\left( g\left( \mathbf{Y} ; {\theta}_{\mathbf{y}}\right)\right) \in {[-1,1]^{c \times m}}. \end{array} $$
(9)

The two terms in (9) represent the common space reconstruction loss for image and text modalities, respectively. At the same time, as the common representation U is consistent to each modality representation, the cross-modal similarity can also be preserved implicitly, which further builds the connection between different modalities.

Next, we explicitly preserve inter-similarity by defining the following function:

$$ \begin{array}{@{}rcl@{}} && \mathcal{J}_{3}= {\min_{{\theta}_{\mathbf{x}}, {\theta}_{\mathbf{y}}}} \|\widetilde{\mathbf{S}}- \mathbf{B}_{\mathbf{x}}^{T} \mathbf{B}_{\mathbf{y}}{\|}_{F}^{2},\\ && \ \ s.t. \ \ \mathbf{B}_{\mathbf{x}}= tanh\left( f\left( \mathbf{X} ; {\theta}_{\mathbf{x}}\right)\right) \in {[-1,1]^{c \times m}}, \\ && \ \ \ \ \ \ \ \ \ \mathbf{B}_{\mathbf{y}} = tanh\left( g\left( \mathbf{Y} ; {\theta}_{\mathbf{y}}\right)\right) \in {[-1,1]^{c \times m}}. \end{array} $$
(10)

Combining (8), (9) and (10) together, the final objective function is formed as:

$$ \begin{array}{@{}rcl@{}} && \mathcal{J}= {\min_{\mathbf{U}, {\theta}_{\mathbf{x}}, {\theta}_{\mathbf{y}}}} \alpha \mathcal{J}_{1} + \beta \mathcal{J}_{2} + \lambda \mathcal{J}_{3} \\ && \ \ \ \ \ \ \ \ \ \ \ \ = \alpha (\|\mathbf{U} - \mathbf{B}_{\mathbf{x}}{\|}_{F}^{2} + \|\mathbf{U} - \mathbf{B}_{\mathbf{y}}{\|}_{F}^{2})\\ && \ \ \ \ \ \ \ \ \ \ \ \ + \beta (\underbrace{\|\widetilde{\mathbf{S}}- \mathbf{U}^{T} \mathbf{B}_{\mathbf{x}}{\|}_{F}^{2}}_{\text {image modality }} + \underbrace{\|\widetilde{\mathbf{S}}- \mathbf{U}^{T} \mathbf{B}_{\mathbf{y}}{\|}_{F}^{2}}_{\text {text modality }})\\ && \ \ \ \ \ \ \ \ \ \ \ \ + \lambda \underbrace{\|\widetilde{\mathbf{S}}- \mathbf{B}_{\mathbf{x}}^{T} \mathbf{B}_{\mathbf{y}}{\|}_{F}^{2}}_{\text {corss-modality}},\\ && \ \ s.t. \ \ \mathbf{B}_{\mathbf{x}}= tanh\left( f\left( \mathbf{X} ; {\theta}_{\mathbf{x}}\right)\right) \in {[-1,1]^{c \times m}}, \\ && \ \ \ \ \ \ \ \ \ \mathbf{B}_{\mathbf{y}} = tanh\left( g\left( \mathbf{Y} ; {\theta}_{\mathbf{y}}\right)\right) \in {[-1,1]^{c \times m}}, \end{array} $$
(11)

where α, β, λ are the trade-off parameters to balance each loss item.

There are several advantages of the proposed scheme. First, we construct a high-order nonlocal similarity matrix by deeply and reliably exploiting the underlying semantic affinity among data, including the local-view and nonlocal-view similarity relationships. On this basis, we further leverage useful information from different modality as much as possible. In light of this, the matrix based similarity-preserving hash learning can produce high-quality hash codes and functions. Second, we introduce a common representation for different modalities so as to effectively relieve the modal discrepancy. By approximating the high-order similarity matrix with the binary representations of different modalities and the common representation, the intra- and inter-modality consistency is preserved both explicitly and implicitly, which further helps improve performance.

3.5 Optimization algorithm

An alternating optimization scheme is designed to solve the problem of (11) iteratively, which is summarized in Algorithm 1. In the following paragraphs, we elaborate the solving algorithm.

Step 1: Updating U with variables 𝜃 x and 𝜃 y fixed.

We first rewrite (11) as:

$$ \begin{array}{@{}rcl@{}} &&\mathcal{J} = {\min_{\mathbf{U}}} \ \alpha (\|\mathbf{U} - \mathbf{B}_{\mathbf{x}}{\|}_{F}^{2} + \|\mathbf{U} - \mathbf{B}_{\mathbf{y}}{\|}_{F}^{2})\\ && \ \ \ \ \ \ \ \ \ + \beta (\|\widetilde{\mathbf{S}} - \mathbf{U}^{T} \mathbf{B}_{\mathbf{x}}{\|}_{F}^{2} + \|\widetilde{\mathbf{S}} - \mathbf{U}^{T} \mathbf{B}_{\mathbf{y}}{\|}_{F}^{2}). \end{array} $$
(12)

For brevity, we expand each item and remove the irrelevant items. Thereafter, (12) is reformulated as follows:

$$ \begin{array}{@{}rcl@{}} &&\min_{\mathbf{U}} -2Tr(\mathbf{U}(\alpha\mathbf{B}_{\mathbf{x}}^{T} + \alpha \mathbf{B}_{\mathbf{y}}^{T} + \beta \widetilde{\mathbf{S}}\mathbf{B}_{\mathbf{x}}^{T}+\beta \widetilde{\mathbf{S}}\mathbf{B}_{\mathbf{y}}^{T}))\\ &&\ \ \ + 2 \alpha \| \mathbf{U}^{T} \mathbf{U}{\|}_{F}^{2} + \beta (\| \mathbf{U}^{T}\mathbf{B}_{\mathrm{x}}{\|}_{F}^{2} + \| \mathbf{U}^{T}\mathbf{B}_{\mathbf{y}}{\|}_{F}^{2}). \end{array} $$
(13)

By setting the derivation of (13) w.r.t. U to zero, we obtain:

$$ \textbf{U} = (2 \mathbf{I}_{c} + \frac{\beta}{\alpha} \mathbf{B}_{\mathbf{x}}\mathbf{B}_{\mathbf{x}}^{T}+ \frac{\beta}{\alpha} \mathbf{B}_{\mathbf{y}}\mathbf{B}_{\mathbf{y}}^{\mathsf{T}})^{-1}[(\mathbf{B}_{\mathbf{x}}+\mathbf{B}_{\mathbf{y}})(\mathbf{I}_{c}+ \frac{\beta}{\alpha}\widetilde{\mathbf{S}})]. $$
(14)

Step 2: Updating 𝜃 x with other variables fixed.

When other variables are fixed, (11) can be reformulated as follows:

$$ \begin{array}{@{}rcl@{}} && \mathcal{J} = {\min_{{\theta}_{\mathbf{x}}}} \ \alpha\|\mathbf{U} - \mathbf{B}_{\mathbf{x}}{\|}_{F}^{2} + \beta \|\widetilde{\mathbf{S}}- \mathbf{U}^{T} \mathbf{B}_{\mathbf{x}}{\|}_{F}^{2} + \lambda \|\widetilde{\mathbf{S}}- \mathbf{B}_{\mathbf{x}}^{T} \mathbf{B}_{\mathbf{y}}{\|}_{F}^{2},\\ && \ \ s.t. \ \ \mathbf{B}_{\mathbf{x}}= tanh\left( f\left( \mathbf{X} ; {\theta}_{\mathbf{x}}\right)\right) \in {[-1,1]^{c \times m}}. \end{array} $$
(15)

We can update the parameter 𝜃x by using mini-batch stochastic back-propagation.

$$ {\theta}_{\mathbf{x}} \quad \longleftarrow \quad {\theta}_{\mathbf{x}}-\mu_{\mathbf{x}} \frac{\partial \mathcal{J}}{\partial {\theta}_{\mathbf{x}}} $$
(16)

where μx is the learning rate.

Step 3: Updating 𝜃 y with other variables fixed.

When other variables are fixed, (11) can be reformulated as follows:

$$ \begin{array}{@{}rcl@{}} && \mathcal{J} = {\min_{{\theta}_{\mathbf{y}}}} \ \alpha\|\mathbf{U} - \mathbf{B}_{\mathbf{y}}{\|}_{F}^{2} + \beta \|\widetilde{\mathbf{S}} - \mathbf{U}^{T} \mathbf{B}_{\mathbf{y}}{\|}_{F}^{2} + \lambda \|\widetilde{\mathbf{S}} - \mathbf{B}_{\mathbf{x}}^{T} \mathbf{B}_{\mathbf{y}}{\|}_{F}^{2},\\ && \ \ s.t. \ \ \mathbf{B}_{\mathbf{y}}= tanh\left( f\left( \mathbf{Y} ; {\theta}_{\mathbf{y}}\right)\right) \in {[-1,1]^{c \times m}}. \end{array} $$
(17)

Similarly, we can solve the above equation with Stochastic Gradient Descent (SGD) by mini-batch back-propagation.

$$ {\theta}_{\mathbf{y}} \quad \longleftarrow \quad {\theta}_{\mathbf{y}}-\mu_{\mathbf{y}} \frac{\partial \mathcal{J}}{\partial {\theta}_{\mathbf{y}}} $$
(18)

where μy is the learning rate.

Algorithm 1 High-order nonlocal Hashing for cross-modal retrieval.

Input: The training data \(\mathcal {O}=(\mathbf {X},\mathbf {Y})\), mini-batch size m, hash code length c, iteration time t, balance parameters α, β, γ, λ. Output: Parameters of network 𝜃x and 𝜃y. Procedure: Randomly initialize the network parameters 𝜃x and 𝜃y; Repeat: 1. Randomly select m image-text pairs from the dataset to construct a mini-batch; 2. Extract image features and text features to construct the similarity matrix \(\widetilde {\mathbf {S}}\); 3. Calculate the loss of the objective function in (11); 4. Update U by using (14); 5. Update the parameter 𝜃x by backpropagation; 6. Update the parameter 𝜃y via backpropagation; Until convergent. Return: 𝜃x and 𝜃y.

3.6 Extensions

The out-of-sample extensions can be easily reached. Specifically, for an unseen image instance \(\textbf {x}_{o} \in \mathbb {R}^{d_{1} \times 1}\) or an unseen text \(\textbf {y}_{o} \in \mathbb {R}^{d_{2} \times 1}\), we can get its hash code as:

$$ \begin{array}{@{}rcl@{}} && \textbf{b}_{\textbf{x}_{o}} = sign\left( f\left( \mathbf{x}_{o} ; {\theta}_{\mathbf{x}}\right)\right),\\ && \textbf{b}_{\textbf{y}_{o}} = sign\left( g\left( \mathbf{y}_{o} ; {\theta}_{\mathbf{y}}\right)\right). \end{array} $$
(19)

Besides, The proposed method can also be easily extended to the cases with more modalities by adding a new subnetwork for each new modality and slightly modifying the objective function (11) as

$$ \begin{array}{@{}rcl@{}} &&\mathcal{J}= {\min_{\mathbf{U}, {\theta}_{\mathbf{x}}, {\theta}_{\mathbf{y}}} } \alpha \mathcal{J}_{1} + \beta \mathcal{J}_{2} + \lambda \mathcal{J}_{3} \\ && \ \ \ = \alpha \sum\limits_{i \in W} \|\mathbf{U} - \mathbf{B}_{{i}}{\|}_{F}^{2} + \beta \sum\limits_{i \in W} \|\widetilde{\mathbf{S}}- \mathbf{U}^{T} \mathbf{B}_{{i}}{\|}_{F}^{2}\\ && \ \ \ + \lambda \sum\limits_{i \in W} {\sum}_{j \in W} \|\widetilde{\mathbf{S}}- \mathbf{B}_{{i}}^{T} \mathbf{B}_{{j}}{\|}_{F}^{2},\\ && \ \ s.t. \ \ \mathbf{B}_{{i}}= tanh\left( f_{i}\left( \mathbf{X}_{i} ; {\theta}_{\mathbf{x}}\right)\right) \in {[-1,1]^{c \times m}}, \\ && \ \ \ \ \ \ \ \ \ \ W = \{ X, Y, ... \}, \end{array} $$
(20)

where fi represents the projecting functions for the modality iW = {X,Y,...}. The higher order similarity matrix can be obtained by slightly adjusting our high-order similarity constructions (6) as:

$$ \begin{array}{@{}rcl@{}} &&\widetilde{\mathbf{S}}= \sum\limits_{i \in |W|} \gamma_{i} \widetilde{\mathbf{A}}_{i},\\ &&s.t. \ \ \sum\limits_{i \in |W|} \gamma_{i} =1, W = \{ X, Y, ... \}, \end{array} $$
(21)

where |W| is the number of modalities.

The optimization problem of (20) can be easily solved by Algorithm 1.

3.7 Computational complexity analysis

In this section, we analyse the computational complexity of the proposed HNH. In each iteration of the training stage, we first construct the unified matrix \(\widetilde {\mathbf {S}}\), which costs O(nm2). Then, it takes O(n(mc + c2)) to update the intermediate representation U. Totally, The computational cost of our iterative process is O(n(m3 + mc + c2)t) (m,c,tn), which is linear to the size of datasets. The competitive computational efficiency shows the practicality and effectiveness of our proposed algorithm when dealing with large-scale real-world cross-modal searching tasks.

4 Experiments

To evaluate the effectiveness of our proposed HNH, we conduct extensive experiments on two widely-used benchmark datasets for cross-modal retrieval, i.e., MIRFlickr-25K [15], NUS-WIDE [5] and Wiki [29].

4.1 Datasets

MIRFlickr-25K

[15] is a multi-label dataset which consists of 25,000 images with corresponding textual tags from 24 unique labels. All of instance pairs are crawled from Flickr website. The dataset also provides a 1,386-dimensional feature vector as the descriptor for each image and a 100-dimensional BoW SIFT feature derived from PCA on the binary tagging vector to represent the corresponding textual content.

NUS-WIDE

[5] is a real-world multi-modal image dataset which is composed of 269,648 images and it corresponding textual tags, collecting from Flickr. In our experiments, we select a subset with 186,577 images-tags instance pairs that belong to 10 most commonly used concepts from the original data as the final experimental dataset. The dataset provides a 1,000-dimensional BoW feature for each text.

Wiki

[29] is a single label dataset with 2,866 coupled image-text instances. All these pairs are labeled with one of ten semantic classes. The dataset has been divided into 2,173 training pairs and 693 testing pairs. We summarize the statistics of two dataset in Table 1.

Table 1 Statistics of two Benchmark datasets

4.2 Baselines and evaluation metric

We compare HNH with nine state-of-the-art unsupervised baselines, including shallow models : CVH [19], IMH [36], LCMH [61], CMFH [7], LSSH [57], RFDH [41], and deep learning based methods : DBRC [12], UDCMH [46], DJSRH [38].

We choose the widely-used Mean Average Precision (MAP) which can well reflect both ranking information and precision, to evaluate the performance of all methods.

More specifically, for a set of queries Q = [q1,q2,...,qp], the Mean Average Precision (MAP) is defined as follows:

$$ MAP=\frac{1}{p}\sum\limits_{i=1}^{p} \underbrace{\Bigg(\frac{1}{N_{g}}\sum\limits_{r=1}^{n}P_{i}(r)\zeta_{i}(r)\Bigg)}_{\text {Average Precision (AP)}}, $$
(22)

where p is the size of the query set Q, Ng is the number of ground-truth neighbors of the query qi in the database, n is the number of entities in the database, Pi(r) denotes the precision of the top r retrieved entities, and ζi(r) = 1 if the r-th retrieved entity is a ground-truth neighbour and ζi(r) = 0, otherwise. The ground-truth neighbors are defined as those sharing at least one semantic label.

We also adopt the top-N precision curves as a performance evaluation criterion to further evaluate the performance of our method and all compared methods.

4.3 Implementation details

We implement the proposed HNH via Pytorch on a workstation (CPU: Intel XEON E5-2650 v3 @ 2.60GHz, RAM: 128GB, GPU: NVIDIA 1080Ti). With regard to the parameters of HNH, we set the batch size as 32, momentum as 0.9, weight decay as 0.0005, learning rates are 0.0001 and 0.01 for ImgNet and TxtNet, respectively. Other balance parameters are selected by a validation procedure. We set γ = 0.9, α = 40, β = 1, λ = 1, k = 2(∗∈{x,y}), For MIRFlickr-25K, γ = 0.6, α = 40, β = 1, λ = 1, k = 2 for NUS-WIDE, and γ = 0.8, α = 40, β = 0.3, λ = 0.01, kx = 2, ky = 0.2 for Wiki. The iteration time is fixed as 40, 80 and 200 for MIRFlickr-25K, NUS-WIDE and Wiki, respectively. In the evaluation of mAP, we set the number of retrieved points as 50. All experiments are repeated several times with random data partitions, and the averaged results are reported.

For fair comparison, following previous work [38, 46], on MIRFlickr-25K and NUS-WIDE, 5,000 samples are randomly chosen for training, and selected 2,000 samples as the testing samples. All data of Wiki are used in experiments. For all compared non-deep methods, the nonlinear image features from the pre-trained Alexnet and the BoW features are used as the original features.

4.4 Results and discussions

4.4.1 MAP results

We compare the proposed HNH with all baselines in the “Image-to-Text” and “Text-to-Image” search tasks with code length varying from 32 bits to 128 bits and summarize the MAP values in Table 2. From the results, we can have the following observations.

  • On MIRFlickr-25K and NUS-WIDE, HNH outperforms all of the compared methods with various binary code length, demonstrating its superiority. In particular, our method achieves 3% - 5.6% and 2.3% - 9.3% improvements in the “Image-to-Text” task on MIRFlickr-25K and NUS-WIDE, respectively, while in the “Text-to-Image” task, HNH achieves the average improvement of 4.3% and 3.6% over the best results of the compared baselines on two datasets, respectively. It is also noticed that our performance on Wiki are not so promising as those on others, where a potential reason there are fewer training samples and larger noises in the dataset. Nevertheless, our method still achieve competitive performance compared to the baseline methods.

  • most of methods achieve better performance with relatively long bit length, which means that utilizing longer hash codes can incorporate more useful information.

  • In many cases, deep models outperform shallow models with deep features which well reflects that an end-to-end structure can facilitate the training of a model through back-propagate.

Table 2 The mAP results of all methods on three datasets

Particularly, the DJSRH method also devotes to construct a high-order similarity matrix by treating the original affinity matrix as the high-level feature matrix, so we further compare it with our proposed method in terms of MAP with different number of top returned points. The results are plotted in Figure 3. We can find that our method has a better performance than DJSRH in all cases, which verifies that our similarity constructing scheme can better capture the unified intrinsic manifold structure for different modalities.

Fig. 3
figure 3

The MAP of DJSRH and HNH with different number of top returned points with 128-bit on MIRFlickr-25K

4.4.2 Top-N precision curves

The top-N precision curves of the cases with 128 bits on three datasets are plotted in Figure 4. From these results, we have the following observations. First, HNH maintains the best overall performance among those traditional learning methods and deep neural network based methods as it is in mAP results. In addition, our method can even achieve better performance with more retrieved points (i.e., 5000) than other methods with less retrieved points in NUS-WIDE dataset, which is very important in retrieval tasks. All these further verify the superiority of our learning strategy on unsupervised cross-modal retrieval tasks.

Fig. 4
figure 4

The top-N precision curves of various models with 128-bit on three datasets

4.4.3 Ablation study

The proposed HNH tries to improve the retrieval performance mainly by constructing the high-order affinity matrix and introducing the common representation. To validate the effectiveness of two strategies, we investigate two variants of HNH, i.e., ‘HNH-1’, ‘HNH-2’. Detailedly, we use ‘HNH-1’ to represent the variant which does not consider nonlocal-view relationships in the similarity matrix construction, only adopting low-level affinity matrix S = γAx + (1 − γ)Ay as the supervision. We remove the the common representation U from the proposed HNH to constitute ‘HNH-2’ which final objective function is \(\mathcal {J} = \alpha \|\widetilde {\mathbf {S}}- \mathbf {B}_{\mathbf {x}}^{T} \mathbf {B}_{\mathbf {x}}{\|}_{F}^{2} + \beta \|\widetilde {\mathbf {S}}- \mathbf {B}_{\mathbf {y}}^{T} \mathbf {B}_{\mathbf {y}}{\|}_{F}^{2} + \lambda \|\widetilde {\mathbf {S}}- {B}_{\mathbf {x}}^{T} \mathbf {B}_{\mathbf {y}}{\|}_{F}^{2}\). We conduct experiments on MIRFlickr-25K dataset and measure the corresponding performance with mAP. The results are reported in Table 3. It can be seen that without one of the components, the performance of the proposed method decreases, which means both components make contributions to improve retrieval accuracy. In addition, these variants all perform better than all of the compared methods in previous experiments, demonstrating the superiority of our method.

Table 3 The ablation comparison among different variants of HNH on MIRFlickr-25K

4.4.4 Parameter sensitivity analysis

There are several parameters that may influence the performance of the proposed HNH. To analyse this, we conduct experiments on MIRFlickr-25K. The MAP curves of the case with 128 bits by varying α,β, λ and γ are plotted in Figure 5. Specifically, the mAP value increases with α increasing from 1 to 20, then the performance remains stable. The mAP value decreases rapidly when β is larger than 5. As for λ, our method achieves promising results when λ ranges from [0.01,1], then it declines. γ controls the weight of the two similarity matrices of two modalities, we can see that our method can achieve the best overall performance when γ is larger than 0.6. In addition, when γ = 1 and γ = 0 which means we only use \(\widetilde {\mathbf {A}}_{\mathbf {x}}\) and \(\widetilde {\mathbf {A}}_{\mathbf {y}}\), respectively, the performance is inferior compared to our integrated method but it is still better than other compared methods, which demonstrates the effectiveness of the proposed HNH and shows that integrating information form different modalities during similarity construction is valid.

Fig. 5
figure 5

Sensitivity analysis of parameter α, β, λ and γ with 128-bit on MIRFlickr-25K

4.4.5 Visualization

In Figure 6, we present examples of top 10 retrieval results on MIRFlickr-25K. The correct searched data is marked in green while the incorrect results are in red. From the results, we can see that given queries, the proposed HNH can return relevant data that shares the same label with high probability. Despite a few irrelevant data returned (e.g., ‘lion’ for the first query), these returned irrelevant data appears visually similar to the correct retrieved data.

Fig. 6
figure 6

Example retrieval results

5 Conclusion

In this paper, we propose a novel deep unsupervised method, i.e., High-order Nonlocal Hashing (HNH) for cross-modal retrieval. We focus on capturing high-order supervisory signals to facilitate searching across various modalities. For this purpose, apart from capturing local similarity information, the relation between samples in an overall point of view is also taken into consideration. On this basis, we further leverage the complementary advantages of multi-modal data to build a unified similarity matrix which potentially carries comprehensive semantics. In addition, a common space representation is introduced to establish connections between different modalities, which helps to eliminate the modal discrepancy. Finally, by preserving similarity among image, text, and common space representations, we can learn high-quality binary codes and effective hash functions. Extensive experiments with excellent results demonstrate the superiority of the proposed HNH in cross-modal retrieval tasks.