1 Introduction

Cross-modal retrieval aims to retrieve relevant items from one modality using queries from another modality, e.g. to retrieve images by text, which plays an essential role in multimedia search and recommendation. Ever-increasing multi-modal data raises the concern about search efficiency. Hashing has become a popular indexing solution because Hamming descriptors can significantly reduce index storage and accelerate distance computation with fast XOR operations (Li et al., 2020; Song et al., 2020; Liu et al., 2019; Zhang et al., 2023b). We can roughly categorize existing efforts on cross-modal hashing (CMH) into unsupervised and supervised methods. Unsupervised methods (Yu et al., 2021a; Su et al., 2019; Hu et al., 2020; Tu et al., 2023) exploit natural multi-modal co-occurrence for hashing. In contrast, supervised methods (Chen et al., 2021; Cao et al., 2018; Jiang and Li, 2017; Zhang et al., 2023a) can leverage full supervision, e.g. ground-truth labels, to better preserve semantic information in hash codes. While supervised methods usually show impressive performance, they are less favored in real-world applications due to the expensive annotation cost. Therefore, we pay attention to the unsupervised methodology alternatively.

Fig. 1
figure 1

Semantic alignment paradigms for transformer-based unsupervised cross-modal hashing. a Handshaking means aligning global representations of class tokens (i.e., [CLS]). b Hugging further exploits fine-grained alignment using content tokens. Fine-grained alignment serves as an auxiliary task to bridge the modality gap and can effectively improve global hash code generation without extra test-time overhead. Additionally, we further extend hugging to produce local quantization codes for efficient reranking, which is a bonus from fine-grained alignment

The performance of unsupervised cross-modal hashing (UCMH) is inherently subject to multimedia understanding. Although deep neural networks have made remarkable progress in hashing, the advances have yet to be fully exploited. State-of-the-art approaches (Wang et al., 2020b; Hu et al., 2020; Su et al., 2019; Yu et al., 2021a; Zhu et al., 2023) mainly adopted classic convolutional neural networks (CNNs), e.g. VGGNet (Simonyan and Zisserman, 2015) and AlexNet (Krizhevsky et al., 2012), to extract visual features and used multilayer perceptrons (MLPs) to encode text information. These designs are sub-optimal to capture semantic information from visual and language modalities, and they also suffer from limited generalizability. To improve UCMH and keep pace with the development of deep learning, one promising direction is to take advantage of transformers (Vaswani et al., 2017).

In the past few years, transformers have made inspiring successes in natural language processing (Devlin et al., 2019; Liu et al., 2019c; Yang et al., 2019) and computer vision (Dosovitskiy et al., 2021; Carion et al., 2020; Liu et al., 2021b), sparking explorations toward better content understanding and semantic extraction. Pre-trained on large-scale corpora (Krizhevsky et al., 2012; Zhu et al., 2015), transformers can serve as versatile experts that are effective and generalizable for various downstream tasks (Shin et al., 2022). We have learned about recent progress in transformer-based image (Li et al., 2022; Lu et al., 2021; Dubey et al., 2022) and video hashing (Li et al., 2021b; He et al., 2021; Wang et al., 2023) that can verify the efficacy of transformers in hashing. Nevertheless, we find transformers-based cross-modal hashing remains under-explored. Although lately (Yu et al., 2022) have proposed a CLIP-based (Radford et al., 2021) approach with impressive results, its success was mainly attributed to the off-the-shelf well-aligned transformers. Differently, in this paper, we pursue a general solution with unaligned transformers.

Pre-trained transformers provide solid semantic extraction for each modality, but UCMH still remains non-trivial. The main challenge is to bridge heterogeneous modalities so that the hash codes can be well aligned. Analogous to existing CNN-MLP-based UCMH models, we can train transformer-based models to produce hash codes via the global representation tokens (i.e., [CLS]). A simple way to learn is to align the global tokens using the objectives in existing UCMH methods. We liken this global alignment strategy to handshaking, as shown in Fig. 1a. In practice, handshaking is effective as expected but can be improved to reduce the modality gap. Note that transformer is a sequential architecture that arranges inputs as sequences. It naturally provides a set of content tokens (e.g. words of a text or patches of an image) with fine-grained and structural semantics, which can capture heterogeneous modality knowledge but was usually overlooked.

To enhance transformer-based UCMH learning, we present a multi-granularity alignment framework dubbed hugging, as illustrated in Fig. 1b. Besides global alignment using hashing representations from [CLS] tokens, we further develop fine-grained alignment based on the content tokens. Particularly, we construct another shared latent space with semantic structure via a GhostVLAD (Zhong et al., 2018) module. Content tokens in this space are softly aggregated into a series of parameterized clusters, each representing a latent topic or semantic concept. Locally (i.e., cluster-wise) contrastive (Chen et al., 2020a; He et al., 2020) alignment serves as an auxiliary objective for model training without requiring external supervision or hints (e.g. object regions in a picture or parsed components in a text). In contrast with handshaking, hugging enables synergy between global and fine-grained alignment in training, providing effective regularization to enhance the cross-modal consistency of hash codes. Interestingly, we find it improves retrieval performance and transferability, even though fine-grained parts do not engage in the forward process of the global part in inference. In other words, hugging will not bring extra inference overhead to global hash code generation. We will provide in-depth experimental analyses on the effectiveness of fine-grained alignment in Sect. 4.3.2.

Moreover, we extend the hugging framework to make the best of fine-grained representations. Note that in the above version of hugging, fine-grained alignment was discarded in inference to maintain the efficiency of hash-based retrieval. It leads to a major limitation on the usage of fine-grained representations. We left the solution as future work in our preliminary work (Wang et al., 2022b). Fortunately, we now find an effective quantization-based solution to turn fine-grained representations into profit while reconciling efficient retrieval. To be specific, for each locally aggregated cluster, we incorporate a learnable optimized quantization module to compress continuous embeddings while retaining maximal semantics. We adopt asymmetric-quantized contrastive learning to align quantized representations, achieving fine-grained alignment and quantization learning by one objective. We call this quantization-based adaptation by hugging\(^+\) to avoid name confusion. Despite involving lossy compression with fine-grained semantics, hugging\(^+\) still keeps the benefit of enhancing global hash codes. Additionally, it allows for generating fine-grained quantization codes during inference, which can be taken as another bonus to improve retrieval performance. In particular, we adopt the common practice of appending a reranking stage (ZhongZhong et al., 2017; Ye et al., 2022) to the retrieval pipeline, in which we first retrieve a moderate subset of relevant items using global hash codes and then rerank the subset according to the fine-grained similarity computed with quantized representations. We highlight two merits of hugging\(^+\) regarding practical two-stage retrieval systems: (i) It enables more efficient reranking because quantization largely reduces the storage overhead for fine-grained representations and also accelerates similarity computation. (ii) It learns a holistic model to produce multi-granularity representations for different stages via just one pass, saving the effort of independent development for multiple retrieval stages. Besides keeping consistent with common practice, we choose to rerank using fine-grained representations rather than fusing multi-granularity similarity in one-stage retrieval for two reasons: (i) Decoupling global and fine-grained ranking improves retrieval efficiency as it iterates fine-grained similarity computation within a small relevant subset rather than the entire database. (ii) Two-stage retrieval is more robust than one-stage retrieval with the fusion-based strategy. Empirically, we find it sensitive to determine the fusion weights of global and fine-grained similarities for one-stage retrieval. Luckily, this weighting issue is naturally avoided in two-stage retrieval.

We instantiate hugging and hugging\(^+\) strategies by building HuggingHash and HuggingHash\(^+\) models, respectively, with a simple contrastive learning objective for global alignment. Experiments on text-image and text-video retrieval datasets show that HuggingHash and HuggingHash\(^+\) can outperform state-of-the-art UCMH methods integrating transformers in handshaking style. Besides, we adapt several state-of-the-art approaches with hugging and hugging\(^+\), demonstrating their flexibility and general effectiveness on UCMH when utilizing transformers.

Our contributions are summarized as follows.

  • We highlight the significance and provide a detailed study on transformer-based UCMH, which can serve as a research basis for this promising new direction.

  • Unlike straightforward ideas that only align global hash codes (i.e., handshaking in Fig. 1), we propose hugging with multi-granularity alignment for transformer-based UCMH. It shows a positive synergy for cross-modal alignment during training and improves retrieval performance and transferability of global hash codes without extra inference overhead.

  • We further extend hugging to hugging\(^+\) that combines a novel optimized quantization with fine-grained alignment. It not only keeps the benefit of improving global hash codes but also enables fine-grained quantization code generation as a gift. By reranking with fine-grained quantization codes, the inference benefits more from the hugging\(^+\) training strategy while still enjoying high retrieval efficiency.

  • We conduct extensive experiments on text-image and text-video retrieval datasets, showing that hugging and hugging\(^+\) help to outperform state-of-the-art baselines with handshaking. Moreover, we demonstrate the efficacy and flexibility of hugging and hugging\(^+\) to exiting UCMH methods when utilizing transformers.

Extension Notes Compared with our preliminary work (Wang et al., 2022b), this paper has been improved and further studied from three aspects. (i) Technical extension. First, we further investigate how fine-grained alignment promotes global hash code learning. The details can be found in Sect. 4.3.2, including the comparison of handshaking and hugging in three aspects: training dynamics, visualization and clustering analysis of the fine-grained latent space, and attention visualization. We believe they can provide readers with a deeper understanding of hugging’s efficacy. Second, hugging drops fine-grained parts in inference and hence falls short of leveraging fine-grained representations. This limitation was unresolved in our preliminary paper. In this paper, we fulfill an effective solution by extending hugging to hugging\(^+\), where first we integrate quantization learning with fine-grained alignment in training and then take fine-grained quantization codes as a bonus for efficient reranking in inference. Our new solution obtains a better tradeoff between retrieval efficacy and efficiency. Third, we contribute a new approach to fine-grained quantization by transforming the classical solution of the orthogonal matrix in Optimized Product Quantization (OPQ) (Ge et al., 2013) into a series of learnable Householder matrices. This integration seamlessly incorporates OPQ into the deep learning pipeline, enhancing fine-grained semantic preservation and quantization quality. This contribution can benefit various existing quantization approaches. (ii) Experimental extension. We include new results on one text-image retrieval (Chua et al., 2009) and two text-video retrieval datasets (Xu et al., 2016; Chen and Dolan, 2011). We join two state-of-the-art and open-sourced approaches (Zhu et al., 2023; Tu et al., 2023) in the investigation of transformer-based UCMH. We also conduct qualitative analysis and extensive ablation studies to justify the efficacy of our design. (iii) Survey extension. We update the survey of related work and add more discussion on the relation and differences to the latest papers.

2 Related Work

2.1 Hashing for Fast Visual Search

Hashing aims to transform high-dimensional data into compact binary codes while preserving semantic information, which has been extensively studied in visual search due to fast retrieval speed and low storage cost (Wang et al., 2016, 2018). According to different strategies on distance (or similarity) computation, it can be subdivided into binary hashing and quantization. Specifically, binary hashing (Datar et al., 2004)Footnote 1 transforms continuous embeddings into the Hamming space such that distances can be quickly computed with bitwise operators. Quantization (Jégou et al., 2011) divides embedding space into disjoint clusters and approximates each data point by the nearest centroid. By pre-computing the inter-centroid distances in a lookup table, the search speed can be greatly accelerated.

To preserve semantic information, traditional binary hashing (Datar et al., 2004; Weiss et al., 2008; Heo et al., 2012; Gong et al., 2013) and quantization (Ge et al., 2013; Babenko and Lempitsky, 2014; Zhang et al., 2014; Kalantidis and Avrithis, 2014; Martinez et al., 2016) approaches adopted various heuristic strategies that are sensitive to the statistical property of application data and inflexible to adapt to new data. More effectively, deep binary hashing (Liong et al., 2015; Zhu et al., 2016; Liu et al., 2019) and quantization (Cao et al., 2017; Yu et al., 2020) approaches jointly optimized deep feature extraction and hashing (or quantization) in an end-to-end fashion, which have been shown to outperform traditional approaches in image retrieval (Shen et al., 2019; Zhang et al., 2019; Wang et al., 2021; Cui et al., 2021; Sun et al., 2022), video retrieval (Li et al., 2021b, 2022a; Zeng et al., 2022; Wang et al., 2023), multi-modal retrieval (Zheng et al., 2020; Tan et al., 2022), and cross-modal retrieval (Chen et al., 2018; Sun et al., 2019; Shen et al., 2021; Li et al., 2021; Tu et al., 2022).

From a practical perspective, we present a hybrid hashing framework that includes binary hashing and quantization for two-stage efficient retrieval. Here we discuss the relations to existing work from different aspects and highlight the insights in our design. (i) (Shi and Chung, 2021) proposed a similar two-stage inference pipeline to our hugging\(^+\) that learned binary hashing and quantization at once for preliminary ranking and further reranking, respectively. Nevertheless, this approach was not designed for transformer-based hashing. Besides, both binary hashing and quantization in it are built on the same global features, which tends to homogenize the two stages and thereby limits the performance gain. Differently, hugging\(^+\) develops global hashing and fine-grained quantization, which leverages heterogeneous, coarse-to-fine semantics to maximize the gain. In addition, both global and fine-grained components can be mutually promoted in hugging-style training, leading to a win-win situation in inference. (ii) The fine-grained quantization in hugging\(^+\) follows the common practice in deep quantization (Klein and Wolf, 2019; Yu et al., 2020; Jang and Cho, 2021; Wang et al., 2022a) that uses a differentiable trick (Chen et al., 2020b) to relax the optimization of product quantization (PQ) (Jégou et al., 2011), making the whole model end-to-end back-propagable. Furthermore, we revisit optimized product quantization (OPQ) (Ge et al., 2013) and introduce an isometric matrix to minimize quantization distortions. Different from OPQ, which solved the matrix by intractable SVD decomposition, we propose a differentiable solution based on the Householder transformation, enabling end-to-end deep learning.

2.2 Unsupervised Cross-Modal Hashing (UCMH)

Traditional UCMH approaches learned to transform hard-craft features (Lowe, 2004) into binary codes by solving linear problems, e.g. matrix factorization (Zhou et al., 2014; Ding et al., 2016; Hu et al., 2019) and spectral decomposition (Kumar and Udupa, 2011; Song et al., 2013). The shallow features and linear solutions limited the performance and scalability. In contrast, by leveraging deep neural networks (DNNs), deep UCMH approaches can capture richer semantic information and generate better hash codes. Early deep approaches (Hu et al., 2019; Wu et al., 2018; Hoang et al., 2020) replaced the hand-crafted features with the deep features and applied linear solutions as in some shallow approaches (Kumar and Udupa, 2011; Zhu et al., 2013; Song et al., 2013; Hu et al., 2019). To better estimate pairwise similarity to guide hash learning during training, later deep approaches (Su et al., 2019; Liu et al., 2020; Wang et al., 2020b; Yang et al., 2020; Zhang et al., 2022; Yu et al., 2021a; Zhang et al., 2021; Zhu et al., 2023; Tu et al., 2023) delved into fusing multi-modal affinities to estimate a precise similarity matrix. Besides, some recent deep approaches tried to narrow the modality gap by adversarial learning (Zhang et al., 2018; Li et al., 2019; Zhang and Peng, 2020), knowledge distillation (Hu et al., 2020; Li and Wang, 2021), or self-supervised learning (Wang et al., 2020, 2021; Hoang et al., 2023; Mikriukov et al., 2022), showing promising results.

Note that existing deep approaches mainly used classic AlexNet (Krizhevsky et al., 2012) or VGGNets (Simonyan and Zisserman, 2015) to extract visual features and used MLPs to encode text information, which fell short of semantic extraction and limited hash representations. Instead, our paper highlights the significance of transformers to UCMH and conducts a detailed study on transformer-based UCMH. We examine the efficacy of transformers to UCMH with five representative and open-sourced approaches (Su et al., 2019; Yang et al., 2020; Yu et al., 2021a; Zhu et al., 2023; Tu et al., 2023). More importantly, we also propose effective multi-granularity hash learning strategies that provide new insights into the transformer-based UCMH.

2.3 Transformers for Multimedia Retrieval

Recently, transformers (Vaswani et al., 2017) have made remarkable progress in CV (Dosovitskiy et al., 2021; Carion et al., 2020; Liu et al., 2021b) and NLP (Devlin et al., 2019; Liu et al., 2019c; Yang et al., 2019) tasks, triggering the surge toward better multimedia understanding. In cross-modal retrieval, the potential of transformers has been widely explored (Shin et al., 2022). Single-stream retrieval methods (Lu et al., 2019; Song and Soleymani, 2019; Li et al., 2020a; Gao et al., 2020; Yu et al., 2021b; Bao et al., 2022b; Radenovic et al., 2023) designed unified models with fully cross-modal interaction. Albeit the superior performance, the quadratic complexity of pairwise interaction limited their retrieval efficiency. Dual-stream methods developed separate encoders for modality-specific representations, in which the core is to align the representations between modalities. Most dual-stream methods (Liu et al., 2019b; Gabeur et al., 2020; Radford et al., 2021; Liu et al., 2021a; Yang et al., 2021; Patrick et al., 2021; Bain et al., 2021; Luo et al., 2022; An et al., 2023; Liu et al., 2023; Lin et al., 2023) globally aligned the aggregation tokens (e.g. [CLS]) with metric learning (Liu, 2009) or contrastive learning (Chen et al., 2020a), where negative sampling strategies have been extensively investigated to enhance representation learning. To reduce the modality gap, many recent works further excavated fine-grained semantics from content tokens (e.g. text words or image patches) and designed various fine-grained interaction strategies, augmenting cross-modal learning and relevance estimation. For instance, Messina et al. (2021); Yao et al. (2022); Wu et al. (2023) measured cross-modal token-wise relevance and aggregated the fine-grained scores by pooling strategies. Wang et al. (2021b) developed a global–local fusion approach that combines global and local similarities for alignment and retrieval. Fang et al. (2023) proposed an uncertainty-adaptive approach that models cross-modal interaction as a distribution matching procedure. Jin et al. (2023b) designed a text-frame attention module to enhance cross-modal interaction and proposed a novel generative retrieval approach based on diffusion mechanism. Zala et al. (2023); Jin et al. (2023a) adopted a hierarchical perspective to model cross-modal interaction at different levels, such as entity, action, and event, where Jin et al. (2023a) also introduced a novel Banzhaf Interaction mechanism to refine cross-modal correspondence. In general, these fine-grained approaches help to achieve better performance but sacrifice retrieval efficiency, because their test-time efficacy essentially relies on complex similarity computation.

In the specific field of hashing-based retrieval, recent advances in image (Lu et al., 2021; Li et al., 2022; Dubey et al., 2022; Chen et al., 2022) and video hashing (Li et al., 2021b; He et al., 2021; Zeng et al., 2022; Wang et al., 2023) have also revealed the effectiveness of uni-modal transformer hashing. In the cross-modal scenario, CLIP-based (Radford et al., 2021) hashing approaches have emerged recently for text-image and text-video retrieval. SACH (Yu et al., 2022) and DCMHT (Tu et al., 2022) are two latest approaches for supervised and unsupervised text-image hashing. Despite impressive results, their successes mainly relied on off-the-shelf well-aligned transformers. Besides, Tu et al. (2022) further leveraged ground-truth labels to guide hash learning. CLIP4Hashing (Zhuo et al., 2022) adapted CLIP for text-video UCMH, where the text and video encoders were only aligned via global hash representations. In contrast, we proposed effective and general solutions for UCMH with unaligned transformers. By adopting multi-granularity alignment in training, we effectively enhance the cross-modal consistency, producing better hash representations than CLIP4Hashing.

3 Methodology

Fig. 2
figure 2

The pipeline HuggingHash. During training, it integrates global alignment based on hash codes and fine-grained alignment with a GhostVLAD (Zhong et al., 2018) module. We design embedding-based (Sect. 3.3.2) and quantization-based (Sect. 3.3.3) objectives as alternatives for fine-grained alignment. The synergy of global and fine-grained alignment shows a positive effect on both components during learning. In inference, we obtain hash representations for retrieval via the global branch. Quantization-based fine-grained alignment further enables fine-grained quantization code generation as a bonus, which can be exploited to refine retrieval results by efficient reranking. Best viewed in color

3.1 Problem Formulation and Model Overview

Given an unlabeled training set \(\mathcal {D}\) of \(N_\mathcal {D}\) naturally coexisted dual-modal (e.g. text-image or text-video) pairs, our goal is to learn a pair of modality-specific hash encoders that encode texts and images (or videos) as L-bit semantic-preserving binary codes for cross-modal retrieval.

To this end, we construct a HuggingHash model using the hugging framework, as illustrated in Fig. 2. Specifically, given a training pair, we first preprocess the text and the image (or video) as the input tokens for transformers. Then, we extract features with transformers and get the output embeddings from the [CLS] and content tokens (Sect. 3.2). Next, we forward the [CLS] tokens to the hash modules and produce text and image (or video) hash code vectors. Meanwhile, we project the embeddings of content tokens to a cross-modal latent space. Finally, we conduct multi-granularity alignment, including global alignment based on the hash codes (Sect. 3.3.1) and fine-grained alignment based on latent local representations (Sect. 3.3.2), to bridge text and visual modalities. We further design learnable optimized quantization collaborating with the fine-grained alignment to make the best of fine-grained representations (Sect. 3.3.3). In inference, we obtain hash codes through the global branch and enable the generation of fine-grained quantization codes (Sect. 3.4.1). We take global hash codes for preliminary ranking (Sect. 3.4.2) and also provide optional reranking with fine-grained quantization codes (Sect. 3.4.3).

3.2 Base Encoders

3.2.1 Text Encoder

For each text sample, we tokenize it into word pieces and construct content tokens. Then we append a [CLS] token and form the text input. Denote the token sequence of the ith text in a mini-batch \(\mathcal {B}\) by \(\mathcal {T}_i=\{t_\text {i,[CLS]}, t_{i,1}, t_{i,2}, \cdots ,t_{i,K^\text {t}_i}\}\), where \(K^\text {t}_i\) is the number of content tokensFootnote 2 for the ith text. We pad the sequence to a fixed length, add position embeddings and forward it to the BERT (Devlin et al., 2019) encoder \(f^\text {t}(\cdot ; \varvec{\theta }_f^\text {t})\) to compute the token embeddings, namely \(\varvec{x}^\text {t}_{i,\texttt {[CLS]}}\in \mathbb {R}^{D^\text {t}}\) and \(\{\varvec{x}^\text {t}_{i,k}\}_{k=1}^{K^\text {t}_i}\subset \mathbb {R}^{D^\text {t}}\). We can formulate the whole process by

$$\begin{aligned} \varvec{x}_{i,k}^\text {t}=f^\text {t}(\mathcal {T}_i; \varvec{\theta }_f^\text {t})_k,\, k=\texttt {[CLS]},{1,2,\cdots ,K^\text {t}_i}. \end{aligned}$$
(1)

3.2.2 Image Encoder

For each image sample, we use the ViT (Dosovitskiy et al., 2021) pre-processor to patchify it into a fixed number (e.g. 196) of content tokens and add a [CLS] token to form the image input. We denote the token sequence of the ith image in a mini-batch \(\mathcal {B}\) by \(\mathcal {V}_i=\{v_{i,\texttt {[CLS]}}, v_{i,1}, v_{i,2}, \cdots ,v_{i,K^\text {v}}\}\), where \(K^\text {v}\) is the number of content tokens. We then add position embeddings and forward them to the ViT \(f^\text {v}(\cdot ; \varvec{\theta }_f^\text {v})\) to compute embeddings, namely \(\varvec{x}^\text {v}_{i,\texttt {[CLS]}}\in \mathbb {R}^{D^\text {v}}\) and \(\{\varvec{x}^\text {v}_{i,k}\}_{k=1}^{K^\text {v}}\subset \mathbb {R}^{D^\text {v}}\). Analogous to the text side, we summarize the image feature extraction process by

$$\begin{aligned} \varvec{x}_{i,k}^\text {v}=f^\text {v}(\mathcal {V}_i; \varvec{\theta }_f^\text {v})_k,\, k=\texttt {[CLS]},{1,2,\cdots ,K^\text {v}}. \end{aligned}$$
(2)

3.2.3 Video Encoder

In addition to text-image retrieval, HuggingHash also supports hash-based text-video retrieval tasks. To be concise, we reuse the notations in Sect. 3.2.2 and assign analogous definitions to them. Given the ith video in a mini-batch \(\mathcal {B}\), we first adopt ViT as the spatial encoder to extract frame-level features \(\varvec{v}_{i,1}, \varvec{v}_{i,2}, \cdots ,\varvec{v}_{i,K^\text {v}}\), where \(K^\text {v}\) is the number of sampled frames. We append a temporal [CLS] token to frame features, add temporal positional embeddings, forming the input token sequence \(\mathcal {V}_i=\{\varvec{v}_{i,\texttt {[CLS]}}, \varvec{v}_{i,1}, \varvec{v}_{i,2}, \cdots ,\varvec{v}_{i,K^\text {v}}\}\). Then, we build a lightweight self-attention layer upon \(\mathcal {V}_i\) to capture temporal semantics, producing output embeddings by \(\varvec{x}^\text {v}_{i,\texttt {[CLS]}}\in \mathbb {R}^{D^\text {v}}\) and \(\{\varvec{x}^\text {v}_{i,k}\}_{k=1}^{K^\text {v}}\subset \mathbb {R}^{D^\text {v}}\). We denote the whole video encoder by \(f^\text {v}(\cdot ; \varvec{\theta }_f^\text {v})\), and formulate the encoding process by

$$\begin{aligned} \varvec{x}_{i,k}^\text {v}=f^\text {v}(\mathcal {V}_i; \varvec{\theta }_f^\text {v})_k,\, k=\texttt {[CLS]},{1,2,\cdots ,K^\text {v}}. \end{aligned}$$
(3)

3.3 Hugging: Multi-granularity Alignment with Transformers for Hash Learning

Transformers provide better multimedia understanding to facilitate hash learning, but aligning heterogeneous knowledge between different modalities is challenging. We present hugging, a multi-granularity alignment framework to tackle the challenge. In addition to the global alignment for hash codes, we design fine-grained alignment using GhostVLAD (Zhong et al., 2018) with content tokens. The global alignment provides direct guidance on hash code learning, while the fine-grained alignment supplies effective regularization to reduce the modality gap.

3.3.1 Global Alignment

We apply global alignment to the hash codes. First, we project and convert the output embeddings of the aggregation tokens (i.e., [CLS]) into binary hash codes:

$$\begin{aligned} \varvec{h}^\text {t}_i= & {} \tanh \left( \alpha \cdot \phi ^\text {t}(\varvec{x}^\text {t}_{i,\texttt {[CLS]}})\right) \in [-1,+1]^L, \end{aligned}$$
(4)
$$\begin{aligned} \varvec{h}^\text {v}_i= & {} \tanh \left( \alpha \cdot \phi ^\text {v}(\varvec{x}^\text {v}_{i,\texttt {[CLS]}})\right) \in [-1,+1]^L,\end{aligned}$$
(5)
$$\begin{aligned} \varvec{b}^\text {t}_i= & {} \varvec{h}^\text {t}_i-\text {sg}\left( \varvec{h}^\text {t}_i-\text {sgn}(\varvec{h}^\text {t}_i)\right) \in \{-1,+1\}^L,\end{aligned}$$
(6)
$$\begin{aligned} \varvec{b}^\text {v}_i= & {} \varvec{h}^\text {v}_i-\text {sg}\left( \varvec{h}^\text {v}_i-\text {sgn}(\varvec{h}^\text {v}_i)\right) \in \{-1,+1\}^L, \end{aligned}$$
(7)

where \(\phi ^\text {t}\) and \(\phi ^\text {v}\) are modality-specific projections that transform \(\mathbb {R}^{D^\text {t}}\)-dimensional text features and \(\mathbb {R}^{D^\text {v}}\)-dimensional visual features into an L-dimensional shared latent space, respectively. \(\alpha >0\) is a factor in controlling the smoothness of the \(\tanh \) outputs. \(\varvec{h}^\text {t}_i\) and \(\varvec{h}^\text {v}_i\) are smoothed hash codes. \(\text {sgn}(\cdot )\) is the sign function that outputs \(+1\) for positive input and \(-1\) otherwise on each element. \(\text {sg}(\cdot )\) is the stop gradient operator that is the identity function in the forward pass but drops gradient for variables inside it during the backward pass. Equations (6) and (7) allow us to directly pass the gradient straight through (Bengio et al., 2013) the binary hash codes, i.e., \(\varvec{b}^\text {t}_i\) and \(\varvec{b}^\text {v}_i\). In HuggingHash, we adopt the contrastive learning loss (van den Oord et al., 2018; He et al., 2020; Chen et al., 2020a) for global alignment, namely

$$\begin{aligned} \ell _{\text {GA}, i}^{\mathbf {\text {tv}}}= & {} -\log \frac{\exp (M_{ii}/\tau )}{\sum _{j=1}^{\vert \mathcal {B}\vert }\exp (M_{ij}/\tau )}, \end{aligned}$$
(8)
$$\begin{aligned} \ell _{\text {GA}, i}^{\mathbf {\text {vt}}}= & {} -\log \frac{\exp (M_{ii}/\tau )}{\sum _{j=1}^{\vert \mathcal {B}\vert }\exp (M_{ji}/\tau )}, \end{aligned}$$
(9)
$$\begin{aligned} \mathcal {L}_\text {GA}= & {} \frac{1}{2\vert \mathcal {B}\vert }\sum _{i=1}^{\vert \mathcal {B}\vert }\left( \ell _{\text {GA}, i}^{\mathbf {\text {tv}}}+\ell _{\text {GA}, i}^{\mathbf {\text {vt}}}\right) , \end{aligned}$$
(10)

where \(\mathcal {B}\) denotes a mini-batch. \(M_{ij}=\cos (\varvec{b}^\text {t}_i,\varvec{b}^\text {v}_j)\). \(\tau >0\) is the temperature hyper-parameter.

3.3.2 Fine-Grained Alignment with Locally Aggregated Descriptors

We present a clustering-based strategy with GhostVLAD (Zhong et al., 2018) for fine-grained alignment. Our basic idea is to exploit concept-aware semantics and enable concept-aware alignment in the latent space. Specifically, we first project the output embeddings of the content tokens into a shared latent space, namely,

$$\begin{aligned} \varvec{z}^\text {t}_{i,k}= & {} \psi ^\text {t}(\varvec{x}^\text {t}_{i,k}),\quad k={1,2,\cdots ,K^\text {t}_i}, \end{aligned}$$
(11)
$$\begin{aligned} \varvec{z}^\text {v}_{i,k}= & {} \psi ^\text {v}(\varvec{x}^\text {v}_{i,k}),\quad k={1,2,\cdots ,K^\text {v}}, \end{aligned}$$
(12)

where \(\psi ^\text {t}\) and \(\psi ^\text {v}\) are the projections that transform \(\mathbb {R}^{D^\text {t}}\)-dimensional text features and \(\mathbb {R}^{D^\text {v}}\)-dimensional visual features into an D-dimensional fine-grained shared latent space, respectively. We denote the collections of latent embeddings for text and visual content tokens as \(\mathcal {Z}^\text {t}_i=\{\varvec{z}^\text {t}_{i,k}\}_{k=1}^{K^\text {t}_i}\) and \(\mathcal {Z}^\text {v}_i=\{\varvec{z}^\text {v}_{i,k}\}_{k=1}^{K^\text {v}}\), respectively.

Fig. 3
figure 3

Embedding-based fine-grained alignment

Then, we use a GhostVLAD module to learn \(N_\text {c}+1\) cluster centroids, \(\left\{ \varvec{c}_0,{\varvec{c}_1 , \varvec{c}_2 , \cdots , \varvec{c}_{N_\text {c}}}\right\} \). In particular, we designate \(\varvec{c}_0\) as the “ghost” centroid to filter noise, e.g. uninformative words in a sentence and background features for an image or a video. Each other centroid is expected to represent a latent concept or attribute, such as color, scene, etc., contributing to a partial description of the object. We forward \(\mathcal {Z}_i^\text {t}\) and \(\mathcal {Z}_j^\text {v}\) to the GhostVLAD, where each token is viewed as composites of different concepts and will be softly assigned to all clusters. We use assignment scores to estimate the relevance of a token to all concepts or attributes. For instance, the assignment score of the text token \(\varvec{z}_{i, k}^\text {t}\) w.r.t. the nth cluster is computed by

$$\begin{aligned} a^\text {t}_{i,k,n}=\frac{\exp (\text {BatchNorm}(\varvec{w}_n^\top \varvec{z}_{i, k}^\text {t}))}{\sum _{n'=0}^{N_\text {c}}\exp (\text {BatchNorm}(\varvec{w}_{n'}^\top \varvec{z}_{i, k}^\text {t}))}, \end{aligned}$$
(13)

where \(\text {BatchNorm}(\cdot )\) is batch normalization (Ioffe and Szegedy, 2015) and \(\varvec{W}=[\varvec{w}_0,\varvec{w}_1,\cdots ,\varvec{w}_{N_\text {c}}]\) is a trainable parameter matrix. Suppose the nth cluster is associated with a latent concept of ‘animal’, then \(a^\text {t}_{i,k,n}\) indicates the relevance of the text token \(\varvec{z}^\text {t}_{i,k}\) to ‘animal’, e.g. 70%.

After clustering, we aggregate modality-wise residual embeddings at each cluster except the “ghost” cluster. For the nth cluster, we aggregate residual embeddings of \(\mathcal {Z}_i^\text {t}\) and \(\mathcal {Z}_j^\text {v}\) respectively by

$$\begin{aligned} \varvec{r}^\text {t}_{i,n}= & {} \sum _{k=1}^{K^\text {t}_i}a^\text {t}_{i,k,n}\cdot (\varvec{z}_{i, k}^\text {t}-\varvec{c}_n), \end{aligned}$$
(14)
$$\begin{aligned} \varvec{r}^\text {v}_{j,n}= & {} \sum _{k=1}^{K^\text {v}}a^\text {v}_{j,k,n}\cdot (\varvec{z}_{j, k}^\text {v}-\varvec{c}_n). \end{aligned}$$
(15)

In Eqs. (14) and (15), each residual embedding represents the token semantics conditioned by the concept associated with the cluster. For instance, \((\varvec{z}^\text {t}_{i,k}-\varvec{c}_n)\) may represent that given the concept of ‘animal’, the token is about the ‘cat’. Notably, assignment scores and residual embeddings are designed to capture different information, i.e., concept relevance and concept-aware semantics, respectively. So we decouple them by using independent parameters for the assigner and centroids. Tokens with high relevance scores do not necessarily approach the centroid. Then the aggregated embedding can depict the total semantics of the whole token sequence w.r.t. the concept associated with the cluster. For example, \(\varvec{r}^{\text {t}}_{i,n}\) may indicate how the text input \(\mathcal {Z}^\text {t}_i\) is relevant to ‘animal’ and what ‘animal’ it tells.

Finally, we define fine-grained alignment as aligning aggregated local representations from text and vision cluster by cluster, and introduce the cluster-wise contrastive learning loss as

$$\begin{aligned} \ell _{\text {FA}(n), i}^{\mathbf {\text {tv}}}= & {} -\log \frac{\exp (m^n_{ii}/\tau )}{\sum _{j=1}^{\vert \mathcal {B}\vert }\exp (m^n_{ij}/\tau )}, \end{aligned}$$
(16)
$$\begin{aligned} \ell _{\text {FA}(n), i}^{\mathbf {\text {vt}}}= & {} -\log \frac{\exp (m^n_{ii}/\tau )}{\sum _{j=1}^{\vert \mathcal {B}\vert }\exp (m^n_{ji}/\tau )}, \end{aligned}$$
(17)
$$\begin{aligned} \mathcal {L}_{\text {FA}(n)}= & {} \frac{1}{2\vert \mathcal {B}\vert }\sum _{i=1}^{\vert \mathcal {B}\vert }\left( \ell _{\text {FA}(n), i}^{\mathbf {\text {tv}}}+\ell _{\text {FA}(n), i}^{\mathbf {\text {vt}}}\right) , \end{aligned}$$
(18)
$$\begin{aligned} \mathcal {L}_\text {FA}= & {} \frac{1}{N_c}\sum _{n=1}^{N_c}\mathcal {L}_{\text {FA}(n)}, \end{aligned}$$
(19)

where \(m^n_{ij}=\cos (\varvec{r}^\text {t}_{i,n}, \varvec{r}^\text {v}_{j,n})\) is the fine-grained similarity of \(\mathcal {Z}_i^\text {t}\) and \(\mathcal {Z}_j^\text {v}\) w.r.t. the nth cluster. We illustrate this embedding-based alignment in Fig. 3. As each aggregated representation is expected to capture both concept relevance (distribution statistics of assignment) and concept-aware semantics (residual embedding), cluster-wise alignment serves as a strong signal to regularize cross-modal consistency concept by concept, thus fulfilling fine-grained alignment.

Fig. 4
figure 4

The training and inference processes of quantization-based fine-grained alignment

Remark: Criteria for choosing clustering algorithm The choice of clustering algorithm is crucial for the fine-grained alignment. Our selection criteria include the following: (i) End-to-end optimization capability is mandatory, as we fundamentally train neural networks to achieve cross-modal alignment. Unfortunately, we find that most classic clustering algorithms like DBSCAN (Ester et al., 1996), do not meet this requirement. Other preferences, in decreasing order, include: (ii) Interpretability in the latent space, such as latent topics for text and visual concepts for images. It is preferable to capture the correlation between samples and latent information. (iii) Adequate robustness to noise, since we often lack detailed fine-grained annotations, such as correspondences between text words and image regions, which necessitates noise reduction and key information extraction. (iv) Simplicity and efficiency, as slow or complex algorithms will hinder training efficiency. (v) The output of the clustering algorithm should also support concise and efficient fine-grained cross-modal alignment.

We opt for GhostVLAD because it meets all these criteria. Specifically, (i) GhostVLAD is derived from NetVLAD (Arandjelovic et al., 2016), itself a trainable neural network module successfully applied to tasks like place recognition (Arandjelovic et al., 2016), image retrieval (Humenberger et al., 2022), and face recognition (Zhong et al., 2018). (ii) GhostVLAD produces interpretable results. It learns a fixed number of cluster prototypes in the latent space, serving as indicators of fine-grained concepts. By adaptively aggregating residuals of each content token w.r.t. each cluster centroid, it provides representations w.r.t. different latent concepts that aid comprehensive sample descriptions. (iii) GhostVLAD employs an information aggregation process that reduces sensitivity to noise introduced by individual tokens. Besides, compared to NetVLAD, it introduces ghost (i.e., idle) clusters, enhancing noise filtering through end-to-end training. (iv) The design of GhostVLAD is GPU-friendly, consisting of common deep learning operators and only requiring a single pass for clustering, as opposed to multiple iterations in k-means. (v) Utilizing GhostVLAD’s output for cross-modal alignment is simple and efficient. The output comprises fixed-size representations w.r.t. the number of local clusters, irrespective of the number of content tokens. Alignment is achieved through one-to-one matching w.r.t. each local cluster, avoiding exhaustive cross-interaction between two embedding sets and thus promoting conciseness and efficiency.

3.3.3 Optimized Quantization Learning for Fine-Grained Representations

When the training is equipped with embedding-based fine-grained alignment, the fine-grained parts have to be discarded in inference to maintain the efficiency of hash-based retrieval. It falls short of leveraging fine-grained representations and has yet to be resolved in our preliminary work (Wang et al., 2022b). Here we provide a quantization-based solution to this problem, which aims to align fine-grained cross-modal correspondence and learn quantized representations jointly.

Suppose at the nth local head of fine-grained alignment, we have a D-dimensional continuous-value embedding \(\varvec{r}_n\) to be quantized with M sub-codebooks, namely \(\varvec{D}^1_n,\varvec{D}^2_n,\cdots ,\varvec{D}^M_n\). The mth sub-codebook \(\varvec{D}^m_n\) consists of K sub-codewords \(\varvec{d}^m_{n,0},\varvec{d}^m_{n,1},\varvec{d}^m_{n,K-1}\in \mathbb {R}^d\). The problem of product quantization (Jégou et al., 2011) is given by

$$\begin{aligned} {\hat{\varvec{r}}}_n={\mathop {\arg \textrm{min}}_{\varvec{d}_n\in \varvec{D}^1_n\times \varvec{D}^2_n\times \cdots \times \varvec{D}^M_n}\,}\left\Vert \varvec{r}_n-\varvec{d}_n\right\Vert _2^2. \end{aligned}$$
(20)

Let \(\mathbb {R}^D\equiv \mathbb {R}^{Md}\), where M and d are both positive integers. We divide \(\varvec{r}_n\) into M equal-length d-dimensional segments, i.e., \(\varvec{r}_n\equiv [\varvec{r}^1_n,\varvec{r}^2_n,\cdots ,\varvec{r}^M_n]\). Then, the original problem in Eq. (20) can be re-formulated into M independent sub-problems. For example, the mth sub-problem is defined as

$$\begin{aligned} {\hat{\varvec{r}}}^m_n={\mathop {\arg \textrm{min}}_{\varvec{d}^m_{n,k}\in \varvec{D}^m_n}\,} \left\Vert \varvec{r}^m_n-\varvec{d}^m_{n,k}\right\Vert _2^2. \end{aligned}$$
(21)

By imposing \(\left\Vert \varvec{r}_n^m\right\Vert _2=\left\Vert \varvec{d}_n^m\right\Vert _2\) (e.g. L2 normalization), Eq. (21) can be re-written as

$$\begin{aligned} {\hat{\varvec{r}}}^m_n={\mathop {\arg \textrm{max}}_{\varvec{d}^m_{n,k}\in \varvec{D}^m_n}\,} \left<\varvec{r}^m_n,\varvec{d}^m_{n,k}\right>, \end{aligned}$$
(22)

where \(\left<\cdot ,\cdot \right>\) denotes the inner product operator.

To enable end-to-end deep learning, we follow the common practice of deep quantization approaches (Klein and Wolf, 2019; Yu et al., 2020) that relax the Eq. (22) by softmax trick, producing

$$\begin{aligned} {\hat{\varvec{r}}}^m_n= & {} {{\sum _{k=0}^{K-1}p_{n,k}^m\varvec{d}^m_{n,k}}}, \end{aligned}$$
(23)
$$\begin{aligned} p^m_{n,i}= & {} \frac{\exp (\beta \cdot \langle \varvec{r}^m_n,\varvec{d}^m_{n,k}\rangle )}{{{\sum _{k'=0}^{K-1} \exp (\beta \cdot \langle \varvec{r}^m_n,\varvec{d}^m_{n,k'}\rangle )}}}. \end{aligned}$$
(24)

\(p_{n,k}^m\in [0,1]^K\) is the codeword selection probability w.r.t. \(\varvec{D}_n^m\). \(\beta >0\) is a scaling factor such that Eq. (23) approximates Eq. (22) when \(\beta \rightarrow +\infty \).

As pointed out by a classic approach, optimized product quantization (OPQ) (Ge et al., 2013), the way to divide \(\varvec{r}_n\) into \(\varvec{r}^1_n,\varvec{r}^2_n,\cdots ,\varvec{r}^M_n\) (i.e., sub-space partition) is important to the quantization quality. However, existing deep quantization practices have not considered this aspect, which may result in large quantization distortions. Inspired by OPQ, we further design optimized quantization learning that introduces an isometric rotation matrix \(\varvec{P}_n\in \mathbb {R}^{D\times D}\) to optimize the sub-space partition in Eq. (20), leading to

$$\begin{aligned} \begin{aligned} {\hat{\varvec{r}}}_n={\mathop {\arg \textrm{min}}_{\varvec{d}_n\in \varvec{D}^1_n\times \varvec{D}^2_n\times \cdots \times \varvec{D}^M_n}\,}\left\Vert \varvec{P}_n\varvec{r}_n-\varvec{d}_n\right\Vert _2^2,\\ \text {s.t.}\quad \varvec{P}_n^\top \varvec{P}_n=\varvec{I}. \end{aligned} \end{aligned}$$
(25)

\(\varvec{I}\) denotes the identity matrix.

Note that in OPQ, the problem of Eq. (25) is solved by SVD decomposition, which is intractable for the end-to-end deep learning pipeline. Differently, we design a back-propagatable solution to Eq. (25). First, we set a series of trainable parameters \(\varvec{u}_{n,1},\varvec{u}_{n,2},\cdots ,\varvec{u}_{n,N_\text {h}}\) and transform them into orthogonal Householder matrices, namely

$$\begin{aligned} \varvec{H}_{n,h}= \varvec{I}-2\frac{\varvec{u}_{n,h}\varvec{u}_{n,h}^\top }{\left\Vert \varvec{u}_{n,h}\right\Vert _2^2},\ 1\le h\le N_\text {h}\le D. \end{aligned}$$
(26)

Then, we parameterize \(\varvec{P}_n\) by a product of \(N_\text {h}\) Householder matrices, giving

$$\begin{aligned} \varvec{P}_n = \prod _{h=1}^{N_\text {h}}\varvec{H}_{n,h}. \end{aligned}$$
(27)

In practice, we implement \(\varvec{P}_n\varvec{r}_n\) in Eq. (25) by applying \(N_\text {h}\) iterations:

$$\begin{aligned} \varvec{r}_{n,0}:= & {} \varvec{r}_n, \end{aligned}$$
(28)
$$\begin{aligned}{} & {} \begin{aligned} \varvec{r}_{n,h}:= \varvec{r}_{n,h-1}-\frac{2\varvec{u}_{n,h}}{\left\Vert \varvec{u}_{n,h}\right\Vert _2^2}\cdot \varvec{u}_{n,h}^\top \varvec{r}_{n,h-1},\\ 1\le h\le N_\text {h}\le D, \end{aligned} \end{aligned}$$
(29)
$$\begin{aligned} \varvec{P}_n\varvec{r}_n:= & {} \varvec{r}_{n,N_\text {h}}. \end{aligned}$$
(30)

After the isometric rotation, we apply Eqs. (23) and (24) as trainable quantization.

Analogous to embedding-based fine-grained alignment, here we introduce how to align fine-grained representations and train the optimized quantization module simultaneously. We denote the quantized representations of \(\varvec{r}^\text {t}_{i,n}\) and \(\varvec{r}^\text {v}_{i,n}\) by \({\hat{\varvec{r}}}^\text {t}_{i,n}\) and \({\hat{\varvec{r}}}^\text {v}_{i,n}\), respectively. Besides, we define the asymmetric-quantized similarity by

$$\begin{aligned} {\hat{m}}_{ij}^{n,\mathbf {\text {tv}}}= & {} \cos (\varvec{r}^\text {t}_{i,n}, {\hat{\varvec{r}}}^\text {v}_{j,n}), \end{aligned}$$
(31)
$$\begin{aligned} {\hat{m}}_{ij}^{n,\mathbf {\text {vt}}}= & {} \cos (\varvec{r}^\text {v}_{i,n}, {\hat{\varvec{r}}}^\text {t}_{j,n}). \end{aligned}$$
(32)

Finally, as illustrated in Fig. 4a, we define the asymmetric-quantized contrastive learning loss for fine-grained alignment as

$$\begin{aligned} {\hat{\ell }}_{\text {FA}(n), i}^{\mathbf {\text {tv}}}= & {} -\log \frac{\exp ({\hat{m}}_{ii}^{n,\mathbf {\text {tv}}}/\tau )}{\sum _{j=1}^{\vert \mathcal {B}\vert } \exp ({\hat{m}}_{ij}^{n,\mathbf {\text {tv}}}/\tau )}, \end{aligned}$$
(33)
$$\begin{aligned} {\hat{\ell }}_{\text {FA}(n), i}^{\mathbf {\text {vt}}}= & {} -\log \frac{\exp ({\hat{m}}_{ii}^{n,\mathbf {\text {vt}}}/\tau )}{\sum _{j=1}^{\vert \mathcal {B}\vert } \exp ({\hat{m}}_{ij}^{n,\mathbf {\text {vt}}}/\tau )}, \end{aligned}$$
(34)
$$\begin{aligned} {\hat{\mathcal {L}}}_{\text {FA}(n)}= & {} \frac{1}{2\vert \mathcal {B}\vert }\sum _{i=1}^{\vert \mathcal {B}\vert } \left( {\hat{\ell }}_{\text {FA}(n), i}^{\mathbf {\text {tv}}}+{\hat{\ell }}_{\text {FA}(n), i}^{\mathbf {\text {vt}}}\right) , \end{aligned}$$
(35)
$$\begin{aligned} {\hat{\mathcal {L}}}_\text {FA}= & {} \frac{1}{N_c}\sum _{n=1}^{N_c}{\hat{\mathcal {L}}}_{\text {FA}(n)}. \end{aligned}$$
(36)

The reason we choose the asymmetric-quantized loss (Jang and Cho, 2021) rather than the symmetric one (Wang et al., 2022a) is to optimize the AQS-based (i.e., based on asymmetric quantization similarity, as shown in Fig. 5) retrieval directly. To distinguish the quantization-based fine-grained alignment from the embedding-based counterpart, we call the hugging with optimized quantization learning by hugging\(^+\). Accordingly, we instantiation of hugging\(^+\) is dubbed HuggingHash\(^+\).

3.3.4 Learning Objectives

Here we summarize the learning objectives of HuggingHash and HuggingHash\(^+\) as follows:

$$\begin{aligned}{} & {} \mathcal {L}_{\textsc {HuggingHash}{}}=\mathcal {L}_\text {GA}+\lambda \mathcal {L}_\text {FA}+\gamma \mathcal {R}_\text {quant}, \end{aligned}$$
(37)
$$\begin{aligned}{} & {} \mathcal {L}_{{\textsc {HuggingHash}{}}^+} = \mathcal {L}_\text {GA}+\lambda {\hat{\mathcal {L}}}_\text {FA}+\gamma \mathcal {R}_\text {quant}, \nonumber \\{} & {} \mathcal {R}_\text {quant} = \frac{1}{2\vert \mathcal {B}\vert }\sum _{i=1}^{\vert \mathcal {B}\vert }\left( \left\Vert \varvec{b}_i^\text {t} -\varvec{h}_i^\text {t}\right\Vert _2^2+\left\Vert \varvec{b}_i^\text {v}-\varvec{h}_i^\text {v}\right\Vert _2^2\right) . \end{aligned}$$
(38)

\(\mathcal {R}_\text {quant}\) is the quantization loss of hash codes. \(\lambda \), \(\gamma >0\) are the hyper-parameters to balance different loss terms. The hugging and hugging\(^+\) frameworks are flexible and compatible. By replacing \(\mathcal {L}_\text {GA}\) (Eq. (10)) with other hashing objectives, we can easily extend them to other UCMH methods.

Fig. 5
figure 5

The hash-based retrieval pipeline. Given a query text, we first use global hash codes to rank the database. We can directly take the top-ranked results or enable a further reranking stage with fine-grained quantization codes

3.4 Indexing and Retrieval

3.4.1 Encoding Global and Fine-Grained Indices

Without loss of generality, we take text-to-image retrieval as an example to describe how HuggingHash and HuggingHash\(^+\) produce indices (i.e., hash-based representations) in inference.

We encode database images with the image hash encoder, which comprises a patchifier, a ViT, and the image hash module. We denote the global hash codes of the ith image by \(\varvec{b}^\text {v}_i\in \{-1,+1\}^L\), which can be taken as the global index.

Meanwhile, if the hugging\(^+\) training framework is adopted, we can take fine-grained representations as a gift from multi-granularity alignment. In addition to the forward process of global index generation, there is no need for another pass to get fine-grained indices of the same instance. Instead, in the same forward pass, we retain output embeddings of local heads and compress them with corresponding quantization modules. As illustrated in Fig. 4b, at the nth local head, we first apply the isometric rotation matrix \(\varvec{P}_n\) to the local image embedding \(\varvec{r}^\text {v}_{i,n}\), which can be efficiently implemented by Eqs. (28) to (30), and obtain the rotated embedding \({\tilde{\varvec{r}}}^\text {v}_{i,n}\) We divide \({\tilde{\varvec{r}}}^\text {v}_{i,n}\) into M segments, namely \({\tilde{\varvec{r}}}^{\text {v},1}_{i,n},{\tilde{\varvec{r}}}^{\text {v},2}_{i,n},\cdots ,{\tilde{\varvec{r}}}^{\text {v},M}_{i,n}\). Then, we find the sub-codeword index of each segment. Take the mth segment as an example:

$$\begin{aligned} k^{\text {v},m}_{i,n}={\mathop {\arg \textrm{max}}_{0\le k< K}\,}{\langle {{{\tilde{\varvec{r}}}^{\text {v},m}_{i,n}},\varvec{d}^m_{n,k}}\rangle }. \end{aligned}$$
(39)

Next, we collect the indices \(\{ k^{\text {v},m}_{i,n} \}_{m=1}^M\) and convert them into a binary code vector \(\varvec{q}^{\text {v}}_{i,n}\). We take it as quantization codes w.r.t. the nth local head.

3.4.2 Ranking in the Hamming Space

Given a text query, we forward it to the text hash encoder, which comprises a tokenizer, a text transformer, and the text hash module. We denote the query hash codes as \(\varvec{b}^\text {t}_\text {q}\in \{-1,+1\}^L\). Hamming distance between \(\varvec{b}^\text {t}_\text {q}\) and \(\varvec{b}^\text {v}_i\) is defined by

$$\begin{aligned} d_{\mathbb {H}}(\varvec{b}^\text {t}_\text {q}, \varvec{b}^\text {v}_i)=\frac{1}{2}\left( L-\varvec{b}^{\text {t}\top }_\text {q}\varvec{b}^\text {v}_i\right) . \end{aligned}$$
(40)

By leveraging bit-wise operators (i.e., XOR), retrieval efficiency can be largely improved.

We rank the database images according to the Hamming distance. The smaller the distance, the higher the ranking. As shown in Fig. 5, if we only consider single-stage retrieval or adopt the vanilla hugging strategy, we directly return the top-ranked IDs w.r.t. Hamming distance. Otherwise, we reserve a portion of the top-ranked images in the whole database, \(N_{\mathcal {D}'}\) in total, for a fine-grained reranking stage.

Table 1 Dataset information and settings

3.4.3 Reranking with Fine-Grained Quantization Codes

Suppose the text query \(\mathcal {T}_\text {q}\) produces an embedding \(\varvec{r}^\text {t}_{\text {q},n}\) at the nth local head. We first apply the isometric rotation matrix \(\varvec{P}_n\) to \(\varvec{r}^\text {t}_{\text {q},n}\) by Eqs. (28) to (30), and obtain the rotated embedding \({\tilde{\varvec{r}}}^\text {t}_{\text {q},n}\). Then, we divide \({\tilde{\varvec{r}}}^\text {t}_{\text {q},n}\) into M segments, namely \({\tilde{\varvec{r}}}^{\text {t},1}_{\text {q},n},{\tilde{\varvec{r}}}^{\text {t},2}_{\text {q},n},\cdots ,{\tilde{\varvec{r}}}^{\text {t},M}_{\text {q},n}\). Next, we adopt Asymmetric Quantization Similarity (AQS) (Jégou et al., 2011) as the metric, which computes the similarity between \({\tilde{\varvec{r}}}^\text {t}_{\text {q},n}\) and the quantized representation of the ith database item, \({{\hat{\varvec{r}}}^{\text {v}}_{i,n}}\), by

$$\begin{aligned} \text {AQS}({\tilde{\varvec{r}}}^\text {t}_{\text {q},n}, {\hat{\varvec{r}}}^{\text {v}}_{i,n})&= \sum _{m=1}^M\frac{ \langle {{\tilde{\varvec{r}}}^{\text {t},m}_{\text {q},n}},{\hat{\varvec{r}}}^{\text {v},m}_{i,n}\rangle }{\left\Vert {\tilde{\varvec{r}}}^{\text {t},m}_{\text {q},n}\right\Vert _2} \end{aligned}$$
(41)
$$\begin{aligned}&=\sum _{m=1}^M\frac{ \langle {{\tilde{\varvec{r}}}^{\text {t},m}_{\text {q},n}},\varvec{d}^m_{n,k^{\text {v}, m}_{i,n}}\rangle }{\left\Vert {\tilde{\varvec{r}}}^{\text {t},m}_{\text {q},n}\right\Vert _2}, \end{aligned}$$
(42)

where \(k^{\text {v}, m}_{i,n}\) is the sub-codeword index of \({\hat{\varvec{r}}}^{\text {v},m}_{i,n}\) in the mth sub-codebook, obtained by Eq. (39). We can set up a lookup table \(\varvec{T}_{\text {q},n}\in \mathbb {R}^{M\times K}\) w.r.t. each \({\tilde{\varvec{r}}}^\text {t}_{\text {q},n}\), which stores the pre-computed similarities between the segments of \({\tilde{\varvec{r}}}^\text {t}_{\text {q},n}\) and all sub-codewords. Specifically, \(T^m_{q,n,k}=\langle {{\tilde{\varvec{r}}}^{\text {t},m}_{\text {q},n}},\varvec{d}_{n,k}^m\rangle /\Vert {\tilde{\varvec{r}}}^{\text {t},m}_{\text {q},n}\Vert _2\). Hence, AQS can be efficiently computed by summing some items from lookup table according to the indices \(\{ k^{\text {v},m}_{i,n} \}_{m=1}^M\) converted from quantization codes \(\varvec{q}_{i,n}^\text {v}\), i.e., 

$$\begin{aligned} \text {AQS}({\tilde{\varvec{r}}}^\text {t}_{\text {q},n}, {\hat{\varvec{r}}}^{\text {v}}_{i,n})=\sum _{m=1}^MT^m_{q,n,k^{\text {v}, m}_{i,n}}. \end{aligned}$$
(43)

The fine-grained similarity between the text query \(\mathcal {T}_\text {q}\) and fine-grained quantization codes of the ith database images, \(\{\varvec{q}^\text {v}_{i,n}\}_{n=1}^{N_\text {c}}\), is computed by summing up \(N_\text {c}\) local similarity scores, namely

$$\begin{aligned} s_{\mathbb {Q}}( \mathcal {T}_\text {q}, \{\varvec{q}^\text {v}_{i,n}\}_{n=1}^{N_\text {c}} ) = \sum _{n=1}^{N_\text {c}}\text {AQS}({\tilde{\varvec{r}}}^\text {t}_{\text {q},n}, {\hat{\varvec{r}}}^{\text {v}}_{i,n}). \end{aligned}$$
(44)

As shown in Fig. 5, if the hugging\(^+\) strategy is adopted, we can further rerank the filtered subset \(\mathcal {D}'\) obtained from Sect. 3.4.2 according to Eq. (44). The higher the score, the higher the ranking. Finally, we return the top-ranked images in the reranked list as retrieval results. Since \(N_{\mathcal {D}'}\ll N_\mathcal {D}\) and the reranking is quantization-based, it keeps high efficiency of hash-based retrieval.

4 Experiments

4.1 Experimental Setup

4.1.1 Datasets

We conduct experiments on four commonly used text-image datasets in cross-modal hashing:

Flickr25K (Huiskes and Lew, 2008) It contains 25,000 image-text pairs with 24 annotated labels. Each pair contains an image and the associated textual tags. We filter out data without labels and use 20,015 pairs in our experiment. The tag information for each image is represented as a 1,386-dimensional bag-of-words vector.

NUSWIDE (Chua et al., 2009) It provides 186,577 image-tag pairs from the top-10 concepts. The tag information is represented as a 1,000-dimensional bag-of-words vector.

MSCOCO (Lin et al., 2014) It consists of 123,558 image-sentence pairs from 80 object categories. Each image is associated with 4 sentences describing its content. Each text is represented as a 2,000-dimensional bag-of-words vector.

Wiki (Rasiwasia et al., 2010) It composes of 2,866 documents from 10 categories. Each document contains an image and a text with at least 70 words. An 128-dimensional SIFT feature vector is provided for each image, and each text is represented as a 10-dimensional topic vector.

In addition, following (Zhuo et al., 2022), we conduct experiments on two text-video datasets:

MSRVTT (Xu et al., 2016) It consists of 10,000 video clips, each is annotated with 20 captions. The average video duration is 15 s and the frame rate is 30FPS. We follow the setting of Zhuo et al. (2022) and report the results on the 1K-A test set (Yu et al., 2018).

MSVD (Chen and Dolan, 2011) It contains 1,970 video clips and each clip is associated with about 40 captions. The dataset is split into train, validation, and test sets with 1,200, 100, and 670 clips, respectively. In testing, we follow Zhuo et al. (2022) to select the fifth caption for each clip, resulting in bidirectional one-to-one retrieval.

The data splits are described in Table 1.

4.1.2 Implementation Details

Our implementation is based on PyTorch (Paszke et al., 2019) with 4 NVIDIA GTX 3080Ti (12GB) GPUs. We adopt the standard metric, mean average precision (MAP@N), to evaluate text-image retrieval tasks. For text-video retrieval tasks, use recall (R@N) and median rank (MdR) of the ground-truth items as the evaluation metrics. For comparison, shallow approaches take bag-of-words features and hand-crafted visual descriptors (e.g. SIFT (Lowe, 2004)) as text and image inputs, respectively. Deep approaches use CNN (e.g. AlexNet (Krizhevsky et al., 2012)) features as image inputs. For transformer-based approaches, we use pretrained BERT (Devlin et al., 2019) (‘bert-base-uncased’) and ViT (Dosovitskiy et al., 2021) (‘vit-base-patch16-224’) as default transformers on text-image retrieval tasks. For text-video retrieval tasks, we follow Zhuo et al. (2022); Luo et al. (2022) that use the pre-trained CLIP (Radford et al., 2021) (‘clip-vit-base-patch32’) to initialize the text and frame encoders. The dimensions of token embeddings are \(D^\text {t}=D^\text {v}=768\). Maximum numbers of text tokens are set to 128 for text-image retrieval tasks and 32 for text-video retrieval tasks. The dimension of fine-grained alignment space, D, is set to 128 for text-image retrieval tasks and 512 for text-video retrieval tasks. The batch size is set to 32. Following the practice of Luo et al. (2022), we take Adam (Kingma and Ba, 2015) as the optimizer with a learning rate of 1e-7 for pretrained text and image transformers and a learning rate of 1e-4 for other modules. For text-video retrieval tasks, we uniformly select 1 frame per second from a raw video and randomly sample 12 frames from the selected ones as the video input. Other default settings are as follows: (i) The loss weights in Eqs. (37) and (38) are \(\lambda =0.2\) and \(\gamma =1\). (ii) The smoothness factor in Eqs. (4) and (5) is \(\alpha =0.5\). (iii) The scaling factor in Eqs. (23) and (24) is \(\beta =1\). (iv) The temperature factor in contrastive learning objectives is \(\tau =0.2\). (v) The number of active clusters in GhostVLAD is \(N_\text {c}=7\). (vi) For the quantization module, we set the codeword number of each sub-codebook, \(K=256\), such that each local embedding is encoded by \(M\log _2K=8M\) bits (i.e., M bytes). (vii) For text-image retrieval tasks, we set \(M=4\) sub-codebooks in each quantization module, producing 32-bit quantization codes at each local head. For text-video retrieval tasks, we set \(M=32\) sub-codebooks in each quantization module, leading to 256-bit quantization codes at each local head. (viii) The number of Householder transformations in Eq. (27) is \(N_\text {h}=D/8=16\). (ix) In the two-stage retrieval pipeline, we set the reranking size \(N_{\mathcal {D}'}=0.1N_\mathcal {D}\).

4.2 Comparison with State-of-the-art Approaches

Table 2 Text-image retrieval mean average precision (MAP) results for different numbers of bits on the three datasets
Fig. 6
figure 6

Top-N precision curves of 64-bit hashing methods

Fig. 7
figure 7

Multi-dataset (Flickr25K-MSCOCO) evaluation with different 64-bit UCMH methods. Transformers and the hugging improve generalizability and robustness. Besides, HuggingHash shows the best zero-shot results

4.2.1 On Text-Image Retrieval

The comparison is with 15 UCMH baselines: (i) 5 shallow methods: CVH (Kumar and Udupa, 2011), IMH (Song et al., 2013), CMFH (Ding et al., 2016), FSH (Liu et al., 2017), ACQ (Irie et al., 2015). (ii) 10 SOTA deep methods: DBRC (Hu et al., 2019), UGACH (Zhang et al., 2018), UCH (Li et al., 2019), DJSRH (Su et al., 2019), UKD-SS (Hu et al., 2020), SRCH (Wang et al., 2020b), DSAH (Yang et al., 2020), DGCPN (Yu et al., 2021a), CIRH (Zhu et al., 2023), UCHSTM (Tu et al., 2023). To explore the impact of transformers on UCMH, we adapt the open-sourced implementation of 5 representative baselines, i.e., DJSRH, DSAH, DGCPN, CIRH, and UCHSTM, by using the same backbones as HuggingHash and HuggingHash\(^+\). There are two differences between HuggingHash and HuggingHash\(^+\): (i) In training, HuggingHash adopts embedding-based fine-grained alignment objective (i.e., Eq. (19)) while HuggingHash\(^+\) adopts quantization-based objective (i.e., Eq. (36)). (ii) In inference, HuggingHash only produces global hash codes, while HuggingHash\(^+\) further supplies fine-grained quantization codes. HuggingHash executes retrieval by ranking with hash codes only, while HuggingHash\(^+\) further enable an efficient reranking stage (as shown in Fig. 5) to refine the retrieval results.

Performance Table 2 reports the MAP results under different numbers of hash bits. The same methods in the ‘Transformers’ block outperform those in the ‘CNN + MLP’ block by considerable margins. It verifies that pre-trained transformers provide better modality understandings than CNNs and MLPs, thus contributing to high-quality hash codes. Besides, on all settings, HuggingHash outperforms transformer-based baselines that only consider global alignment. In terms of global alignment for hash codes, although HuggingHash adopts a simple contrastive learning objective (i.e., Eq. (10)) that is much simpler than the baselines, the superior results it shows indicate the effectiveness of exploring multi-granularity alignment with transformers beyond global alignment itself. Moreover, by making better use of fine-grained quantization codes, HuggingHash\(^+\) can further surpass HuggingHash, especially at those top positions, as we can learn from the MAP@50 results on the Wiki dataset. It suggests the value of leveraging fine-grained semantics in transformers. We also illustrate the precision curves of different approaches to give a more intuitive understanding. As shown in Fig. 6, HuggingHash reaches higher precision than other baselines significantly. Within the range of reranking, HuggingHash\(^+\) refines the retrieval results such that the matched items are ranked higher precisely. Therefore, we can see a further gain within the reranking range. Note that HuggingHash\(^+\) provides both coarse-grained and fine-grained representations by learning one unified model. This property is beneficial to practical search systems (Asadi and Lin, 2013) where ranking is multi-stage and consists of a series of independent models. On the other hand, as reranking will consume more memory (or storage) and computation compared to one-stage hash-based retrieval, the efficacy-efficiency tradeoff should depend on the application scenario.

Transferability Transferability is an important but often ignored target in practice, reflecting the domain generalizability from offline training to online serving. Here we conduct a multi-dataset (Flickr25K-MSCOCO) evaluation with five representative baselines and the proposed models. We compare standard (i.e., train and test on the same dataset) and zero-shot performance (i.e., train and test on different datasets). We also investigate different backbones. To deal with different vocabularies between two datasets, we replace the bag-of-words features with the word2vec (Le and Mikolov, 2014) features as the text inputs. “BERT+ViT” indicates transformer-based variants using global alignment. “BERT+ViT +Hugging” and “BERT+ViT +Hugging\(^+\)” indicate the variants with hugging or hugging\(^+\), respectively.

The results are shown in Fig. 7. We can learn that transformers not only boost in-domain performance but also improve generalizability. The proposed hugging and hugging\(^+\) strategies can further improve the zero-shot performance in most cases. Besides, although combining hugging with SOTA baselines yields competitive results, we notice that HuggingHash and HuggingHash\(^+\) show better zero-shot performance. We conjecture that contrastive learning objectives help to produce more transferable hash codes.

Table 3 Text-video retrieval recall and median rank results for different numbers of bits on the three datasets

4.2.2 On Text-Video Retrieval

The comparison is with 14 baselines: (i) 7 representative embedding-based text-video retrieval methods: CE (Liu et al., 2019b), MMT (Gabeur et al., 2020), Support-Set (Patrick et al., 2021), HiT (Liu et al., 2021a), CLIP (Radford et al., 2021), T2VLAD (Wang et al., 2021b), and CLIP4Clip (Luo et al., 2022). (ii) 1 hash-based text-video retrieval method with non-transformer backbone: S\(^2\)Bin (Qi et al., 2021). (iii) 5 UCMH methods originally designed for text-image retrieval: DJSRH (Su et al., 2019), DSAH (Yang et al., 2020), DGCPN (Yu et al., 2021a), CIRH (Zhu et al., 2023), and UCHSTM (Tu et al., 2023). (iv) 1 state-of-the-art transformer-based UCMH method tailored for text-video retrieval: CLIP4Hashing (Zhuo et al., 2022). In particular, we follow Zhuo et al. (2022) to adopt pre-trained CLIP (Radford et al., 2021) as the text and video frame encoder, which is also a popular practice in text-video retrieval literature (Luo et al., 2022). Although pretrained CLIP is well-aligned for text-image tasks, we argue that it is not aligned for text-video retrieval. It needs to be improved to understand the correspondence between language and temporal dynamics, e.g. whether an object is moving from left to right or otherwise. In HuggingHash and HuggingHash\(^+\), we design a temporal self-attention layer based on the image-level frame embeddings to enhance this aspect. We also adapt the UCMH baselines in (iii) by using the same backbones as HuggingHash and HuggingHash\(^+\).

Performance Table 3 presents the results under different numbers of hash bits. Some findings are as below. (i) CLIP provides useful cross-modal knowledge to bridge the modality gap but is not well-aligned for text-video retrieval. Though vanilla CLIP alone can outperform CE and SupportSet, it is insufficient to understand text-video correspondence and shows inferior results than HiT on most metrics. Through temporal-aware training, CLIP4CLIP shows higher recall than other embedding-based approaches; 2048-bit CLIP4Hashing outperforms vanilla CLIP. (ii) The text-image hashing objectives are inadequate for one-to-one text-video retrieval due to the unsatisfied basic assumption. Despite the same backbone, we can see that all the compared baselines from text-image retrieval achieve significantly lower recall than HuggingHash. One of the reasons is that most text-image hashing objectives are based on the assumption that one query is associated with multiple matched items given the same labels. Differently, in text-video retrieval, one query is expected to match only one item, which leads to an extremely sparse similarity matrix, hence limiting the performance of text-image hashing approaches. (iii) Hugging is more effective than handshaking. The strong baseline, CLIP4Hashing, can be regarded as a handshaking approach as it only designs global alignment. Our HuggingHash outperforms it by considerable margins. Besides, T2VLAD develops multi-granularity similarity in training and inference, which can also be regarded as a practice of hugging. Without large-scale cross-modal pretraining, T2VLAD even outperforms vanilla CLIP. The success highlights the efficacy of multi-granularity similarity. Nevertheless, we note that it depends on high-dimensional continuous embeddings of global and local semantics, which incurs much more computation and storage overhead. The efficiency is a major weakness of T2VLAD, as we will show below. (iv) Hugging\(^+\) can further improve retrieval performance beyond hugging. Similar to the observation on the text-image retrieval, we can find that HuggingHash\(^+\) shows relatively better results than HuggingHash in most cases. The advantage of reranking is more pronounced in the case of shorter code length, e.g. 256 and 512, while fading for longer bit length, e.g. 2048. Two main factors contribute to the observed results. First, longer hash code lengths enhance semantic information capacity, making fine-grained reranking less necessary for global retrieval. Besides, as hash codes become longer, the impact of complementing them with fine-grained quantization codes diminishes, particularly evident with 2048-bit hash codes.

Fig. 8
figure 8

Storage (memory) overhead and average query time of different text-to-video retrieval models

Table 4 Comparison of HuggingHash, HuggingHash\(^+\), and T2VLAD (Wang et al., 2021b) on the MSRVTT dataset

Comparison with T2VLAD Using the Same Encoders Since HuggingHash, HuggingHash\(^+\) and T2VLAD share similar network modules, comparing them using the same text encoder and visual encoder can be worthy. We report the results in Table 4. As the source code for T2VLAD is currently unavailable, we replicated the model based on our HuggingHash. Specifically, since we only utilized CLIP features for video, we set the global alignment to have one expert. Notably, the original T2VLAD paper employed a bidirectional max-margin ranking loss with a margin of 0.02 for training. To align our experiments, we also present results using the contrastive learning loss for training, denoted as T2VLAD\(^*\).

We can learn from Table 4 that T2VLAD outperforms HuggingHash (or HuggingHash\(^+\)) notably for lower values of D. This outcome reflects the higher capacity of T2VLAD’s representation compared to HuggingHash (or HuggingHash\(^+\)). For instance, at \(D=256\), T2VLAD uses higher-dimensional embeddings, including 256-dimensional floating-point and 3584-dimensional local embeddings. In contrast, HuggingHash (or HuggingHash\(^+\)) relies on 256-bit binary global hash codes. As D increases, HuggingHash and HuggingHash\(^+\) gradually approach or even surpass T2VLAD variants. For instance, at \(D=2048\), HuggingHash and HuggingHash\(^+\) surpass T2VLAD, and T2VLAD\(^*\) slightly outperforms them. This can be attributed to the risk of overfitting in T2VLAD’s high-dimensional feature space and the potential impact of global–local similarity balance on T2VLAD’s performance. In contrast, HuggingHash and HuggingHash\(^+\) separate these aspects, avoiding competition.

In terms of global retrieval, HuggingHash (or HuggingHash\(^+\)) employs D bits per item and XOR operations over D bits for similarity calculation. In contrast, T2VLAD utilizes a D-dimensional global embedding and 7 512-dimensional local embeddings, incurring over 32 times the storage and memory overhead of hash-based methods. T2VLAD also involves more time-consuming floating-point multiplications for similarity calculations. Consequently, T2VLAD exhibits significantly longer retrieval times. For real-world large-scale retrieval scenarios, efficiency is crucial alongside performance. Notably, HuggingHash\(^+\) reduces overhead by using lightweight quantized codes instead of embeddings, and reorders only a small portion (e.g. top 10%) of the globally hashed list based on quantized representations, resulting in faster retrieval.

Table 5 Ablation study on MSCOCO and MSRVTT datasets.

Although HuggingHash and T2VLAD share similar network modules and aligned model settings from table 4, the results serve as references rather than fair comparisons. Fair comparison remains challenging due to differing design considerations and operational mechanisms. HuggingHash and HuggingHash\(^+\) prioritize efficient retrieval, focusing on representational size constraints and semantic preservation within limited resources. Our fine-grained alignment enhances global alignment for improved cross-modal semantics in hash codes without altering generation and retrieval. The quantization option enhances reranking for performance refinement while maintaining efficiency. T2VLAD prioritizes performance optimization, resulting in disparities in representation and efficiency. Its straightforward global–local fusion employs fine-grained alignment, capturing multi-granularity semantics at the cost of efficiency.

Efficiency of hash-based retrieval Efficiency is an important target for retrieval since it highly relates to the scalability of the search system. Here we show results about this rarely investigated aspect in text-video retrieval. The comparison is with 6 embedding-based text-to-video retrieval: CE, MMT, SupportSet, HiT, T2VLAD, and CLIP4Clip, on the MSRVTT dataset. The manners of models are as follows: (i) CE represents an instance by 9 768-dimensional embeddings and a 9-dimensional weighting vector; (ii) MMT represents an instance by 7 512-dimensional embeddings and a 7-dimensional weighting vector; (iii) SupportSet represents an instance by a 1024-dimensional embedding; (iv) HiT represents an instance by a 2048-dimensional embedding; (v) T2VLAD represents an instance by 9 768-dimensional local embeddings and a 768-dimensional global embedding; (vi) CLIP4Clip represents an instance by a 512-dimensional embedding; (vii) Hash-based approaches represent an instance by a fixed-length hash code vector; we take HuggingHash as a showcase. A 256-bit code can be encoded by 32 bytes for compact storage; (viii) HuggingHash\(^+\) represents an instance by a fixed-length global hash code and 7 256-bit local quantization codes. Since MSRVTT only contains 10k videos, we duplicate the videos to simulate a large database. We evaluate: (i) the average query time, including the time for text encoding time on GPU and nearest neighbor search on CPU; and (ii) the storage overhead for offline-computed video representations. We experiment with an NVIDIA GTX 3080 Ti (12GB) and Intel\(\circledR \) Xeon\(\circledR \) Platinum 8269CY CPU @ 2.50GHz (104 cores). The results are shown in Fig. 8. We can see that hash-based approaches reduce storage (or memory) overhead and accelerate retrieval, especially on large-scale data. For example, on a 1 M-size database, HuggingHash (2048-bit) consumes \(\sim \)241\(\times \) and \(\sim \)16\(\times \) less of storage and accelerate \(\sim \)23.6\(\times \) and \(\sim \)1.5\(\times \) of query speed, compared with T2VLAD and CLIP4Clip, respectively. Compared with vanilla hash-based retrieval, though HuggingHash\(^+\) consumes more computation and storage due to the fine-grained quantization-based reranking, overall, it still achieves a good trade-off between efficacy and efficiency.

4.3 Model Analyses

4.3.1 Ablation Study

Recall that HuggingHash is trained with multi-granularity alignment objectives (Eq. (37)) and uses global hash codes to rank and retrieve items. HuggingHash\(^+\) extends HuggingHash by replacing embedding-based fine-grained alignment loss with quantization-based loss (Eq. (38)) and further reranks top-10% hash-ranked results with fine-grained quantization codes (Fig. 5).

To understand the contributions of different modules in HuggingHash and HuggingHash\(^+\), we construct 8 variants for analysis. The modifications of these variants are listed below. 3 variants of HuggingHash: (i) HuggingHash\(_{\setminus \mathcal {L}_\text {FA}}\) removes the embedding-based fine-grained alignment loss (Eq. (19)), producing a handshaking model. (ii) HuggingHash\(_{\cup \text {Emb\_ReR}}\) enables top-10% reranking in retrieval by leveraging fine-grained embeddings. (iii) HuggingHash\(_{\cup \text {PQ\_ReR}}\) enables top-10% reranking in retrieval by leveraging fine-grained embeddings and OPQ (Ge et al., 2013). We adopt the FAISS (Johnson et al., 2021) implementation of OPQ as a post-compression on the extracted embeddings. 5 variants of HuggingHash\(^+\): (i) HuggingHash\(^+_{\setminus \mathcal {L}_\text {GA}}\) removes the global hash alignment loss (Eq. (10)) and only trains a fine-grained quantization model. Since it does not produce global hash codes, in inference, we use the quantization codes to rank the whole database. (ii) HuggingHash\(^+_\text {FG\_R}\) follows the hugging\(^+\) training strategy but changes the retrieval process. Instead of the two-stage ranking, it leverages fine-grained quantization codes to rank the whole database. (iii) HuggingHash\(^+_{\setminus \text {ReR}}\) follows the hugging\(^+\) training strategy but disables the top-10% reranking in inference. It slightly differs from standard HuggingHash on the fine-grained alignment of training. Concretely, it uses a quantization-based loss (Eq. (36)) while HuggingHash uses an embedding-based loss (Eq. (19)). (iv) HuggingHash\(^+_{\setminus \mathcal {P}_n}\) removes the rotation matrix, \(\mathcal {P}_n\) in Eq. (25), of each local quantization module. (v) HuggingHash\(^+_\text {Fuse\_R}\) only changes the retrieval process. Instead of the two-stage ranking, it directly fuses multi-granularity similarity scores with the (IV) strategy in fig. 10 to rank the whole database. (vi) To further investigate the global–local bit allocation strategy, we attempt to reallocate all the bits from the fine-grained quantization in the hugging\(^+\) mechanism to global hashing, thereby deriving a long-bit handshaking model. We designate the new model ID as 2\(^*\). We also categorize different variants into 5 types according to the retrieval process: (A) Rank the whole database with global hash codes. (B) Rank the whole database with fine-grained quantization codes. (C) Rank the whole database with global hash codes, and then rerank top-10% results with fine-grained quantization codes. (D) Rank the whole database with global hash codes, and then rerank the top-10% results with fine-grained embeddings. (E) Rank the whole database by fusing multi-granularity similarities from global hash codes and fine-grained quantization codes.

Fig. 9
figure 9

Storage (memory) overhead and average query time of different types of HuggingHash (or HuggingHash\(^+\)) variants. Showcased task: text-to-video retrieval. The lengths of hash codes and quantization codes are 2048-bit and \(7\times 256\)-bit, respectively. The results under 10 million items are extrapolated

Performance results on MSCOCO (text-image) and MSRVTT (text-video) are reported in table 5. We number the variants with IDs for easy reference. Similar to Fig. 8, we also measure the storage and computational efficiency of different variants to give a more comprehensive comparison, as shown in Fig. 9. We can obtain several conclusions from the ablation results. (i) Global alignment and fine-grained alignment promote each other; both of them are indispensable for achieving a positive synergy in hugging and hugging\(^+\). Comparing Models 1 and 2, we can learn that the fine-grained alignment loss is significant in improving the quality of hash codes. Note that Models 1 and 2 share the same retrieval type and are consistent in retrieval efficiency; in other words, hugging will not bring extra inference overhead to global hash-based retrieval. In Sect. 4.3.2, we will conduct more empirical analyses to understand this implicit efficacy of hugging. Besides, Models 6 and 7 share the same type that ranks the whole database with 224-bit (\(7\times 32\)) fine-grained quantization codes on MSCOCO and with 1792-bit (\(7\times 256\)) codes on MSRVTT, whereas Model 6 learns by fine-grained alignment only and Model 7 learn by hugging\(^+\). As Model 7 outperforms Model 6 by considerable margins, we learn that global alignment can improve fine-grained alignment as well. (ii) Quantization in hugging\(^+\) facilities the efficient usage of fine-grained semantics. In contrast to Model 1, Model 5 leverages fine-grained semantics and improves retrieval performance effectively. Although it relatively underperforms Model 3, the major benefit of high efficiency by integrating quantization should not be overlooked. Take text-to-video reranking as a showcase, as shown in Fig. 9, Model 5 consumes only \(\sim 1/60\) the storage (memory) overhead and half of retrieval time the Model 3 requires in a database of 10 million. (iii) Jointly learning quantization and fine-grained alignment in hugging\(^+\) helps to achieve a better trade-off between efficacy and efficiency. By learning fine-grained embeddings and using OPQ to post-compress the learned embeddings, Model 4 shows slight performance gains on 1 in most scenarios, but is still inferior to 5 that jointly learns optimized quantization and fine-grained alignment. The underlying reason is that end-to-end learning of quantized representations can reduce semantic distortion, thus producing better fine-grained quantization codes while maintaining high efficiency. (iv) In some scenarios, quantization can serve as another regularization to enhance the synergy between global and fine-grained alignment. For example, Model 8 with quantization-based fine-grained alignment loss shows superior results than Model 1 on MSCOCO, but this phenomenon does not hold true on MSRVTT. (v) Shorter bit-length retrieval benefits more from the fine-grained reranking. Comparing Models 5 and 8, we learn that reranking improves retrieval performance in most cases. In particular, on MSRVTT, we observe that reranking shows more performance gains on shorter bit-lengths, e.g. 256- and 512-bit, while its efficacy fades on longer ones, e.g. 2048-bit video-to-text retrieval. The reason is that we fix the bit-length of fine-grained quantization codes by default, e.g. \(7\times 256\)-bit on MSRVTT, which is less effective in providing more complementary semantics beyond long and sufficient global hash codes. (vi) Two-stage pipeline is more efficient and robust than multi-granularity fusion-based retrieval. To exploit multi-granularity semantics for retrieval, another natural design is Model 10, which fuses global and fine-grained similarity scores for more precise estimation. However, we can see from Fig. 9 that fusion-based retrieval requires more than twice the time of two-stage retrieval, which becomes expensive in large-scale scenarios. Moreover, the retrieval performance is also sensitive w.r.t. the adopted fusion strategy. As shown in Fig. 10, the efficacy of different fusion strategies varies in different cases, which can increase the labor of careful choosing. Under the default fusion strategy (i.e., (IV) in Fig. 10), the gains from fusion are still unsure. For example, compared with a handshaking Model 8, Model 10 does not gain performance in half of the retrieval tasks on MSCOCO. On MSRVTT, while it brings remarkable improvements under 256 bits thanks to the fine-grained similarity scores, the advantages of Model 10 over Model 8 shrink rapidly as bit-length raises and even turn into clear disadvantages under 2048 bits due to biased similarity estimation. In contrast, Model 5 with the two-stage strategy shows more stable improvement over Model 8, thus preferable to the fusion-based strategy. (vii) Hugging and hugging\(^+\) help semantic preservation within limited bit lengths. Model 2\(^*\) demonstrates significantly better performance across most metrics compared to Models 1 and 5. The superior performance is attributed to the longer global hash code used by Model 2\(^*\), enabling enhanced semantic information storage. Although Model 5 benefits from reranking, it only optimizes the top portion of a sorted list, not the entire list, limiting its performance improvement. The comparison is illustrative, considering the inherent differences. Interestingly, Model 2\(^*\)’s dominance diminishes when hash code lengths of Models 1 and 5 are increased to 2048 bits on the MSRVTT dataset, indicating diminishing returns in bit length extension and the necessity for more intelligent learning objectives. (viii) Introducing an optimized rotation matrix can improve fine-grained quantization. In contrast to 5, Model 9 without \(\varvec{P}_n\) in each local quantization module shows slight but observable performance decays, indicating that optimizing subspace partition for learnable quantization can lead to better quantization codes.

Fig. 10
figure 10

Average MAP@All of text-to-image and image-to-text retrieval w.r.t. various bit-lengths and fusion strategies. \(S_{ij}\) denotes the global similarity and \(s_{ij}^n\) denotes the nth local similarity. Under each bit-length, the result of the best fusion strategy except (I) and (II) is highlighted in a red box. Model 10 in table 5 adopts (IV) by default

4.3.2 Understanding the Effectiveness of Fine-grained Alignment

Until now, we have learned that hugging can effectively enhance global hash code learning and boost retrieval performance, but how fine-grained alignment helps to learn better global representation still remains unclear. In this section, we delve into this phenomenon and aim to understand the underlying mechanism more comprehensively.

Without loss of generality, we take HuggingHash as the analysis object in this section. First, we investigate the training dynamics to acquire the differences between hugging and handshaking during training. Then, from the macro view, we visualize fine-grained latent space to show what each cluster looks like. Finally, from the micro view, we analyze visual attention maps to facilitate an intuitive understanding of how local clusters contribute to global hash learning.

Fig. 11
figure 11

Dynamics of the performance on the test set and global hash alignment loss on the train set. The recall on MSRVTT is computed by geometric mean of R@{1,5,10}. We compare two strategies with HuggingHash

Analysis of training dynamics We study the dynamics of training loss and testing performance of HuggingHash and a handshaking baseline without fine-grained alignment. The results on MSCOCO and MSRVTT are illustrated in Fig. 11. On MSCOCO, we can observe that fine-grained alignment helps to reduce the modality gap of global representation. Hence hugging reaches a better optimum than handshaking. Although both hugging and handshaking tend to overfit and degrade after reaching the optimum, hugging can still retain better performance than handshaking. On MSRVTT, the pre-trained CLIP provides an easy start for aligning the texts and videos. Thus, we can see hugging and handshaking have similar training loss dynamics. Nevertheless, handshaking still underperforms hugging on the test set, which suggests the significance of fine-grained alignment to model generalizability.

Basically, our intuition is that global alignment provides direct guidance to the cross-modal correspondence, while fine-grained alignment constructs a structural space with (latent) concept-level correspondence to regularize cross-modal learning. The synergy of global and fine-grained turns out to reach a better optimum.

Fig. 12
figure 12

Visualization of fine-grained latent space and fine-grained cross-modal correspondence in hugging

Analysis of fine-grained latent space As a case study, we pick up some text-image pairs associated with 3 single labels: cat, dog and truck, from the MSCOCO dataset. Each text or image is represented by multiple content tokens in the fine-grained latent space. Figure 12a visualizes the token distribution by different labels and modalities, where the black stars mark the centroids of clusters. For quantitative analysis, we compute the class-specific and modality-specific average assignment scores w.r.t. all clusters and present them in Fig. 12b. The general consistency between the text and the image assignments w.r.t. each class implies the effectiveness of cross-modal fine-grained alignment. Besides, we observe that the assignment pattern can vary among different labels. While truck exhibits a uniform pattern, cat and dog concentrate on clusters 1, 5 and 7. In particular, we hypothesize a connection between the cluster 7 and the concept of animals.

Fig. 13
figure 13

Attention map visualization on MSCOCO. Left: attention maps w.r.t. [CLS] in handshaking and hugging. Hugging helps to focus on more comprehensive semantic regions. Right: attention maps w.r.t. different clusters. We can see the text-visual semantic correspondence

Fig. 14
figure 14

Attention map visualization on MSRVTT. Left: attention maps on different frames w.r.t. the temporal [CLS] token in handshaking and hugging. Hugging helps to focus on more precise semantic regions (e.g. the people) in frame#1 and more comprehensive semantic regions in frame#2 (e.g. the dog). Right: attention maps w.r.t. different VLAD clusters. We can see some correspondence between text and visual semantics

Fig. 15
figure 15

Hyper-parameter Sensitivities on MSCOCO dataset. Without loss of generality, the results in (a)–(f) are obtained with a 64-bit HuggingHash. The quantization- and reranking-related results in (g)–(i) are obtained with a 64-bit HuggingHash\(^+\). Default settings are marked in bold. The dotted lines mark the MAP results under default settings

Analysis of visual attention Apart from Fig. 12, which provides a macro view of the cross-modal alignment in the fine-grained latent space, here we present a more intuitive visualization of the text-visual semantic correspondence and how it contributes to the global representations.

We use GradCAM (Selvaraju et al., 2017) to visualize the attention map w.r.t. the [CLS] token, and the attention maps w.r.t. different clusters in hugging. For cluster-level attention, we aggregate the attention maps of the associated content tokens weighted by their assignment scores. The results on MSCOCO and MSRVTT datasets are illustrated in Figs. 13 and 14, respectively. We can see fine-grained alignment improves global hash representations in two regards. (i) Fine-grained alignment is conceptual-aware and helps to capture more complete cues. As shown in Fig. 13b, clusters \(\varvec{c}_5\) and \(\varvec{c}_7\) are relevant to cat and dog respectively, and their attention maps also present a concentration on corresponding areas in the image. Therefore, in Fig. 13a, the [CLS] attention map of hugging demonstrates a comprehensive grasp of the images. Whereas the [CLS] attention map of handshaking seems to miss the attention to the dog, leading to inferior hash codes. A similar phenomenon can be learned from Fig. 14, where handshaking fails to focus on the dog in frame#2. (ii) Fine-grained alignment regularizes global learning and helps to focus on details more precisely. As shown in Fig. 14a, while handshaking spreads the attention to larger and vaguer areas, hugging attends to the people in frame#1 more accurately so that we can see the outlines. As a result, hugging helps to produce more discriminative hash codes.

Table 6 Retrieval performance of UCMH approaches on MSCOCO and MSRVTT datasets
Fig. 16
figure 16

Top-10 ranked images using different strategies on the Flickr25K dataset. For Hugging\(^+\) w/ Reranking, we further rerank the top 10 ranked images. The ground-truth labels of query #4572 are ‘sky’, ‘structure’, and ‘water’. The and bounding boxes for images mark positive and negative images, respectively. The evaluation metric is the average precision at 10, i.e., AP@10 (Color figure online)

Fig. 17
figure 17

Top-10 ranked videos using different strategies on the MSRVTT dataset. For Hugging\(^+\) w/ Reranking, we further rerank the top 5 ranked videos. We show two search cases. The bounding box marks the ground-truth video w.r.t. the text query. The evaluation metrics are the recalls at 1, 5, and 10, i.e., R@{1,5,10} (Color figure online)

4.3.3 Hyper-Parameter Analysis

In this section, we analyze how hyper-parameters influence HuggingHash and HuggingHash\(^+\). We conduct detailed experiments on MSCOCO since it is a standard benchmark dataset for vision-language tasks. Without loss of generality, we analyze most hyper-parameters with HuggingHash and analyze the quantization-related and reranking-related hyper-parameters with HuggingHash\(^+\). Results are illustrated in Fig. 15.

Effects of loss terms \(\lambda \) is the weight of \(\mathcal {L}_\text {FA}\) (i.e., fine-grained alignment loss) that essentially controls its task gradient contribution to the learning process. Adjusting \(\lambda \) from 0 to 0.2 can boost the performance, verifying that \(\mathcal {L}_\text {FA}\) is beneficial. However, we can see the gain drops as \(\lambda \) increases beyond 0.2. This is because the auxiliary task of fine-grained alignment dominates the learning and even adversely constricts the main task of aligning hash codes. \(\gamma \) controls the strength of regularization \(\mathcal {R}_\text {quant}\) and a proper range is [0.5, 1].

Effects of \(\tau \) The temperature factor, \(\tau \) controls the penalty intensity of contrastive learning, which is sensitive. Figure 15c illustrates its effect. We can learn that a suitable range for \(\tau \) is [0.2, 0.25].

Effects of cluster number Figure 15d shows the effect of active cluster (i.e., latent concept) number \(N_c\) in GhostVLAD. While 15 and 7 are reasonable choices for \(N_c\), we set \(N_c=7\) by default to pursue a higher training efficiency.

Effects of used transformers We equip HuggingHash with different transformers (Sanh et al., 2019; Liu et al., 2019c; Yang et al., 2019; Touvron et al., 2021; Liu et al., 2021b; Bao et al., 2022). Figure 15e and f show that large BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019c) variants are good choices for the text transformer. Swin Transformer (Liu et al., 2021b) is a good choice for image encoder.

Effects of optimized quantization Note that the quantization codes are taken to rerank top-10% hash-ranked items by default, but we report the MAP metric w.r.t. the whole database. Hence, considering that the influence of different settings in the quantization is smoothed, we focus more on the trends rather than their magnitudes. \(N_\text {h}\) denotes the number of Householder matrices in Eq. (27) set to approximate the optimal rotation matrix. The effect of \(N_\text {h}\) is shown in Fig. 15g, where \(N_\text {h}=0\) means not enabling rotation matrix such that Eq. (25) is simplified to Eq. (20). We can see that \(N_\text {h}\ge 16\) yields converged results. We take \(N_\text {h}=16\) by default to slightly reduce computation while maintaining satisfactory performance. The quantization sub-codebook number in each local head, M, controls the distortion and also the bit-length of quantization codes. Figure 15h illustrates the sensitivity of M. The retrieval performance continues to improve as it increases until 8. We set \(M=4\) by default for efficiency concerns.

Effects of reranking range As illustrated in Fig. 5, we narrow the reranking range to top-\(N_{\mathcal {D}'}\) items by the hash-based ranking. We investigate the effect of rerank proportion, \(N_{\mathcal {D}'}/N_\mathcal {D}\) in Fig. 15i, where “0” indicates only global hash-based ranking and “1” means (re)ranking with fine-grained quantization codes on the whole database. We set reranking top-10% of items by default. The performance keeps improving as we increase \(N_{\mathcal {D}'}/N_\mathcal {D}\) from 0% \(\sim \) 75% and the trend is prone to converge after \(N_{\mathcal {D}'}/N_\mathcal {D}>75\%\).

4.3.4 Retrieval Visualization Analysis

We visualize the top-10 retrieval results on two datasets: Flickr25K for image-text retrieval and MSRVTT for video-text retrieval. We compare three different methods: Handshaking, Hugging\(^+\) w/o Reranking, and Hugging\(^+\) w/ Reranking, as illustrated in Figs. 16 and 17.

In Fig. 16, we observe that Hugging\(^+\) yields a more accurate result set compared to Handshaking. Furthermore, with the use of fine-grained reranking, the images most relevant to the query, specifically those involving landscapes, are ranked higher. At the same time, a museum image that initially appeared at Rank 3, which was deemed a negative sample because it failed to match the label despite containing a ‘lamb’, was demoted, thereby improving AP@10Footnote 3.

In Fig. 17, two cases are presented. For query #9340, the Hugging\(^+\) model successfully highlights the "trying to comfort" detail within the query by training fine-grained alignment. This pushes the videos with matching scenarios to the forefront. Reranking further exploits this fine-grained information, effectively improving the rank of videos closely aligned with the query. For query #8914, although not exemplary in terms of evaluation metrics, it serves to demonstrate the generalizability of Hugging\(^+\) and fine-grained reranking. Despite the absence of ground-truth videos at higher ranks, the reranked list still features videos somewhat relevant to the query text. Specifically, reranking elevates videos featuring ‘models’ and ‘runways’, which might be considered false negatives due to limited annotations in the evaluation dataset.

4.3.5 Improving Existing Approaches with Hugging and Hugging\(^+\)

In Fig. 7, we have illustrated how transformers, hugging and hugging\(^+\) help to improve the in-domain performance and cross-domain generalizability, respectively. In this section, we conduct more detailed experiments on both text-image (i.e., MSCOCO) and text-video (i.e., MSRVTT) datasets to investigate the compatibility and effectiveness of our proposed designs with existing UCMH approaches. We integrate the transformers, hugging and hugging\(^+\), with each of the 5 state-of-the-art UCMH approaches, namely DJSRH, DSAH, DGCPN, CIRH, and UCHSTM.

We report the results in Table 6, from which we see consistent improvements in the three designs. To be specific, on MSCOCO, when equipped with the same designs, several existing approaches show competitive or even superior performance to HuggingHash (or HuggingHash\(^+\)), e.g. UCHSTM + hugging\(^+\) vs. HuggingHash\(^+\). It is acceptable because HuggingHash and HuggingHash\(^+\) are two simple instantiation models. Using more delicate global alignment mechanisms to improve the instantiation is reasonable.

Quite differently, on MSRVTT, the baselines still fall behind HuggingHash (or HuggingHash\(^+\)) by considerable gaps even with the same proposed designs. The phenomenon is partially due to the unsatisfied basic assumption, as we have discussed in Sect. 4.2.2. Despite the performance gaps, we can learn that hugging and hugging\(^+\) significantly enhance retrieval results. By introducing fine-grained alignment based on contrastive learning, hugging brings large performance gains to the handshaking (i.e., only using transformers) models, which confirms the positive synergy between global and fine-grained alignment in training. Besides, hugging\(^+\) with reranking further improves the recall beyond hugging, while the scale of the improvements varies with different bit-lengths. Under shorter bit-length, e.g. 256 bits, the gain of reranking is relatively mild. The reason is that global hash-based ranking misses many positive items in the top-10% subset, and hence reranking is unable to help. Under medium bit-length, e.g. 512 or 1024 bits, the global hash-based ranking is capable of including more positive items in the top-10% subset but still insufficient to rank them precisely. Fortunately, the fine-grained reranking compensates for this shortcoming and we observe large relative improvements of about 20% to 40%. Under longer bit-length, e.g. 2048 bits, the gain begins to shrink because the hash-based ranking has become more accurate.

5 Conclusions

This paper studies the new and practical problem of transformer-based unsupervised cross-modal hashing (UCMH). We propose a hugging framework that unifies multi-granularity cross-modal alignment as solid self-supervision for hash learning. Besides, we extend hugging to hugging\(^+\) that effectively learns optimized quantized representations and aligns fine-grained cross-modal correspondence simultaneously. It retains the benefit of improving global hash codes and also provides fine-grained quantization code as a gift. Reranking with fine-grained quantization codes effectively boosts retrieval performance while enjoying high efficiency. We build HuggingHash and HuggingHash\(^+\) to instantiate hugging and hugging\(^+\), respectively, and show their advantages on text-image and text-video retrieval. We conduct extensive experiments to investigate the efficacy of different components in our design. We also assemble the proposed hugging and hugging\(^+\) to several state-of-the-art UCMH approaches, showing that our design is flexible and compatible with UCMH when choosing transformers as backbones.